[mvapich-discuss] unrecognized protocol for send/recv over 8KB
(fwd)
Brian Budge
brian.budge at gmail.com
Tue Jan 8 11:27:57 EST 2008
Hi Matt -
ibv_rc_pingpong worked, and I decided to try a new clean install, and it
seems to be working quite a bit better now. I must have somehow added some
nasty stuff to the Makefile during my previous attempts.
Here is the output:
# OSU MPI Bandwidth Test v3.0
# Size Bandwidth (MB/s)
1 1.18
2 2.59
4 4.92
8 10.38
16 20.31
32 40.12
64 77.14
128 144.37
256 241.72
512 362.12
1024 471.01
2048 546.45
4096 581.47
8192 600.65
16384 611.52
32768 632.87
65536 642.27
131072 646.30
262144 644.22
524288 644.15
1048576 649.36
2097152 662.55
4194304 672.55
How do these numbers look for a 10 Gb SDR HCA?
Thanks for your help!
Brian
On Jan 7, 2008 5:12 PM, Matthew Koop <koop at cse.ohio-state.edu> wrote:
> Brian,
>
> Can you try the ibv_rc_pingpong program, which is a low-level (non-MPI)
> test that ships with OFED? This will make sure that your basic InfiniBand
> setup is working properly.
>
> Did any other error message print out other than the one you gave?
>
> Matt
>
> On Mon, 7 Jan 2008, Brian Budge wrote:
>
> > Hi Matt -
> >
> > I have now done the install from the ofa build file, and I can boot and
> run
> > the ring test, but now when I run the osu_bw.c benchmark, the executable
>
> > dies in MPI_Init().
> >
> > The things I altered in make.mvapich2.ofa were:
> >
> > OPEN_IB_HOME=${OPEN_IB_HOME:-/usr}
> > SHARED_LIBS=${SHARED_LIBS:-yes}
> >
> > and on the configure line I added:
> > --disable-f77 --disable-f90
> >
> > Here is the error message that I am getting:
> >
> > rank 1 in job 1 burn_60139 caused collective abort of all ranks
> > exit status of rank 1: killed by signal 9
> >
> > Thanks,
> > Brian
> >
> > On Jan 7, 2008 1:21 PM, Matthew Koop <koop at cse.ohio-state.edu> wrote:
> >
> > > Brian,
> > >
> > > The make.mvapich.detect script is just a helper script (not meant to
> be
> > > executed directly). You need to use the make.mvapich.ofa script, which
> > > will call configure and make for you with the correct arguments.
> > >
> > > More information can be found in our MVAPICH2 user guide under
> > > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP"
> > >
> > > https://mvapich.cse.ohio-state.edu/support/
> > >
> > > Let us know if you have any other problems.
> > >
> > > Matt
> > >
> > >
> > >
> > >
> > > On Mon, 7 Jan 2008, Brian Budge wrote:
> > >
> > > > Hi Wei -
> > > >
> > > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no
> > > difference.
> > > >
> > > > When I build with rdma, this adds the following:
> > > > export LIBS="${LIBS} -lrdmacm"
> > > > export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH
> -DRDMA_CM"
> > > >
> > > > It seems that I am using the make.mvapich2.detect script to build.
> It
> > > asks
> > > > me for my interface, and gives me the option for the mellanox
> interface,
> > > > which I choose.
> > > >
> > > > I just tried a fresh install directly from the tarball instead of
> using
> > > the
> > > > gentoo package. Now the program completes (goes beyond 8K message),
> but
> > > my
> > > > bandwidth isn't very good. Running the osu_bw.c test, I get about
> 250
> > > MB/s
> > > > maximum. It seems like IB isn't being used.
> > > >
> > > > I did the following:
> > > > ./make.mvapich2.detect #, and chose the mellanox option
> > > > ./configure --enable-threads=multiple
> > > > make
> > > > make install
> > > >
> > > > So it seems that the package is doing something to enable infiniband
> > > that I
> > > > am not doing with the tarball. Conversely, the tarball can run
> without
> > > > crashing.
> > > >
> > > > Advice?
> > > >
> > > > Thanks,
> > > > Brian
> > > >
> > > > On Jan 6, 2008 6:38 AM, wei huang < huanwei at cse.ohio-state.edu>
> wrote:
> > > >
> > > > > Hi Brian,
> > > > >
> > > > > > I am using the openib-mvapich2-1.0.1 package in the
> gentoo-science
> > > > > overlay
> > > > > > addition to the standard gentoo packages. I have also tried
> 1.0with
> > > > > the
> > > > > > same results.
> > > > > >
> > > > > > I compiled with multithreading turned on (haven't tried without
> > > this,
> > > > > but
> > > > > > the sample codes I am initially testing are not multithreaded,
> > > although
> > > > > my
> > > > > > application is). I also tried with or without rdma with no
> change.
> > > The
> > > > >
> > > > > > script seems to be setting the build for SMALL_CLUSTER.
> > > > >
> > > > > So you are using make.mvapich2.ofa to compile the package? I am a
> bit
> > > > > confused about ''I also tried with or without rdma with no
> change''.
> > > What
> > > > > exact change you made here? Also, SMALL_CLUSTER is obsolete for
> ofa
> > > > > stack...
> > > > >
> > > > > -- Wei
> > > > >
> > > > > >
> > > > > > Let me know what other information would be useful.
> > > > > >
> > > > > > Thanks,
> > > > > > Brian
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Jan 4, 2008 6:12 PM, wei huang < huanwei at cse.ohio-state.edu>
> > > wrote:
> > > > > >
> > > > > > > Hi Brian,
> > > > > > >
> > > > > > > Thanks for letting us know this problem. Would you please let
> us
> > > know
> > > > > some
> > > > > > > more details to help us locate the issue.
> > > > > > >
> > > > > > > 1) More details on your platform.
> > > > > > >
> > > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED
> > > package?
> > > > > or
> > > > > > > some version from our website.
> > > > > > >
> > > > > > > 3) If it is from our website, did you change anything from the
> > > default
> > > > >
> > > > > > > compiling scripts?
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > -- Wei
> > > > > > > > I'm new to the list here... hi! I have been using OpenMPI
> for a
> > > > > while,
> > > > > > > and
> > > > > > > > LAM before that, but new requirements keep pushing me to new
> > > > > > > > implementations. In particular, I was interested in using
> > > > > infiniband
> > > > > > > (using
> > > > > > > > OFED 1.2.5.1) in a multi-threaded environment. It seems
> that
> > > > > MVAPICH is
> > > > > > > the
> > > > > > > > library for that particular combination :)
> > > > > > > >
> > > > > > > > In any case, I installed MVAPICH, and I can boot the
> daemons,
> > > and
> > > > > run
> > > > > > > the
> > > > > > > > ring speed test with no problems. When I run any programs
> with
> > > > > mpirun,
> > > > > > > > however, I get an error when sending or receiving more than
> 8192
> > > > > bytes.
> > > > > > > >
> > > > > > > > For example, if I run the bandwidth test from the benchmarks
> > > page
> > > > > > > > (osu_bw.c), I get the following:
> > > > > > > >
> ---------------------------------------------------------------
> > > > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > > > Thursday 06:16:00
> > > > > > > > burn
> > > > > > > > burn-3
> > > > > > > > # OSU MPI Bandwidth Test v3.0
> > > > > > > > # Size Bandwidth (MB/s)
> > > > > > > > 1 1.24
> > > > > > > > 2 2.72
> > > > > > > > 4 5.44
> > > > > > > > 8 10.18
> > > > > > > > 16 19.09
> > > > > > > > 32 29.69
> > > > > > > > 64 65.01
> > > > > > > > 128 147.31
> > > > > > > > 256 244.61
> > > > > > > > 512 354.32
> > > > > > > > 1024 367.91
> > > > > > > > 2048 451.96
> > > > > > > > 4096 550.66
> > > > > > > > 8192 598.35
> > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from
> rndv
> > > req to
> > > > > > > send
> > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out
> of
> > > > > range) in
> > > > > > > > MPIDI_CH3_RndvSend:263
> > > > > > > > Fatal error in MPI_Waitall:
> > > > > > > > Other MPI error, error stack:
> > > > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > > > > > > > status_array=0xdb3140) failed
> > > > > > > > (unknown)(): Other MPI error
> > > > > > > > rank 1 in job 4 burn_37156 caused collective abort of all
>
> > > ranks
> > > > > > > > exit status of rank 1: killed by signal 9
> > > > > > > >
> ---------------------------------------------------------------
> > > > > > > >
> > > > > > > > I get a similar problem with the latency test, however, the
> > > protocol
> > > > > > > that is
> > > > > > > > complained about is different:
> > > > > > > >
> > > --------------------------------------------------------------------
> > > > >
> > > > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > > > Thursday 09:21:20
> > > > > > > > # OSU MPI Latency Test v3.0
> > > > > > > > # Size Latency (us)
> > > > > > > > 0 3.93
> > > > > > > > 1 4.07
> > > > > > > > 2 4.06
> > > > > > > > 4 3.82
> > > > > > > > 8 3.98
> > > > > > > > 16 4.03
> > > > > > > > 32 4.00
> > > > > > > > 64 4.28
> > > > > > > > 128 5.22
> > > > > > > > 256 5.88
> > > > > > > > 512 8.65
> > > > > > > > 1024 9.11
> > > > > > > > 2048 11.53
> > > > > > > > 4096 16.17
> > > > > > > > 8192 25.67
> > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type
> from
> > > rndv
> > > > > req
> > > > > > > to
> > > > > > > > send
> > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out
> of
> > > > > range) in
> > > > > > > > MPIDI_CH3_RndvSend:263
> > > > > > > > Fatal error in MPI_Recv:
> > > > > > > > Other MPI error, error stack:
> > > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR,
> > > src=0,
> > > > > > > tag=1,
> > > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > > > > > > > (unknown)(): Other MPI error
> > > > > > > > rank 1 in job 5 burn_37156 caused collective abort of all
> > > ranks
> > > > > > > >
> > > --------------------------------------------------------------------
> > > > > > > >
> > > > > > > > The protocols (0 and 8126589) are consistent if I run the
> > > program
> > > > > > > multiple
> > > > > > > > times.
> > > > > > > >
> > > > > > > > Anyone have any ideas? If you need more info, please let me
>
> > > know.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Brian
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080108/00bde41f/attachment-0001.html
More information about the mvapich-discuss
mailing list