[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

Brian Budge brian.budge at gmail.com
Tue Jan 8 11:27:57 EST 2008


Hi Matt -

ibv_rc_pingpong worked, and I decided to try a new clean install, and it
seems to be working quite a bit better now.  I must have somehow added some
nasty stuff to the Makefile during my previous attempts.

Here is the output:

# OSU MPI Bandwidth Test v3.0
# Size        Bandwidth (MB/s)
1                         1.18
2                         2.59
4                         4.92
8                        10.38
16                       20.31
32                       40.12
64                       77.14
128                     144.37
256                     241.72
512                     362.12
1024                    471.01
2048                    546.45
4096                    581.47
8192                    600.65
16384                   611.52
32768                   632.87
65536                   642.27
131072                  646.30
262144                  644.22
524288                  644.15
1048576                 649.36
2097152                 662.55
4194304                 672.55

How do these numbers look for a 10 Gb SDR HCA?

Thanks for your help!
  Brian

On Jan 7, 2008 5:12 PM, Matthew Koop <koop at cse.ohio-state.edu> wrote:

> Brian,
>
> Can you try the ibv_rc_pingpong program, which is a low-level (non-MPI)
> test that ships with OFED? This will make sure that your basic InfiniBand
> setup is working properly.
>
> Did any other error message print out other than the one you gave?
>
> Matt
>
> On Mon, 7 Jan 2008, Brian Budge wrote:
>
> > Hi Matt -
> >
> > I have now done the install from the ofa build file, and I can boot and
> run
> > the ring test, but now when I run the osu_bw.c benchmark, the executable
>
> > dies in MPI_Init().
> >
> > The things I altered in make.mvapich2.ofa were:
> >
> > OPEN_IB_HOME=${OPEN_IB_HOME:-/usr}
> > SHARED_LIBS=${SHARED_LIBS:-yes}
> >
> > and on the configure line I added:
> >  --disable-f77 --disable-f90
> >
> > Here is the error message that I am getting:
> >
> > rank 1 in job 1  burn_60139   caused collective abort of all ranks
> >   exit status of rank 1: killed by signal 9
> >
> > Thanks,
> >   Brian
> >
> > On Jan 7, 2008 1:21 PM, Matthew Koop <koop at cse.ohio-state.edu> wrote:
> >
> > > Brian,
> > >
> > > The make.mvapich.detect script is just a helper script (not meant to
> be
> > > executed directly). You need to use the make.mvapich.ofa script, which
> > > will call configure and make for you with the correct arguments.
> > >
> > > More information can be found in our MVAPICH2 user guide under
> > > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP"
> > >
> > > https://mvapich.cse.ohio-state.edu/support/
> > >
> > > Let us know if you have any other problems.
> > >
> > > Matt
> > >
> > >
> > >
> > >
> > > On Mon, 7 Jan 2008, Brian Budge wrote:
> > >
> > > > Hi Wei -
> > > >
> > > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no
> > > difference.
> > > >
> > > > When I build with rdma, this adds the following:
> > > >         export LIBS="${LIBS} -lrdmacm"
> > > >         export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH
> -DRDMA_CM"
> > > >
> > > > It seems that I am using the make.mvapich2.detect script to build.
>  It
> > > asks
> > > > me for my interface, and gives me the option for the mellanox
> interface,
> > > > which I choose.
> > > >
> > > > I just tried a fresh install directly from the tarball instead of
> using
> > > the
> > > > gentoo package.  Now the program completes (goes beyond 8K message),
> but
> > > my
> > > > bandwidth isn't very good.  Running the osu_bw.c test, I get about
> 250
> > > MB/s
> > > > maximum.  It seems like IB isn't being used.
> > > >
> > > > I did the following:
> > > > ./make.mvapich2.detect #, and chose the mellanox option
> > > > ./configure --enable-threads=multiple
> > > > make
> > > > make install
> > > >
> > > > So it seems that the package is doing something to enable infiniband
> > > that I
> > > > am not doing with the tarball.  Conversely, the tarball can run
> without
> > > > crashing.
> > > >
> > > > Advice?
> > > >
> > > > Thanks,
> > > >   Brian
> > > >
> > > > On Jan 6, 2008 6:38 AM, wei huang < huanwei at cse.ohio-state.edu>
> wrote:
> > > >
> > > > > Hi Brian,
> > > > >
> > > > > > I am using the openib-mvapich2-1.0.1 package in the
> gentoo-science
> > > > > overlay
> > > > > > addition to the standard gentoo packages.  I have also tried
> 1.0with
> > > > > the
> > > > > > same results.
> > > > > >
> > > > > > I compiled with multithreading turned on (haven't tried without
> > > this,
> > > > > but
> > > > > > the sample codes I am initially testing are not multithreaded,
> > > although
> > > > > my
> > > > > > application is).  I also tried with or without rdma with no
> change.
> > >  The
> > > > >
> > > > > > script seems to be setting the build for SMALL_CLUSTER.
> > > > >
> > > > > So you are using make.mvapich2.ofa to compile the package? I am a
> bit
> > > > > confused about ''I also tried with or without rdma with no
> change''.
> > > What
> > > > > exact change you made here? Also, SMALL_CLUSTER is obsolete for
> ofa
> > > > > stack...
> > > > >
> > > > > -- Wei
> > > > >
> > > > > >
> > > > > > Let me know what other information would be useful.
> > > > > >
> > > > > > Thanks,
> > > > > >   Brian
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Jan 4, 2008 6:12 PM, wei huang < huanwei at cse.ohio-state.edu>
> > > wrote:
> > > > > >
> > > > > > > Hi Brian,
> > > > > > >
> > > > > > > Thanks for letting us know this problem. Would you please let
> us
> > > know
> > > > > some
> > > > > > > more details to help us locate the issue.
> > > > > > >
> > > > > > > 1) More details on your platform.
> > > > > > >
> > > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED
> > > package?
> > > > > or
> > > > > > > some version from our website.
> > > > > > >
> > > > > > > 3) If it is from our website, did you change anything from the
> > > default
> > > > >
> > > > > > > compiling scripts?
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > -- Wei
> > > > > > > > I'm new to the list here... hi!  I have been using OpenMPI
> for a
> > > > > while,
> > > > > > > and
> > > > > > > > LAM before that, but new requirements keep pushing me to new
> > > > > > > > implementations.  In particular, I was interested in using
> > > > > infiniband
> > > > > > > (using
> > > > > > > > OFED 1.2.5.1) in a multi-threaded environment.  It seems
> that
> > > > > MVAPICH is
> > > > > > > the
> > > > > > > > library for that particular combination :)
> > > > > > > >
> > > > > > > > In any case, I installed MVAPICH, and I can boot the
> daemons,
> > > and
> > > > > run
> > > > > > > the
> > > > > > > > ring speed test with no problems.  When I run any programs
> with
> > > > > mpirun,
> > > > > > > > however, I get an error when sending or receiving more than
> 8192
> > > > > bytes.
> > > > > > > >
> > > > > > > > For example, if I run the bandwidth test from the benchmarks
> > > page
> > > > > > > > (osu_bw.c), I get the following:
> > > > > > > >
> ---------------------------------------------------------------
> > > > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > > > Thursday 06:16:00
> > > > > > > > burn
> > > > > > > > burn-3
> > > > > > > > # OSU MPI Bandwidth Test v3.0
> > > > > > > > # Size        Bandwidth (MB/s)
> > > > > > > > 1                         1.24
> > > > > > > > 2                         2.72
> > > > > > > > 4                         5.44
> > > > > > > > 8                         10.18
> > > > > > > > 16                       19.09
> > > > > > > > 32                       29.69
> > > > > > > > 64                       65.01
> > > > > > > > 128                     147.31
> > > > > > > > 256                     244.61
> > > > > > > > 512                     354.32
> > > > > > > > 1024                    367.91
> > > > > > > > 2048                     451.96
> > > > > > > > 4096                     550.66
> > > > > > > > 8192                    598.35
> > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from
> rndv
> > > req to
> > > > > > > send
> > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out
> of
> > > > > range) in
> > > > > > > > MPIDI_CH3_RndvSend:263
> > > > > > > > Fatal error in MPI_Waitall:
> > > > > > > > Other MPI error, error stack:
> > > > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > > > > > > > status_array=0xdb3140) failed
> > > > > > > > (unknown)(): Other MPI error
> > > > > > > > rank 1 in job 4  burn_37156   caused collective abort of all
>
> > > ranks
> > > > > > > >   exit status of rank 1: killed by signal 9
> > > > > > > >
> ---------------------------------------------------------------
> > > > > > > >
> > > > > > > > I get a similar problem with the latency test, however, the
> > > protocol
> > > > > > > that is
> > > > > > > > complained about is different:
> > > > > > > >
> > > --------------------------------------------------------------------
> > > > >
> > > > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > > > Thursday 09:21:20
> > > > > > > > # OSU MPI Latency Test v3.0
> > > > > > > > # Size            Latency (us)
> > > > > > > > 0                         3.93
> > > > > > > > 1                         4.07
> > > > > > > > 2                         4.06
> > > > > > > > 4                         3.82
> > > > > > > > 8                         3.98
> > > > > > > > 16                         4.03
> > > > > > > > 32                        4.00
> > > > > > > > 64                        4.28
> > > > > > > > 128                       5.22
> > > > > > > > 256                       5.88
> > > > > > > > 512                       8.65
> > > > > > > > 1024                      9.11
> > > > > > > > 2048                     11.53
> > > > > > > > 4096                     16.17
> > > > > > > > 8192                     25.67
> > > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type
> from
> > > rndv
> > > > > req
> > > > > > > to
> > > > > > > > send
> > > > > > > > Internal Error: invalid error code ffffffff (Ring Index out
> of
> > > > > range) in
> > > > > > > > MPIDI_CH3_RndvSend:263
> > > > > > > > Fatal error in MPI_Recv:
> > > > > > > > Other MPI error, error stack:
> > > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR,
> > > src=0,
> > > > > > > tag=1,
> > > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > > > > > > > (unknown)(): Other MPI error
> > > > > > > > rank 1 in job 5  burn_37156   caused collective abort of all
> > > ranks
> > > > > > > >
> > > --------------------------------------------------------------------
> > > > > > > >
> > > > > > > > The protocols (0 and 8126589) are consistent if I run the
> > > program
> > > > > > > multiple
> > > > > > > > times.
> > > > > > > >
> > > > > > > > Anyone have any ideas?  If you need more info, please let me
>
> > > know.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >   Brian
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080108/00bde41f/attachment-0001.html


More information about the mvapich-discuss mailing list