[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

Brian Budge brian.budge at gmail.com
Mon Jan 7 12:30:24 EST 2008


Hi Wei -

I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no difference.

When I build with rdma, this adds the following:
        export LIBS="${LIBS} -lrdmacm"
        export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM"

It seems that I am using the make.mvapich2.detect script to build.  It asks
me for my interface, and gives me the option for the mellanox interface,
which I choose.

I just tried a fresh install directly from the tarball instead of using the
gentoo package.  Now the program completes (goes beyond 8K message), but my
bandwidth isn't very good.  Running the osu_bw.c test, I get about 250 MB/s
maximum.  It seems like IB isn't being used.

I did the following:
./make.mvapich2.detect #, and chose the mellanox option
./configure --enable-threads=multiple
make
make install

So it seems that the package is doing something to enable infiniband that I
am not doing with the tarball.  Conversely, the tarball can run without
crashing.

Advice?

Thanks,
  Brian

On Jan 6, 2008 6:38 AM, wei huang < huanwei at cse.ohio-state.edu> wrote:

> Hi Brian,
>
> > I am using the openib-mvapich2-1.0.1 package in the gentoo-science
> overlay
> > addition to the standard gentoo packages.  I have also tried 1.0 with
> the
> > same results.
> >
> > I compiled with multithreading turned on (haven't tried without this,
> but
> > the sample codes I am initially testing are not multithreaded, although
> my
> > application is).  I also tried with or without rdma with no change.  The
>
> > script seems to be setting the build for SMALL_CLUSTER.
>
> So you are using make.mvapich2.ofa to compile the package? I am a bit
> confused about ''I also tried with or without rdma with no change''. What
> exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa
> stack...
>
> -- Wei
>
> >
> > Let me know what other information would be useful.
> >
> > Thanks,
> >   Brian
> >
> >
> >
> > On Jan 4, 2008 6:12 PM, wei huang <huanwei at cse.ohio-state.edu> wrote:
> >
> > > Hi Brian,
> > >
> > > Thanks for letting us know this problem. Would you please let us know
> some
> > > more details to help us locate the issue.
> > >
> > > 1) More details on your platform.
> > >
> > > 2) Exact version of mvapich2 you are using. Is it from OFED package?
> or
> > > some version from our website.
> > >
> > > 3) If it is from our website, did you change anything from the default
>
> > > compiling scripts?
> > >
> > > Thanks.
> > >
> > > -- Wei
> > > > I'm new to the list here... hi!  I have been using OpenMPI for a
> while,
> > > and
> > > > LAM before that, but new requirements keep pushing me to new
> > > > implementations.  In particular, I was interested in using
> infiniband
> > > (using
> > > > OFED 1.2.5.1) in a multi-threaded environment.  It seems that
> MVAPICH is
> > > the
> > > > library for that particular combination :)
> > > >
> > > > In any case, I installed MVAPICH, and I can boot the daemons, and
> run
> > > the
> > > > ring speed test with no problems.  When I run any programs with
> mpirun,
> > > > however, I get an error when sending or receiving more than 8192
> bytes.
> > > >
> > > > For example, if I run the bandwidth test from the benchmarks page
> > > > (osu_bw.c), I get the following:
> > > > ---------------------------------------------------------------
> > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > Thursday 06:16:00
> > > > burn
> > > > burn-3
> > > > # OSU MPI Bandwidth Test v3.0
> > > > # Size        Bandwidth (MB/s)
> > > > 1                         1.24
> > > > 2                         2.72
> > > > 4                         5.44
> > > > 8                        10.18
> > > > 16                       19.09
> > > > 32                       29.69
> > > > 64                       65.01
> > > > 128                     147.31
> > > > 256                     244.61
> > > > 512                     354.32
> > > > 1024                    367.91
> > > > 2048                    451.96
> > > > 4096                     550.66
> > > > 8192                    598.35
> > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to
> > > send
> > > > Internal Error: invalid error code ffffffff (Ring Index out of
> range) in
> > > > MPIDI_CH3_RndvSend:263
> > > > Fatal error in MPI_Waitall:
> > > > Other MPI error, error stack:
> > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > > > status_array=0xdb3140) failed
> > > > (unknown)(): Other MPI error
> > > > rank 1 in job 4  burn_37156   caused collective abort of all ranks
> > > >   exit status of rank 1: killed by signal 9
> > > > ---------------------------------------------------------------
> > > >
> > > > I get a similar problem with the latency test, however, the protocol
> > > that is
> > > > complained about is different:
> > > > --------------------------------------------------------------------
>
> > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > Thursday 09:21:20
> > > > # OSU MPI Latency Test v3.0
> > > > # Size            Latency (us)
> > > > 0                         3.93
> > > > 1                         4.07
> > > > 2                         4.06
> > > > 4                         3.82
> > > > 8                         3.98
> > > > 16                         4.03
> > > > 32                        4.00
> > > > 64                        4.28
> > > > 128                       5.22
> > > > 256                       5.88
> > > > 512                       8.65
> > > > 1024                      9.11
> > > > 2048                     11.53
> > > > 4096                     16.17
> > > > 8192                     25.67
> > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv
> req
> > > to
> > > > send
> > > > Internal Error: invalid error code ffffffff (Ring Index out of
> range) in
> > > > MPIDI_CH3_RndvSend:263
> > > > Fatal error in MPI_Recv:
> > > > Other MPI error, error stack:
> > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0,
> > > tag=1,
> > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > > > (unknown)(): Other MPI error
> > > > rank 1 in job 5  burn_37156   caused collective abort of all ranks
> > > > --------------------------------------------------------------------
> > > >
> > > > The protocols (0 and 8126589) are consistent if I run the program
> > > multiple
> > > > times.
> > > >
> > > > Anyone have any ideas?  If you need more info, please let me know.
> > > >
> > > > Thanks,
> > > >   Brian
> > > >
> > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080107/4ae6be47/attachment-0001.html


More information about the mvapich-discuss mailing list