[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

Matthew Koop koop at cse.ohio-state.edu
Mon Jan 7 16:21:24 EST 2008


Brian,

The make.mvapich.detect script is just a helper script (not meant to be
executed directly). You need to use the make.mvapich.ofa script, which
will call configure and make for you with the correct arguments.

More information can be found in our MVAPICH2 user guide under
"4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP"

https://mvapich.cse.ohio-state.edu/support/

Let us know if you have any other problems.

Matt




On Mon, 7 Jan 2008, Brian Budge wrote:

> Hi Wei -
>
> I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no difference.
>
> When I build with rdma, this adds the following:
>         export LIBS="${LIBS} -lrdmacm"
>         export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM"
>
> It seems that I am using the make.mvapich2.detect script to build.  It asks
> me for my interface, and gives me the option for the mellanox interface,
> which I choose.
>
> I just tried a fresh install directly from the tarball instead of using the
> gentoo package.  Now the program completes (goes beyond 8K message), but my
> bandwidth isn't very good.  Running the osu_bw.c test, I get about 250 MB/s
> maximum.  It seems like IB isn't being used.
>
> I did the following:
> ./make.mvapich2.detect #, and chose the mellanox option
> ./configure --enable-threads=multiple
> make
> make install
>
> So it seems that the package is doing something to enable infiniband that I
> am not doing with the tarball.  Conversely, the tarball can run without
> crashing.
>
> Advice?
>
> Thanks,
>   Brian
>
> On Jan 6, 2008 6:38 AM, wei huang < huanwei at cse.ohio-state.edu> wrote:
>
> > Hi Brian,
> >
> > > I am using the openib-mvapich2-1.0.1 package in the gentoo-science
> > overlay
> > > addition to the standard gentoo packages.  I have also tried 1.0 with
> > the
> > > same results.
> > >
> > > I compiled with multithreading turned on (haven't tried without this,
> > but
> > > the sample codes I am initially testing are not multithreaded, although
> > my
> > > application is).  I also tried with or without rdma with no change.  The
> >
> > > script seems to be setting the build for SMALL_CLUSTER.
> >
> > So you are using make.mvapich2.ofa to compile the package? I am a bit
> > confused about ''I also tried with or without rdma with no change''. What
> > exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa
> > stack...
> >
> > -- Wei
> >
> > >
> > > Let me know what other information would be useful.
> > >
> > > Thanks,
> > >   Brian
> > >
> > >
> > >
> > > On Jan 4, 2008 6:12 PM, wei huang <huanwei at cse.ohio-state.edu> wrote:
> > >
> > > > Hi Brian,
> > > >
> > > > Thanks for letting us know this problem. Would you please let us know
> > some
> > > > more details to help us locate the issue.
> > > >
> > > > 1) More details on your platform.
> > > >
> > > > 2) Exact version of mvapich2 you are using. Is it from OFED package?
> > or
> > > > some version from our website.
> > > >
> > > > 3) If it is from our website, did you change anything from the default
> >
> > > > compiling scripts?
> > > >
> > > > Thanks.
> > > >
> > > > -- Wei
> > > > > I'm new to the list here... hi!  I have been using OpenMPI for a
> > while,
> > > > and
> > > > > LAM before that, but new requirements keep pushing me to new
> > > > > implementations.  In particular, I was interested in using
> > infiniband
> > > > (using
> > > > > OFED 1.2.5.1) in a multi-threaded environment.  It seems that
> > MVAPICH is
> > > > the
> > > > > library for that particular combination :)
> > > > >
> > > > > In any case, I installed MVAPICH, and I can boot the daemons, and
> > run
> > > > the
> > > > > ring speed test with no problems.  When I run any programs with
> > mpirun,
> > > > > however, I get an error when sending or receiving more than 8192
> > bytes.
> > > > >
> > > > > For example, if I run the bandwidth test from the benchmarks page
> > > > > (osu_bw.c), I get the following:
> > > > > ---------------------------------------------------------------
> > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > Thursday 06:16:00
> > > > > burn
> > > > > burn-3
> > > > > # OSU MPI Bandwidth Test v3.0
> > > > > # Size        Bandwidth (MB/s)
> > > > > 1                         1.24
> > > > > 2                         2.72
> > > > > 4                         5.44
> > > > > 8                        10.18
> > > > > 16                       19.09
> > > > > 32                       29.69
> > > > > 64                       65.01
> > > > > 128                     147.31
> > > > > 256                     244.61
> > > > > 512                     354.32
> > > > > 1024                    367.91
> > > > > 2048                    451.96
> > > > > 4096                     550.66
> > > > > 8192                    598.35
> > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to
> > > > send
> > > > > Internal Error: invalid error code ffffffff (Ring Index out of
> > range) in
> > > > > MPIDI_CH3_RndvSend:263
> > > > > Fatal error in MPI_Waitall:
> > > > > Other MPI error, error stack:
> > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > > > > status_array=0xdb3140) failed
> > > > > (unknown)(): Other MPI error
> > > > > rank 1 in job 4  burn_37156   caused collective abort of all ranks
> > > > >   exit status of rank 1: killed by signal 9
> > > > > ---------------------------------------------------------------
> > > > >
> > > > > I get a similar problem with the latency test, however, the protocol
> > > > that is
> > > > > complained about is different:
> > > > > --------------------------------------------------------------------
> >
> > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > Thursday 09:21:20
> > > > > # OSU MPI Latency Test v3.0
> > > > > # Size            Latency (us)
> > > > > 0                         3.93
> > > > > 1                         4.07
> > > > > 2                         4.06
> > > > > 4                         3.82
> > > > > 8                         3.98
> > > > > 16                         4.03
> > > > > 32                        4.00
> > > > > 64                        4.28
> > > > > 128                       5.22
> > > > > 256                       5.88
> > > > > 512                       8.65
> > > > > 1024                      9.11
> > > > > 2048                     11.53
> > > > > 4096                     16.17
> > > > > 8192                     25.67
> > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv
> > req
> > > > to
> > > > > send
> > > > > Internal Error: invalid error code ffffffff (Ring Index out of
> > range) in
> > > > > MPIDI_CH3_RndvSend:263
> > > > > Fatal error in MPI_Recv:
> > > > > Other MPI error, error stack:
> > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0,
> > > > tag=1,
> > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > > > > (unknown)(): Other MPI error
> > > > > rank 1 in job 5  burn_37156   caused collective abort of all ranks
> > > > > --------------------------------------------------------------------
> > > > >
> > > > > The protocols (0 and 8126589) are consistent if I run the program
> > > > multiple
> > > > > times.
> > > > >
> > > > > Anyone have any ideas?  If you need more info, please let me know.
> > > > >
> > > > > Thanks,
> > > > >   Brian
> > > > >
> > > >
> > > >
> > >
> >
> >
>



More information about the mvapich-discuss mailing list