[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

Brian Budge brian.budge at gmail.com
Mon Jan 7 19:15:09 EST 2008


Hi Matt -

I have now done the install from the ofa build file, and I can boot and run
the ring test, but now when I run the osu_bw.c benchmark, the executable
dies in MPI_Init().

The things I altered in make.mvapich2.ofa were:

OPEN_IB_HOME=${OPEN_IB_HOME:-/usr}
SHARED_LIBS=${SHARED_LIBS:-yes}

and on the configure line I added:
 --disable-f77 --disable-f90

Here is the error message that I am getting:

rank 1 in job 1  burn_60139   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

Thanks,
  Brian

On Jan 7, 2008 1:21 PM, Matthew Koop <koop at cse.ohio-state.edu> wrote:

> Brian,
>
> The make.mvapich.detect script is just a helper script (not meant to be
> executed directly). You need to use the make.mvapich.ofa script, which
> will call configure and make for you with the correct arguments.
>
> More information can be found in our MVAPICH2 user guide under
> "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP"
>
> https://mvapich.cse.ohio-state.edu/support/
>
> Let us know if you have any other problems.
>
> Matt
>
>
>
>
> On Mon, 7 Jan 2008, Brian Budge wrote:
>
> > Hi Wei -
> >
> > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no
> difference.
> >
> > When I build with rdma, this adds the following:
> >         export LIBS="${LIBS} -lrdmacm"
> >         export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM"
> >
> > It seems that I am using the make.mvapich2.detect script to build.  It
> asks
> > me for my interface, and gives me the option for the mellanox interface,
> > which I choose.
> >
> > I just tried a fresh install directly from the tarball instead of using
> the
> > gentoo package.  Now the program completes (goes beyond 8K message), but
> my
> > bandwidth isn't very good.  Running the osu_bw.c test, I get about 250
> MB/s
> > maximum.  It seems like IB isn't being used.
> >
> > I did the following:
> > ./make.mvapich2.detect #, and chose the mellanox option
> > ./configure --enable-threads=multiple
> > make
> > make install
> >
> > So it seems that the package is doing something to enable infiniband
> that I
> > am not doing with the tarball.  Conversely, the tarball can run without
> > crashing.
> >
> > Advice?
> >
> > Thanks,
> >   Brian
> >
> > On Jan 6, 2008 6:38 AM, wei huang < huanwei at cse.ohio-state.edu> wrote:
> >
> > > Hi Brian,
> > >
> > > > I am using the openib-mvapich2-1.0.1 package in the gentoo-science
> > > overlay
> > > > addition to the standard gentoo packages.  I have also tried 1.0with
> > > the
> > > > same results.
> > > >
> > > > I compiled with multithreading turned on (haven't tried without
> this,
> > > but
> > > > the sample codes I am initially testing are not multithreaded,
> although
> > > my
> > > > application is).  I also tried with or without rdma with no change.
>  The
> > >
> > > > script seems to be setting the build for SMALL_CLUSTER.
> > >
> > > So you are using make.mvapich2.ofa to compile the package? I am a bit
> > > confused about ''I also tried with or without rdma with no change''.
> What
> > > exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa
> > > stack...
> > >
> > > -- Wei
> > >
> > > >
> > > > Let me know what other information would be useful.
> > > >
> > > > Thanks,
> > > >   Brian
> > > >
> > > >
> > > >
> > > > On Jan 4, 2008 6:12 PM, wei huang <huanwei at cse.ohio-state.edu>
> wrote:
> > > >
> > > > > Hi Brian,
> > > > >
> > > > > Thanks for letting us know this problem. Would you please let us
> know
> > > some
> > > > > more details to help us locate the issue.
> > > > >
> > > > > 1) More details on your platform.
> > > > >
> > > > > 2) Exact version of mvapich2 you are using. Is it from OFED
> package?
> > > or
> > > > > some version from our website.
> > > > >
> > > > > 3) If it is from our website, did you change anything from the
> default
> > >
> > > > > compiling scripts?
> > > > >
> > > > > Thanks.
> > > > >
> > > > > -- Wei
> > > > > > I'm new to the list here... hi!  I have been using OpenMPI for a
> > > while,
> > > > > and
> > > > > > LAM before that, but new requirements keep pushing me to new
> > > > > > implementations.  In particular, I was interested in using
> > > infiniband
> > > > > (using
> > > > > > OFED 1.2.5.1) in a multi-threaded environment.  It seems that
> > > MVAPICH is
> > > > > the
> > > > > > library for that particular combination :)
> > > > > >
> > > > > > In any case, I installed MVAPICH, and I can boot the daemons,
> and
> > > run
> > > > > the
> > > > > > ring speed test with no problems.  When I run any programs with
> > > mpirun,
> > > > > > however, I get an error when sending or receiving more than 8192
> > > bytes.
> > > > > >
> > > > > > For example, if I run the bandwidth test from the benchmarks
> page
> > > > > > (osu_bw.c), I get the following:
> > > > > > ---------------------------------------------------------------
> > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > Thursday 06:16:00
> > > > > > burn
> > > > > > burn-3
> > > > > > # OSU MPI Bandwidth Test v3.0
> > > > > > # Size        Bandwidth (MB/s)
> > > > > > 1                         1.24
> > > > > > 2                         2.72
> > > > > > 4                         5.44
> > > > > > 8                        10.18
> > > > > > 16                       19.09
> > > > > > 32                       29.69
> > > > > > 64                       65.01
> > > > > > 128                     147.31
> > > > > > 256                     244.61
> > > > > > 512                     354.32
> > > > > > 1024                    367.91
> > > > > > 2048                    451.96
> > > > > > 4096                     550.66
> > > > > > 8192                    598.35
> > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv
> req to
> > > > > send
> > > > > > Internal Error: invalid error code ffffffff (Ring Index out of
> > > range) in
> > > > > > MPIDI_CH3_RndvSend:263
> > > > > > Fatal error in MPI_Waitall:
> > > > > > Other MPI error, error stack:
> > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > > > > > status_array=0xdb3140) failed
> > > > > > (unknown)(): Other MPI error
> > > > > > rank 1 in job 4  burn_37156   caused collective abort of all
> ranks
> > > > > >   exit status of rank 1: killed by signal 9
> > > > > > ---------------------------------------------------------------
> > > > > >
> > > > > > I get a similar problem with the latency test, however, the
> protocol
> > > > > that is
> > > > > > complained about is different:
> > > > > >
> --------------------------------------------------------------------
> > >
> > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > Thursday 09:21:20
> > > > > > # OSU MPI Latency Test v3.0
> > > > > > # Size            Latency (us)
> > > > > > 0                         3.93
> > > > > > 1                         4.07
> > > > > > 2                         4.06
> > > > > > 4                         3.82
> > > > > > 8                         3.98
> > > > > > 16                         4.03
> > > > > > 32                        4.00
> > > > > > 64                        4.28
> > > > > > 128                       5.22
> > > > > > 256                       5.88
> > > > > > 512                       8.65
> > > > > > 1024                      9.11
> > > > > > 2048                     11.53
> > > > > > 4096                     16.17
> > > > > > 8192                     25.67
> > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from
> rndv
> > > req
> > > > > to
> > > > > > send
> > > > > > Internal Error: invalid error code ffffffff (Ring Index out of
> > > range) in
> > > > > > MPIDI_CH3_RndvSend:263
> > > > > > Fatal error in MPI_Recv:
> > > > > > Other MPI error, error stack:
> > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR,
> src=0,
> > > > > tag=1,
> > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > > > > > (unknown)(): Other MPI error
> > > > > > rank 1 in job 5  burn_37156   caused collective abort of all
> ranks
> > > > > >
> --------------------------------------------------------------------
> > > > > >
> > > > > > The protocols (0 and 8126589) are consistent if I run the
> program
> > > > > multiple
> > > > > > times.
> > > > > >
> > > > > > Anyone have any ideas?  If you need more info, please let me
> know.
> > > > > >
> > > > > > Thanks,
> > > > > >   Brian
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080107/b3f5f91e/attachment-0001.html


More information about the mvapich-discuss mailing list