[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

Matthew Koop koop at cse.ohio-state.edu
Mon Jan 7 20:12:26 EST 2008


Brian,

Can you try the ibv_rc_pingpong program, which is a low-level (non-MPI)
test that ships with OFED? This will make sure that your basic InfiniBand
setup is working properly.

Did any other error message print out other than the one you gave?

Matt

On Mon, 7 Jan 2008, Brian Budge wrote:

> Hi Matt -
>
> I have now done the install from the ofa build file, and I can boot and run
> the ring test, but now when I run the osu_bw.c benchmark, the executable
> dies in MPI_Init().
>
> The things I altered in make.mvapich2.ofa were:
>
> OPEN_IB_HOME=${OPEN_IB_HOME:-/usr}
> SHARED_LIBS=${SHARED_LIBS:-yes}
>
> and on the configure line I added:
>  --disable-f77 --disable-f90
>
> Here is the error message that I am getting:
>
> rank 1 in job 1  burn_60139   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
>
> Thanks,
>   Brian
>
> On Jan 7, 2008 1:21 PM, Matthew Koop <koop at cse.ohio-state.edu> wrote:
>
> > Brian,
> >
> > The make.mvapich.detect script is just a helper script (not meant to be
> > executed directly). You need to use the make.mvapich.ofa script, which
> > will call configure and make for you with the correct arguments.
> >
> > More information can be found in our MVAPICH2 user guide under
> > "4.4.1 Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP"
> >
> > https://mvapich.cse.ohio-state.edu/support/
> >
> > Let us know if you have any other problems.
> >
> > Matt
> >
> >
> >
> >
> > On Mon, 7 Jan 2008, Brian Budge wrote:
> >
> > > Hi Wei -
> > >
> > > I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no
> > difference.
> > >
> > > When I build with rdma, this adds the following:
> > >         export LIBS="${LIBS} -lrdmacm"
> > >         export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM"
> > >
> > > It seems that I am using the make.mvapich2.detect script to build.  It
> > asks
> > > me for my interface, and gives me the option for the mellanox interface,
> > > which I choose.
> > >
> > > I just tried a fresh install directly from the tarball instead of using
> > the
> > > gentoo package.  Now the program completes (goes beyond 8K message), but
> > my
> > > bandwidth isn't very good.  Running the osu_bw.c test, I get about 250
> > MB/s
> > > maximum.  It seems like IB isn't being used.
> > >
> > > I did the following:
> > > ./make.mvapich2.detect #, and chose the mellanox option
> > > ./configure --enable-threads=multiple
> > > make
> > > make install
> > >
> > > So it seems that the package is doing something to enable infiniband
> > that I
> > > am not doing with the tarball.  Conversely, the tarball can run without
> > > crashing.
> > >
> > > Advice?
> > >
> > > Thanks,
> > >   Brian
> > >
> > > On Jan 6, 2008 6:38 AM, wei huang < huanwei at cse.ohio-state.edu> wrote:
> > >
> > > > Hi Brian,
> > > >
> > > > > I am using the openib-mvapich2-1.0.1 package in the gentoo-science
> > > > overlay
> > > > > addition to the standard gentoo packages.  I have also tried 1.0with
> > > > the
> > > > > same results.
> > > > >
> > > > > I compiled with multithreading turned on (haven't tried without
> > this,
> > > > but
> > > > > the sample codes I am initially testing are not multithreaded,
> > although
> > > > my
> > > > > application is).  I also tried with or without rdma with no change.
> >  The
> > > >
> > > > > script seems to be setting the build for SMALL_CLUSTER.
> > > >
> > > > So you are using make.mvapich2.ofa to compile the package? I am a bit
> > > > confused about ''I also tried with or without rdma with no change''.
> > What
> > > > exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa
> > > > stack...
> > > >
> > > > -- Wei
> > > >
> > > > >
> > > > > Let me know what other information would be useful.
> > > > >
> > > > > Thanks,
> > > > >   Brian
> > > > >
> > > > >
> > > > >
> > > > > On Jan 4, 2008 6:12 PM, wei huang <huanwei at cse.ohio-state.edu>
> > wrote:
> > > > >
> > > > > > Hi Brian,
> > > > > >
> > > > > > Thanks for letting us know this problem. Would you please let us
> > know
> > > > some
> > > > > > more details to help us locate the issue.
> > > > > >
> > > > > > 1) More details on your platform.
> > > > > >
> > > > > > 2) Exact version of mvapich2 you are using. Is it from OFED
> > package?
> > > > or
> > > > > > some version from our website.
> > > > > >
> > > > > > 3) If it is from our website, did you change anything from the
> > default
> > > >
> > > > > > compiling scripts?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > -- Wei
> > > > > > > I'm new to the list here... hi!  I have been using OpenMPI for a
> > > > while,
> > > > > > and
> > > > > > > LAM before that, but new requirements keep pushing me to new
> > > > > > > implementations.  In particular, I was interested in using
> > > > infiniband
> > > > > > (using
> > > > > > > OFED 1.2.5.1) in a multi-threaded environment.  It seems that
> > > > MVAPICH is
> > > > > > the
> > > > > > > library for that particular combination :)
> > > > > > >
> > > > > > > In any case, I installed MVAPICH, and I can boot the daemons,
> > and
> > > > run
> > > > > > the
> > > > > > > ring speed test with no problems.  When I run any programs with
> > > > mpirun,
> > > > > > > however, I get an error when sending or receiving more than 8192
> > > > bytes.
> > > > > > >
> > > > > > > For example, if I run the bandwidth test from the benchmarks
> > page
> > > > > > > (osu_bw.c), I get the following:
> > > > > > > ---------------------------------------------------------------
> > > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > > Thursday 06:16:00
> > > > > > > burn
> > > > > > > burn-3
> > > > > > > # OSU MPI Bandwidth Test v3.0
> > > > > > > # Size        Bandwidth (MB/s)
> > > > > > > 1                         1.24
> > > > > > > 2                         2.72
> > > > > > > 4                         5.44
> > > > > > > 8                        10.18
> > > > > > > 16                       19.09
> > > > > > > 32                       29.69
> > > > > > > 64                       65.01
> > > > > > > 128                     147.31
> > > > > > > 256                     244.61
> > > > > > > 512                     354.32
> > > > > > > 1024                    367.91
> > > > > > > 2048                    451.96
> > > > > > > 4096                     550.66
> > > > > > > 8192                    598.35
> > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv
> > req to
> > > > > > send
> > > > > > > Internal Error: invalid error code ffffffff (Ring Index out of
> > > > range) in
> > > > > > > MPIDI_CH3_RndvSend:263
> > > > > > > Fatal error in MPI_Waitall:
> > > > > > > Other MPI error, error stack:
> > > > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > > > > > > status_array=0xdb3140) failed
> > > > > > > (unknown)(): Other MPI error
> > > > > > > rank 1 in job 4  burn_37156   caused collective abort of all
> > ranks
> > > > > > >   exit status of rank 1: killed by signal 9
> > > > > > > ---------------------------------------------------------------
> > > > > > >
> > > > > > > I get a similar problem with the latency test, however, the
> > protocol
> > > > > > that is
> > > > > > > complained about is different:
> > > > > > >
> > --------------------------------------------------------------------
> > > >
> > > > > > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > > > > > Thursday 09:21:20
> > > > > > > # OSU MPI Latency Test v3.0
> > > > > > > # Size            Latency (us)
> > > > > > > 0                         3.93
> > > > > > > 1                         4.07
> > > > > > > 2                         4.06
> > > > > > > 4                         3.82
> > > > > > > 8                         3.98
> > > > > > > 16                         4.03
> > > > > > > 32                        4.00
> > > > > > > 64                        4.28
> > > > > > > 128                       5.22
> > > > > > > 256                       5.88
> > > > > > > 512                       8.65
> > > > > > > 1024                      9.11
> > > > > > > 2048                     11.53
> > > > > > > 4096                     16.17
> > > > > > > 8192                     25.67
> > > > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from
> > rndv
> > > > req
> > > > > > to
> > > > > > > send
> > > > > > > Internal Error: invalid error code ffffffff (Ring Index out of
> > > > range) in
> > > > > > > MPIDI_CH3_RndvSend:263
> > > > > > > Fatal error in MPI_Recv:
> > > > > > > Other MPI error, error stack:
> > > > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR,
> > src=0,
> > > > > > tag=1,
> > > > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > > > > > > (unknown)(): Other MPI error
> > > > > > > rank 1 in job 5  burn_37156   caused collective abort of all
> > ranks
> > > > > > >
> > --------------------------------------------------------------------
> > > > > > >
> > > > > > > The protocols (0 and 8126589) are consistent if I run the
> > program
> > > > > > multiple
> > > > > > > times.
> > > > > > >
> > > > > > > Anyone have any ideas?  If you need more info, please let me
> > know.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >   Brian
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > >
> >
> >
>



More information about the mvapich-discuss mailing list