[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

wei huang huanwei at cse.ohio-state.edu
Sun Jan 6 09:38:20 EST 2008


Hi Brian,

> I am using the openib-mvapich2-1.0.1 package in the gentoo-science overlay
> addition to the standard gentoo packages.  I have also tried 1.0 with the
> same results.
>
> I compiled with multithreading turned on (haven't tried without this, but
> the sample codes I am initially testing are not multithreaded, although my
> application is).  I also tried with or without rdma with no change.  The
> script seems to be setting the build for SMALL_CLUSTER.

So you are using make.mvapich2.ofa to compile the package? I am a bit
confused about ''I also tried with or without rdma with no change''. What
exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa
stack...

-- Wei

>
> Let me know what other information would be useful.
>
> Thanks,
>   Brian
>
>
>
> On Jan 4, 2008 6:12 PM, wei huang <huanwei at cse.ohio-state.edu> wrote:
>
> > Hi Brian,
> >
> > Thanks for letting us know this problem. Would you please let us know some
> > more details to help us locate the issue.
> >
> > 1) More details on your platform.
> >
> > 2) Exact version of mvapich2 you are using. Is it from OFED package? or
> > some version from our website.
> >
> > 3) If it is from our website, did you change anything from the default
> > compiling scripts?
> >
> > Thanks.
> >
> > -- Wei
> > > I'm new to the list here... hi!  I have been using OpenMPI for a while,
> > and
> > > LAM before that, but new requirements keep pushing me to new
> > > implementations.  In particular, I was interested in using infiniband
> > (using
> > > OFED 1.2.5.1) in a multi-threaded environment.  It seems that MVAPICH is
> > the
> > > library for that particular combination :)
> > >
> > > In any case, I installed MVAPICH, and I can boot the daemons, and run
> > the
> > > ring speed test with no problems.  When I run any programs with mpirun,
> > > however, I get an error when sending or receiving more than 8192 bytes.
> > >
> > > For example, if I run the bandwidth test from the benchmarks page
> > > (osu_bw.c), I get the following:
> > > ---------------------------------------------------------------
> > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > Thursday 06:16:00
> > > burn
> > > burn-3
> > > # OSU MPI Bandwidth Test v3.0
> > > # Size        Bandwidth (MB/s)
> > > 1                         1.24
> > > 2                         2.72
> > > 4                         5.44
> > > 8                        10.18
> > > 16                       19.09
> > > 32                       29.69
> > > 64                       65.01
> > > 128                     147.31
> > > 256                     244.61
> > > 512                     354.32
> > > 1024                    367.91
> > > 2048                    451.96
> > > 4096                    550.66
> > > 8192                    598.35
> > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to
> > send
> > > Internal Error: invalid error code ffffffff (Ring Index out of range) in
> > > MPIDI_CH3_RndvSend:263
> > > Fatal error in MPI_Waitall:
> > > Other MPI error, error stack:
> > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > > status_array=0xdb3140) failed
> > > (unknown)(): Other MPI error
> > > rank 1 in job 4  burn_37156   caused collective abort of all ranks
> > >   exit status of rank 1: killed by signal 9
> > > ---------------------------------------------------------------
> > >
> > > I get a similar problem with the latency test, however, the protocol
> > that is
> > > complained about is different:
> > > --------------------------------------------------------------------
> > > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > > Thursday 09:21:20
> > > # OSU MPI Latency Test v3.0
> > > # Size            Latency (us)
> > > 0                         3.93
> > > 1                         4.07
> > > 2                         4.06
> > > 4                         3.82
> > > 8                         3.98
> > > 16                        4.03
> > > 32                        4.00
> > > 64                        4.28
> > > 128                       5.22
> > > 256                       5.88
> > > 512                       8.65
> > > 1024                      9.11
> > > 2048                     11.53
> > > 4096                     16.17
> > > 8192                     25.67
> > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req
> > to
> > > send
> > > Internal Error: invalid error code ffffffff (Ring Index out of range) in
> > > MPIDI_CH3_RndvSend:263
> > > Fatal error in MPI_Recv:
> > > Other MPI error, error stack:
> > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0,
> > tag=1,
> > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > > (unknown)(): Other MPI error
> > > rank 1 in job 5  burn_37156   caused collective abort of all ranks
> > > --------------------------------------------------------------------
> > >
> > > The protocols (0 and 8126589) are consistent if I run the program
> > multiple
> > > times.
> > >
> > > Anyone have any ideas?  If you need more info, please let me know.
> > >
> > > Thanks,
> > >   Brian
> > >
> >
> >
>



More information about the mvapich-discuss mailing list