[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

Brian Budge brian.budge at gmail.com
Fri Jan 4 21:23:58 EST 2008


Hi Wei -

I am running gentoo linux on amd64, 2 or 4 opteron 8216 per node.  Kernel
is  2.6.23-gentoo-r4 SMP.  I have infiniband built into the kernel:

CONFIG_INFINIBAND=y
CONFIG_INFINIBAND_USER_MAD=y
CONFIG_INFINIBAND_USER_ACCESS=y
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=y
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_AMSO1100=y
CONFIG_MLX4_INFINIBAND=y
CONFIG_INFINIBAND_IPOIB=y
CONFIG_INFINIBAND_IPOIB_DEBUG=y

I am using the openib-mvapich2-1.0.1 package in the gentoo-science overlay
addition to the standard gentoo packages.  I have also tried 1.0 with the
same results.

I compiled with multithreading turned on (haven't tried without this, but
the sample codes I am initially testing are not multithreaded, although my
application is).  I also tried with or without rdma with no change.  The
script seems to be setting the build for SMALL_CLUSTER.

Let me know what other information would be useful.

Thanks,
  Brian



On Jan 4, 2008 6:12 PM, wei huang <huanwei at cse.ohio-state.edu> wrote:

> Hi Brian,
>
> Thanks for letting us know this problem. Would you please let us know some
> more details to help us locate the issue.
>
> 1) More details on your platform.
>
> 2) Exact version of mvapich2 you are using. Is it from OFED package? or
> some version from our website.
>
> 3) If it is from our website, did you change anything from the default
> compiling scripts?
>
> Thanks.
>
> -- Wei
> > I'm new to the list here... hi!  I have been using OpenMPI for a while,
> and
> > LAM before that, but new requirements keep pushing me to new
> > implementations.  In particular, I was interested in using infiniband
> (using
> > OFED 1.2.5.1) in a multi-threaded environment.  It seems that MVAPICH is
> the
> > library for that particular combination :)
> >
> > In any case, I installed MVAPICH, and I can boot the daemons, and run
> the
> > ring speed test with no problems.  When I run any programs with mpirun,
> > however, I get an error when sending or receiving more than 8192 bytes.
> >
> > For example, if I run the bandwidth test from the benchmarks page
> > (osu_bw.c), I get the following:
> > ---------------------------------------------------------------
> > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > Thursday 06:16:00
> > burn
> > burn-3
> > # OSU MPI Bandwidth Test v3.0
> > # Size        Bandwidth (MB/s)
> > 1                         1.24
> > 2                         2.72
> > 4                         5.44
> > 8                        10.18
> > 16                       19.09
> > 32                       29.69
> > 64                       65.01
> > 128                     147.31
> > 256                     244.61
> > 512                     354.32
> > 1024                    367.91
> > 2048                    451.96
> > 4096                    550.66
> > 8192                    598.35
> > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to
> send
> > Internal Error: invalid error code ffffffff (Ring Index out of range) in
> > MPIDI_CH3_RndvSend:263
> > Fatal error in MPI_Waitall:
> > Other MPI error, error stack:
> > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > status_array=0xdb3140) failed
> > (unknown)(): Other MPI error
> > rank 1 in job 4  burn_37156   caused collective abort of all ranks
> >   exit status of rank 1: killed by signal 9
> > ---------------------------------------------------------------
> >
> > I get a similar problem with the latency test, however, the protocol
> that is
> > complained about is different:
> > --------------------------------------------------------------------
> > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > Thursday 09:21:20
> > # OSU MPI Latency Test v3.0
> > # Size            Latency (us)
> > 0                         3.93
> > 1                         4.07
> > 2                         4.06
> > 4                         3.82
> > 8                         3.98
> > 16                        4.03
> > 32                        4.00
> > 64                        4.28
> > 128                       5.22
> > 256                       5.88
> > 512                       8.65
> > 1024                      9.11
> > 2048                     11.53
> > 4096                     16.17
> > 8192                     25.67
> > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req
> to
> > send
> > Internal Error: invalid error code ffffffff (Ring Index out of range) in
> > MPIDI_CH3_RndvSend:263
> > Fatal error in MPI_Recv:
> > Other MPI error, error stack:
> > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0,
> tag=1,
> > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > (unknown)(): Other MPI error
> > rank 1 in job 5  burn_37156   caused collective abort of all ranks
> > --------------------------------------------------------------------
> >
> > The protocols (0 and 8126589) are consistent if I run the program
> multiple
> > times.
> >
> > Anyone have any ideas?  If you need more info, please let me know.
> >
> > Thanks,
> >   Brian
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080104/86c75792/attachment.html


More information about the mvapich-discuss mailing list