[mvapich-discuss] unrecognized protocol for send/recv over 8KB
(fwd)
Brian Budge
brian.budge at gmail.com
Fri Jan 4 21:23:58 EST 2008
Hi Wei -
I am running gentoo linux on amd64, 2 or 4 opteron 8216 per node. Kernel
is 2.6.23-gentoo-r4 SMP. I have infiniband built into the kernel:
CONFIG_INFINIBAND=y
CONFIG_INFINIBAND_USER_MAD=y
CONFIG_INFINIBAND_USER_ACCESS=y
CONFIG_INFINIBAND_USER_MEM=y
CONFIG_INFINIBAND_ADDR_TRANS=y
CONFIG_INFINIBAND_MTHCA=y
CONFIG_INFINIBAND_MTHCA_DEBUG=y
CONFIG_INFINIBAND_AMSO1100=y
CONFIG_MLX4_INFINIBAND=y
CONFIG_INFINIBAND_IPOIB=y
CONFIG_INFINIBAND_IPOIB_DEBUG=y
I am using the openib-mvapich2-1.0.1 package in the gentoo-science overlay
addition to the standard gentoo packages. I have also tried 1.0 with the
same results.
I compiled with multithreading turned on (haven't tried without this, but
the sample codes I am initially testing are not multithreaded, although my
application is). I also tried with or without rdma with no change. The
script seems to be setting the build for SMALL_CLUSTER.
Let me know what other information would be useful.
Thanks,
Brian
On Jan 4, 2008 6:12 PM, wei huang <huanwei at cse.ohio-state.edu> wrote:
> Hi Brian,
>
> Thanks for letting us know this problem. Would you please let us know some
> more details to help us locate the issue.
>
> 1) More details on your platform.
>
> 2) Exact version of mvapich2 you are using. Is it from OFED package? or
> some version from our website.
>
> 3) If it is from our website, did you change anything from the default
> compiling scripts?
>
> Thanks.
>
> -- Wei
> > I'm new to the list here... hi! I have been using OpenMPI for a while,
> and
> > LAM before that, but new requirements keep pushing me to new
> > implementations. In particular, I was interested in using infiniband
> (using
> > OFED 1.2.5.1) in a multi-threaded environment. It seems that MVAPICH is
> the
> > library for that particular combination :)
> >
> > In any case, I installed MVAPICH, and I can boot the daemons, and run
> the
> > ring speed test with no problems. When I run any programs with mpirun,
> > however, I get an error when sending or receiving more than 8192 bytes.
> >
> > For example, if I run the bandwidth test from the benchmarks page
> > (osu_bw.c), I get the following:
> > ---------------------------------------------------------------
> > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > Thursday 06:16:00
> > burn
> > burn-3
> > # OSU MPI Bandwidth Test v3.0
> > # Size Bandwidth (MB/s)
> > 1 1.24
> > 2 2.72
> > 4 5.44
> > 8 10.18
> > 16 19.09
> > 32 29.69
> > 64 65.01
> > 128 147.31
> > 256 244.61
> > 512 354.32
> > 1024 367.91
> > 2048 451.96
> > 4096 550.66
> > 8192 598.35
> > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to
> send
> > Internal Error: invalid error code ffffffff (Ring Index out of range) in
> > MPIDI_CH3_RndvSend:263
> > Fatal error in MPI_Waitall:
> > Other MPI error, error stack:
> > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> > status_array=0xdb3140) failed
> > (unknown)(): Other MPI error
> > rank 1 in job 4 burn_37156 caused collective abort of all ranks
> > exit status of rank 1: killed by signal 9
> > ---------------------------------------------------------------
> >
> > I get a similar problem with the latency test, however, the protocol
> that is
> > complained about is different:
> > --------------------------------------------------------------------
> > budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> > Thursday 09:21:20
> > # OSU MPI Latency Test v3.0
> > # Size Latency (us)
> > 0 3.93
> > 1 4.07
> > 2 4.06
> > 4 3.82
> > 8 3.98
> > 16 4.03
> > 32 4.00
> > 64 4.28
> > 128 5.22
> > 256 5.88
> > 512 8.65
> > 1024 9.11
> > 2048 11.53
> > 4096 16.17
> > 8192 25.67
> > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req
> to
> > send
> > Internal Error: invalid error code ffffffff (Ring Index out of range) in
> > MPIDI_CH3_RndvSend:263
> > Fatal error in MPI_Recv:
> > Other MPI error, error stack:
> > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0,
> tag=1,
> > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> > (unknown)(): Other MPI error
> > rank 1 in job 5 burn_37156 caused collective abort of all ranks
> > --------------------------------------------------------------------
> >
> > The protocols (0 and 8126589) are consistent if I run the program
> multiple
> > times.
> >
> > Anyone have any ideas? If you need more info, please let me know.
> >
> > Thanks,
> > Brian
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080104/86c75792/attachment.html
More information about the mvapich-discuss
mailing list