[mvapich-discuss] Re: unrecognized protocol for send/recv over 8KB
Brian Budge
brian.budge at gmail.com
Fri Jan 4 18:04:33 EST 2008
Hi again -
I noticed this in the benchmark code:
int large_message_size = 8192;
Does MVAPICH internally treat messages over 8192 bytes differently than
those under 8 KB? Could this be something wrong with how I've configured
infiniband? I had a program running OpenMPI already over IB on the system,
but maybe I need to configure something special for MVAPICH?
Sorry if I appear to be grasping at straws... but I am ;)
Thanks,
Brian
On Jan 3, 2008 5:46 PM, Brian Budge <brian.budge at gmail.com> wrote:
> Hi all -
>
> I'm new to the list here... hi! I have been using OpenMPI for a while,
> and LAM before that, but new requirements keep pushing me to new
> implementations. In particular, I was interested in using infiniband (using
> OFED 1.2.5.1) in a multi-threaded environment. It seems that MVAPICH is
> the library for that particular combination :)
>
> In any case, I installed MVAPICH, and I can boot the daemons, and run the
> ring speed test with no problems. When I run any programs with mpirun,
> however, I get an error when sending or receiving more than 8192 bytes.
>
> For example, if I run the bandwidth test from the benchmarks page
> (osu_bw.c), I get the following:
> ---------------------------------------------------------------
> budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> Thursday 06:16:00
> burn
> burn-3
> # OSU MPI Bandwidth Test v3.0
> # Size Bandwidth (MB/s)
> 1 1.24
> 2 2.72
> 4 5.44
> 8 10.18
> 16 19.09
> 32 29.69
> 64 65.01
> 128 147.31
> 256 244.61
> 512 354.32
> 1024 367.91
> 2048 451.96
> 4096 550.66
> 8192 598.35
> [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to send
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_RndvSend:263
> Fatal error in MPI_Waitall:
> Other MPI error, error stack:
> MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> status_array=0xdb3140) failed
> (unknown)(): Other MPI error
> rank 1 in job 4 burn_37156 caused collective abort of all ranks
> exit status of rank 1: killed by signal 9
> ---------------------------------------------------------------
>
> I get a similar problem with the latency test, however, the protocol that
> is complained about is different:
> --------------------------------------------------------------------
> budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> Thursday 09:21:20
> # OSU MPI Latency Test v3.0
> # Size Latency (us)
> 0 3.93
> 1 4.07
> 2 4.06
> 4 3.82
> 8 3.98
> 16 4.03
> 32 4.00
> 64 4.28
> 128 5.22
> 256 5.88
> 512 8.65
> 1024 9.11
> 2048 11.53
> 4096 16.17
> 8192 25.67
> [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req to
> send
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_RndvSend:263
> Fatal error in MPI_Recv:
> Other MPI error, error stack:
> MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, tag=1,
> MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> (unknown)(): Other MPI error
> rank 1 in job 5 burn_37156 caused collective abort of all ranks
> --------------------------------------------------------------------
>
> The protocols (0 and 8126589) are consistent if I run the program multiple
> times.
>
> Anyone have any ideas? If you need more info, please let me know.
>
> Thanks,
> Brian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080104/fb86b07d/attachment-0001.html
More information about the mvapich-discuss
mailing list