[mvapich-discuss] unrecognized protocol for send/recv over 8KB (fwd)

wei huang huanwei at cse.ohio-state.edu
Fri Jan 4 21:12:46 EST 2008


Hi Brian,

Thanks for letting us know this problem. Would you please let us know some
more details to help us locate the issue.

1) More details on your platform.

2) Exact version of mvapich2 you are using. Is it from OFED package? or
some version from our website.

3) If it is from our website, did you change anything from the default
compiling scripts?

Thanks.

-- Wei
> I'm new to the list here... hi!  I have been using OpenMPI for a while, and
> LAM before that, but new requirements keep pushing me to new
> implementations.  In particular, I was interested in using infiniband (using
> OFED 1.2.5.1) in a multi-threaded environment.  It seems that MVAPICH is the
> library for that particular combination :)
>
> In any case, I installed MVAPICH, and I can boot the daemons, and run the
> ring speed test with no problems.  When I run any programs with mpirun,
> however, I get an error when sending or receiving more than 8192 bytes.
>
> For example, if I run the bandwidth test from the benchmarks page
> (osu_bw.c), I get the following:
> ---------------------------------------------------------------
> budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> Thursday 06:16:00
> burn
> burn-3
> # OSU MPI Bandwidth Test v3.0
> # Size        Bandwidth (MB/s)
> 1                         1.24
> 2                         2.72
> 4                         5.44
> 8                        10.18
> 16                       19.09
> 32                       29.69
> 64                       65.01
> 128                     147.31
> 256                     244.61
> 512                     354.32
> 1024                    367.91
> 2048                    451.96
> 4096                    550.66
> 8192                    598.35
> [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to send
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_RndvSend:263
> Fatal error in MPI_Waitall:
> Other MPI error, error stack:
> MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> status_array=0xdb3140) failed
> (unknown)(): Other MPI error
> rank 1 in job 4  burn_37156   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
> ---------------------------------------------------------------
>
> I get a similar problem with the latency test, however, the protocol that is
> complained about is different:
> --------------------------------------------------------------------
> budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> Thursday 09:21:20
> # OSU MPI Latency Test v3.0
> # Size            Latency (us)
> 0                         3.93
> 1                         4.07
> 2                         4.06
> 4                         3.82
> 8                         3.98
> 16                        4.03
> 32                        4.00
> 64                        4.28
> 128                       5.22
> 256                       5.88
> 512                       8.65
> 1024                      9.11
> 2048                     11.53
> 4096                     16.17
> 8192                     25.67
> [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req to
> send
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_RndvSend:263
> Fatal error in MPI_Recv:
> Other MPI error, error stack:
> MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, tag=1,
> MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> (unknown)(): Other MPI error
> rank 1 in job 5  burn_37156   caused collective abort of all ranks
> --------------------------------------------------------------------
>
> The protocols (0 and 8126589) are consistent if I run the program multiple
> times.
>
> Anyone have any ideas?  If you need more info, please let me know.
>
> Thanks,
>   Brian
>



More information about the mvapich-discuss mailing list