[mvapich-discuss] Re: unrecognized protocol for send/recv over 8KB

Brian Budge brian.budge at gmail.com
Fri Jan 4 18:04:33 EST 2008


Hi again -

I noticed this in the benchmark code:

int large_message_size = 8192;


Does MVAPICH internally treat messages over 8192 bytes differently than
those under 8 KB?  Could this be something wrong with how I've configured
infiniband?  I had a program running OpenMPI already over IB on the system,
but maybe I need to configure something special for MVAPICH?

Sorry if I appear to be grasping at straws... but I am ;)

Thanks,
  Brian

On Jan 3, 2008 5:46 PM, Brian Budge <brian.budge at gmail.com> wrote:

> Hi all -
>
> I'm new to the list here... hi!  I have been using OpenMPI for a while,
> and LAM before that, but new requirements keep pushing me to new
> implementations.  In particular, I was interested in using infiniband (using
> OFED 1.2.5.1) in a multi-threaded environment.  It seems that MVAPICH is
> the library for that particular combination :)
>
> In any case, I installed MVAPICH, and I can boot the daemons, and run the
> ring speed test with no problems.  When I run any programs with mpirun,
> however, I get an error when sending or receiving more than 8192 bytes.
>
> For example, if I run the bandwidth test from the benchmarks page
> (osu_bw.c), I get the following:
> ---------------------------------------------------------------
> budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> Thursday 06:16:00
> burn
> burn-3
> # OSU MPI Bandwidth Test v3.0
> # Size        Bandwidth (MB/s)
> 1                         1.24
> 2                         2.72
> 4                         5.44
> 8                        10.18
> 16                       19.09
> 32                       29.69
> 64                       65.01
> 128                     147.31
> 256                     244.61
> 512                     354.32
> 1024                    367.91
> 2048                    451.96
> 4096                    550.66
> 8192                    598.35
> [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to send
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_RndvSend:263
> Fatal error in MPI_Waitall:
> Other MPI error, error stack:
> MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
> status_array=0xdb3140) failed
> (unknown)(): Other MPI error
> rank 1 in job 4  burn_37156   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
> ---------------------------------------------------------------
>
> I get a similar problem with the latency test, however, the protocol that
> is complained about is different:
> --------------------------------------------------------------------
> budge at burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
> Thursday 09:21:20
> # OSU MPI Latency Test v3.0
> # Size            Latency (us)
> 0                         3.93
> 1                         4.07
> 2                         4.06
> 4                         3.82
> 8                         3.98
> 16                        4.03
> 32                        4.00
> 64                        4.28
> 128                       5.22
> 256                       5.88
> 512                       8.65
> 1024                      9.11
> 2048                     11.53
> 4096                     16.17
> 8192                     25.67
> [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req to
> send
> Internal Error: invalid error code ffffffff (Ring Index out of range) in
> MPIDI_CH3_RndvSend:263
> Fatal error in MPI_Recv:
> Other MPI error, error stack:
> MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, tag=1,
> MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
> (unknown)(): Other MPI error
> rank 1 in job 5  burn_37156   caused collective abort of all ranks
> --------------------------------------------------------------------
>
> The protocols (0 and 8126589) are consistent if I run the program multiple
> times.
>
> Anyone have any ideas?  If you need more info, please let me know.
>
> Thanks,
>   Brian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080104/fb86b07d/attachment-0001.html


More information about the mvapich-discuss mailing list