[mvapich-discuss] Re: Rndv Receiver is receiving less than as expected

Dhabaleswar Panda panda at cse.ohio-state.edu
Tue Jun 29 10:49:54 EDT 2010


Thanks for the update here. The RGET protocol (MV2_RNDV_PROTOCOL=RGET)
allows a process A to get data from B while B is computing. This allows
overlap of computation and communication. If you are seeing good
performance with this option, you can use it for this application without
any harm. MVAPICH2 supports both RPUT and RGET protocols. Based on the
computation-communication characteristics of an application, you can use
one of these two protocols to obtain the maximum performance.

We are taking a look at the error you reported yesterday. Let us know
whether you see similar error with RGET protocol or not.

You also indicated in yesterday's e-mail that you are using 1.4rc2 (even
though you see the error in 1.5rc2). The 1.4rc2 was released during August
2009. Since then many feature enhancements and bug fixes have happened.
You can use either the 1.4.1 branch version (with all bug fixes) or
update it to the latest 1.5rc2 version.

Thanks,

DK

> For what it's worth I just set MV2_RNDV_PROTOCOL=RGET and so far the
> performance is on-par and a little better than that of OpenMPI with this
> application. I'll post back once I find out if it runs until completion or
> crashes.
>
> One question though-- is there any harm in setting this variable on a
> permanent basis, particularly in terms of performance?
>
> On Mon, Jun 28, 2010 at 6:11 PM, Aaron Knister <aaron.knister at gmail.com>wrote:
>
> > Hi,
> >
> > I'm running mvapich2-1.4rc2 using SLURM as the PMI and having some
> > difficulties with gromacs-4.0.7. I can't find the exact number but with
> > processor counts somewhere after 40-- definitely 80 and higher the gromacs
> > application terminates after some time (the amount of time varies slightly
> > between runs) with this error:
> >
> >
> > Warning! Rndv Receiver is receiving (13760 < 24768) less than as expected
> > Fatal error in MPI_Alltoall:
> > Message truncated, error stack:
> > MPI_Alltoall(734)......................: MPI_Alltoall(sbuf=0x1672840,
> > scount=344, MPI_FLOAT, rbuf=0x2aaaad349360, rcount=344, MPI_FLOAT,
> > comm=0xc4000000) failed
> > MPIR_Alltoall(193).....................:
> > MPIDI_CH3U_Post_data_receive_found(445): Message from rank 21 and tag 9
> > truncated; 24768 bytes received but buffer size is 13760
> > Warning! Rndv Receiver is receiving (22016 < 27520) less than as expected
> > Fatal error in MPI_Alltoall:
> > Message truncated, error stack:
> > MPI_Alltoall(734)......................: MPI_Alltoall(sbuf=0x2aaaad3ce4e0,
> > scount=344, MPI_FLOAT, rbuf=0x1e6af900, rcount=344, MPI_FLOAT,
> > comm=0xc4000004) failed
> > MPIR_Alltoall(193).....................:
> > MPIDI_CH3U_Post_data_receive_found(445): Message from rank 17 and tag 9
> > truncated; 27520 bytes received but buffer size is 22016
> >
> > The sizes of the buffers aren't identical each time, but the rank numbers
> > that throw the errors seem to be consistent. The error doesn't occur with
> > OpenMPI which interestingly runs the code significantly faster than mvapich2
> > although I don't know why. I've also tried mvapich2-1.5rc2 and the error is
> > still present. Please let me know if you need any additional information
> > from me.
> >
> > Thanks in advance!
> >
> > -Aaron
> >
>



More information about the mvapich-discuss mailing list