[mvapich-discuss] Re: Rndv Receiver is receiving less than as expected

Dhabaleswar Panda panda at cse.ohio-state.edu
Tue Jun 29 13:00:23 EDT 2010


Hi Aaron,

> Using the RGET protocol the application ran to completion and was several
> orders of magnitude faster than RPUT. I'm considering setting
> MV2_RNDV_PROTOCOL to RGET as a system default-- would you advise against
> this?

I think it will be OK. Please note that both RGET and RPUT can be selected
at run-time with the MV2_RNDV_PROTOCOL parameter. Thus, irrespective of
which protocol (RGET/RPUT) you select as default, you will be able to
switch it for any application at run-time. More details on this parameter
are provided here.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.5rc2.html#x1-13500011.47

For applications with irregular communication patterns, we have seen good
performance benefits with RGET. This is what we are observing. However, we
have not done exhaustive performance analysis across all applications.
Please feel free to share your experience with us once you change the
protocol to RGET.

> It's on the to-do list to move to to 1.4.1 or 1.5rc2, though it's never an
> easy task to get users to recompile code and move them forward to a new mpi
> version.

I completely understand it.

Thanks,

DK

> Thanks for looking into this!
>
> -Aaron
>
> On Tue, Jun 29, 2010 at 10:49 AM, Dhabaleswar Panda <
> panda at cse.ohio-state.edu> wrote:
>
> > Thanks for the update here. The RGET protocol (MV2_RNDV_PROTOCOL=RGET)
> > allows a process A to get data from B while B is computing. This allows
> > overlap of computation and communication. If you are seeing good
> > performance with this option, you can use it for this application without
> > any harm. MVAPICH2 supports both RPUT and RGET protocols. Based on the
> > computation-communication characteristics of an application, you can use
> > one of these two protocols to obtain the maximum performance.
> >
> > We are taking a look at the error you reported yesterday. Let us know
> > whether you see similar error with RGET protocol or not.
> >
> > You also indicated in yesterday's e-mail that you are using 1.4rc2 (even
> > though you see the error in 1.5rc2). The 1.4rc2 was released during August
> > 2009. Since then many feature enhancements and bug fixes have happened.
> > You can use either the 1.4.1 branch version (with all bug fixes) or
> > update it to the latest 1.5rc2 version.
> >
> > Thanks,
> >
> > DK
> >
> > > For what it's worth I just set MV2_RNDV_PROTOCOL=RGET and so far the
> > > performance is on-par and a little better than that of OpenMPI with this
> > > application. I'll post back once I find out if it runs until completion
> > or
> > > crashes.
> > >
> > > One question though-- is there any harm in setting this variable on a
> > > permanent basis, particularly in terms of performance?
> > >
> > > On Mon, Jun 28, 2010 at 6:11 PM, Aaron Knister <aaron.knister at gmail.com
> > >wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm running mvapich2-1.4rc2 using SLURM as the PMI and having some
> > > > difficulties with gromacs-4.0.7. I can't find the exact number but with
> > > > processor counts somewhere after 40-- definitely 80 and higher the
> > gromacs
> > > > application terminates after some time (the amount of time varies
> > slightly
> > > > between runs) with this error:
> > > >
> > > >
> > > > Warning! Rndv Receiver is receiving (13760 < 24768) less than as
> > expected
> > > > Fatal error in MPI_Alltoall:
> > > > Message truncated, error stack:
> > > > MPI_Alltoall(734)......................: MPI_Alltoall(sbuf=0x1672840,
> > > > scount=344, MPI_FLOAT, rbuf=0x2aaaad349360, rcount=344, MPI_FLOAT,
> > > > comm=0xc4000000) failed
> > > > MPIR_Alltoall(193).....................:
> > > > MPIDI_CH3U_Post_data_receive_found(445): Message from rank 21 and tag 9
> > > > truncated; 24768 bytes received but buffer size is 13760
> > > > Warning! Rndv Receiver is receiving (22016 < 27520) less than as
> > expected
> > > > Fatal error in MPI_Alltoall:
> > > > Message truncated, error stack:
> > > > MPI_Alltoall(734)......................:
> > MPI_Alltoall(sbuf=0x2aaaad3ce4e0,
> > > > scount=344, MPI_FLOAT, rbuf=0x1e6af900, rcount=344, MPI_FLOAT,
> > > > comm=0xc4000004) failed
> > > > MPIR_Alltoall(193).....................:
> > > > MPIDI_CH3U_Post_data_receive_found(445): Message from rank 17 and tag 9
> > > > truncated; 27520 bytes received but buffer size is 22016
> > > >
> > > > The sizes of the buffers aren't identical each time, but the rank
> > numbers
> > > > that throw the errors seem to be consistent. The error doesn't occur
> > with
> > > > OpenMPI which interestingly runs the code significantly faster than
> > mvapich2
> > > > although I don't know why. I've also tried mvapich2-1.5rc2 and the
> > error is
> > > > still present. Please let me know if you need any additional
> > information
> > > > from me.
> > > >
> > > > Thanks in advance!
> > > >
> > > > -Aaron
> > > >
> > >
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
> >
>
>
> --
> Aaron Knister
> Systems Administrator
> JCET/DoIT
> University of Maryland, Baltimore County
> aaronk at umbc.edu
>



More information about the mvapich-discuss mailing list