[mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2

Eric A. Borisch eborisch at ieee.org
Thu Jan 3 10:24:08 EST 2008


Lei,

Thanks for the information. I would suggest that, if this can't be
fixed in the vapi version, then the LAZY_MEM_UNREGISTER define should
be removed from the default compile options for the versions where it
is (apparently) not fully supported.

This is a very nasty bug. The MPI layer reports back no errors, but
the data isn't actually transferred successfully. In addition, it
presents as a timing / waiting error to the user, as all of the local
(shared mem) peers transfer data successfully, so significant time can
be spent chasing down a suspected user oversight for what is actually
an error within the MPI layer.

This would apply to the MVAPICH and MVAPICH2, in both the vapi and
vapi_multirail makefiles.

In addition, it should be documented that the LAZY_MEM_UNREGISTER
switch is NOT compatible with vapi-based channels.

Thanks,
 Eric

On Dec 21, 2007 5:29 PM, LEI CHAI <chai.15 at osu.edu> wrote:
> Hi Eric,
>
> Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests.
>
> Thanks,
> Lei
>
>
>
> ----- Original Message -----
> From: "Eric A. Borisch" <eborisch at ieee.org>
> Date: Friday, December 21, 2007 10:23 am
> Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2
>
> > I seem to be running into a memory registration issue.
> >
> > Observations:
> >
> > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall)
> > into a
> > local buffer on the root rank, I receive all of the data from any
> > ranks that are running on the same machine, but only part (or none at
> > all) of the data from ranks running on external machines. The transfer
> > length is above the eager/rendezvous threshold.
> > 2) Once the problem occurs, it is persistent. However, if I force
> > MVAPICH to re-register by calling "while(dreg_evict())" at this point
> > and then re-transfer, the correct data is received. (Same memory being
> > transferred from / to.)
> > 3) I've only witnessed problems occurring above the 4G (as
> > returned by
> > malloc()) memory range.
> > 4) When I receive partial data from ranks, the data ends on a (4k)
> > page bound. Data past this bound (which should have been updated) is
> > unchanged during the transfer, yet both the sender and receiver report
> > no errors. (This is very bad!)
> > 5) Stepping through the code on both ends of the transfer shows the
> > software agreeing on the (correct) length and location as far down as
> > I can follow it.
> > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows
> > no issues. (Other than the expected performance hit.)
> > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2-
> > 1.0 (vapi)
> > 8) The user code is also sending data out (from a different buffer)
> > over ethernet to a remote gui from the root node.
> >
> > I can't move to gen2 at this point -- we are using a vendor library
> > for interfacing to another system, and this library uses VAPI.
> >
> > uname -a output:
> > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST
> > 2006 x86_64 x86_64 x86_64 GNU/Linux
> >
> > Intel SE7520JR2 motherboards. 4G physical ram on each node.
> >
> > It appears (perhaps this is obvious) that the assumption that memory
> > registered (by the dreg.c code) remains registered until explicitly
> > unregistered (again, by the dreg.c code) is being violated in some
> > way. This, however, is wading in to uncharted (for me, at least) linux
> > memory management waters. The user code is doing nothing to fiddle
> > with registration in any explicit way. (With the exception of as
> > mentioned in (2))
> >
> > Please let me know what other information I can provide to resolve
> > this. I'm still trying to put together a small test program to cause
> > the problem, but have been unsuccessful so far.
> >
> > Thanks,
> > Eric
> > --
> > Eric A. Borisch
> > eborisch at ieee.org
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>



-- 
Eric A. Borisch
eborisch at ieee.org


More information about the mvapich-discuss mailing list