[mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2

Matthew Koop koop at cse.ohio-state.edu
Fri Jan 4 14:03:07 EST 2008


Alexei,

ptmalloc2 is being used in our case to provide enhanced performance and
correctness. To speed up communications we cache registration of memory
regions (a costly operation) that are being used for communication. To
provide correct behavior we need to intercept malloc/free and friends so
old registrations can be flushed (otherwise the virtual->physical mapping
can change, leading to incorrect results).

Matt

On Fri, 4 Jan 2008, Alexei I. Adamovich wrote:

> Eric,
>
> what is the version of glibc you are using?
>
> I've found the following message on Wolfram Gloger's malloc homepage
> (http://www.malloc.de/en/index.html):
>
> WG> ...
> WG> New ptmalloc2 release Jun 5th, 2006!
> WG>
> WG> Here you can download the current snapshot of ptmalloc2 (C source
> WG> code), the second version of ptmalloc based on Doug Lea's
> WG> malloc-2.7.x. This code has already been included in
> WG> glibc-2.3.x. In multi-thread Applications, ptmalloc2 is currently
> WG> slightly more memory-efficient than ptmalloc3.
> WG>
> WG> ..
>
> So, I guess, the usage of more fresh glibc could be a solution.
>
> Please, inform me if you have evaluated this possibility already.
>
> In case you have RPM-based Linux distribution, you could found
> your current glibc version using
>
>  'rpm -qa | grep -i libc'
>
> command.
>
>
> Lei,
>
> am I wrong? Is the ptmalloc2 being used only as a thread-safe version of malloc,
> or possibly there is a more sufficient reason for using just the ptmalloc2
> source code supplied?
>
>  Sincerely,
>
> Alexei I. Adamovich
>
> On Thu, Jan 03, 2008 at 09:24:08AM -0600, Eric A. Borisch wrote:
> > Lei,
> >
> > Thanks for the information. I would suggest that, if this can't be
> > fixed in the vapi version, then the LAZY_MEM_UNREGISTER define should
> > be removed from the default compile options for the versions where it
> > is (apparently) not fully supported.
> >
> > This is a very nasty bug. The MPI layer reports back no errors, but
> > the data isn't actually transferred successfully. In addition, it
> > presents as a timing / waiting error to the user, as all of the local
> > (shared mem) peers transfer data successfully, so significant time can
> > be spent chasing down a suspected user oversight for what is actually
> > an error within the MPI layer.
> >
> > This would apply to the MVAPICH and MVAPICH2, in both the vapi and
> > vapi_multirail makefiles.
> >
> > In addition, it should be documented that the LAZY_MEM_UNREGISTER
> > switch is NOT compatible with vapi-based channels.
> >
> > Thanks,
> >  Eric
> >
> > On Dec 21, 2007 5:29 PM, LEI CHAI <chai.15 at osu.edu> wrote:
> > > Hi Eric,
> > >
> > > Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests.
> > >
> > > Thanks,
> > > Lei
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: "Eric A. Borisch" <eborisch at ieee.org>
> > > Date: Friday, December 21, 2007 10:23 am
> > > Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2
> > >
> > > > I seem to be running into a memory registration issue.
> > > >
> > > > Observations:
> > > >
> > > > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall)
> > > > into a
> > > > local buffer on the root rank, I receive all of the data from any
> > > > ranks that are running on the same machine, but only part (or none at
> > > > all) of the data from ranks running on external machines. The transfer
> > > > length is above the eager/rendezvous threshold.
> > > > 2) Once the problem occurs, it is persistent. However, if I force
> > > > MVAPICH to re-register by calling "while(dreg_evict())" at this point
> > > > and then re-transfer, the correct data is received. (Same memory being
> > > > transferred from / to.)
> > > > 3) I've only witnessed problems occurring above the 4G (as
> > > > returned by
> > > > malloc()) memory range.
> > > > 4) When I receive partial data from ranks, the data ends on a (4k)
> > > > page bound. Data past this bound (which should have been updated) is
> > > > unchanged during the transfer, yet both the sender and receiver report
> > > > no errors. (This is very bad!)
> > > > 5) Stepping through the code on both ends of the transfer shows the
> > > > software agreeing on the (correct) length and location as far down as
> > > > I can follow it.
> > > > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows
> > > > no issues. (Other than the expected performance hit.)
> > > > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2-
> > > > 1.0 (vapi)
> > > > 8) The user code is also sending data out (from a different buffer)
> > > > over ethernet to a remote gui from the root node.
> > > >
> > > > I can't move to gen2 at this point -- we are using a vendor library
> > > > for interfacing to another system, and this library uses VAPI.
> > > >
> > > > uname -a output:
> > > > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST
> > > > 2006 x86_64 x86_64 x86_64 GNU/Linux
> > > >
> > > > Intel SE7520JR2 motherboards. 4G physical ram on each node.
> > > >
> > > > It appears (perhaps this is obvious) that the assumption that memory
> > > > registered (by the dreg.c code) remains registered until explicitly
> > > > unregistered (again, by the dreg.c code) is being violated in some
> > > > way. This, however, is wading in to uncharted (for me, at least) linux
> > > > memory management waters. The user code is doing nothing to fiddle
> > > > with registration in any explicit way. (With the exception of as
> > > > mentioned in (2))
> > > >
> > > > Please let me know what other information I can provide to resolve
> > > > this. I'm still trying to put together a small test program to cause
> > > > the problem, but have been unsuccessful so far.
> > > >
> > > > Thanks,
> > > > Eric
> > > > --
> > > > Eric A. Borisch
> > > > eborisch at ieee.org
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > >
> > >
> > >
> >
> >
> >
> > --
> > Eric A. Borisch
> > eborisch at ieee.org
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list