[mvapich-discuss] (no subject)

Tue Mar 21 18:40:41 EST 2006

Hello Troy,

> When running HPL, hpl.dat can contain multiple problem sizes.
> xhpl
> * reads the config file, and runs one problem size (until it completes the  
> problem)
> * After the previous problem is finished, hpl will then start execution on  
> the next problem size.

Thanks for the explanation. I understand what you are saying.

> However, with MVAPICH (and -DLAZY_MEM_UNREGISTER), memory allocation  
> follow a pattern more like 70%-71%-141%.  When the hpl process finishes  
> one problem size and moves onto the next, no memory is freed -- but it  
> /is/ allocated.  I'd say it's something like a memory leak, because memory  
> is allocated and never freed until the process exits; but I suspect  
> 'memory leak' is not the correct term.  (And it appears that  
> -DLAZY_MEM_UNREGISTER has something to do with the behavior)

You are right, it does have something to do with -DLAZY_MEM_UNREGISTER.
This macro controls the registration cache mechanism of MVAPICH. This
`cache' is used to minimize the cost of registration/deregistration (an
expensive operation for InfiniBand/other HPC interconnects).

In order to implement this `caching' functionality properly, we need to
be able to gurantee that a virtual address (say 0xabc) corresponds to
ONE registration cache entry. Every time the user program uses MPI to
transfer the buffer (0xabc), this entry is consulted to find out the
whether this buffer was previously registered or not. If it was -- there
is no need to re-register it.

After the user program is done with buffer 0xabc, it may call `free' to
release this buffer. However, if this memory is freed, a subsequent call
to malloc (and friends) may return the same buffer address (0xabc).
Unfortunately, this virtual address may now map to different physical
memory pages. RDMA to these pages may not reflect in the "expected" user
buffer.

There is no other way for MPI (atleast in userland) to be able to tell
the difference if these mappings were changed. The solution adopted is
to instead instruct malloc not to return memory to the system. Thus,
even if user application calls free, the buffer is not really returned
(for re-use) to the system. Hence, the memory utilization (i.e. of the
entire process) can only grow.  This results in malloc _always_
returning unique virtual buffer addresses. The instruction to malloc is
achieved using the mallopt calls (viainit.c)

mallopt(M_TRIM_THRESHOLD, -1);
mallopt(M_MMAP_MAX, 0);

Really, this is not anything special to MVAPICH, but rather all MPIs
which do caching of registered buffers need to do it pretty much the
same way. Alternate solutions involve intercepting malloc/free calls and
as such, not a very portable solution either (IMHO). If only InfiniBand
memory registration costs were lower ...

If you have an application which continuously allocates/frees buffers
(like this HPL config you talk about), then you may be better off just
disabling -DLAZY_MEM_UNREGISTER. If you choose to run HPL in three
separate jobs (instead of one job consisting of three problems), then
you will not face this problem with -DLAZY_MEM_UNREGISTER.

> >Right. As long as you are aware of the performance implications of
> >turning registration cache off, it should be fine. There will be no
> >other side effects
> 
> That I can live with; although I do have one final question:  Can  
> LAZY_MEM_UNREGISTER be tuned at run-time, or only at compile-time.
> (ie. can I set MVAPICH to be less... lazy... at unregistering memory with  
> a command-line option to mpirun?)

This is a good point. We haven't had this kind of request before. Thanks
for bringing this up. We will work on this and support for this should
be available from our trunk sometime soon.

Thanks,
Sayantan.

> -- 
> Troy Telford

-- 
http://www.cse.ohio-state.edu/~surs