[mvapich-discuss] (no subject)

Roland Fehrenbacher Roland.Fehrenbacher at transtec.de
Thu Mar 23 04:13:27 EST 2006


>>>>> "Sayantan" == Sayantan Sur <surs at cse.ohio-state.edu> writes:

Hi Sayantan,

    >> When running HPL, hpl.dat can contain multiple problem sizes.
    >> xhpl * reads the config file, and runs one problem size (until
    >> it completes the problem) * After the previous problem is
    >> finished, hpl will then start execution on the next problem
    >> size.

    Sayantan> Thanks for the explanation. I understand what you are
    Sayantan> saying.

I'm facing the same problem here with xhpl and mvapich 0.9.7. The
strange thing is that this problem didn't happen with versions at
least up to 0.9.5 (haven't tested 0.9.6). Is the mechanism mentioned
below a performance optimization that entered the code after 0.9.5? I
just checked, and I had set -DLAZY_MEM_UNREGISTER in my 0.9.5 version
already.

Roland


    >> However, with MVAPICH (and -DLAZY_MEM_UNREGISTER), memory
    >> allocation follow a pattern more like 70%-71%-141%.  When the
    >> hpl process finishes one problem size and moves onto the next,
    >> no memory is freed -- but it /is/ allocated.  I'd say it's
    >> something like a memory leak, because memory is allocated and
    >> never freed until the process exits; but I suspect 'memory
    >> leak' is not the correct term.  (And it appears that
    >> -DLAZY_MEM_UNREGISTER has something to do with the behavior)

    Sayantan> You are right, it does have something to do with
    Sayantan> -DLAZY_MEM_UNREGISTER.  This macro controls the
    Sayantan> registration cache mechanism of MVAPICH. This `cache' is
    Sayantan> used to minimize the cost of registration/deregistration
    Sayantan> (an expensive operation for InfiniBand/other HPC
    Sayantan> interconnects).

    Sayantan> In order to implement this `caching' functionality
    Sayantan> properly, we need to be able to gurantee that a virtual
    Sayantan> address (say 0xabc) corresponds to ONE registration
    Sayantan> cache entry. Every time the user program uses MPI to
    Sayantan> transfer the buffer (0xabc), this entry is consulted to
    Sayantan> find out the whether this buffer was previously
    Sayantan> registered or not. If it was -- there is no need to
    Sayantan> re-register it.

    Sayantan> After the user program is done with buffer 0xabc, it may
    Sayantan> call `free' to release this buffer. However, if this
    Sayantan> memory is freed, a subsequent call to malloc (and
    Sayantan> friends) may return the same buffer address (0xabc).
    Sayantan> Unfortunately, this virtual address may now map to
    Sayantan> different physical memory pages. RDMA to these pages may
    Sayantan> not reflect in the "expected" user buffer.

    Sayantan> There is no other way for MPI (atleast in userland) to
    Sayantan> be able to tell the difference if these mappings were
    Sayantan> changed. The solution adopted is to instead instruct
    Sayantan> malloc not to return memory to the system. Thus, even if
    Sayantan> user application calls free, the buffer is not really
    Sayantan> returned (for re-use) to the system. Hence, the memory
    Sayantan> utilization (i.e. of the entire process) can only grow.
    Sayantan> This results in malloc _always_ returning unique virtual
    Sayantan> buffer addresses. The instruction to malloc is achieved
    Sayantan> using the mallopt calls (viainit.c)

    Sayantan> mallopt(M_TRIM_THRESHOLD, -1); mallopt(M_MMAP_MAX, 0);

    Sayantan> Really, this is not anything special to MVAPICH, but
    Sayantan> rather all MPIs which do caching of registered buffers
    Sayantan> need to do it pretty much the same way. Alternate
    Sayantan> solutions involve intercepting malloc/free calls and as
    Sayantan> such, not a very portable solution either (IMHO). If
    Sayantan> only InfiniBand memory registration costs were lower ...

    Sayantan> If you have an application which continuously
    Sayantan> allocates/frees buffers (like this HPL config you talk
    Sayantan> about), then you may be better off just disabling
    Sayantan> -DLAZY_MEM_UNREGISTER. If you choose to run HPL in three
    Sayantan> separate jobs (instead of one job consisting of three
    Sayantan> problems), then you will not face this problem with
    Sayantan> -DLAZY_MEM_UNREGISTER.

    >> >Right. As long as you are aware of the performance
    >> implications of >turning registration cache off, it should be
    >> fine. There will be no >other side effects
    >> 
    >> That I can live with; although I do have one final question:
    >> Can LAZY_MEM_UNREGISTER be tuned at run-time, or only at
    >> compile-time.  (ie. can I set MVAPICH to be less... lazy... at
    >> unregistering memory with a command-line option to mpirun?)

    Sayantan> This is a good point. We haven't had this kind of
    Sayantan> request before. Thanks for bringing this up. We will
    Sayantan> work on this and support for this should be available
    Sayantan> from our trunk sometime soon.

    Sayantan> Thanks, Sayantan.

    >> -- Troy Telford

    Sayantan> -- http://www.cse.ohio-state.edu/~surs
    Sayantan> _______________________________________________
    Sayantan> mvapich-discuss mailing list
    Sayantan> mvapich-discuss at cse.ohio-state.edu
    Sayantan> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list