[mvapich-discuss] Infinite loop in ptmalloc

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Dec 16 20:26:05 EST 2014


Sure thing.
On Dec 16, 2014 8:10 PM, "Adam T. Moody" <moody20 at llnl.gov> wrote:

> Hi Jonathan,
> I commented out the #if _LIBC check in ptmalloc_unlock_all2(), around line
> 267 in mpid/ch3/channels/common/src/memory/ptmalloc2/arena.c to force the
> following three lines to be compiled into the library:
>
> tsd_setspecific(arena_key, save_arena);
> __malloc_hook = save_malloc_hook;
> __free_hook = save_free_hook;
>
> This seems to work around the problem.
>
> Can you take a closer look at this #if to double-check whether it should
> be there?
> -Adam
>
>
>
> Jonathan Perkins wrote:
>
>  Hi Adam.  Thanks for the additional information.  We've received some
>> reports about problems with our internal ptmalloc2 implementation and
>> python applications misbehaving when used together.
>>
>> I'm glad that you have a work around for the time being.  I'll touch
>> bases with you again once we have more info on what is happening in this
>> case after we're able to create a reproducer and have more insight on
>> this problem.
>>
>> On Tue, Dec 16, 2014 at 02:44:39PM -0800, Adam T. Moody wrote:
>>
>>
>>> Hi Jonathan,
>>> I'll look into a reproducer.  Right now, it's not trivial to reproduce.
>>> It's a python app that uses MPI.  The python process uses Popen to fork
>>> and
>>> exec "cat" to read "/proc/cpuinfo".  It's this child process that then
>>> gets
>>> stuck in the infinite recursion loop.  We found that a work around is to
>>> use
>>> Popen to start a shell, which then cats the file... ugh.
>>>
>>> One thing I can see under Totalview is that ptmalloc_unlock_all2 is
>>> defined
>>> and therefore apparently used, however the #if below was not compiled
>>> into
>>> the library:
>>>
>>> #if defined _LIBC || defined MALLOC_HOOKS
>>> tsd_setspecific(arena_key, save_arena);
>>> __malloc_hook = save_malloc_hook;
>>> __free_hook = save_free_hook;
>>> #endif
>>>
>>> These lines look to be responsible for restoring the original hooks in
>>> the
>>> child process.  I'm guessing that's important.  Apparently, neither _LIBC
>>> nor MALLOC_HOOKS are defined.  It looks like this is the only place
>>> MALLOC_HOOKS is defined in all of the source code, which leads me to
>>> believe
>>> this is deprecated.  I'm guess _LIBC is the critical one here.  Should
>>> this
>>> macro be defined?
>>> -Adam
>>>
>>>
>>> Jonathan Perkins wrote:
>>>
>>>
>>>
>>>> Hi Adam.  Thanks for the report and debugging info.  We're inspecting
>>>> this code path.  In the meantime, can you provide us with a simple
>>>> reproducer to help us investigate this further?
>>>>
>>>> On Mon, Dec 15, 2014 at 05:53:53PM -0800, Adam T. Moody wrote:
>>>>
>>>>
>>>>
>>>>> Hello MVAPICH team,
>>>>> We have a code using MVAPICH2-1.9 that forks a process whose child
>>>>> then dies
>>>>> after it eventually consumes all available memory.  If I SIGSTOP the
>>>>> child
>>>>> and attach to it before it dies, I can see from its stack trace that
>>>>> it's
>>>>> apparently in an infinite recursion loop consisting of calls to:
>>>>>
>>>>> malloc_atfork()
>>>>> malloc() at mvapich_malloc.c:3403
>>>>>
>>>>> I can see that mvapich_malloc.c:3403 is the last line of the following,
>>>>> which invokes the __malloc_hook function pointer:
>>>>>
>>>>> __malloc_ptr_t (*hook) __MALLOC_P ((size_t, __const __malloc_ptr_t)) =
>>>>> __malloc_hook;
>>>>> if (hook != NULL)
>>>>> return (*hook)(bytes, RETURN_ADDRESS (0));
>>>>>
>>>>>
>>>>> From the stack trace, I can deduce that __malloc_hook must be pointing
>>>>> to
>>>>>
>>>>
>>>>
>>>>> malloc_atfork().
>>>>>
>>>>> Then looking at the malloc_atfork() impelmentation, I can see that it
>>>>> calls
>>>>> public_mALLOc() in it's else clause, which seems like it may be the
>>>>> code
>>>>> path leading to the loop:
>>>>>
>>>>> } else {
>>>>> /* Suspend the thread until the `atfork' handlers have completed.
>>>>>    By that time, the hooks will have been reset as well, so that
>>>>>    mALLOc() can be used again. */
>>>>> (void)mutex_lock(&list_lock);
>>>>> (void)mutex_unlock(&list_lock);
>>>>> return public_mALLOc(sz);
>>>>> }
>>>>>
>>>>> Do you have ideas how this might happen?  Can you imagine a case that
>>>>> would
>>>>> lead to a loop here?
>>>>>
>>>>> I see a lock and followed immediately by an unlock.  Does this lock
>>>>> really
>>>>> protect anything?
>>>>> Thanks,
>>>>> -Adam
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141216/c629a0e3/attachment-0001.html>


More information about the mvapich-discuss mailing list