[mvapich-discuss] Infinite loop in ptmalloc

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Dec 16 18:11:53 EST 2014


Hi Adam.  Thanks for the additional information.  We've received some
reports about problems with our internal ptmalloc2 implementation and
python applications misbehaving when used together.

I'm glad that you have a work around for the time being.  I'll touch
bases with you again once we have more info on what is happening in this
case after we're able to create a reproducer and have more insight on
this problem.

On Tue, Dec 16, 2014 at 02:44:39PM -0800, Adam T. Moody wrote:
> Hi Jonathan,
> I'll look into a reproducer.  Right now, it's not trivial to reproduce.
> It's a python app that uses MPI.  The python process uses Popen to fork and
> exec "cat" to read "/proc/cpuinfo".  It's this child process that then gets
> stuck in the infinite recursion loop.  We found that a work around is to use
> Popen to start a shell, which then cats the file... ugh.
> 
> One thing I can see under Totalview is that ptmalloc_unlock_all2 is defined
> and therefore apparently used, however the #if below was not compiled into
> the library:
> 
> #if defined _LIBC || defined MALLOC_HOOKS
>  tsd_setspecific(arena_key, save_arena);
>  __malloc_hook = save_malloc_hook;
>  __free_hook = save_free_hook;
> #endif
> 
> These lines look to be responsible for restoring the original hooks in the
> child process.  I'm guessing that's important.  Apparently, neither _LIBC
> nor MALLOC_HOOKS are defined.  It looks like this is the only place
> MALLOC_HOOKS is defined in all of the source code, which leads me to believe
> this is deprecated.  I'm guess _LIBC is the critical one here.  Should this
> macro be defined?
> -Adam
> 
> 
> Jonathan Perkins wrote:
> 
> >Hi Adam.  Thanks for the report and debugging info.  We're inspecting
> >this code path.  In the meantime, can you provide us with a simple
> >reproducer to help us investigate this further?
> >
> >On Mon, Dec 15, 2014 at 05:53:53PM -0800, Adam T. Moody wrote:
> >
> >>Hello MVAPICH team,
> >>We have a code using MVAPICH2-1.9 that forks a process whose child then dies
> >>after it eventually consumes all available memory.  If I SIGSTOP the child
> >>and attach to it before it dies, I can see from its stack trace that it's
> >>apparently in an infinite recursion loop consisting of calls to:
> >>
> >>malloc_atfork()
> >>malloc() at mvapich_malloc.c:3403
> >>
> >>I can see that mvapich_malloc.c:3403 is the last line of the following,
> >>which invokes the __malloc_hook function pointer:
> >>
> >>__malloc_ptr_t (*hook) __MALLOC_P ((size_t, __const __malloc_ptr_t)) =
> >>  __malloc_hook;
> >>if (hook != NULL)
> >>  return (*hook)(bytes, RETURN_ADDRESS (0));
> >>
> >>From the stack trace, I can deduce that __malloc_hook must be pointing to
> >>malloc_atfork().
> >>
> >>Then looking at the malloc_atfork() impelmentation, I can see that it calls
> >>public_mALLOc() in it's else clause, which seems like it may be the code
> >>path leading to the loop:
> >>
> >>} else {
> >>  /* Suspend the thread until the `atfork' handlers have completed.
> >>     By that time, the hooks will have been reset as well, so that
> >>     mALLOc() can be used again. */
> >>  (void)mutex_lock(&list_lock);
> >>  (void)mutex_unlock(&list_lock);
> >>  return public_mALLOc(sz);
> >>}
> >>
> >>Do you have ideas how this might happen?  Can you imagine a case that would
> >>lead to a loop here?
> >>
> >>I see a lock and followed immediately by an unlock.  Does this lock really
> >>protect anything?
> >>Thanks,
> >>-Adam
> >>_______________________________________________
> >>mvapich-discuss mailing list
> >>mvapich-discuss at cse.ohio-state.edu
> >>http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> >
> 

-- 
Jonathan Perkins


More information about the mvapich-discuss mailing list