[mvapich-discuss] Infinite loop in ptmalloc
Adam T. Moody
moody20 at llnl.gov
Tue Dec 16 21:12:45 EST 2014
Hi Jonathan,
I've come up with a simple reproducer in C:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include "mpi.h"
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
pid_t pid = fork();
if (pid == 0) {
void* buf = malloc(2);
buf = realloc(buf, 4);
void* buf2 = malloc(6);
free(buf);
free(buf2);
return 0;
} else {
sleep(300);
}
MPI_Finalize();
return 0;
}
You can run this as a single task MPI job. In my testing, the child
process ends up in infinite recursion as I've described before. There
are potentially two separate bugs:
1) The malloc hooks are not being restored after fork in
ptmalloc_unlock_all2 in the child proc due to the "#if _LIBC ||
MALLOC_HOOKS" guard at arena.c:269.
2) An application call to realloc updates the last arena id via
tsd_setspecific at mvapich_malloc.c:3594. If the app calls realloc
followed by malloc all while the malloc_atfork routine is in place,
MVAPICH enters the recursion loop. It seems realloc is the only wrapper
to call tsd_setspecific, so perhaps it shouldn't or perhaps it needs a
realloc_atfork hook?
Please double check me on this.
-Adam
Jonathan Perkins wrote:
>Sure thing.
>On Dec 16, 2014 8:10 PM, "Adam T. Moody" <moody20 at llnl.gov> wrote:
>
>
>
>>Hi Jonathan,
>>I commented out the #if _LIBC check in ptmalloc_unlock_all2(), around line
>>267 in mpid/ch3/channels/common/src/memory/ptmalloc2/arena.c to force the
>>following three lines to be compiled into the library:
>>
>>tsd_setspecific(arena_key, save_arena);
>>__malloc_hook = save_malloc_hook;
>>__free_hook = save_free_hook;
>>
>>This seems to work around the problem.
>>
>>Can you take a closer look at this #if to double-check whether it should
>>be there?
>>-Adam
>>
>>
>>
>>Jonathan Perkins wrote:
>>
>> Hi Adam. Thanks for the additional information. We've received some
>>
>>
>>>reports about problems with our internal ptmalloc2 implementation and
>>>python applications misbehaving when used together.
>>>
>>>I'm glad that you have a work around for the time being. I'll touch
>>>bases with you again once we have more info on what is happening in this
>>>case after we're able to create a reproducer and have more insight on
>>>this problem.
>>>
>>>On Tue, Dec 16, 2014 at 02:44:39PM -0800, Adam T. Moody wrote:
>>>
>>>
>>>
>>>
>>>>Hi Jonathan,
>>>>I'll look into a reproducer. Right now, it's not trivial to reproduce.
>>>>It's a python app that uses MPI. The python process uses Popen to fork
>>>>and
>>>>exec "cat" to read "/proc/cpuinfo". It's this child process that then
>>>>gets
>>>>stuck in the infinite recursion loop. We found that a work around is to
>>>>use
>>>>Popen to start a shell, which then cats the file... ugh.
>>>>
>>>>One thing I can see under Totalview is that ptmalloc_unlock_all2 is
>>>>defined
>>>>and therefore apparently used, however the #if below was not compiled
>>>>into
>>>>the library:
>>>>
>>>>#if defined _LIBC || defined MALLOC_HOOKS
>>>>tsd_setspecific(arena_key, save_arena);
>>>>__malloc_hook = save_malloc_hook;
>>>>__free_hook = save_free_hook;
>>>>#endif
>>>>
>>>>These lines look to be responsible for restoring the original hooks in
>>>>the
>>>>child process. I'm guessing that's important. Apparently, neither _LIBC
>>>>nor MALLOC_HOOKS are defined. It looks like this is the only place
>>>>MALLOC_HOOKS is defined in all of the source code, which leads me to
>>>>believe
>>>>this is deprecated. I'm guess _LIBC is the critical one here. Should
>>>>this
>>>>macro be defined?
>>>>-Adam
>>>>
>>>>
>>>>Jonathan Perkins wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Hi Adam. Thanks for the report and debugging info. We're inspecting
>>>>>this code path. In the meantime, can you provide us with a simple
>>>>>reproducer to help us investigate this further?
>>>>>
>>>>>On Mon, Dec 15, 2014 at 05:53:53PM -0800, Adam T. Moody wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Hello MVAPICH team,
>>>>>>We have a code using MVAPICH2-1.9 that forks a process whose child
>>>>>>then dies
>>>>>>after it eventually consumes all available memory. If I SIGSTOP the
>>>>>>child
>>>>>>and attach to it before it dies, I can see from its stack trace that
>>>>>>it's
>>>>>>apparently in an infinite recursion loop consisting of calls to:
>>>>>>
>>>>>>malloc_atfork()
>>>>>>malloc() at mvapich_malloc.c:3403
>>>>>>
>>>>>>I can see that mvapich_malloc.c:3403 is the last line of the following,
>>>>>>which invokes the __malloc_hook function pointer:
>>>>>>
>>>>>>__malloc_ptr_t (*hook) __MALLOC_P ((size_t, __const __malloc_ptr_t)) =
>>>>>>__malloc_hook;
>>>>>>if (hook != NULL)
>>>>>>return (*hook)(bytes, RETURN_ADDRESS (0));
>>>>>>
>>>>>>
>>>>>>From the stack trace, I can deduce that __malloc_hook must be pointing
>>>>>>to
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>>malloc_atfork().
>>>>>>
>>>>>>Then looking at the malloc_atfork() impelmentation, I can see that it
>>>>>>calls
>>>>>>public_mALLOc() in it's else clause, which seems like it may be the
>>>>>>code
>>>>>>path leading to the loop:
>>>>>>
>>>>>>} else {
>>>>>>/* Suspend the thread until the `atfork' handlers have completed.
>>>>>> By that time, the hooks will have been reset as well, so that
>>>>>> mALLOc() can be used again. */
>>>>>>(void)mutex_lock(&list_lock);
>>>>>>(void)mutex_unlock(&list_lock);
>>>>>>return public_mALLOc(sz);
>>>>>>}
>>>>>>
>>>>>>Do you have ideas how this might happen? Can you imagine a case that
>>>>>>would
>>>>>>lead to a loop here?
>>>>>>
>>>>>>I see a lock and followed immediately by an unlock. Does this lock
>>>>>>really
>>>>>>protect anything?
>>>>>>Thanks,
>>>>>>-Adam
>>>>>>_______________________________________________
>>>>>>mvapich-discuss mailing list
>>>>>>mvapich-discuss at cse.ohio-state.edu
>>>>>>http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>
>>
>
>
>
More information about the mvapich-discuss
mailing list