[mvapich-discuss] Deadlock in while calling malloc?

Martin Cuma martin.cuma at utah.edu
Tue Nov 10 15:35:11 EST 2015


Hi Jonathan,

good points, tried that with "mpirun -genv MV2_ENABLE_AFFINITY 0 -bind-to 
numa -map-by numa" - as I am also trying to keep the process' threads on 
one socket. Still get the deadlock. This is on 240 procs, 4 threads each. 
When going down to 2 threads per process, it seems to run through, but, 
that's probably just due to decreased chance of threads stepping over each 
other.

Any other thoughts on this?

BTW, the above mpirun commands (inherited from MPICH) seemed to be the 
simplest to achieve the socket affinity with multithreaded MVAPICH2 
program - any objections to that?

Thanks,
MC

-- 
Martin Cuma
Center for High Performance Computing
Department of Geology and Geophysics
University of Utah

On Tue, 10 Nov 2015, Jonathan Perkins wrote:

> Hello Martin.  Can you try modifying your program to call MPI_Init_thread and
> request MPI_THREAD_FUNNELED.  When running your program also set MV2_ENABLE_AFFINITY=0.
> I think this may resolve your issue since each thread is actually entering the MPI library
> during the malloc calls.
> 
> On Mon, Nov 9, 2015 at 4:27 PM Martin Cuma <martin.cuma at utah.edu> wrote:
>       Hello everyone,
>
>       I am seeing an occasional deadlock in a code which mallocs some memory in
>       a OpenMP threaded region. The MPI code is OpenMP threaded but does not
>       communicate from the thread so MPI is initialized with plain MPI_Init in
>       the MPI_THREAD_SINGLE mode.
>
>       Everything seems to run fine until I hit about 200 MPI processes with 4 or
>       more threads each. Then the program fairly reliably deadlocks on an MPI
>       collective or a barrier and when I investigate the cause, I see one or
>       more processes not reaching the barrier. These processes are stuck inside
>       an malloc call, e.g. as in the backtrace here:
>       Backtrace:
>       #0  0x00007fc405360fe6 in find_and_free_dregs_inside ()
>           from /uufs/chpc.utah.edu/sys/installdir/mvapich2/2.1p/lib/libmpi.so.12
>       #1  0x00007fc405391555 in mvapich2_mem_unhook ()
>            at
>       ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/mem_hooks.c:148
>       #2  0x00007fc40539174d in mvapich2_munmap ()
>            at
>       ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/mem_hooks.c:270
>       #3  0x00007fc405395661 in new_heap ()
>            at
>       ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/ptmalloc2/arena.c:542
>       #4  0x00007fc405391b80 in _int_new_arena ()
>            at
>       ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/ptmalloc2/arena.c:762
>       #5  0x00007fc4053958ff in arena_get2 ()
>            at
>       ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/ptmalloc2/arena.c:717
>       #6  0x00007fc405392c26 in malloc ()
>            at
> ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_mall
>       oc.c:3405
>       #7  0x000000000043b470 in fillerp (Ebr=value has been optimized out
>       ) at ./fillerp.c:39
>
>       The typical allocation is done for temporary arrays inside of the threaded
>       region, such as:
>         22 #pragma omp parallel private(iif,iis,ii)
>         23 {
>         24  double _Complex *ne,*nh,*na;
>       ...
>         39  ne = (double _Complex *)malloc(sizeof(double
>       _Complex)*3*invd->Nrlmax*irx);
>       ...
>
>       The code should be fairly clean (checked with memory checkers such as
>       Intel InspectorXE and used on a variety of systems/data sets, though not
>       typically with this high of a process count). Also, with MPICH2 or Intel
>       MPI I am not seeing this deadlock, which makes me suspect an issue with
>       MVAPICH2.
>
>       Before I dig further, I'd like to ask the forum would be if this issue
>       rings a bell to someone. Also, is it possible to modify the allocation
>       behavior using environment variables, configure options, etc? Any other
>       thoughts/suggestions?
>
>       Thanks,
>       MC
>
>       --
>       Martin Cuma
>       Center for High Performance Computing
>       Department of Geology and Geophysics
>       University of Utah
>       _______________________________________________
>       mvapich-discuss mailing list
>       mvapich-discuss at cse.ohio-state.edu
>       http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> 
>


More information about the mvapich-discuss mailing list