[mvapich-discuss] Deadlock in while calling malloc?

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Nov 10 14:03:55 EST 2015


Hello Martin.  Can you try modifying your program to call MPI_Init_thread
and request MPI_THREAD_FUNNELED.  When running your program also
set MV2_ENABLE_AFFINITY=0.

I think this may resolve your issue since each thread is actually entering
the MPI library during the malloc calls.

On Mon, Nov 9, 2015 at 4:27 PM Martin Cuma <martin.cuma at utah.edu> wrote:

> Hello everyone,
>
> I am seeing an occasional deadlock in a code which mallocs some memory in
> a OpenMP threaded region. The MPI code is OpenMP threaded but does not
> communicate from the thread so MPI is initialized with plain MPI_Init in
> the MPI_THREAD_SINGLE mode.
>
> Everything seems to run fine until I hit about 200 MPI processes with 4 or
> more threads each. Then the program fairly reliably deadlocks on an MPI
> collective or a barrier and when I investigate the cause, I see one or
> more processes not reaching the barrier. These processes are stuck inside
> an malloc call, e.g. as in the backtrace here:
> Backtrace:
> #0  0x00007fc405360fe6 in find_and_free_dregs_inside ()
>     from /uufs/chpc.utah.edu/sys/installdir/mvapich2/2.1p/lib/libmpi.so.12
> #1 <http://chpc.utah.edu/sys/installdir/mvapich2/2.1p/lib/libmpi.so.12#1>
> 0x00007fc405391555 in mvapich2_mem_unhook ()
>      at
>
> ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/mem_hooks.c:148
> #2  0x00007fc40539174d in mvapich2_munmap ()
>      at
>
> ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/mem_hooks.c:270
> #3  0x00007fc405395661 in new_heap ()
>      at
>
> ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/ptmalloc2/arena.c:542
> #4  0x00007fc405391b80 in _int_new_arena ()
>      at
>
> ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/ptmalloc2/arena.c:762
> #5  0x00007fc4053958ff in arena_get2 ()
>      at
>
> ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/ptmalloc2/arena.c:717
> #6  0x00007fc405392c26 in malloc ()
>      at
>
> ../../../srcdir/mvapich2/2.1/src/mpid/ch3/channels/common/src/memory/ptmalloc2/mvapich_malloc.c:3405
> #7  0x000000000043b470 in fillerp (Ebr=value has been optimized out
> ) at ./fillerp.c:39
>
> The typical allocation is done for temporary arrays inside of the threaded
> region, such as:
>   22 #pragma omp parallel private(iif,iis,ii)
>   23 {
>   24  double _Complex *ne,*nh,*na;
> ...
>   39  ne = (double _Complex *)malloc(sizeof(double
> _Complex)*3*invd->Nrlmax*irx);
> ...
>
> The code should be fairly clean (checked with memory checkers such as
> Intel InspectorXE and used on a variety of systems/data sets, though not
> typically with this high of a process count). Also, with MPICH2 or Intel
> MPI I am not seeing this deadlock, which makes me suspect an issue with
> MVAPICH2.
>
> Before I dig further, I'd like to ask the forum would be if this issue
> rings a bell to someone. Also, is it possible to modify the allocation
> behavior using environment variables, configure options, etc? Any other
> thoughts/suggestions?
>
> Thanks,
> MC
>
> --
> Martin Cuma
> Center for High Performance Computing
> Department of Geology and Geophysics
> University of Utah
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151110/9f9e37cc/attachment.html>


More information about the mvapich-discuss mailing list