[mvapich-discuss] Stuck on a free() upon exit

Rustico, Eugenio eugenio.rustico at baw.de
Thu Jan 14 05:53:06 EST 2016


Hello,

unfortunately it is not possible to update the cluster (its administration is
outsourced and there is a migration to another cluster), so I cannot try with
newer library versions. If anyone has a clue what I could try, please let me
know. Otherwise I will keep terminating the processes manually.

Best regards,
Eugenio Rustico

> khaled hamidouche <hamidouc at cse.ohio-state.edu> hat am 24. September 2015 um
> 23:01 geschrieben:
> 
> 
> Hi Eugenio,
> 
> Please try with the latest MV2-GDR (2.1RC2) and let us know if you face the
> same issue.
> 
> Thanks
> 
> On Thu, Sep 24, 2015 at 8:15 AM, Rustico, Eugenio <eugenio.rustico at baw.de>
> wrote:
> 
> > Hello,
> >
> > I am using mvapich2.1a-gdr in a multi-process software based on CUDA. The
> > process allocates the host memory (some with new, some with calloc, some
> > with
> > cudaHostAlloc), triggers a pthread (which uses the GPU and ends with
> > pthread_exit), waits for the pthread to end (with pthread barriers),
> > deallocates
> > the memory (delete, free and cudaFreeHost) and exits.
> >
> > The software behaves correctly (e.g. memory transfers) but the processes
> > do not
> > end. I did some debugging and they are all stuck on a deallocation
> > instruction
> > (a delete operator) on a host buffer. It is one of the very last lines of
> > code;
> > the threads already exited, only the main thread remains. There is another
> > thread that is run by the MPI environment (or by the CUDA runtime), which
> > I did
> > not create explicitly. The call stacks at the moment of the hang are:
> >
> > #0  0x00007fcbe2abefe2 in ?? () from
> > /sw/mpi/mvapich/mvapich2.1a-gdr/lib64/libmpi.so.12
> > #1  0x00007fcbe2abf678 in _int_free () from
> > /sw/mpi/mvapich/mvapich2.1a-gdr/lib64/libmpi.so.12
> > #2  0x00007fcbe2ac31fb in free () from
> > /sw/mpi/mvapich/mvapich2.1a-gdr/lib64/libmpi.so.12
> > #3  0x0000000000411591 in GPUSPH::deallocateGlobalHostBuffers() ()
> > #4  0x0000000000411789 in GPUSPH::finalize() ()
> > #5  0x000000000042c71e in main ()
> >
> > #0  0x0000003b64c0e75d in read () from /lib64/libpthread.so.0
> > #1  0x000000301fa0876f in ibv_get_async_event () from
> > /usr/lib64/libibverbs.so.1
> > #2  0x00007f686f80d3c9 in async_thread () from
> > /sw/mpi/mvapich/mvapich2.1a-gdr/lib64/libmpi.so.12
> > #3  0x0000003b64c079d1 in start_thread () from /lib64/libpthread.so.0
> > #4  0x0000003b648e8b6d in clone () from /lib64/libc.so.6
> >
> > The last one looks suspicious to me. What are the read and a "async" event
> > for?
> >
> > Notes:
> > - All asynchronous transfers are currently disabled (I only use MPI_send)
> > - All processes hang to the same delete (which I believe is the first free
> > performed in the main thread), and if I comment that, they all stop at one
> > of
> > the next (but not the immediate next!)
> > - The arrays being deallocated are read-only for the thread (which anyway
> > terminates before the hang)
> > - The initialization is performed with MPI_Init_thread(NULL, NULL,
> > MPI_THREAD_MULTIPLE, &result), which is successful
> >
> > Unfortunately, I cannot easily update the MVAPICH libs (I can make a
> > request but
> > it will take min 1-2 weeks).
> > Any suggestion would be appreciated.
> >
> > Best regards,
> > Eugenio Rustico
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >


More information about the mvapich-discuss mailing list