[mvapich-discuss] (no subject)

khaled hamidouche hamidouc at cse.ohio-state.edu
Thu Sep 24 17:01:08 EDT 2015


Hi Eugenio,

Please try with the latest MV2-GDR (2.1RC2) and let us know if you face the
same issue.

Thanks

On Thu, Sep 24, 2015 at 8:15 AM, Rustico, Eugenio <eugenio.rustico at baw.de>
wrote:

> Hello,
>
> I am using mvapich2.1a-gdr in a multi-process software based on CUDA. The
> process allocates the host memory (some with new, some with calloc, some
> with
> cudaHostAlloc), triggers a pthread (which uses the GPU and ends with
> pthread_exit), waits for the pthread to end (with pthread barriers),
> deallocates
> the memory (delete, free and cudaFreeHost) and exits.
>
> The software behaves correctly (e.g. memory transfers) but the processes
> do not
> end. I did some debugging and they are all stuck on a deallocation
> instruction
> (a delete operator) on a host buffer. It is one of the very last lines of
> code;
> the threads already exited, only the main thread remains. There is another
> thread that is run by the MPI environment (or by the CUDA runtime), which
> I did
> not create explicitly. The call stacks at the moment of the hang are:
>
> #0  0x00007fcbe2abefe2 in ?? () from
> /sw/mpi/mvapich/mvapich2.1a-gdr/lib64/libmpi.so.12
> #1  0x00007fcbe2abf678 in _int_free () from
> /sw/mpi/mvapich/mvapich2.1a-gdr/lib64/libmpi.so.12
> #2  0x00007fcbe2ac31fb in free () from
> /sw/mpi/mvapich/mvapich2.1a-gdr/lib64/libmpi.so.12
> #3  0x0000000000411591 in GPUSPH::deallocateGlobalHostBuffers() ()
> #4  0x0000000000411789 in GPUSPH::finalize() ()
> #5  0x000000000042c71e in main ()
>
> #0  0x0000003b64c0e75d in read () from /lib64/libpthread.so.0
> #1  0x000000301fa0876f in ibv_get_async_event () from
> /usr/lib64/libibverbs.so.1
> #2  0x00007f686f80d3c9 in async_thread () from
> /sw/mpi/mvapich/mvapich2.1a-gdr/lib64/libmpi.so.12
> #3  0x0000003b64c079d1 in start_thread () from /lib64/libpthread.so.0
> #4  0x0000003b648e8b6d in clone () from /lib64/libc.so.6
>
> The last one looks suspicious to me. What are the read and a "async" event
> for?
>
> Notes:
> - All asynchronous transfers are currently disabled (I only use MPI_send)
> - All processes hang to the same delete (which I believe is the first free
> performed in the main thread), and if I comment that, they all stop at one
> of
> the next (but not the immediate next!)
> - The arrays being deallocated are read-only for the thread (which anyway
> terminates before the hang)
> - The initialization is performed with MPI_Init_thread(NULL, NULL,
> MPI_THREAD_MULTIPLE, &result), which is successful
>
> Unfortunately, I cannot easily update the MVAPICH libs (I can make a
> request but
> it will take min 1-2 weeks).
> Any suggestion would be appreciated.
>
> Best regards,
> Eugenio Rustico
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150924/7b01e67b/attachment.html>


More information about the mvapich-discuss mailing list