[mvapich-discuss] deadlock behavior

sreeram potluri potluri at cse.ohio-state.edu
Tue May 21 12:51:32 EDT 2013


Hi Brody,

Thanks for reporting the issue. Can you send us the reproducers if
possible?

Sreeram Potluri


On Tue, May 21, 2013 at 4:27 AM, Brody Huval <brodyh at stanford.edu> wrote:

> Hi,
>
> I'm currently getting a deadlock from a cuda+mpi program and I think it's
> coming from MVAPICH2. The behavior is the same when sending from device
> memory, or sending from host (after a transfer from the GPU). Running the
> same code with mpich2 does not produce the deadlock.
>
> The deadlock seems to occur when I try to perform the equivalent of an
> asynchronous MPI_Allreduce on a single float, using MPI_Isend's and
> MPI_Irecv's. However, one of the nodes will never receive their message,
> despite it being sent. When I run with 4 instances, the deadlock does not
> happen when ran on a single node, only when split across 2 nodes, and even
> then it depends on the rank order.
>
> The problem goes away if I replace the Isend and Irecvs with
> MPI_Allreduce, and use the flag MV2_CUDA_SMP_IPC=1. Using the flag with
> Isend & Irecvs from host causes a segfault, and using it after transferring
> to host will work.
>
> Any idea on what could cause this? Thanks in advance for any help.
>
> Best,
> Brody Huval
>
>
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130521/7f2ad891/attachment.html


More information about the mvapich-discuss mailing list