[mvapich-discuss] Deadlock with CUDA and InfiniBand

Wed Sep 10 19:12:50 EDT 2014

Hello,

I have a somewhat obscure deadlock issue that only occurs under a specific set of circumstances.  The application I develop, PyFR (http://pyfr.org/), is written in Python and uses CUDA and MPI to run on GPU clusters (although we do not use any CUDA-aware-MPI functionality, all copying is marshaled by us).

When using either TCP, SHM, or a combination thereof as a transport layer no problems are observed.  Further, when running over IB with one-rank-per-node no problems are observed.  However, when running over IB with some ranks on the same node PyFR deadlocks.  The deadlocking can be avoided by setting: MV2_USE_RDMA_FAST_PATH=0.

Interestingly, we observed a similar issue with Intel MPI a while back; specifically version 3 would deadlock when running over IB (but not TCP or SHM).  This was resolved by upgrading to version 4.  No issues have ever been observed when using Platform MPI.

Searching I have found the following mailing list posting describing a similar behaviour:

  http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-May/004435.html

which like PyFR appears to occur when combining CUDA and IB and can be resolved by setting MV2_USE_RDMA_FAST_PATH=0.

>From an API standpoint PyFR is relatively simple: all requests are persistent, point-to-point and non-blocking.  Unfortunately my attempts to produce a reduced test case have never got very far -- only the complete application is able to reliably produce deadlocks.

What is the best means of proceeding?

Regards, Freddie.