[mvapich-discuss] Segmentation fault when using CUDA device interop

Tue Sep 4 10:22:33 EDT 2012

Hi all,

I had been using mvapich2 1.8 w/ CUDA interop for a while without any problems. Since our cluster had software upgrades, I get segmentation faults. The upgrade was from RHEL 5.5 / CUDA 4.1 to RHEL 6 / CUDA 4.2. We're currently running NVIDIA drivers 295.41.

I see the behavior even with the simple benchmark applications.

I'm running on nodes with either 4 or 8 connected GPUs (in Dell c410x chassis) that do not have infiniband. Following the recommendations in the user guide, I set the following env vars:
export MV2_USE_CUDA=1
export MV2_USE_SHARED_MEM=1 
export MV2_SMP_SEND_BUF_SIZE=262144

Any ideas?

Here is some sample output. The error message is the same regardless of the application, and seems to occur on the first call to MPI_* that reads/writes to device memory.
--------
$ mpirun -n 2 ~/get_local_rank.sh ./osu_bw D D
CMA: no RDMA devices found
CMA: no RDMA devices found
# OSU MPI-CUDA Bandwidth Test v3.6
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
[nyx0151.engin.umich.edu:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
--------
Joshua A. Anderson, Ph.D.
Chemical Engineering Department, University of Michigan