[mvapich-discuss] multi-threaded CUDA MPI_Allgatherv crash

Justin Luitjens jluitjens at nvidia.com
Wed Oct 30 17:01:44 EDT 2013


I wanted to follow up on this in case anyone else hits this.

This was tracked down to a driver bug in R319 which will be fixed in future driver releases.

Thanks,
Justin

From: mvapich-discuss [mailto:mvapich-discuss-bounces at cse.ohio-state.edu] On Behalf Of Justin Luitjens
Sent: Wednesday, October 16, 2013 1:33 PM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] multi-threaded CUDA MPI_Allgatherv crash

The attached reproducer crashes in mvapich 2-2.0a.  It appears that the GPU direct version of MPI_Allgatherv is not thread safe.

I compiled this as follows:

%> nvcc -c -arch=sm_20 -O3 -I/shared/devtechapps/mpi/gnu-4.7.3/mvapich2-2.0a/cuda-5.5.22/include -Xcompiler -fopenmp mpialltoall.cu -o mpialltoall.o
%> mpic++ -o alltoall mpialltoall.o -L/shared/apps/cuda/CUDA-v5.5.22/lib64 -lcuda -lcudart -fopenmp

I then set the following variables:

export MV2_USE_CUDA=1
export MV2_ENABLE_AFFINITY=0

Finally I ran with this:

%> mpirun -np 2 ./alltoall

This crashes with the following error:

[dt00:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
[dt00:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

If I set the number of threads to 1 this example runs fine.
If I set the number of threads to 2 and use host memory the example also runs fine.
This only seems to crash if the data is in device memory and we use multiple threads.

Thanks,
Justin
________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131030/43e308b4/attachment.html>


More information about the mvapich-discuss mailing list