[mvapich-discuss] multi-threaded CUDA MPI_Allgatherv crash

sreeram potluri potluri at cse.ohio-state.edu
Thu Oct 17 17:40:01 EDT 2013


Justin,

Thank you for the reproducer. We are looking into this issue.

Best
Sreeram Potluri


On Wed, Oct 16, 2013 at 4:32 PM, Justin Luitjens <jluitjens at nvidia.com>wrote:

> The attached reproducer crashes in mvapich 2-2.0a.  It appears that the
> GPU direct version of MPI_Allgatherv is not thread safe.  ****
>
> ** **
>
> I compiled this as follows:****
>
> ** **
>
> %> nvcc -c -arch=sm_20 -O3
> -I/shared/devtechapps/mpi/gnu-4.7.3/mvapich2-2.0a/cuda-5.5.22/include
> -Xcompiler -fopenmp mpialltoall.cu -o mpialltoall.o****
>
> %> mpic++ -o alltoall mpialltoall.o -L/shared/apps/cuda/CUDA-v5.5.22/lib64
> -lcuda -lcudart –fopenmp****
>
> ** **
>
> I then set the following variables:****
>
> ** **
>
> export MV2_USE_CUDA=1****
>
> export MV2_ENABLE_AFFINITY=0****
>
> ** **
>
> Finally I ran with this:****
>
> ** **
>
> %> mpirun -np 2 ./alltoall****
>
> ** **
>
> This crashes with the following error:****
>
> ** **
>
> [dt00:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)****
>
> [dt00:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
> (signal 11)****
>
> ** **
>
>
> ===================================================================================
> ****
>
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES****
>
> =   EXIT CODE: 11****
>
> =   CLEANING UP REMAINING PROCESSES****
>
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES****
>
>
> ===================================================================================
> ****
>
> YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault
> (signal 11)****
>
> This typically refers to a problem with your application.****
>
> Please see the FAQ page for debugging suggestions****
>
> ** **
>
> If I set the number of threads to 1 this example runs fine.****
>
> If I set the number of threads to 2 and use host memory the example also
> runs fine.****
>
> This only seems to crash if the data is in device memory and we use
> multiple threads.****
>
> ** **
>
> Thanks,****
>
> Justin****
>  ------------------------------
>  This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
>  ------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131017/99926f0a/attachment.html>


More information about the mvapich-discuss mailing list