[mvapich-discuss] Segmentation fault at some MPI functions after MPI_Put
khaled hamidouche
hamidouc at cse.ohio-state.edu
Tue Nov 3 15:58:08 EST 2015
Hi Akihiro,
Thanks for your report, sorry to know that you are facing issue with
MV2-GDR. However, we are not able to reproduce your issue. Your reproducer
is passing in our local test bed.
Would you please let us know more details on your setup:
1) How you are able to run without explicitly disabling GDRCOPY ?
2) can you please run with OMB (osu_put_latency) and let us know if it is
passing for you ?
Thanks
../install/bin/mpirun_rsh -np 3 ivy1 ivy1 ivy1 MV2_CUDA_IPC=1
MV2_NUM_PORTS=2 MV2_USE_CUDA=1 MV2_USE_GPUDIRECT_GDRCOPY=1
MV2_GPUDIRECT_GDRCOPY_LIB=/opt/gdrcopy7.5/libgdrapi.so MV2_IBA_HCA=mlx5_1
./test_put 131073
put size=131073
Warning *** The GPU and IB selected are not on the same socket.
*** This configuration may not deliver the best performance.
Warning *** The GPU and IB selected are not on the same socket.
*** This configuration may not deliver the best performance.
Warning *** The GPU and IB selected are not on the same socket.
*** This configuration may not deliver the best performance.
On Tue, Nov 3, 2015 at 6:32 AM, Akihiro Tabuchi <
tabuchi at hpcs.cs.tsukuba.ac.jp> wrote:
> Dear MVAPICH developers,
>
> I use MVAPICH2-GDR 2.1 on a GPU cluster which has four GPUs on each node.
> When the following conditions, MPI_Win_Free or MPI_Barrier cause a
> segmentation fault after MPI_Put to a GPU on another MPI process in the
> same node.
> 1. synchronization by MPI_Win_lock and MPI_Win_unlock
> 2. (128*N)KB < (MPI_Put transfer size) <= (128*N+8)KB, (N >= 1)
> 3-a. When MV2_CUDA_IPC=1, the number of processes in a node is three
> and over.
> 3-b. When MV2_CUDA_IPC=0, the number of processes in a node is two and
> over.
>
> A test program and a backtrace of it are the below.
>
> A test program
>
> ###################################################################################################
> #include <stdio.h>
> #include <stdlib.h>
> #include <mpi.h>
> #include <cuda_runtime.h>
> #define MAXSIZE (4*1024*1024)
>
> int main(int argc, char **argv){
> MPI_Init(&argc, &argv);
>
> if(argc != 2){
> printf("few arguments\n");
> return 1;
> }
> int size = atoi(argv[1]);
> if(size > MAXSIZE){
> printf("too large size\n");
> return 1;
> }
> int rank, nranks;
> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> MPI_Comm_size(MPI_COMM_WORLD, &nranks);
>
> if(nranks < 2){
> printf("# of processes must be more than 1\n");
> return 1;
> }
> if(rank == 0){
> printf("put size=%d\n", size);
> }
>
> char *buf;
> cudaMalloc((void**)&buf, MAXSIZE*sizeof(char));
> MPI_Win win;
> MPI_Win_create((void*)buf, MAXSIZE*sizeof(char), sizeof(char),
> MPI_INFO_NULL, MPI_COMM_WORLD, &win);
>
> if(rank == 0){
> int target_rank = 1;
> MPI_Win_lock(MPI_LOCK_SHARED, target_rank, 0, win);
> MPI_Put((void*)buf, size, MPI_BYTE, target_rank, 0, size, MPI_BYTE,
> win);
> MPI_Win_unlock(target_rank, win);
> }
>
> //MPI_Barrier(MPI_COMM_WORLD);
> MPI_Win_free(&win);
> cudaFree(buf);
> MPI_Finalize();
> return 0;
> }
>
> ###################################################################################################
>
>
> A backtrace when the program was run by
> "mpirun_rsh -np 3 -hostfile $PBS_NODEFILE MV2_NUM_PORTS=2 MV2_USE_CUDA=1
> MV2_CUDA_IPC=1 ./put_test 131073"
> (three processes are run on same node)
>
> ###################################################################################################
> [tcag-0001:mpi_rank_1][error_sighandler] Caught error: Segmentation
> fault (signal 11)
> [tcag-0001:mpi_rank_1][print_backtrace] 0:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(print_backtrace+0x23)
> [0x2b49628c7753]
> [tcag-0001:mpi_rank_1][print_backtrace] 1:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(error_sighandler+0x5e)
> [0x2b49628c786e]
> [tcag-0001:mpi_rank_1][print_backtrace] 2: /lib64/libc.so.6(+0x326b0)
> [0x2b4962c7b6b0]
> [tcag-0001:mpi_rank_1][print_backtrace] 3:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_decr_refcount+0x27)
> [0x2b4962888447]
> [tcag-0001:mpi_rank_1][print_backtrace] 4:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_unregister+0x11)
> [0x2b4962888a61]
> [tcag-0001:mpi_rank_1][print_backtrace] 5:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_MRAILI_self_cq_poll+0x143)
> [0x2b4962895973]
> [tcag-0001:mpi_rank_1][print_backtrace] 6:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_Progress+0x337)
> [0x2b4962866117]
> [tcag-0001:mpi_rank_1][print_backtrace] 7:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Wait+0x47)
> [0x2b496280bad7]
> [tcag-0001:mpi_rank_1][print_backtrace] 8:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Recv+0xb7)
> [0x2b496280c737]
> [tcag-0001:mpi_rank_1][print_backtrace] 9:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_intra+0x1fc8)
> [0x2b49625e38d8]
> [tcag-0001:mpi_rank_1][print_backtrace] 10:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_impl+0x4a)
> [0x2b49625e3d3a]
> [tcag-0001:mpi_rank_1][print_backtrace] 11:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_Win_free+0x25e)
> [0x2b496283ebfe]
> [tcag-0001:mpi_rank_1][print_backtrace] 12:
>
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPI_Win_free+0x23a)
> [0x2b49627ec62a]
> [tcag-0001:mpi_rank_1][print_backtrace] 13: ./put_test() [0x400ac8]
> [tcag-0001:mpi_rank_1][print_backtrace] 14:
> /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b4962c67d5d]
> [tcag-0001:mpi_rank_1][print_backtrace] 15: ./put_test() [0x400919]
> [tcag-0001:mpispawn_0][readline] Unexpected End-Of-File on file
> descriptor 6. MPI process died?
> [tcag-0001:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [tcag-0001:mpispawn_0][child_handler] MPI process (rank: 1, pid: 25550)
> terminated with signal 11 -> abort job
> [tcag-0001:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> tcag-0001 aborted: Error while reading a PMI socket (4)
>
> ###################################################################################################
>
>
> Do you know the cause of this problem?
>
> Best regards,
> Akihiro Tabuchi
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151103/1b0d68a1/attachment-0001.html>
More information about the mvapich-discuss
mailing list