[mvapich-discuss] Segmentation fault at some MPI functions after MPI_Put

Wed Nov 4 01:47:43 EST 2015

Dear Khaled and Jiri

Thank you for your reply.
I forgot to write that I set MV2_USE_GPUDIRECT_GDRCOPY=0 because GDRCOPY 
for CUDA7.5 is not installed in the cluster.
osu_put_latency was passed but the result is unreasonable when 
MV2_CUDA_IPC=1.

when MV2_CUDA_IPC=1
("mpirun_rsh -np 2 -hostfile $PBS_NODEFILE MV2_NUM_PORTS=2 
MV2_USE_CUDA=1 MV2_CUDA_IPC=1 MV2_USE_GPUDIRECT_GDRCOPY=0 
./local_rank.sh osu_put_latency -d cuda -w create -s lock D D")
(local_rank.sh is used for setting LOCAL_RANK=$MV2_COMM_WORLD_LOCAL_RANK 
for GPU selection)
###################################################################################################
# OSU MPI_Put-CUDA Latency Test v5.0
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
# Size          Latency (us)
0                       0.05
1                       3.30
2                       3.29
4                       3.29
8                       3.30
16                      3.31
32                      3.30
64                      3.26
128                     3.30
256                     3.29
512                     3.27
1024                    3.34
2048                    3.25
4096                    3.26
8192                    3.46
16384                   3.25
32768                   3.21
65536                   3.18
131072                  3.33
262144                  3.22
524288                  3.14
1048576                 3.20
2097152                 3.17
4194304                 3.21
###################################################################################################

when MV2_CUDA_IPC=0
###################################################################################################
# OSU MPI_Put-CUDA Latency Test v5.0
# Window creation: MPI_Win_create
# Synchronization: MPI_Win_lock/unlock
# Rank 0 Memory on DEVICE (D) and Rank 1 Memory on DEVICE (D)
# Size          Latency (us)
0                       0.05
1                       4.41
2                       4.40
4                       4.41
8                       4.41
16                      4.40
32                      4.41
64                      4.48
128                     4.80
256                     5.38
512                     6.47
1024                    8.94
2048                   13.58
4096                   21.33
8192                   36.63
16384                  38.95
32768                  55.44
65536                  82.53
131072                 65.37
262144                 94.06
524288                143.40
1048576               252.99
2097152               493.56
4194304               976.52
###################################################################################################

nvidia-smi topo -m
###################################################################################################
         ^[[4mGPU0       GPU1    GPU2    GPU3    mlx4_0  CPU Affinity^[[0m
GPU0     X      PHB     SOC     SOC     SOC     0-9
GPU1    PHB      X      SOC     SOC     SOC     0-9
GPU2    SOC     SOC      X      PHB     PHB     10-19
GPU3    SOC     SOC     PHB      X      PHB     10-19
mlx4_0  SOC     SOC     PHB     PHB      X

Legend:

   X   = Self
   SOC = Path traverses a socket-level link (e.g. QPI)
   PHB = Path traverses a PCIe host bridge
   PXB = Path traverses multiple PCIe internal switches
   PIX = Path traverses a PCIe internal switch
###################################################################################################

The system configuration is the below.
######################################
CPU: Intel Xeon-E5 2680v2 x 2socket
GPU: NVIDIA K20X x 4
IB:  Mellanox Connect-X3 Dual-port QDR
######################################

Best regards,
Akihiro Tabuchi

On 2015年11月04日 06:18, Jiri Kraus wrote:
> Hi Akihiro,
>
> can you provide the output of
>
> $ nvidia-smi topo -m
>
> on the machine were this happens?
>
> Thanks
>
> Jiri
>
> Sent from my smartphone. Please excuse autocorrect typos.
>
>
>
> ---- Akihiro Tabuchi schrieb ----
>
> Dear MVAPICH developers,
>
> I use MVAPICH2-GDR 2.1 on a GPU cluster which has four GPUs on each node.
> When the following conditions, MPI_Win_Free or MPI_Barrier cause a
> segmentation fault after MPI_Put to a GPU on another MPI process in the
> same node.
>   1.  synchronization by MPI_Win_lock and MPI_Win_unlock
>   2.  (128*N)KB < (MPI_Put transfer size) <= (128*N+8)KB, (N >= 1)
>   3-a. When MV2_CUDA_IPC=1, the number of processes in a node is three
> and over.
>   3-b. When MV2_CUDA_IPC=0, the number of processes in a node is two and
> over.
>
> A test program and a backtrace of it are the below.
>
> A test program
> ###################################################################################################
> #include <stdio.h>
> #include <stdlib.h>
> #include <mpi.h>
> #include <cuda_runtime.h>
> #define MAXSIZE (4*1024*1024)
>
> int main(int argc, char **argv){
>    MPI_Init(&argc, &argv);
>
>    if(argc != 2){
>      printf("few arguments\n");
>      return 1;
>    }
>    int size = atoi(argv[1]);
>    if(size > MAXSIZE){
>      printf("too large size\n");
>      return 1;
>    }
>    int rank, nranks;
>    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>    MPI_Comm_size(MPI_COMM_WORLD, &nranks);
>
>    if(nranks < 2){
>      printf("# of processes must be more than 1\n");
>      return 1;
>    }
>    if(rank == 0){
>      printf("put size=%d\n", size);
>    }
>
>    char *buf;
>    cudaMalloc((void**)&buf, MAXSIZE*sizeof(char));
>    MPI_Win win;
>    MPI_Win_create((void*)buf, MAXSIZE*sizeof(char), sizeof(char),
> MPI_INFO_NULL, MPI_COMM_WORLD, &win);
>
>    if(rank == 0){
>      int target_rank = 1;
>      MPI_Win_lock(MPI_LOCK_SHARED, target_rank, 0, win);
>      MPI_Put((void*)buf, size, MPI_BYTE, target_rank, 0, size, MPI_BYTE,
> win);
>      MPI_Win_unlock(target_rank, win);
>    }
>
>    //MPI_Barrier(MPI_COMM_WORLD);
>    MPI_Win_free(&win);
>    cudaFree(buf);
>    MPI_Finalize();
>    return 0;
> }
> ###################################################################################################
>
>
> A backtrace when the program was run by
> "mpirun_rsh -np 3 -hostfile $PBS_NODEFILE MV2_NUM_PORTS=2 MV2_USE_CUDA=1
> MV2_CUDA_IPC=1 ./put_test 131073"
> (three processes are run on same node)
> ###################################################################################################
> [tcag-0001:mpi_rank_1][error_sighandler] Caught error: Segmentation
> fault (signal 11)
> [tcag-0001:mpi_rank_1][print_backtrace]   0:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(print_backtrace+0x23)
> [0x2b49628c7753]
> [tcag-0001:mpi_rank_1][print_backtrace]   1:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(error_sighandler+0x5e)
> [0x2b49628c786e]
> [tcag-0001:mpi_rank_1][print_backtrace]   2: /lib64/libc.so.6(+0x326b0)
> [0x2b4962c7b6b0]
> [tcag-0001:mpi_rank_1][print_backtrace]   3:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_decr_refcount+0x27)
> [0x2b4962888447]
> [tcag-0001:mpi_rank_1][print_backtrace]   4:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_unregister+0x11)
> [0x2b4962888a61]
> [tcag-0001:mpi_rank_1][print_backtrace]   5:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_MRAILI_self_cq_poll+0x143)
> [0x2b4962895973]
> [tcag-0001:mpi_rank_1][print_backtrace]   6:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_Progress+0x337)
> [0x2b4962866117]
> [tcag-0001:mpi_rank_1][print_backtrace]   7:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Wait+0x47)
> [0x2b496280bad7]
> [tcag-0001:mpi_rank_1][print_backtrace]   8:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Recv+0xb7)
> [0x2b496280c737]
> [tcag-0001:mpi_rank_1][print_backtrace]   9:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_intra+0x1fc8)
> [0x2b49625e38d8]
> [tcag-0001:mpi_rank_1][print_backtrace]  10:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_impl+0x4a)
> [0x2b49625e3d3a]
> [tcag-0001:mpi_rank_1][print_backtrace]  11:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_Win_free+0x25e)
> [0x2b496283ebfe]
> [tcag-0001:mpi_rank_1][print_backtrace]  12:
> /work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPI_Win_free+0x23a)
> [0x2b49627ec62a]
> [tcag-0001:mpi_rank_1][print_backtrace]  13: ./put_test() [0x400ac8]
> [tcag-0001:mpi_rank_1][print_backtrace]  14:
> /lib64/libc.so.6(__libc_start_main+0xfd) [0x2b4962c67d5d]
> [tcag-0001:mpi_rank_1][print_backtrace]  15: ./put_test() [0x400919]
> [tcag-0001:mpispawn_0][readline] Unexpected End-Of-File on file
> descriptor 6. MPI process died?
> [tcag-0001:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [tcag-0001:mpispawn_0][child_handler] MPI process (rank: 1, pid: 25550)
> terminated with signal 11 -> abort job
> [tcag-0001:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> tcag-0001 aborted: Error while reading a PMI socket (4)
> ###################################################################################################
>
>
> Do you know the cause of this problem?
>
> Best regards,
> Akihiro Tabuchi
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> -----------------------------------------------------------------------------------
> NVIDIA GmbH
> Wuerselen
> Amtsgericht Aachen
> HRB 8361
> Managing Director: Karen Theresa Burns
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact
> the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
>

-- 
Akihiro Tabuchi
tabuchi at hpcs.cs.tsukuba.ac.jp