[mvapich-discuss] Segmentation fault at some MPI functions after MPI_Put

Akihiro Tabuchi tabuchi at hpcs.cs.tsukuba.ac.jp
Tue Nov 3 06:32:25 EST 2015


Dear MVAPICH developers,

I use MVAPICH2-GDR 2.1 on a GPU cluster which has four GPUs on each node.
When the following conditions, MPI_Win_Free or MPI_Barrier cause a
segmentation fault after MPI_Put to a GPU on another MPI process in the
same node.
 1.  synchronization by MPI_Win_lock and MPI_Win_unlock
 2.  (128*N)KB < (MPI_Put transfer size) <= (128*N+8)KB, (N >= 1)
 3-a. When MV2_CUDA_IPC=1, the number of processes in a node is three
and over.
 3-b. When MV2_CUDA_IPC=0, the number of processes in a node is two and
over.

A test program and a backtrace of it are the below.

A test program
###################################################################################################
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <cuda_runtime.h>
#define MAXSIZE (4*1024*1024)

int main(int argc, char **argv){
  MPI_Init(&argc, &argv);

  if(argc != 2){
    printf("few arguments\n");
    return 1;
  }
  int size = atoi(argv[1]);
  if(size > MAXSIZE){
    printf("too large size\n");
    return 1;
  }
  int rank, nranks;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nranks);

  if(nranks < 2){
    printf("# of processes must be more than 1\n");
    return 1;
  }
  if(rank == 0){
    printf("put size=%d\n", size);
  }

  char *buf;
  cudaMalloc((void**)&buf, MAXSIZE*sizeof(char));
  MPI_Win win;
  MPI_Win_create((void*)buf, MAXSIZE*sizeof(char), sizeof(char),
MPI_INFO_NULL, MPI_COMM_WORLD, &win);

  if(rank == 0){
    int target_rank = 1;
    MPI_Win_lock(MPI_LOCK_SHARED, target_rank, 0, win);
    MPI_Put((void*)buf, size, MPI_BYTE, target_rank, 0, size, MPI_BYTE,
win);
    MPI_Win_unlock(target_rank, win);
  }

  //MPI_Barrier(MPI_COMM_WORLD);
  MPI_Win_free(&win);
  cudaFree(buf);
  MPI_Finalize();
  return 0;
}
###################################################################################################


A backtrace when the program was run by
"mpirun_rsh -np 3 -hostfile $PBS_NODEFILE MV2_NUM_PORTS=2 MV2_USE_CUDA=1
MV2_CUDA_IPC=1 ./put_test 131073"
(three processes are run on same node)
###################################################################################################
[tcag-0001:mpi_rank_1][error_sighandler] Caught error: Segmentation
fault (signal 11)
[tcag-0001:mpi_rank_1][print_backtrace]   0:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(print_backtrace+0x23)
[0x2b49628c7753]
[tcag-0001:mpi_rank_1][print_backtrace]   1:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(error_sighandler+0x5e)
[0x2b49628c786e]
[tcag-0001:mpi_rank_1][print_backtrace]   2: /lib64/libc.so.6(+0x326b0)
[0x2b4962c7b6b0]
[tcag-0001:mpi_rank_1][print_backtrace]   3:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_decr_refcount+0x27)
[0x2b4962888447]
[tcag-0001:mpi_rank_1][print_backtrace]   4:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_unregister+0x11)
[0x2b4962888a61]
[tcag-0001:mpi_rank_1][print_backtrace]   5:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_MRAILI_self_cq_poll+0x143)
[0x2b4962895973]
[tcag-0001:mpi_rank_1][print_backtrace]   6:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_Progress+0x337)
[0x2b4962866117]
[tcag-0001:mpi_rank_1][print_backtrace]   7:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Wait+0x47)
[0x2b496280bad7]
[tcag-0001:mpi_rank_1][print_backtrace]   8:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Recv+0xb7)
[0x2b496280c737]
[tcag-0001:mpi_rank_1][print_backtrace]   9:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_intra+0x1fc8)
[0x2b49625e38d8]
[tcag-0001:mpi_rank_1][print_backtrace]  10:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_impl+0x4a)
[0x2b49625e3d3a]
[tcag-0001:mpi_rank_1][print_backtrace]  11:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_Win_free+0x25e)
[0x2b496283ebfe]
[tcag-0001:mpi_rank_1][print_backtrace]  12:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPI_Win_free+0x23a)
[0x2b49627ec62a]
[tcag-0001:mpi_rank_1][print_backtrace]  13: ./put_test() [0x400ac8]
[tcag-0001:mpi_rank_1][print_backtrace]  14:
/lib64/libc.so.6(__libc_start_main+0xfd) [0x2b4962c67d5d]
[tcag-0001:mpi_rank_1][print_backtrace]  15: ./put_test() [0x400919]
[tcag-0001:mpispawn_0][readline] Unexpected End-Of-File on file
descriptor 6. MPI process died?
[tcag-0001:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
MPI process died?
[tcag-0001:mpispawn_0][child_handler] MPI process (rank: 1, pid: 25550)
terminated with signal 11 -> abort job
[tcag-0001:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
tcag-0001 aborted: Error while reading a PMI socket (4)
###################################################################################################


Do you know the cause of this problem?

Best regards,
Akihiro Tabuchi



More information about the mvapich-discuss mailing list