[mvapich-discuss] Segmentation fault at some MPI functions after MPI_Put

Jiri Kraus jkraus at nvidia.com
Tue Nov 3 16:18:33 EST 2015


Hi Akihiro,

can you provide the output of

$ nvidia-smi topo -m

on the machine were this happens?

Thanks

Jiri

Sent from my smartphone. Please excuse autocorrect typos.


---- Akihiro Tabuchi schrieb ----

Dear MVAPICH developers,

I use MVAPICH2-GDR 2.1 on a GPU cluster which has four GPUs on each node.
When the following conditions, MPI_Win_Free or MPI_Barrier cause a
segmentation fault after MPI_Put to a GPU on another MPI process in the
same node.
 1.  synchronization by MPI_Win_lock and MPI_Win_unlock
 2.  (128*N)KB < (MPI_Put transfer size) <= (128*N+8)KB, (N >= 1)
 3-a. When MV2_CUDA_IPC=1, the number of processes in a node is three
and over.
 3-b. When MV2_CUDA_IPC=0, the number of processes in a node is two and
over.

A test program and a backtrace of it are the below.

A test program
###################################################################################################
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <cuda_runtime.h>
#define MAXSIZE (4*1024*1024)

int main(int argc, char **argv){
  MPI_Init(&argc, &argv);

  if(argc != 2){
    printf("few arguments\n");
    return 1;
  }
  int size = atoi(argv[1]);
  if(size > MAXSIZE){
    printf("too large size\n");
    return 1;
  }
  int rank, nranks;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &nranks);

  if(nranks < 2){
    printf("# of processes must be more than 1\n");
    return 1;
  }
  if(rank == 0){
    printf("put size=%d\n", size);
  }

  char *buf;
  cudaMalloc((void**)&buf, MAXSIZE*sizeof(char));
  MPI_Win win;
  MPI_Win_create((void*)buf, MAXSIZE*sizeof(char), sizeof(char),
MPI_INFO_NULL, MPI_COMM_WORLD, &win);

  if(rank == 0){
    int target_rank = 1;
    MPI_Win_lock(MPI_LOCK_SHARED, target_rank, 0, win);
    MPI_Put((void*)buf, size, MPI_BYTE, target_rank, 0, size, MPI_BYTE,
win);
    MPI_Win_unlock(target_rank, win);
  }

  //MPI_Barrier(MPI_COMM_WORLD);
  MPI_Win_free(&win);
  cudaFree(buf);
  MPI_Finalize();
  return 0;
}
###################################################################################################


A backtrace when the program was run by
"mpirun_rsh -np 3 -hostfile $PBS_NODEFILE MV2_NUM_PORTS=2 MV2_USE_CUDA=1
MV2_CUDA_IPC=1 ./put_test 131073"
(three processes are run on same node)
###################################################################################################
[tcag-0001:mpi_rank_1][error_sighandler] Caught error: Segmentation
fault (signal 11)
[tcag-0001:mpi_rank_1][print_backtrace]   0:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(print_backtrace+0x23)
[0x2b49628c7753]
[tcag-0001:mpi_rank_1][print_backtrace]   1:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(error_sighandler+0x5e)
[0x2b49628c786e]
[tcag-0001:mpi_rank_1][print_backtrace]   2: /lib64/libc.so.6(+0x326b0)
[0x2b4962c7b6b0]
[tcag-0001:mpi_rank_1][print_backtrace]   3:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_decr_refcount+0x27)
[0x2b4962888447]
[tcag-0001:mpi_rank_1][print_backtrace]   4:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_unregister+0x11)
[0x2b4962888a61]
[tcag-0001:mpi_rank_1][print_backtrace]   5:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_MRAILI_self_cq_poll+0x143)
[0x2b4962895973]
[tcag-0001:mpi_rank_1][print_backtrace]   6:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_Progress+0x337)
[0x2b4962866117]
[tcag-0001:mpi_rank_1][print_backtrace]   7:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Wait+0x47)
[0x2b496280bad7]
[tcag-0001:mpi_rank_1][print_backtrace]   8:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Recv+0xb7)
[0x2b496280c737]
[tcag-0001:mpi_rank_1][print_backtrace]   9:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_intra+0x1fc8)
[0x2b49625e38d8]
[tcag-0001:mpi_rank_1][print_backtrace]  10:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_impl+0x4a)
[0x2b49625e3d3a]
[tcag-0001:mpi_rank_1][print_backtrace]  11:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_Win_free+0x25e)
[0x2b496283ebfe]
[tcag-0001:mpi_rank_1][print_backtrace]  12:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPI_Win_free+0x23a)
[0x2b49627ec62a]
[tcag-0001:mpi_rank_1][print_backtrace]  13: ./put_test() [0x400ac8]
[tcag-0001:mpi_rank_1][print_backtrace]  14:
/lib64/libc.so.6(__libc_start_main+0xfd) [0x2b4962c67d5d]
[tcag-0001:mpi_rank_1][print_backtrace]  15: ./put_test() [0x400919]
[tcag-0001:mpispawn_0][readline] Unexpected End-Of-File on file
descriptor 6. MPI process died?
[tcag-0001:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
MPI process died?
[tcag-0001:mpispawn_0][child_handler] MPI process (rank: 1, pid: 25550)
terminated with signal 11 -> abort job
[tcag-0001:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
tcag-0001 aborted: Error while reading a PMI socket (4)
###################################################################################################


Do you know the cause of this problem?

Best regards,
Akihiro Tabuchi

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
Managing Director: Karen Theresa Burns

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151103/c99fbcbf/attachment-0001.html>


More information about the mvapich-discuss mailing list