[mvapich-discuss] Segmentation fault at some MPI functions after MPI_Put
Akihiro Tabuchi
tabuchi at hpcs.cs.tsukuba.ac.jp
Tue Nov 3 06:32:25 EST 2015
Dear MVAPICH developers,
I use MVAPICH2-GDR 2.1 on a GPU cluster which has four GPUs on each node.
When the following conditions, MPI_Win_Free or MPI_Barrier cause a
segmentation fault after MPI_Put to a GPU on another MPI process in the
same node.
1. synchronization by MPI_Win_lock and MPI_Win_unlock
2. (128*N)KB < (MPI_Put transfer size) <= (128*N+8)KB, (N >= 1)
3-a. When MV2_CUDA_IPC=1, the number of processes in a node is three
and over.
3-b. When MV2_CUDA_IPC=0, the number of processes in a node is two and
over.
A test program and a backtrace of it are the below.
A test program
###################################################################################################
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <cuda_runtime.h>
#define MAXSIZE (4*1024*1024)
int main(int argc, char **argv){
MPI_Init(&argc, &argv);
if(argc != 2){
printf("few arguments\n");
return 1;
}
int size = atoi(argv[1]);
if(size > MAXSIZE){
printf("too large size\n");
return 1;
}
int rank, nranks;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nranks);
if(nranks < 2){
printf("# of processes must be more than 1\n");
return 1;
}
if(rank == 0){
printf("put size=%d\n", size);
}
char *buf;
cudaMalloc((void**)&buf, MAXSIZE*sizeof(char));
MPI_Win win;
MPI_Win_create((void*)buf, MAXSIZE*sizeof(char), sizeof(char),
MPI_INFO_NULL, MPI_COMM_WORLD, &win);
if(rank == 0){
int target_rank = 1;
MPI_Win_lock(MPI_LOCK_SHARED, target_rank, 0, win);
MPI_Put((void*)buf, size, MPI_BYTE, target_rank, 0, size, MPI_BYTE,
win);
MPI_Win_unlock(target_rank, win);
}
//MPI_Barrier(MPI_COMM_WORLD);
MPI_Win_free(&win);
cudaFree(buf);
MPI_Finalize();
return 0;
}
###################################################################################################
A backtrace when the program was run by
"mpirun_rsh -np 3 -hostfile $PBS_NODEFILE MV2_NUM_PORTS=2 MV2_USE_CUDA=1
MV2_CUDA_IPC=1 ./put_test 131073"
(three processes are run on same node)
###################################################################################################
[tcag-0001:mpi_rank_1][error_sighandler] Caught error: Segmentation
fault (signal 11)
[tcag-0001:mpi_rank_1][print_backtrace] 0:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(print_backtrace+0x23)
[0x2b49628c7753]
[tcag-0001:mpi_rank_1][print_backtrace] 1:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(error_sighandler+0x5e)
[0x2b49628c786e]
[tcag-0001:mpi_rank_1][print_backtrace] 2: /lib64/libc.so.6(+0x326b0)
[0x2b4962c7b6b0]
[tcag-0001:mpi_rank_1][print_backtrace] 3:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_decr_refcount+0x27)
[0x2b4962888447]
[tcag-0001:mpi_rank_1][print_backtrace] 4:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(dreg_unregister+0x11)
[0x2b4962888a61]
[tcag-0001:mpi_rank_1][print_backtrace] 5:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_MRAILI_self_cq_poll+0x143)
[0x2b4962895973]
[tcag-0001:mpi_rank_1][print_backtrace] 6:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_CH3I_Progress+0x337)
[0x2b4962866117]
[tcag-0001:mpi_rank_1][print_backtrace] 7:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Wait+0x47)
[0x2b496280bad7]
[tcag-0001:mpi_rank_1][print_backtrace] 8:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIC_Recv+0xb7)
[0x2b496280c737]
[tcag-0001:mpi_rank_1][print_backtrace] 9:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_intra+0x1fc8)
[0x2b49625e38d8]
[tcag-0001:mpi_rank_1][print_backtrace] 10:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIR_Reduce_scatter_block_impl+0x4a)
[0x2b49625e3d3a]
[tcag-0001:mpi_rank_1][print_backtrace] 11:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPIDI_Win_free+0x25e)
[0x2b496283ebfe]
[tcag-0001:mpi_rank_1][print_backtrace] 12:
/work/XMPTCA/tabuchi/local/opt/mvapich2/gdr/2.1/cuda7.5/gnu/lib64/libmpi.so.12(MPI_Win_free+0x23a)
[0x2b49627ec62a]
[tcag-0001:mpi_rank_1][print_backtrace] 13: ./put_test() [0x400ac8]
[tcag-0001:mpi_rank_1][print_backtrace] 14:
/lib64/libc.so.6(__libc_start_main+0xfd) [0x2b4962c67d5d]
[tcag-0001:mpi_rank_1][print_backtrace] 15: ./put_test() [0x400919]
[tcag-0001:mpispawn_0][readline] Unexpected End-Of-File on file
descriptor 6. MPI process died?
[tcag-0001:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
MPI process died?
[tcag-0001:mpispawn_0][child_handler] MPI process (rank: 1, pid: 25550)
terminated with signal 11 -> abort job
[tcag-0001:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
tcag-0001 aborted: Error while reading a PMI socket (4)
###################################################################################################
Do you know the cause of this problem?
Best regards,
Akihiro Tabuchi
More information about the mvapich-discuss
mailing list