[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Suja Ramachandran sujaram at igcar.gov.in
Fri Feb 15 00:40:35 EST 2013


Dear All,

I am trying to implement checkpoint/restart in mvapich2-1.8.1 with 
BLCR-0.8.4 in an 8 node cluster. mvapich2-1.8.1 is configured with
./configure --enable-ckpt --with-blcr=/usr/local --enable-g=all 
--enable-error-messages=all --enable-shared 
--prefix=/share/apps/mvapich2-1.8.1/ --disable-rdma-cm
where, /share/apps is shared via NFS. Now,
1. Mvapich jobs are running fine across multiple nodes.
2.  In a single node, an mvapich job can be successfully checkpointed  
and restarted using cr_checkpoint -p <pid> and cr_restart
3. But in case of mpi jobs running across multiple nodes (b2c1 and b2c2 
in this case), once a job is checkpointed using cr_checkpoint -p <pid> 
command, the checkpoint files are created without any errors , and the 
job is able to run till completion. But, once the job is over, it is 
giving errors as follows:

[b2c1.local:mpi_rank_0][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpispawn_0][child_handler] MPI process (rank: 0, pid: 873) 
terminated with signal 11 -> abort job
[b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file 
descriptor 8. MPI process died?
[b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI 
socket. MPI process died?
[b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from 
node 20.0.3.253 aborted: MPI process error (1)
[b2c2.local:mpispawn_1][read_size] Unexpected End-Of-File on file 
descriptor 8. MPI process died?
[b2c2.local:mpispawn_1][read_size] Unexpected End-Of-File on file 
descriptor 8. MPI process died?
[b2c2.local:mpispawn_1][handle_mt_peer] Error while reading PMI socket. 
MPI process died?

When I tried to backtrace using MV2_DEBUG_SHOW_BACKTRACE=1 , the result 
is as follows:

[b2c2.local:mpi_rank_7][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c2.local:mpi_rank_7][print_backtrace]   0: ./vector [0x49cc8a]
[b2c2.local:mpi_rank_7][print_backtrace]   1: ./vector [0x49cd74]
[b2c2.local:mpi_rank_7][print_backtrace]   2: /lib64/libpthread.so.0 
[0x322560ebe0]
[b2c2.local:mpi_rank_7][print_backtrace]   3: ./vector [0x43b678]
[b2c2.local:mpi_rank_7][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
[b2c2.local:mpi_rank_7][print_backtrace]   5: ./vector [0x483b42]
[b2c2.local:mpi_rank_7][print_backtrace]   6: ./vector [0x410634]
[b2c2.local:mpi_rank_7][print_backtrace]   7: ./vector [0x49d41f]
[b2c2.local:mpi_rank_7][print_backtrace]   8: ./vector [0x457e69]
[b2c2.local:mpi_rank_7][print_backtrace]   9: ./vector [0x409659]
[b2c2.local:mpi_rank_7][print_backtrace]  10: ./vector [0x405274]
[b2c2.local:mpi_rank_7][print_backtrace]  11: 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3224e1d994]
[b2c2.local:mpi_rank_7][print_backtrace]  12: ./vector [0x404e29]
[b2c2.local:mpi_rank_1][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c2.local:mpi_rank_1][print_backtrace]   0: ./vector [0x49cc8a]
[b2c2.local:mpi_rank_1][print_backtrace]   1: ./vector [0x49cd74]
[b2c2.local:mpi_rank_1][print_backtrace]   2: /lib64/libpthread.so.0 
[0x322560ebe0]
[b2c2.local:mpi_rank_1][print_backtrace]   3: ./vector [0x43b678]
[b2c2.local:mpi_rank_1][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
[b2c2.local:mpi_rank_1][print_backtrace]   5: ./vector [0x483b42]
[b2c2.local:mpi_rank_1][print_backtrace]   6: ./vector [0x410634]
[b2c2.local:mpi_rank_1][print_backtrace]   7: ./vector [0x49d41f]
[b2c2.local:mpi_rank_1][print_backtrace]   8: ./vector [0x457e69]
[b2c2.local:mpi_rank_1][print_backtrace]   9: ./vector [0x409659]
[b2c2.local:mpi_rank_1][print_backtrace]  10: ./vector [0x405274]
[b2c2.local:mpi_rank_1][print_backtrace]  11: 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3224e1d994]
[b2c2.local:mpi_rank_1][print_backtrace]  12: ./vector [0x404e29]
[b2c2.local:mpispawn_1][readline] Unexpected End-Of-File on file 
descriptor 11. MPI process died?
[b2c2.local:mpispawn_1][mtpmi_processops] Error while reading PMI 
socket. MPI process died?
[b2c2.local:mpispawn_1][child_handler] MPI process (rank: 7, pid: 506) 
terminated with signal 11 -> abort job
[b2c1.local:mpi_rank_6][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpi_rank_6][print_backtrace]   0: ./vector [0x49cc8a]
[b2c1.local:mpi_rank_6][print_backtrace]   1: ./vector [0x49cd74]
[b2c1.local:mpi_rank_6][print_backtrace]   2: /lib64/libpthread.so.0 
[0x3b5da0ebe0]
[b2c1.local:mpi_rank_6][print_backtrace]   3: ./vector [0x43b4e7]
[b2c1.local:mpi_rank_6][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
[b2c1.local:mpi_rank_6][print_backtrace]   5: ./vector [0x483b42]
[b2c1.local:mpi_rank_6][print_backtrace]   6: ./vector [0x410634]
[b2c1.local:mpi_rank_6][print_backtrace]   7: ./vector [0x49d41f]
[b2c1.local:mpi_rank_6][print_backtrace]   8: ./vector [0x457e69]
[b2c1.local:mpi_rank_6][print_backtrace]   9: ./vector [0x409659]
[b2c1.local:mpi_rank_6][print_backtrace]  10: ./vector [0x405274]
[b2c1.local:mpi_rank_6][print_backtrace]  11: 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
[b2c1.local:mpi_rank_6][print_backtrace]  12: ./vector [0x404e29]
[b2c1.local:mpi_rank_2][print_backtrace]   0: ./vector [0x49cc8a]
[b2c1.local:mpi_rank_2][print_backtrace]   1: ./vector [0x49cd74]
[b2c1.local:mpi_rank_2][print_backtrace]   2: /lib64/libpthread.so.0 
[0x3b5da0ebe0]
[b2c1.local:mpi_rank_2][print_backtrace]   3: ./vector [0x43b4e7]
[b2c1.local:mpi_rank_2][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
[b2c1.local:mpi_rank_2][print_backtrace]   5: ./vector [0x483b42]
[b2c1.local:mpi_rank_2][print_backtrace]   6: ./vector [0x410634]
[b2c1.local:mpi_rank_2][print_backtrace]   7: ./vector [0x49d41f]
[b2c1.local:mpi_rank_2][print_backtrace]   8: ./vector [0x457e69]
[b2c1.local:mpi_rank_2][print_backtrace]   9: ./vector [0x409659]
[b2c1.local:mpi_rank_2][print_backtrace]  10: ./vector [0x405274]
[b2c1.local:mpi_rank_2][print_backtrace]  11: 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
[b2c1.local:mpi_rank_2][print_backtrace]  12: ./vector [0x404e29]
[b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file 
descriptor 8. MPI process died?
[b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI 
socket. MPI process died?
[b2c1.local:mpispawn_0][child_handler] MPI process (rank: 2, pid: 1509) 
terminated with signal 11 -> abort job
[b2c1.local:mpi_rank_4][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpi_rank_4][print_backtrace]   0: ./vector [0x49cc8a]
[b2c1.local:mpi_rank_4][print_backtrace]   1: ./vector [0x49cd74]
[b2c1.local:mpi_rank_4][print_backtrace]   2: /lib64/libpthread.so.0 
[0x3b5da0ebe0]
[b2c1.local:mpi_rank_4][print_backtrace]   3: ./vector [0x43b678]
[b2c1.local:mpi_rank_4][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
[b2c1.local:mpi_rank_4][print_backtrace]   5: ./vector [0x483b42]
[b2c1.local:mpi_rank_4][print_backtrace]   6: ./vector [0x410634]
[b2c1.local:mpi_rank_4][print_backtrace]   7: ./vector [0x49d41f]
[b2c1.local:mpi_rank_4][print_backtrace]   8: ./vector [0x457e69]
[b2c1.local:mpi_rank_4][print_backtrace]   9: ./vector [0x409659]
[b2c1.local:mpi_rank_4][print_backtrace]  10: ./vector [0x405274]
[b2c1.local:mpi_rank_4][print_backtrace]  11: 
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
[b2c1.local:mpi_rank_4][print_backtrace]  12: ./vector [0x404e29]
[b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from 
node 20.0.3.254 aborted: Error while reading a PMI socket (4)

I am using Infiniband interconnect for interprocess communication. 
20.0.3.254  is the IP for ib0 interface in b2c1 and  20.0.3.253  is the 
IP for ib0 interface in b2c2..We have a GB ethernet interface too, which 
is also giving the same results.
As seen on some of the mailing list discussions, I have tried using 
MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1  arguments which 
are also not working..Any help on this matter will be appreciated

thanks and regards,
Suja



More information about the mvapich-discuss mailing list