[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with
BLCR
Suja Ramachandran
sujaram at igcar.gov.in
Fri Feb 15 00:40:35 EST 2013
Dear All,
I am trying to implement checkpoint/restart in mvapich2-1.8.1 with
BLCR-0.8.4 in an 8 node cluster. mvapich2-1.8.1 is configured with
./configure --enable-ckpt --with-blcr=/usr/local --enable-g=all
--enable-error-messages=all --enable-shared
--prefix=/share/apps/mvapich2-1.8.1/ --disable-rdma-cm
where, /share/apps is shared via NFS. Now,
1. Mvapich jobs are running fine across multiple nodes.
2. In a single node, an mvapich job can be successfully checkpointed
and restarted using cr_checkpoint -p <pid> and cr_restart
3. But in case of mpi jobs running across multiple nodes (b2c1 and b2c2
in this case), once a job is checkpointed using cr_checkpoint -p <pid>
command, the checkpoint files are created without any errors , and the
job is able to run till completion. But, once the job is over, it is
giving errors as follows:
[b2c1.local:mpi_rank_0][error_sighandler] Caught error: Segmentation
fault (signal 11)
[b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation
fault (signal 11)
[b2c1.local:mpispawn_0][child_handler] MPI process (rank: 0, pid: 873)
terminated with signal 11 -> abort job
[b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file
descriptor 8. MPI process died?
[b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI
socket. MPI process died?
[b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from
node 20.0.3.253 aborted: MPI process error (1)
[b2c2.local:mpispawn_1][read_size] Unexpected End-Of-File on file
descriptor 8. MPI process died?
[b2c2.local:mpispawn_1][read_size] Unexpected End-Of-File on file
descriptor 8. MPI process died?
[b2c2.local:mpispawn_1][handle_mt_peer] Error while reading PMI socket.
MPI process died?
When I tried to backtrace using MV2_DEBUG_SHOW_BACKTRACE=1 , the result
is as follows:
[b2c2.local:mpi_rank_7][error_sighandler] Caught error: Segmentation
fault (signal 11)
[b2c2.local:mpi_rank_7][print_backtrace] 0: ./vector [0x49cc8a]
[b2c2.local:mpi_rank_7][print_backtrace] 1: ./vector [0x49cd74]
[b2c2.local:mpi_rank_7][print_backtrace] 2: /lib64/libpthread.so.0
[0x322560ebe0]
[b2c2.local:mpi_rank_7][print_backtrace] 3: ./vector [0x43b678]
[b2c2.local:mpi_rank_7][print_backtrace] 4: ./vector(free+0xcb) [0x43e10b]
[b2c2.local:mpi_rank_7][print_backtrace] 5: ./vector [0x483b42]
[b2c2.local:mpi_rank_7][print_backtrace] 6: ./vector [0x410634]
[b2c2.local:mpi_rank_7][print_backtrace] 7: ./vector [0x49d41f]
[b2c2.local:mpi_rank_7][print_backtrace] 8: ./vector [0x457e69]
[b2c2.local:mpi_rank_7][print_backtrace] 9: ./vector [0x409659]
[b2c2.local:mpi_rank_7][print_backtrace] 10: ./vector [0x405274]
[b2c2.local:mpi_rank_7][print_backtrace] 11:
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3224e1d994]
[b2c2.local:mpi_rank_7][print_backtrace] 12: ./vector [0x404e29]
[b2c2.local:mpi_rank_1][error_sighandler] Caught error: Segmentation
fault (signal 11)
[b2c2.local:mpi_rank_1][print_backtrace] 0: ./vector [0x49cc8a]
[b2c2.local:mpi_rank_1][print_backtrace] 1: ./vector [0x49cd74]
[b2c2.local:mpi_rank_1][print_backtrace] 2: /lib64/libpthread.so.0
[0x322560ebe0]
[b2c2.local:mpi_rank_1][print_backtrace] 3: ./vector [0x43b678]
[b2c2.local:mpi_rank_1][print_backtrace] 4: ./vector(free+0xcb) [0x43e10b]
[b2c2.local:mpi_rank_1][print_backtrace] 5: ./vector [0x483b42]
[b2c2.local:mpi_rank_1][print_backtrace] 6: ./vector [0x410634]
[b2c2.local:mpi_rank_1][print_backtrace] 7: ./vector [0x49d41f]
[b2c2.local:mpi_rank_1][print_backtrace] 8: ./vector [0x457e69]
[b2c2.local:mpi_rank_1][print_backtrace] 9: ./vector [0x409659]
[b2c2.local:mpi_rank_1][print_backtrace] 10: ./vector [0x405274]
[b2c2.local:mpi_rank_1][print_backtrace] 11:
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3224e1d994]
[b2c2.local:mpi_rank_1][print_backtrace] 12: ./vector [0x404e29]
[b2c2.local:mpispawn_1][readline] Unexpected End-Of-File on file
descriptor 11. MPI process died?
[b2c2.local:mpispawn_1][mtpmi_processops] Error while reading PMI
socket. MPI process died?
[b2c2.local:mpispawn_1][child_handler] MPI process (rank: 7, pid: 506)
terminated with signal 11 -> abort job
[b2c1.local:mpi_rank_6][error_sighandler] Caught error: Segmentation
fault (signal 11)
[b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation
fault (signal 11)
[b2c1.local:mpi_rank_6][print_backtrace] 0: ./vector [0x49cc8a]
[b2c1.local:mpi_rank_6][print_backtrace] 1: ./vector [0x49cd74]
[b2c1.local:mpi_rank_6][print_backtrace] 2: /lib64/libpthread.so.0
[0x3b5da0ebe0]
[b2c1.local:mpi_rank_6][print_backtrace] 3: ./vector [0x43b4e7]
[b2c1.local:mpi_rank_6][print_backtrace] 4: ./vector(free+0xcb) [0x43e10b]
[b2c1.local:mpi_rank_6][print_backtrace] 5: ./vector [0x483b42]
[b2c1.local:mpi_rank_6][print_backtrace] 6: ./vector [0x410634]
[b2c1.local:mpi_rank_6][print_backtrace] 7: ./vector [0x49d41f]
[b2c1.local:mpi_rank_6][print_backtrace] 8: ./vector [0x457e69]
[b2c1.local:mpi_rank_6][print_backtrace] 9: ./vector [0x409659]
[b2c1.local:mpi_rank_6][print_backtrace] 10: ./vector [0x405274]
[b2c1.local:mpi_rank_6][print_backtrace] 11:
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
[b2c1.local:mpi_rank_6][print_backtrace] 12: ./vector [0x404e29]
[b2c1.local:mpi_rank_2][print_backtrace] 0: ./vector [0x49cc8a]
[b2c1.local:mpi_rank_2][print_backtrace] 1: ./vector [0x49cd74]
[b2c1.local:mpi_rank_2][print_backtrace] 2: /lib64/libpthread.so.0
[0x3b5da0ebe0]
[b2c1.local:mpi_rank_2][print_backtrace] 3: ./vector [0x43b4e7]
[b2c1.local:mpi_rank_2][print_backtrace] 4: ./vector(free+0xcb) [0x43e10b]
[b2c1.local:mpi_rank_2][print_backtrace] 5: ./vector [0x483b42]
[b2c1.local:mpi_rank_2][print_backtrace] 6: ./vector [0x410634]
[b2c1.local:mpi_rank_2][print_backtrace] 7: ./vector [0x49d41f]
[b2c1.local:mpi_rank_2][print_backtrace] 8: ./vector [0x457e69]
[b2c1.local:mpi_rank_2][print_backtrace] 9: ./vector [0x409659]
[b2c1.local:mpi_rank_2][print_backtrace] 10: ./vector [0x405274]
[b2c1.local:mpi_rank_2][print_backtrace] 11:
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
[b2c1.local:mpi_rank_2][print_backtrace] 12: ./vector [0x404e29]
[b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file
descriptor 8. MPI process died?
[b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI
socket. MPI process died?
[b2c1.local:mpispawn_0][child_handler] MPI process (rank: 2, pid: 1509)
terminated with signal 11 -> abort job
[b2c1.local:mpi_rank_4][error_sighandler] Caught error: Segmentation
fault (signal 11)
[b2c1.local:mpi_rank_4][print_backtrace] 0: ./vector [0x49cc8a]
[b2c1.local:mpi_rank_4][print_backtrace] 1: ./vector [0x49cd74]
[b2c1.local:mpi_rank_4][print_backtrace] 2: /lib64/libpthread.so.0
[0x3b5da0ebe0]
[b2c1.local:mpi_rank_4][print_backtrace] 3: ./vector [0x43b678]
[b2c1.local:mpi_rank_4][print_backtrace] 4: ./vector(free+0xcb) [0x43e10b]
[b2c1.local:mpi_rank_4][print_backtrace] 5: ./vector [0x483b42]
[b2c1.local:mpi_rank_4][print_backtrace] 6: ./vector [0x410634]
[b2c1.local:mpi_rank_4][print_backtrace] 7: ./vector [0x49d41f]
[b2c1.local:mpi_rank_4][print_backtrace] 8: ./vector [0x457e69]
[b2c1.local:mpi_rank_4][print_backtrace] 9: ./vector [0x409659]
[b2c1.local:mpi_rank_4][print_backtrace] 10: ./vector [0x405274]
[b2c1.local:mpi_rank_4][print_backtrace] 11:
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
[b2c1.local:mpi_rank_4][print_backtrace] 12: ./vector [0x404e29]
[b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from
node 20.0.3.254 aborted: Error while reading a PMI socket (4)
I am using Infiniband interconnect for interprocess communication.
20.0.3.254 is the IP for ib0 interface in b2c1 and 20.0.3.253 is the
IP for ib0 interface in b2c2..We have a GB ethernet interface too, which
is also giving the same results.
As seen on some of the mailing list discussions, I have tried using
MV2_IBA_HCA=mlx4_0 MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1 arguments which
are also not working..Any help on this matter will be appreciated
thanks and regards,
Suja
More information about the mvapich-discuss
mailing list