[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Raghunath rajachan at cse.ohio-state.edu
Fri Feb 15 02:36:34 EST 2013


Suja,

I tried a simple test case with the same configuration options that
you used, and I am unable to reproduce this error. Can you send me the
exact command that you used to launch your MPI job, along with the
environment variables that you set? Can you also try it with the
latest version of MVAPICH2
(http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.9a2.tgz)
to see if the problem persists?
--
Raghu


On Fri, Feb 15, 2013 at 12:40 AM, Suja Ramachandran
<sujaram at igcar.gov.in> wrote:
> Dear All,
>
> I am trying to implement checkpoint/restart in mvapich2-1.8.1 with
> BLCR-0.8.4 in an 8 node cluster. mvapich2-1.8.1 is configured with
> ./configure --enable-ckpt --with-blcr=/usr/local --enable-g=all
> --enable-error-messages=all --enable-shared
> --prefix=/share/apps/mvapich2-1.8.1/ --disable-rdma-cm
> where, /share/apps is shared via NFS. Now,
> 1. Mvapich jobs are running fine across multiple nodes.
> 2.  In a single node, an mvapich job can be successfully checkpointed  and
> restarted using cr_checkpoint -p <pid> and cr_restart
> 3. But in case of mpi jobs running across multiple nodes (b2c1 and b2c2 in
> this case), once a job is checkpointed using cr_checkpoint -p <pid> command,
> the checkpoint files are created without any errors , and the job is able to
> run till completion. But, once the job is over, it is giving errors as
> follows:
>
> [b2c1.local:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [b2c1.local:mpispawn_0][child_handler] MPI process (rank: 0, pid: 873)
> terminated with signal 11 -> abort job
> [b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 8. MPI process died?
> [b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> 20.0.3.253 aborted: MPI process error (1)
> [b2c2.local:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor
> 8. MPI process died?
> [b2c2.local:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor
> 8. MPI process died?
> [b2c2.local:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
> process died?
>
> When I tried to backtrace using MV2_DEBUG_SHOW_BACKTRACE=1 , the result is
> as follows:
>
> [b2c2.local:mpi_rank_7][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [b2c2.local:mpi_rank_7][print_backtrace]   0: ./vector [0x49cc8a]
> [b2c2.local:mpi_rank_7][print_backtrace]   1: ./vector [0x49cd74]
> [b2c2.local:mpi_rank_7][print_backtrace]   2: /lib64/libpthread.so.0
> [0x322560ebe0]
> [b2c2.local:mpi_rank_7][print_backtrace]   3: ./vector [0x43b678]
> [b2c2.local:mpi_rank_7][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
> [b2c2.local:mpi_rank_7][print_backtrace]   5: ./vector [0x483b42]
> [b2c2.local:mpi_rank_7][print_backtrace]   6: ./vector [0x410634]
> [b2c2.local:mpi_rank_7][print_backtrace]   7: ./vector [0x49d41f]
> [b2c2.local:mpi_rank_7][print_backtrace]   8: ./vector [0x457e69]
> [b2c2.local:mpi_rank_7][print_backtrace]   9: ./vector [0x409659]
> [b2c2.local:mpi_rank_7][print_backtrace]  10: ./vector [0x405274]
> [b2c2.local:mpi_rank_7][print_backtrace]  11:
> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3224e1d994]
> [b2c2.local:mpi_rank_7][print_backtrace]  12: ./vector [0x404e29]
> [b2c2.local:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [b2c2.local:mpi_rank_1][print_backtrace]   0: ./vector [0x49cc8a]
> [b2c2.local:mpi_rank_1][print_backtrace]   1: ./vector [0x49cd74]
> [b2c2.local:mpi_rank_1][print_backtrace]   2: /lib64/libpthread.so.0
> [0x322560ebe0]
> [b2c2.local:mpi_rank_1][print_backtrace]   3: ./vector [0x43b678]
> [b2c2.local:mpi_rank_1][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
> [b2c2.local:mpi_rank_1][print_backtrace]   5: ./vector [0x483b42]
> [b2c2.local:mpi_rank_1][print_backtrace]   6: ./vector [0x410634]
> [b2c2.local:mpi_rank_1][print_backtrace]   7: ./vector [0x49d41f]
> [b2c2.local:mpi_rank_1][print_backtrace]   8: ./vector [0x457e69]
> [b2c2.local:mpi_rank_1][print_backtrace]   9: ./vector [0x409659]
> [b2c2.local:mpi_rank_1][print_backtrace]  10: ./vector [0x405274]
> [b2c2.local:mpi_rank_1][print_backtrace]  11:
> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3224e1d994]
> [b2c2.local:mpi_rank_1][print_backtrace]  12: ./vector [0x404e29]
> [b2c2.local:mpispawn_1][readline] Unexpected End-Of-File on file descriptor
> 11. MPI process died?
> [b2c2.local:mpispawn_1][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [b2c2.local:mpispawn_1][child_handler] MPI process (rank: 7, pid: 506)
> terminated with signal 11 -> abort job
> [b2c1.local:mpi_rank_6][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [b2c1.local:mpi_rank_6][print_backtrace]   0: ./vector [0x49cc8a]
> [b2c1.local:mpi_rank_6][print_backtrace]   1: ./vector [0x49cd74]
> [b2c1.local:mpi_rank_6][print_backtrace]   2: /lib64/libpthread.so.0
> [0x3b5da0ebe0]
> [b2c1.local:mpi_rank_6][print_backtrace]   3: ./vector [0x43b4e7]
> [b2c1.local:mpi_rank_6][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
> [b2c1.local:mpi_rank_6][print_backtrace]   5: ./vector [0x483b42]
> [b2c1.local:mpi_rank_6][print_backtrace]   6: ./vector [0x410634]
> [b2c1.local:mpi_rank_6][print_backtrace]   7: ./vector [0x49d41f]
> [b2c1.local:mpi_rank_6][print_backtrace]   8: ./vector [0x457e69]
> [b2c1.local:mpi_rank_6][print_backtrace]   9: ./vector [0x409659]
> [b2c1.local:mpi_rank_6][print_backtrace]  10: ./vector [0x405274]
> [b2c1.local:mpi_rank_6][print_backtrace]  11:
> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
> [b2c1.local:mpi_rank_6][print_backtrace]  12: ./vector [0x404e29]
> [b2c1.local:mpi_rank_2][print_backtrace]   0: ./vector [0x49cc8a]
> [b2c1.local:mpi_rank_2][print_backtrace]   1: ./vector [0x49cd74]
> [b2c1.local:mpi_rank_2][print_backtrace]   2: /lib64/libpthread.so.0
> [0x3b5da0ebe0]
> [b2c1.local:mpi_rank_2][print_backtrace]   3: ./vector [0x43b4e7]
> [b2c1.local:mpi_rank_2][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
> [b2c1.local:mpi_rank_2][print_backtrace]   5: ./vector [0x483b42]
> [b2c1.local:mpi_rank_2][print_backtrace]   6: ./vector [0x410634]
> [b2c1.local:mpi_rank_2][print_backtrace]   7: ./vector [0x49d41f]
> [b2c1.local:mpi_rank_2][print_backtrace]   8: ./vector [0x457e69]
> [b2c1.local:mpi_rank_2][print_backtrace]   9: ./vector [0x409659]
> [b2c1.local:mpi_rank_2][print_backtrace]  10: ./vector [0x405274]
> [b2c1.local:mpi_rank_2][print_backtrace]  11:
> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
> [b2c1.local:mpi_rank_2][print_backtrace]  12: ./vector [0x404e29]
> [b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
> 8. MPI process died?
> [b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
> MPI process died?
> [b2c1.local:mpispawn_0][child_handler] MPI process (rank: 2, pid: 1509)
> terminated with signal 11 -> abort job
> [b2c1.local:mpi_rank_4][error_sighandler] Caught error: Segmentation fault
> (signal 11)
> [b2c1.local:mpi_rank_4][print_backtrace]   0: ./vector [0x49cc8a]
> [b2c1.local:mpi_rank_4][print_backtrace]   1: ./vector [0x49cd74]
> [b2c1.local:mpi_rank_4][print_backtrace]   2: /lib64/libpthread.so.0
> [0x3b5da0ebe0]
> [b2c1.local:mpi_rank_4][print_backtrace]   3: ./vector [0x43b678]
> [b2c1.local:mpi_rank_4][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
> [b2c1.local:mpi_rank_4][print_backtrace]   5: ./vector [0x483b42]
> [b2c1.local:mpi_rank_4][print_backtrace]   6: ./vector [0x410634]
> [b2c1.local:mpi_rank_4][print_backtrace]   7: ./vector [0x49d41f]
> [b2c1.local:mpi_rank_4][print_backtrace]   8: ./vector [0x457e69]
> [b2c1.local:mpi_rank_4][print_backtrace]   9: ./vector [0x409659]
> [b2c1.local:mpi_rank_4][print_backtrace]  10: ./vector [0x405274]
> [b2c1.local:mpi_rank_4][print_backtrace]  11:
> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
> [b2c1.local:mpi_rank_4][print_backtrace]  12: ./vector [0x404e29]
> [b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node
> 20.0.3.254 aborted: Error while reading a PMI socket (4)
>
> I am using Infiniband interconnect for interprocess communication.
> 20.0.3.254  is the IP for ib0 interface in b2c1 and  20.0.3.253  is the IP
> for ib0 interface in b2c2..We have a GB ethernet interface too, which is
> also giving the same results.
> As seen on some of the mailing list discussions, I have tried using
> MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1  arguments which are
> also not working..Any help on this matter will be appreciated
>
> thanks and regards,
> Suja
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


More information about the mvapich-discuss mailing list