[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Suja Ramachandran sujaram at igcar.gov.in
Fri Feb 15 03:00:49 EST 2013


Hi,

Thanks for the reply..I have used the command

  /share/apps/mvapich2-1.8.1/bin/mpirun_rsh -np 8 -hostfile ./hostfile   
MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1 
MV2_CKPT_FILE=~rpmaps/checkpoint/scripts/mvapichckpt 
MV2_DEBUG_SHOW_BACKTRACE=1 ./vector

Also, I have tried it by avoiding the options MV2_IBA_HCA=mlx4_0 
MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1.

The environment variables set are
export PATH=/share/apps/mvapich2-1.8.1/bin/:$PATH
export 
LD_LIBRARY_PATH=/share/apps/mvapich2-1.8.1/lib:/usr/local/lib:$LD_LIBRARY_PATH

I have not yet installed the fault tolerance backplane(FTB). Is that 
mandatory for checkpoint/restart?

I will also try with mvapich2-1.9

(FYI,'vector' is  the executable of the vector addition program given 
here: http://www.cs.umanitoba.ca/~comp4510/examplesDIR/vsum.c )

thanks and regards,
suja

On Friday 15 February 2013 01:06 PM, Raghunath wrote:
> Suja,
>
> I tried a simple test case with the same configuration options that
> you used, and I am unable to reproduce this error. Can you send me the
> exact command that you used to launch your MPI job, along with the
> environment variables that you set? Can you also try it with the
> latest version of MVAPICH2
> (http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.9a2.tgz)
> to see if the problem persists?
> --
> Raghu
>
>
> On Fri, Feb 15, 2013 at 12:40 AM, Suja Ramachandran
> <sujaram at igcar.gov.in> wrote:
>> Dear All,
>>
>> I am trying to implement checkpoint/restart in mvapich2-1.8.1 with
>> BLCR-0.8.4 in an 8 node cluster. mvapich2-1.8.1 is configured with
>> ./configure --enable-ckpt --with-blcr=/usr/local --enable-g=all
>> --enable-error-messages=all --enable-shared
>> --prefix=/share/apps/mvapich2-1.8.1/ --disable-rdma-cm
>> where, /share/apps is shared via NFS. Now,
>> 1. Mvapich jobs are running fine across multiple nodes.
>> 2.  In a single node, an mvapich job can be successfully checkpointed  and
>> restarted using cr_checkpoint -p <pid> and cr_restart
>> 3. But in case of mpi jobs running across multiple nodes (b2c1 and b2c2 in
>> this case), once a job is checkpointed using cr_checkpoint -p <pid> command,
>> the checkpoint files are created without any errors , and the job is able to
>> run till completion. But, once the job is over, it is giving errors as
>> follows:
>>
>> [b2c1.local:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c1.local:mpispawn_0][child_handler] MPI process (rank: 0, pid: 873)
>> terminated with signal 11 -> abort job
>> [b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
>> 8. MPI process died?
>> [b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
>> MPI process died?
>> [b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
>> 20.0.3.253 aborted: MPI process error (1)
>> [b2c2.local:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor
>> 8. MPI process died?
>> [b2c2.local:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor
>> 8. MPI process died?
>> [b2c2.local:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI
>> process died?
>>
>> When I tried to backtrace using MV2_DEBUG_SHOW_BACKTRACE=1 , the result is
>> as follows:
>>
>> [b2c2.local:mpi_rank_7][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c2.local:mpi_rank_7][print_backtrace]   0: ./vector [0x49cc8a]
>> [b2c2.local:mpi_rank_7][print_backtrace]   1: ./vector [0x49cd74]
>> [b2c2.local:mpi_rank_7][print_backtrace]   2: /lib64/libpthread.so.0
>> [0x322560ebe0]
>> [b2c2.local:mpi_rank_7][print_backtrace]   3: ./vector [0x43b678]
>> [b2c2.local:mpi_rank_7][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
>> [b2c2.local:mpi_rank_7][print_backtrace]   5: ./vector [0x483b42]
>> [b2c2.local:mpi_rank_7][print_backtrace]   6: ./vector [0x410634]
>> [b2c2.local:mpi_rank_7][print_backtrace]   7: ./vector [0x49d41f]
>> [b2c2.local:mpi_rank_7][print_backtrace]   8: ./vector [0x457e69]
>> [b2c2.local:mpi_rank_7][print_backtrace]   9: ./vector [0x409659]
>> [b2c2.local:mpi_rank_7][print_backtrace]  10: ./vector [0x405274]
>> [b2c2.local:mpi_rank_7][print_backtrace]  11:
>> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3224e1d994]
>> [b2c2.local:mpi_rank_7][print_backtrace]  12: ./vector [0x404e29]
>> [b2c2.local:mpi_rank_1][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c2.local:mpi_rank_1][print_backtrace]   0: ./vector [0x49cc8a]
>> [b2c2.local:mpi_rank_1][print_backtrace]   1: ./vector [0x49cd74]
>> [b2c2.local:mpi_rank_1][print_backtrace]   2: /lib64/libpthread.so.0
>> [0x322560ebe0]
>> [b2c2.local:mpi_rank_1][print_backtrace]   3: ./vector [0x43b678]
>> [b2c2.local:mpi_rank_1][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
>> [b2c2.local:mpi_rank_1][print_backtrace]   5: ./vector [0x483b42]
>> [b2c2.local:mpi_rank_1][print_backtrace]   6: ./vector [0x410634]
>> [b2c2.local:mpi_rank_1][print_backtrace]   7: ./vector [0x49d41f]
>> [b2c2.local:mpi_rank_1][print_backtrace]   8: ./vector [0x457e69]
>> [b2c2.local:mpi_rank_1][print_backtrace]   9: ./vector [0x409659]
>> [b2c2.local:mpi_rank_1][print_backtrace]  10: ./vector [0x405274]
>> [b2c2.local:mpi_rank_1][print_backtrace]  11:
>> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3224e1d994]
>> [b2c2.local:mpi_rank_1][print_backtrace]  12: ./vector [0x404e29]
>> [b2c2.local:mpispawn_1][readline] Unexpected End-Of-File on file descriptor
>> 11. MPI process died?
>> [b2c2.local:mpispawn_1][mtpmi_processops] Error while reading PMI socket.
>> MPI process died?
>> [b2c2.local:mpispawn_1][child_handler] MPI process (rank: 7, pid: 506)
>> terminated with signal 11 -> abort job
>> [b2c1.local:mpi_rank_6][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c1.local:mpi_rank_6][print_backtrace]   0: ./vector [0x49cc8a]
>> [b2c1.local:mpi_rank_6][print_backtrace]   1: ./vector [0x49cd74]
>> [b2c1.local:mpi_rank_6][print_backtrace]   2: /lib64/libpthread.so.0
>> [0x3b5da0ebe0]
>> [b2c1.local:mpi_rank_6][print_backtrace]   3: ./vector [0x43b4e7]
>> [b2c1.local:mpi_rank_6][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
>> [b2c1.local:mpi_rank_6][print_backtrace]   5: ./vector [0x483b42]
>> [b2c1.local:mpi_rank_6][print_backtrace]   6: ./vector [0x410634]
>> [b2c1.local:mpi_rank_6][print_backtrace]   7: ./vector [0x49d41f]
>> [b2c1.local:mpi_rank_6][print_backtrace]   8: ./vector [0x457e69]
>> [b2c1.local:mpi_rank_6][print_backtrace]   9: ./vector [0x409659]
>> [b2c1.local:mpi_rank_6][print_backtrace]  10: ./vector [0x405274]
>> [b2c1.local:mpi_rank_6][print_backtrace]  11:
>> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
>> [b2c1.local:mpi_rank_6][print_backtrace]  12: ./vector [0x404e29]
>> [b2c1.local:mpi_rank_2][print_backtrace]   0: ./vector [0x49cc8a]
>> [b2c1.local:mpi_rank_2][print_backtrace]   1: ./vector [0x49cd74]
>> [b2c1.local:mpi_rank_2][print_backtrace]   2: /lib64/libpthread.so.0
>> [0x3b5da0ebe0]
>> [b2c1.local:mpi_rank_2][print_backtrace]   3: ./vector [0x43b4e7]
>> [b2c1.local:mpi_rank_2][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
>> [b2c1.local:mpi_rank_2][print_backtrace]   5: ./vector [0x483b42]
>> [b2c1.local:mpi_rank_2][print_backtrace]   6: ./vector [0x410634]
>> [b2c1.local:mpi_rank_2][print_backtrace]   7: ./vector [0x49d41f]
>> [b2c1.local:mpi_rank_2][print_backtrace]   8: ./vector [0x457e69]
>> [b2c1.local:mpi_rank_2][print_backtrace]   9: ./vector [0x409659]
>> [b2c1.local:mpi_rank_2][print_backtrace]  10: ./vector [0x405274]
>> [b2c1.local:mpi_rank_2][print_backtrace]  11:
>> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
>> [b2c1.local:mpi_rank_2][print_backtrace]  12: ./vector [0x404e29]
>> [b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
>> 8. MPI process died?
>> [b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
>> MPI process died?
>> [b2c1.local:mpispawn_0][child_handler] MPI process (rank: 2, pid: 1509)
>> terminated with signal 11 -> abort job
>> [b2c1.local:mpi_rank_4][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c1.local:mpi_rank_4][print_backtrace]   0: ./vector [0x49cc8a]
>> [b2c1.local:mpi_rank_4][print_backtrace]   1: ./vector [0x49cd74]
>> [b2c1.local:mpi_rank_4][print_backtrace]   2: /lib64/libpthread.so.0
>> [0x3b5da0ebe0]
>> [b2c1.local:mpi_rank_4][print_backtrace]   3: ./vector [0x43b678]
>> [b2c1.local:mpi_rank_4][print_backtrace]   4: ./vector(free+0xcb) [0x43e10b]
>> [b2c1.local:mpi_rank_4][print_backtrace]   5: ./vector [0x483b42]
>> [b2c1.local:mpi_rank_4][print_backtrace]   6: ./vector [0x410634]
>> [b2c1.local:mpi_rank_4][print_backtrace]   7: ./vector [0x49d41f]
>> [b2c1.local:mpi_rank_4][print_backtrace]   8: ./vector [0x457e69]
>> [b2c1.local:mpi_rank_4][print_backtrace]   9: ./vector [0x409659]
>> [b2c1.local:mpi_rank_4][print_backtrace]  10: ./vector [0x405274]
>> [b2c1.local:mpi_rank_4][print_backtrace]  11:
>> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3b5d21d994]
>> [b2c1.local:mpi_rank_4][print_backtrace]  12: ./vector [0x404e29]
>> [b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node
>> 20.0.3.254 aborted: Error while reading a PMI socket (4)
>>
>> I am using Infiniband interconnect for interprocess communication.
>> 20.0.3.254  is the IP for ib0 interface in b2c1 and  20.0.3.253  is the IP
>> for ib0 interface in b2c2..We have a GB ethernet interface too, which is
>> also giving the same results.
>> As seen on some of the mailing list discussions, I have tried using
>> MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1  arguments which are
>> also not working..Any help on this matter will be appreciated
>>
>> thanks and regards,
>> Suja
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130215/0edae7d5/attachment.html


More information about the mvapich-discuss mailing list