[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Suja Ramachandran sujaram at igcar.gov.in
Tue Feb 26 02:25:36 EST 2013


Hi,

I am running the program across two nodes with host names ib2c1 and 
ib2c2 (with IB connectivity).I am using the command
/share/apps/mvapich2-1.9a2/bin/mpirun_rsh  -np 8 -hostfile ./hostfile 
MV2_CKPT_FILE=./mvapichckpt MV2_DEBUG_FT_VERBOSE=2 ./vector

(where both /share/apps and my current working directory are shared via NFS)

  While taking  a single checkpoint using mv2_checkpoint utility, the 
checkpoint files are created without any error, and the program continue 
its execution till the end. At the end, after the job completion, the 
error messages are appearing as shown below..

[b2c2.local:mpi_rank_1][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c2.local:mpispawn_1][readline] Unexpected End-Of-File on file 
descriptor 11. MPI process died?
[b2c2.local:mpispawn_1][mtpmi_processops] Error while reading PMI 
socket. MPI process died?
[b2c1.local:mpi_rank_4][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpi_rank_6][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file 
descriptor 11. MPI process died?
[b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI 
socket. MPI process died?
[b2c1.local:mpi_rank_0][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c2.local:mpispawn_1][child_handler] MPI process (rank: 1, pid: 21164) 
terminated with signal 11 -> abort job
[b2c1.local:mpispawn_0][child_handler] MPI process (rank: 0, pid: 8026) 
terminated with signal 11 -> abort job
[b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from 
node ib2c2 aborted: Error while reading a PMI socket (4)

But, while taking checkpoint of the process a second time, the 
checkpoint fails and the program stops execution as follows:

Master sent elements 0 to 7143 to rank 1 b2c1.local.
Master sent elements 7143 to 14286 to rank 2 b2c1.local.
Master sent elements 14286 to 21429 to rank 3 b2c1.local.
Slave 3 is sending partial sum 7143.000000 to master b2c2.local.
Slave 2 is sending partial sum 7143.000000 to master b2c1.local.
Slave 4 is sending partial sum 7143.000000 to master b2c1.local.
Master sent elements 21429 to 28572 to rank 4 b2c1.local.
Slave 1 is sending partial sum 7143.000000 to master b2c2.local.
[0]:  CR completed...
[7]:  CR completed...
Master sent elements 28572 to 35715 to rank 5 b2c1.local.
Master sent elements 35715 to 42858 to rank 6 b2c1.local.
Master sent elements 42858 to 50000 to rank 7 b2c1.local.
[b2c2.local:mpi_rank_7][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c2.local:mpi_rank_3][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c2.local:mpi_rank_5][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c2.local:mpispawn_1][readline] Unexpected End-Of-File on file 
descriptor 10. MPI process died?
[b2c2.local:mpispawn_1][mtpmi_processops] Error while reading PMI 
socket. MPI process died?
[b2c2.local:mpispawn_1][child_handler] MPI process (rank: 5, pid: 21570) 
terminated with signal 11 -> abort job
[b2c1.local:mpi_rank_2][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpi_rank_4][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpi_rank_6][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file 
descriptor 12. MPI process died?
[b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI 
socket. MPI process died?
[b2c1.local:mpispawn_0][child_handler] MPI process (rank: 6, pid: 8464) 
terminated with signal 11 -> abort job
[b2c1.local:mpi_rank_0][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c2.local:mpi_rank_1][error_sighandler] Caught error: Segmentation 
fault (signal 11)
[b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from 
node ib2c2 aborted: Error while reading a PMI socket (4)
[b2c1.local:mpirun_rsh][CR_Callback] Unexpected results from 0: ""
[b2c1.local:mpirun_rsh][CR_Callback] Some processes failed to 
checkpoint. Abort checkpoint...


Here, the [0]:  CR completed...
[7]:  CR completed...
statements denote the completion of the first checkpoint. The error 
messages after the line "Master sent elements 42858 to 50000 to rank 7 
b2c1.local." is because of invoking the mv2_checkpoint utility a second 
time for taking checkpoint...

thanks and regards,
Suja


On Tuesday 26 February 2013 11:32 AM, Raghunath wrote:
> Suja,
>
> I tried your test case with multiple nodes, and I still see no issues.
> At what point do you see this error, when taking a checkpoint, or
> during a restart? Also, can you send me the complete log of your run.
> It looks like the output you pasted does not provide a lot of
> information to come to any conclusion about why you are having this
> issue. Can you also try running again with the  value of
> "MV2_DEBUG_FT_VERBOSE" environment variable set to 2, and sending me
> the output?
> --
> Raghu
>
>
> On Mon, Feb 25, 2013 at 2:24 AM, Suja Ramachandran <sujaram at igcar.gov.in> wrote:
>> Hi,
>>
>> Yes, the 1.9a2 version is giving the errors...May I know if you have tried
>> the test case within one node or across multiple nodes? In my case,
>> checkpoint/restart for any number of times is working fine on single node.
>> MPI jobs running across multiple nodes (connected via IB as well as
>> ethernet) are giving the errors. The error messages are like
>>
>> [b2h.hpc.igcar.in:mpi_rank_3][error_sighandler] Caught error: Segmentation
>> fault (signal 11)
>> [b2h.hpc.igcar.in:mpispawn_1][readline] Unexpected End-Of-File on file
>> descriptor 13. MPI process died?
>> [b2h.hpc.igcar.in:mpispawn_1][mtpmi_processops] Error while reading PMI
>> socket. MPI process died?
>> [b2h.hpc.igcar.in:mpispawn_1][child_handler] MPI process (rank: 3, pid:
>> 32193) terminated with signal 11 -> abort job
>> [b2c1.local:mpi_rank_8][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c1.local:mpi_rank_4][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> [b2c1.local:mpispawn_0][readline] Unexpected End-Of-File on file descriptor
>> 13. MPI process died?
>>
>> [b2c1.local:mpispawn_0][mtpmi_processops] Error while reading PMI socket.
>> MPI process died?
>> [b2c1.local:mpispawn_0][child_handler] MPI process (rank: 8, pid: 30295)
>> terminated with signal 11 -> abort job
>> [b2h.hpc.igcar.in:mpi_rank_1][error_sighandler] Caught error: Segmentation
>> fault (signal 11)
>>
>> [b2c1.local:mpi_rank_0][error_sighandler] Caught error: Segmentation fault
>> (signal 11)
>> Slave 9 is sending partial sum 5555.000000 to master b2h.hpc.igcar.in.
>> [b2c1.local:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node
>> b2h aborted: Error while reading a PMI socket (4)
>>
>>
>> thanks and regards,
>> suja
>>
>> On Friday 22 February 2013 11:09 PM, Raghunath wrote:
>>> Hi Suja,
>>>
>>> I am copying this email to an internal developers-only list.
>>>
>>>
>>>> Yes, I have used the mv2_checkpoint tool too..One more problem I have
>>>> noticed is during the first checkpoint alone the application is able to
>>>> complete its execution. If I  try to checkpoint an application the second
>>>> time, or checkpoint an application restarted using cr_restart , the same
>>>> errors are causing the program to stop execution..Now thats a real problem
>>>> for me!
>>> Is this with the 1.9a2 version (with the same build options) as well?
>>> I tried a simple test case, taking back-to-back checkpoints of the
>>> vsum program you had pointed me to, and things work as expected. What
>>> is the exact error message you see when the program stops executing?
>>>
>>>> (Btw, any option is required while configuring BLCR for making it work
>>>> with MVAPICH?)
>>> No, you can build BLCR as you would normally. No special build flags
>>> are required for it to work with MVAPICH.
>>> --
>>> Raghu
>>>



More information about the mvapich-discuss mailing list