[mvapich-discuss] checkpointing failure ...

yiannis georgiou yiannis.georgiou at imag.fr
Thu Jun 12 04:33:19 EDT 2008


Hello,

I got the same error several times mainly on large problem sizes and  
large number of nodes, even though I've seen it some times with small  
number of nodes as well. Most of the times the fatal error was  
produced just before the finish of the checkpointing procedure. The  
strange thing is that  the error doesn't appear all the times. After a  
number of repeats under the same configuration (HPL problem size-  
cluster number of nodes), we observe that sometimes we can get this  
error but sometimes the checkpoints can be successfully taken and  
restarted.

The explanation seems reasonable. Is this a known OFED bug ? Is there  
a way to avoid it?

The nodes used on my experiments consist of :
  Intel Xeon EM64T 3GHz 2CPU , 1CORE
3 GHz / 1 MB L2 cache

and memory 2 GB (4x512MB) / 400MHz(2.5ns)

Thanks!!

regards,
Yiannis

Quoting Karthik Gopalakrishnan <gopalakk at cse.ohio-state.edu>:

> Hi Biswajit.
>
> You are seeing this error due to a memory allocation failure. MVAPICH2
> calls an OFED function ibv_reg_mr() to reallocate a memory region for
> the HCA after taking a checkpoint. This function call is returning an
> error. This is a failure in the OFED stack. Please let us know if you
> are seeing this error for all problem sizes or only large ones. Also
> let us know the amount of memory you have on your processing nodes.
> Maybe you are running low on memory which is causing the allocation
> failure.
>
> Thanks & Regards,
> Karthik
>
> On Wed, Jun 11, 2008 at 4:40 AM,  <biswajit at crlindia.com> wrote:
>>
>> While running HPL  with  checkpointing  enabled MVAPICH2 1.0.2  the progamme
>> crushed giving
>>  following errors:
>>
>>   1. 2: reregister dentry 0x642010, addr 0x2bc6578000 pagebase_low_p,
>> 10121216 register_nbytes
>>   [2] Abort: reregister fails
>>   at line 1104 in file dreg.c
>>   rank 2 in job 1  n163_32790   caused collective abort of all ranks
>>    exit status of rank 2: killed by signal 9
>>
>>    
>> [mpiexec_cr][/home/biswajit/mvapichBlcrInstall/mvapich2-1.0.2ckpt1/src/pm/mpd/mpiexec_cr.c:
>> line 196]abort: checkpoint failed
>>
>>
>>
>>
>> While restarting  restart fails gving following errors:
>>
>>      cri_syscall(CR_OP_RSTRT_PROCS): Invalid argument
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 

Yiannis Georgiou                LIG Laboratory / MESCAL Project
Yiannis.Georgiou at imag.fr        http://mescal.imag.fr/
                                 FRANCE



More information about the mvapich-discuss mailing list