[mvapich-discuss] checkpointing failure ...

Karthik Gopalakrishnan gopalakk at cse.ohio-state.edu
Wed Jun 11 17:23:24 EDT 2008


Hi Biswajit.

You are seeing this error due to a memory allocation failure. MVAPICH2
calls an OFED function ibv_reg_mr() to reallocate a memory region for
the HCA after taking a checkpoint. This function call is returning an
error. This is a failure in the OFED stack. Please let us know if you
are seeing this error for all problem sizes or only large ones. Also
let us know the amount of memory you have on your processing nodes.
Maybe you are running low on memory which is causing the allocation
failure.

Thanks & Regards,
Karthik

On Wed, Jun 11, 2008 at 4:40 AM,  <biswajit at crlindia.com> wrote:
>
> While running HPL  with  checkpointing  enabled MVAPICH2 1.0.2  the progamme
> crushed giving
>  following errors:
>
>   1. 2: reregister dentry 0x642010, addr 0x2bc6578000 pagebase_low_p,
> 10121216 register_nbytes
>   [2] Abort: reregister fails
>   at line 1104 in file dreg.c
>   rank 2 in job 1  n163_32790   caused collective abort of all ranks
>    exit status of rank 2: killed by signal 9
>
>  [mpiexec_cr][/home/biswajit/mvapichBlcrInstall/mvapich2-1.0.2ckpt1/src/pm/mpd/mpiexec_cr.c:
> line 196]abort: checkpoint failed
>
>
>
>
> While restarting  restart fails gving following errors:
>
>      cri_syscall(CR_OP_RSTRT_PROCS): Invalid argument
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>


More information about the mvapich-discuss mailing list