[mvapich-discuss] checkpointing failure ...
Karthik Gopalakrishnan
gopalakk at cse.ohio-state.edu
Wed Jun 11 17:23:24 EDT 2008
Hi Biswajit.
You are seeing this error due to a memory allocation failure. MVAPICH2
calls an OFED function ibv_reg_mr() to reallocate a memory region for
the HCA after taking a checkpoint. This function call is returning an
error. This is a failure in the OFED stack. Please let us know if you
are seeing this error for all problem sizes or only large ones. Also
let us know the amount of memory you have on your processing nodes.
Maybe you are running low on memory which is causing the allocation
failure.
Thanks & Regards,
Karthik
On Wed, Jun 11, 2008 at 4:40 AM, <biswajit at crlindia.com> wrote:
>
> While running HPL with checkpointing enabled MVAPICH2 1.0.2 the progamme
> crushed giving
> following errors:
>
> 1. 2: reregister dentry 0x642010, addr 0x2bc6578000 pagebase_low_p,
> 10121216 register_nbytes
> [2] Abort: reregister fails
> at line 1104 in file dreg.c
> rank 2 in job 1 n163_32790 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
>
> [mpiexec_cr][/home/biswajit/mvapichBlcrInstall/mvapich2-1.0.2ckpt1/src/pm/mpd/mpiexec_cr.c:
> line 196]abort: checkpoint failed
>
>
>
>
> While restarting restart fails gving following errors:
>
> cri_syscall(CR_OP_RSTRT_PROCS): Invalid argument
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
More information about the mvapich-discuss
mailing list