[mvapich-discuss] checkpointing failure ...

Thu Jun 12 16:05:08 EDT 2008

Hi Yiannis,

If you are operating close to the limit for memory pinning, I would
suspect that fluctuations in system memory usage could cause this error
when you attempt to deregister and reregister the memory needed. Can you
try running the tests with slightly smaller problem sizes? i.e. Same
number of nodes but smaller HPL problem size.

Regards.
  --Sundeep.

On Thu, 12 Jun 2008, yiannis georgiou wrote:

> Hello,
>
> I got the same error several times mainly on large problem sizes and
> large number of nodes, even though I've seen it some times with small
> number of nodes as well. Most of the times the fatal error was
> produced just before the finish of the checkpointing procedure. The
> strange thing is that  the error doesn't appear all the times. After a
> number of repeats under the same configuration (HPL problem size-
> cluster number of nodes), we observe that sometimes we can get this
> error but sometimes the checkpoints can be successfully taken and
> restarted.
>
> The explanation seems reasonable. Is this a known OFED bug ? Is there
> a way to avoid it?
>
> The nodes used on my experiments consist of :
>   Intel Xeon EM64T 3GHz 2CPU , 1CORE
> 3 GHz / 1 MB L2 cache
>
> and memory 2 GB (4x512MB) / 400MHz(2.5ns)
>
> Thanks!!
>
> regards,
> Yiannis
>
> Quoting Karthik Gopalakrishnan <gopalakk at cse.ohio-state.edu>:
>
> > Hi Biswajit.
> >
> > You are seeing this error due to a memory allocation failure. MVAPICH2
> > calls an OFED function ibv_reg_mr() to reallocate a memory region for
> > the HCA after taking a checkpoint. This function call is returning an
> > error. This is a failure in the OFED stack. Please let us know if you
> > are seeing this error for all problem sizes or only large ones. Also
> > let us know the amount of memory you have on your processing nodes.
> > Maybe you are running low on memory which is causing the allocation
> > failure.
> >
> > Thanks & Regards,
> > Karthik
> >
> > On Wed, Jun 11, 2008 at 4:40 AM,  <biswajit at crlindia.com> wrote:
> >>
> >> While running HPL  with  checkpointing  enabled MVAPICH2 1.0.2  the progamme
> >> crushed giving
> >>  following errors:
> >>
> >>   1. 2: reregister dentry 0x642010, addr 0x2bc6578000 pagebase_low_p,
> >> 10121216 register_nbytes
> >>   [2] Abort: reregister fails
> >>   at line 1104 in file dreg.c
> >>   rank 2 in job 1  n163_32790   caused collective abort of all ranks
> >>    exit status of rank 2: killed by signal 9
> >>
> >>
> >> [mpiexec_cr][/home/biswajit/mvapichBlcrInstall/mvapich2-1.0.2ckpt1/src/pm/mpd/mpiexec_cr.c:
> >> line 196]abort: checkpoint failed
> >>
> >>
> >>
> >>
> >> While restarting  restart fails gving following errors:
> >>
> >>      cri_syscall(CR_OP_RSTRT_PROCS): Invalid argument
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
>
> --
>
> Yiannis Georgiou                LIG Laboratory / MESCAL Project
> Yiannis.Georgiou at imag.fr        http://mescal.imag.fr/
>                                  FRANCE
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>