[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Mon Jan 31 05:27:59 EST 2011


On Mon, Jan 31, 2011 at 10:45:15AM +0100, Thomas Zeiser wrote:
> Hello Xiangyong,
> 
> On Sun, Jan 30, 2011 at 08:16:52PM -0500, xiangyong ouyang wrote:
> > Hello Thomas,
> > 
> > The error code "-14" is a system error: EFAULT which indicates a "Bad
> > Address".  We suspect it has something to do with the Lustre
> > filesystem setup.
> > I have attached a new patch to produce more informative print out to
> > consolidate our guess.  Please revert the previous patch before
> > applying this one.
> 
> you were right with your guess:
> 
> [0]: begin checkpoint...
> [Rank 0][cr.c: line 722]cr_checkpoint() failed: -14 (Bad address)
> [CR_Callback] Checkpoint of a Process Failed
> MPI process (rank: 0) terminated unexpectedly on l1418
> cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
> Abort
> [Rank 1][cr.c: line 722]cr_checkpoint() failed: -14 (Bad address)
> MPI process (rank: 1) terminated unexpectedly on l1407
> 
> > In our local checkpoint tests we use Lustre-1.8.1.1 and kernel 2.6.18.
> >  Can you tell me what Lustre versions you have tried?
> 
> The Lustre installation is NEC's LXFS solution ...
> 
> >From the compute nodes:
> 
> [root at l1418 ~]# uname -a
> Linux l1418 2.6.18-194.11.4.el5 #1 SMP Tue Sep 21 05:04:09 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root at l1418 ~]# cat /proc/fs/lustre/version 
> lustre: 1.8.4
> kernel: patchless_client
> build:  1.8.4-20100724012708-PRISTINE-2.6.18-194.11.4.el5
> 
> [root at l1418 ~]# cat /etc/redhat-release 
> CentOS release 5.5 (Final)

one additional note:

[root at l1418 etc]# mount |grep /lxfs
10.188.20.31 at o2ib:10.188.20.32 at o2ib:/lnec on /lxfs type lustre (rw,noatime,nodiratime)

i.e. the Lustre file system is accessed over the same Infiniband HCA as MPI
communication is done.

Adding "localflock" to the mount options on the clients did not
change the situation.

> >From the Lustre servers:
> 
> [root at lmds1 ~]# uname -a
> Linux lmds1 2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010 x86_64 x86_64 x86_64 GNU/Linux
> 
> [root at lmds1 ~]# cat /proc/fs/lustre/version 
> lustre: 1.8.3
> kernel: patchless_client
> build:  1.8.3-20100409175828-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3
> 
> [root at lmds1 ~]# cat /etc/redhat-release 
> CentOS release 5.4 (Final)
> 
> [root at lmds1 ~]# rpm -qa|grep -i lustre
> kernel-devel-2.6.18-164.11.1.el5_lustre.1.8.3
> kernel-2.6.18-164.11.1.el5_lustre.1.8.3
> lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> kernel-headers-2.6.18-164.11.1.el5_lustre.1.8.3
> lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> lustre-tests-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
> 
> 
> Best regards,
> 
> thomas

-- 
Dr.-Ing. Thomas Zeiser, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28737, Fax: +49 9131 302941
thomas.zeiser at rrze.uni-erlangen.de
http://www.rrze.uni-erlangen.de/hpc/


More information about the mvapich-discuss mailing list