[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

Thomas Zeiser thomas.zeiser at rrze.uni-erlangen.de
Mon Jan 31 04:45:15 EST 2011


Hello Xiangyong,

On Sun, Jan 30, 2011 at 08:16:52PM -0500, xiangyong ouyang wrote:
> Hello Thomas,
> 
> The error code "-14" is a system error: EFAULT which indicates a "Bad
> Address".  We suspect it has something to do with the Lustre
> filesystem setup.
> I have attached a new patch to produce more informative print out to
> consolidate our guess.  Please revert the previous patch before
> applying this one.

you were right with your guess:

[0]: begin checkpoint...
[Rank 0][cr.c: line 722]cr_checkpoint() failed: -14 (Bad address)
[CR_Callback] Checkpoint of a Process Failed
MPI process (rank: 0) terminated unexpectedly on l1418
cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
Abort
[Rank 1][cr.c: line 722]cr_checkpoint() failed: -14 (Bad address)
MPI process (rank: 1) terminated unexpectedly on l1407

> In our local checkpoint tests we use Lustre-1.8.1.1 and kernel 2.6.18.
>  Can you tell me what Lustre versions you have tried?

The Lustre installation is NEC's LXFS solution ...


>From the compute nodes:

[root at l1418 ~]# uname -a
Linux l1418 2.6.18-194.11.4.el5 #1 SMP Tue Sep 21 05:04:09 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

[root at l1418 ~]# cat /proc/fs/lustre/version 
lustre: 1.8.4
kernel: patchless_client
build:  1.8.4-20100724012708-PRISTINE-2.6.18-194.11.4.el5

[root at l1418 ~]# cat /etc/redhat-release 
CentOS release 5.5 (Final)

>From the Lustre servers:

[root at lmds1 ~]# uname -a
Linux lmds1 2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010 x86_64 x86_64 x86_64 GNU/Linux

[root at lmds1 ~]# cat /proc/fs/lustre/version 
lustre: 1.8.3
kernel: patchless_client
build:  1.8.3-20100409175828-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3

[root at lmds1 ~]# cat /etc/redhat-release 
CentOS release 5.4 (Final)

[root at lmds1 ~]# rpm -qa|grep -i lustre
kernel-devel-2.6.18-164.11.1.el5_lustre.1.8.3
kernel-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
kernel-headers-2.6.18-164.11.1.el5_lustre.1.8.3
lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
lustre-tests-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3


Best regards,

thomas
-- 
Dr.-Ing. Thomas Zeiser, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen (RRZE)
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28737, Fax: +49 9131 302941
thomas.zeiser at rrze.uni-erlangen.de
http://www.rrze.uni-erlangen.de/hpc/


More information about the mvapich-discuss mailing list