[mvapich-discuss] Problems BLCR-checkpointing MVAPICH2-1.5.1 application to Lustre parallel filesystem

xiangyong ouyang ouyangx at cse.ohio-state.edu
Sun Jan 30 20:16:52 EST 2011


Hello Thomas,


The error code "-14" is a system error: EFAULT which indicates a "Bad
Address".  We suspect it has something to do with the Lustre
filesystem setup.
I have attached a new patch to produce more informative print out to
consolidate our guess.  Please revert the previous patch before
applying this one.

In our local checkpoint tests we use Lustre-1.8.1.1 and kernel 2.6.18.
 Can you tell me what Lustre versions you have tried?



-Xiangyong Ouyang



On Fri, Jan 28, 2011 at 8:38 AM, Thomas Zeiser
<thomas.zeiser at rrze.uni-erlangen.de> wrote:
> Hello Xiangyong,
>
> On Fri, Jan 28, 2011 at 12:12:12AM -0500, xiangyong ouyang wrote:
>> Hello Thomas,
>>
>> We re-ran the test with IMB, and we were able to checkpoint to Lustre
>> successfully.
>>
>> First of all, please make sure you have the right permission in the
>> Lustre filesystem, enough free space available in Lustre,  not run out
>> of quote, etc.  Some users experienced problems with checkpoint which
>> were caused by these filesystem issues.
>
> Permissions cannot be the problem as the first few MBs of the
> checkpoint always get written.
>
> There are no quotas on the file system and creating a 100 GB file
> with "dd" works fine:
>
> uz at l1446: 13:47 [~] $ dd if=/dev/zero of=/lxfs/unrz/uz/chk-imb-45283.ladm1.dd bs=1M count=100000
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 306.241 seconds, 342 MB/s
>
> uz at l1446: 13:52 [~] $ ls -lh /lxfs/unrz/uz/chk-imb-45283.ladm1.dd
> -rw------- 1 uz unrz 98G Jan 28 13:52 /lxfs/unrz/uz/chk-imb-45283.ladm1.dd
>
> And of course also normal MPI-IO works fine.
>
>> It would be helpful to collect more information about the failure you
>> encountered.  I have attached a small patch that will print some error
>> messages when checkpoint fails.  Can you apply this patch to your
>> mvapich2?  I'm assuming you are using MVAPICH2-1.5.1p1.  Please re-run
>> the checkpoint test and tell us the error print outs. Thanks!
>
> It's error code -14: (full logs are in the attachment)
>
> cr_checkpoint failed with return code=-14
>
>> We have made some improvements since MVAPICH2-1.5.1p1.   If possible
>> can you try our latest MVAPICH2-1.6RC2 which is available at:
>> http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.6rc2.tgz
>
> Basically the same storry; the main differences are
> - it seems to take slightly longer (i.e. the process stalls longer) until
>  the abort message comes (but that may be subjective)
> - the way how STDOUT/STDERROR messages are generated must have
>  changed as the "[0]: begin checkpoint..." message does not appear
>
> With your patch from 1.5.1p1 applied to 1.6rc2 I again get error
> -14.
>
> [Rank 0][cr.c: line 955]cr_checkpoint failed with return code=-14
> [CR_Callback] Checkpoint of a Process Failed
> MPI process (rank: 0) terminated unexpectedly on l1440
> cr_core.c:244 cr_checkpoint: Unexpected return from CR_OP_HAND_ABORT
> Abort
> handle_mt_peer: fail to read...: Success
> [Rank 1][cr.c: line 955]cr_checkpoint failed with return code=-14
> MPI process (rank: 1) terminated unexpectedly on l1439
>
>
> To tell the complete story, I have to admit that in today's tests I
> had very few IMB runs which were successfully checkpointed to the
> Lustre filesystem. However, I could not detect any systematics. The
> good cases are also included in the attachment.
> To me, it looks like some timeing race condition of some sort ...
>
>
>
> Thanks for your help,
>
> thomas
> --
> Dr.-Ing. Thomas Zeiser, HPC Services
> Friedrich-Alexander-Universitaet Erlangen-Nuernberg
> Regionales Rechenzentrum Erlangen (RRZE)
> Martensstrasse 1, 91058 Erlangen, Germany
> Tel.: +49 9131 85-28737, Fax: +49 9131 302941
> thomas.zeiser at rrze.uni-erlangen.de
> http://www.rrze.uni-erlangen.de/hpc/
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cr_abort_errno.patch
Type: text/x-patch
Size: 710 bytes
Desc: not available
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20110130/94c2f914/cr_abort_errno.bin


More information about the mvapich-discuss mailing list