[mvapich-discuss] BLCR+MAVPICH2

hljgqz 15776869853 at 163.com
Thu Mar 12 07:07:35 EDT 2015


Hi thank you for your help, here comes more detials:<br/><br/>[root at node3 bin]# mpiname -a<br/>MVAPICH2 2.1rc1 Thu Dec 18 20:00:00 EDT 2014 ch3:mrail<br/><br/>Compilation<br/>CC: gcc    -DNDEBUG -DNVALGRIND -O2<br/>CXX: g++   -DNDEBUG -DNVALGRIND -O2<br/>F77: gfortran -L/lib -L/lib   -O2<br/>FC: gfortran   -O2<br/><br/>Configuration<br/>--enable-ckpt --disable-shared<br/>Yesterday I inster sleep(30) into cpi ,however Today I run NPB3.3-MPI program named bt.A.4<br/><br/>[root at node3 NPB3.3-MPI]# mpirun_rsh -np 4 -hostfile hosts MV2_DEBUG_FT_VERBOSE=1 bin/bt.A.4<br/>on another console I run cr_checkpoint -p <PID> try to make a checkpoint , but I can't checkpoint the mpirun_rsh.<br/>here is messages come from console of mpirun_rsh<br/>[node3:mpirun_rsh][CR_Callback] Unexpected results from 0: ""<br/>[node3:mpirun_rsh][CR_Callback] Some processes failed to checkpoint. Abort checkpoint...
At 2015-03-12 03:53:55, "Jian Lin" <lin.2180 at osu.edu> wrote:
>Hi, 
>
>Thanks for your note.
>
>Are you using the cpi program came with MPICH without any modification?
>This program runs very fast, and there may be no enough time for
>capturing a snapshot. When trying to checkpoint a job that has
>completed, cr_checkpoint will dump the errors as you post.
>
>Besides the output of "mpiname -a" and the output with
>"MV2_DEBUG_FT_VERBOSE=1", can you please also provide the last few lines
>of dmesg output after the error occurs? It will be helpful for us to
>understand what happens.
>
>On Wed, 11 Mar 2015 14:55:09 +0000
>Jonathan Perkins <perkinjo at cse.ohio-state.edu> wrote:
>
>> Hi thanks for your note.
>> 
>> Can you provide us the output of mpiname -a?  Also can you rerun the
>> job(s) but also set MV2_DEBUG_FT_VERBOSE equal to 1?
>> 
>> On Wed, Mar 11, 2015 at 8:35 AM hljgqz <15776869853 at 163.com> wrote:
>> 
>> > Dear all,
>> >     I have a problem on using Checkpoint/Restart on mvapich2-2.1 .
>> > My cluster nodes use centos6.6 x86_64 , mallenox infiniband  , BLCR
>> > is well installed , I can use it ckpt normal programs.
>> > and I configure the mvapich2 with ./configure --enable-ckpt
>> > --disable-shared .
>> >    However , I can't checkpoint when use mpirun_rsh -np 4 -hostfile
>> > hosts ./cpi (or other mpi program like lu.A.4 ). When the program
>> > finished ,here came :
>> > [root at node3 node0]# cr_checkpoint -p 3366
>> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3367) exited with
>> > code 0 during checkpoint
>> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3371) exited with
>> > code 0 during checkpoint
>> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3366) exited with
>> > code 1 during checkpoint
>> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3368) exited with
>> > code 1 during checkpoint
>> > - chkpt_watchdog: 'mpirun_rsh' (tgid/pid 3366/3369) exited with
>> > code 1 during checkpoint
>> > Checkpoint failed: no processes checkpointed
>> >
>> >  And ,if I use mpiexec -n 4 ./cpi , I can run cr_checkpoint to get a
>> > context , but I can't restart . here come :
>> > [root at node3 node0]# cr_restart context.3436
>> > [mpiexec at node3] HYDT_dmxu_poll_wait_for_event
>> > (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents &
>> > ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
>> > [mpiexec at node3] HYD_pmci_wait_for_completion
>> > (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
>> > [mpiexec at node3] main (ui/mpich/mpiexec.c:344): process manager error
>> > waiting for completion
>> >
>> >
>> >
>> >
>> >
>> > 来自网易手机号码邮箱了解更多 <http://shouji.163.com>
>> > _______________________________________________
>> > mvapich-discuss mailing list
>> > mvapich-discuss at cse.ohio-state.edu
>> > http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>> >
>
>
>
>-- 
>Jian Lin
>http://linjian.org



More information about the mvapich-discuss mailing list