[mvapich-discuss] question about checkpoint/restart

Xie Min xmxmxie at gmail.com
Sat Nov 10 03:29:00 EST 2007


Hi,
    Maybe something wrong in our configuration, I compile mvapich2 on
another server, and with checkpoint/restart in it, this version works
fine.

XM

2007/11/9, wei huang <huanwei at cse.ohio-state.edu>:
> Hi Xie,
>
> It looks to me that your environment is not setup properly. Can you run
> normal mvapich2 (without CR) using:
>
> mpiexec -np 4 -env MV2_ON_DEMAND_THRESHOLD 1 ./bt.A.4
>
> Also, if you have multiple ofed installations can you make sure that
> bt.A.4 is linked with the correct ofed libraries.
>
> BTW, blcr 0.5.6 is a bit old. We would suggest you install the newest
> version 0.6.1, Although your problem does not seem to be related with the
> blcr version.
>
> Thanks
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
> On Fri, 9 Nov 2007, Xie Min wrote:
>
> > hi,
> >     We have an infiniband cluster and use mvapich2-1.0.1 on it.
> > Recently I want to try the checkoint/restart functions, but I met a
> > problem at the startup of MPI programs. I am using NPB 3.2, below is
> > the output of the error:
> >
> > root at node2 tmp]# /usr/local/mvapich2-ckpt/bin/mpiexec -l -np 4 ./bt.A.4
> > 1: Fatal error in MPI_Init:
> > 1: Other MPI error, error stack:
> > 1: MPIR_Init_thread(259)...: Initialization failed
> > 0: Fatal error in MPI_Init:
> > 0: Other MPI error, error stack:
> > 0: MPIR_Init_thread(259)...: Initialization failed
> > 0: MPID_Init(102)..........: channel initialization failed
> > 0: MPIDI_CH3_Init(178).....:
> > 0: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
> > 0: MPICM_Init_UD(900)......: Couldn't create completion channel
> > 1: MPID_Init(102)..........: channel initialization failed
> > 1: MPIDI_CH3_Init(178).....:
> > 1: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
> > 1: MPICM_Init_UD(900)......: Couldn't create completion channel
> > rank 1 in job 6  node2_33394   caused collective abort of all ranks
> >   exit status of rank 1: killed by signal 9
> > rank 0 in job 6  node2_33394   caused collective abort of all ranks
> >   exit status of rank 0: killed by signal 9
> >
> >
> > We use BLCR 0.5.6 as the checkpointer, and OFED-1.2 stack, node in our
> > cluster is based on Intel Xeon EMT64.
> >
> > I recompiled a new mvapich2 without checkpoint/restart in it, and
> > don't met the error above.
> >
> > How to cope with this problem?
> >
> > Thanks.
> >
> > XM
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>


More information about the mvapich-discuss mailing list