[mvapich-discuss] question about checkpoint/restart

wei huang huanwei at cse.ohio-state.edu
Thu Nov 8 22:05:19 EST 2007


Hi Xie,

It looks to me that your environment is not setup properly. Can you run
normal mvapich2 (without CR) using:

mpiexec -np 4 -env MV2_ON_DEMAND_THRESHOLD 1 ./bt.A.4

Also, if you have multiple ofed installations can you make sure that
bt.A.4 is linked with the correct ofed libraries.

BTW, blcr 0.5.6 is a bit old. We would suggest you install the newest
version 0.6.1, Although your problem does not seem to be related with the
blcr version.

Thanks

Regards,
Wei Huang

774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501


On Fri, 9 Nov 2007, Xie Min wrote:

> hi,
>     We have an infiniband cluster and use mvapich2-1.0.1 on it.
> Recently I want to try the checkoint/restart functions, but I met a
> problem at the startup of MPI programs. I am using NPB 3.2, below is
> the output of the error:
>
> root at node2 tmp]# /usr/local/mvapich2-ckpt/bin/mpiexec -l -np 4 ./bt.A.4
> 1: Fatal error in MPI_Init:
> 1: Other MPI error, error stack:
> 1: MPIR_Init_thread(259)...: Initialization failed
> 0: Fatal error in MPI_Init:
> 0: Other MPI error, error stack:
> 0: MPIR_Init_thread(259)...: Initialization failed
> 0: MPID_Init(102)..........: channel initialization failed
> 0: MPIDI_CH3_Init(178).....:
> 0: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
> 0: MPICM_Init_UD(900)......: Couldn't create completion channel
> 1: MPID_Init(102)..........: channel initialization failed
> 1: MPIDI_CH3_Init(178).....:
> 1: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
> 1: MPICM_Init_UD(900)......: Couldn't create completion channel
> rank 1 in job 6  node2_33394   caused collective abort of all ranks
>   exit status of rank 1: killed by signal 9
> rank 0 in job 6  node2_33394   caused collective abort of all ranks
>   exit status of rank 0: killed by signal 9
>
>
> We use BLCR 0.5.6 as the checkpointer, and OFED-1.2 stack, node in our
> cluster is based on Intel Xeon EMT64.
>
> I recompiled a new mvapich2 without checkpoint/restart in it, and
> don't met the error above.
>
> How to cope with this problem?
>
> Thanks.
>
> XM
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list