[mvapich-discuss] question about checkpoint/restart
wei huang
huanwei at cse.ohio-state.edu
Thu Nov 8 22:05:19 EST 2007
Hi Xie,
It looks to me that your environment is not setup properly. Can you run
normal mvapich2 (without CR) using:
mpiexec -np 4 -env MV2_ON_DEMAND_THRESHOLD 1 ./bt.A.4
Also, if you have multiple ofed installations can you make sure that
bt.A.4 is linked with the correct ofed libraries.
BTW, blcr 0.5.6 is a bit old. We would suggest you install the newest
version 0.6.1, Although your problem does not seem to be related with the
blcr version.
Thanks
Regards,
Wei Huang
774 Dreese Lab, 2015 Neil Ave,
Dept. of Computer Science and Engineering
Ohio State University
OH 43210
Tel: (614)292-8501
On Fri, 9 Nov 2007, Xie Min wrote:
> hi,
> We have an infiniband cluster and use mvapich2-1.0.1 on it.
> Recently I want to try the checkoint/restart functions, but I met a
> problem at the startup of MPI programs. I am using NPB 3.2, below is
> the output of the error:
>
> root at node2 tmp]# /usr/local/mvapich2-ckpt/bin/mpiexec -l -np 4 ./bt.A.4
> 1: Fatal error in MPI_Init:
> 1: Other MPI error, error stack:
> 1: MPIR_Init_thread(259)...: Initialization failed
> 0: Fatal error in MPI_Init:
> 0: Other MPI error, error stack:
> 0: MPIR_Init_thread(259)...: Initialization failed
> 0: MPID_Init(102)..........: channel initialization failed
> 0: MPIDI_CH3_Init(178).....:
> 0: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
> 0: MPICM_Init_UD(900)......: Couldn't create completion channel
> 1: MPID_Init(102)..........: channel initialization failed
> 1: MPIDI_CH3_Init(178).....:
> 1: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
> 1: MPICM_Init_UD(900)......: Couldn't create completion channel
> rank 1 in job 6 node2_33394 caused collective abort of all ranks
> exit status of rank 1: killed by signal 9
> rank 0 in job 6 node2_33394 caused collective abort of all ranks
> exit status of rank 0: killed by signal 9
>
>
> We use BLCR 0.5.6 as the checkpointer, and OFED-1.2 stack, node in our
> cluster is based on Intel Xeon EMT64.
>
> I recompiled a new mvapich2 without checkpoint/restart in it, and
> don't met the error above.
>
> How to cope with this problem?
>
> Thanks.
>
> XM
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list