[mvapich-discuss] question about checkpoint/restart

Xie Min xmxmxie at gmail.com
Thu Nov 8 21:46:44 EST 2007


hi,
    We have an infiniband cluster and use mvapich2-1.0.1 on it.
Recently I want to try the checkoint/restart functions, but I met a
problem at the startup of MPI programs. I am using NPB 3.2, below is
the output of the error:

root at node2 tmp]# /usr/local/mvapich2-ckpt/bin/mpiexec -l -np 4 ./bt.A.4
1: Fatal error in MPI_Init:
1: Other MPI error, error stack:
1: MPIR_Init_thread(259)...: Initialization failed
0: Fatal error in MPI_Init:
0: Other MPI error, error stack:
0: MPIR_Init_thread(259)...: Initialization failed
0: MPID_Init(102)..........: channel initialization failed
0: MPIDI_CH3_Init(178).....:
0: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
0: MPICM_Init_UD(900)......: Couldn't create completion channel
1: MPID_Init(102)..........: channel initialization failed
1: MPIDI_CH3_Init(178).....:
1: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
1: MPICM_Init_UD(900)......: Couldn't create completion channel
rank 1 in job 6  node2_33394   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9
rank 0 in job 6  node2_33394   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9


We use BLCR 0.5.6 as the checkpointer, and OFED-1.2 stack, node in our
cluster is based on Intel Xeon EMT64.

I recompiled a new mvapich2 without checkpoint/restart in it, and
don't met the error above.

How to cope with this problem?

Thanks.

XM


More information about the mvapich-discuss mailing list