[mvapich-discuss] question about checkpoint/restart

Xie Min xmxmxie at gmail.com
Sat Nov 10 02:08:11 EST 2007


Hi,
    I compiled a new mvapich2 with these setting:
             ENABLE_CKPT=no
             PTMALLOC=yes
             RDMA_CM_SUPPORT=no

    and run the bt.A.4 using your method, but still met the problem,
below I attach some more information, hopes they will be helpful.

[root at node1 tmp]# ldd ./bt.A.4
        libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaaabc1000)
        libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x00002aaaaaccd000)
        libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x00002aaaaadd7000)
        libm.so.6 => /lib64/tls/libm.so.6 (0x00002aaaaaeec000)
        libc.so.6 => /lib64/tls/libc.so.6 (0x00002aaaab073000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002aaaab2a7000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab3b2000)
        libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x00002aaaab4b6000)
        /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
[root at node1 tmp]# rpm -qf /usr/lib64/libibverbs.so.1
libibverbs-1.1.1-0
[root at node1 tmp]# rpm -qi libibverbs-1.1.1-0
Name        : libibverbs                   Relocations: (not relocatable)
Version     : 1.1.1                             Vendor: OpenFabrics
Release     : 0                             Build Date: Mon Sep 24 22:36:37 2007
Install Date: Mon Sep 24 23:31:38 2007      Build Host: star99
Group       : System Environment/Libraries   Source RPM: ofa_user-1.2-0.src.rpm
Size        : 201033                           License: GPL/BSD
Signature   : (none)
URL         : http://www.openfabrics.org/
Summary     : A library for direct userspace use of InfiniBand
Description :
libibverbs is a library that allows userspace processes to use
InfiniBand "verbs" as described in the InfiniBand Architecture
Specification.  This includes direct hardware access for fast path
operations.


[root at node1 tmp]# /usr/local/mvapich2-ckpt/bin/mpiexec -l -np 4 -env
MV2_ON_DEMAND_THRESHOLD 1 ./bt.A.4
2: Fatal error in MPI_Init:
2: Other MPI error, error stack:
0: Fatal error in MPI_Init:
0: Other MPI error, error stack:
0: MPIR_Init_thread(259)...: Initialization failed
0: MPID_Init(102)..........: channel initialization failed
0: MPIDI_CH3_Init(178).....:
0: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
0: MPICM_Init_UD(900)......: Couldn't create completion channel
2: MPIR_Init_thread(259)...: Initialization failed
2: MPID_Init(102)..........: channel initialization failed
2: MPIDI_CH3_Init(178).....:
2: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
2: MPICM_Init_UD(900)......: Couldn't create completion channel
rank 0 in job 1  node1_43661   caused collective abort of all ranks
  exit status of rank 0: killed by signal 9


but when I run using "/usr/local/mvapich2-ckpt/bin/mpiexec -l -np 4
./bt.A.4", it is OK.

XM


2007/11/9, wei huang <huanwei at cse.ohio-state.edu>:
> Hi Xie,
>
> It looks to me that your environment is not setup properly. Can you run
> normal mvapich2 (without CR) using:
>
> mpiexec -np 4 -env MV2_ON_DEMAND_THRESHOLD 1 ./bt.A.4
>
> Also, if you have multiple ofed installations can you make sure that
> bt.A.4 is linked with the correct ofed libraries.
>
> BTW, blcr 0.5.6 is a bit old. We would suggest you install the newest
> version 0.6.1, Although your problem does not seem to be related with the
> blcr version.
>
> Thanks
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>


More information about the mvapich-discuss mailing list