[mvapich-discuss] question about checkpoint/restart
Xie Min
xmxmxie at gmail.com
Sat Nov 10 02:08:11 EST 2007
Hi,
I compiled a new mvapich2 with these setting:
ENABLE_CKPT=no
PTMALLOC=yes
RDMA_CM_SUPPORT=no
and run the bt.A.4 using your method, but still met the problem,
below I attach some more information, hopes they will be helpful.
[root at node1 tmp]# ldd ./bt.A.4
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002aaaaabc1000)
libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x00002aaaaaccd000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x00002aaaaadd7000)
libm.so.6 => /lib64/tls/libm.so.6 (0x00002aaaaaeec000)
libc.so.6 => /lib64/tls/libc.so.6 (0x00002aaaab073000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002aaaab2a7000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab3b2000)
libibcommon.so.1 => /usr/lib64/libibcommon.so.1 (0x00002aaaab4b6000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
[root at node1 tmp]# rpm -qf /usr/lib64/libibverbs.so.1
libibverbs-1.1.1-0
[root at node1 tmp]# rpm -qi libibverbs-1.1.1-0
Name : libibverbs Relocations: (not relocatable)
Version : 1.1.1 Vendor: OpenFabrics
Release : 0 Build Date: Mon Sep 24 22:36:37 2007
Install Date: Mon Sep 24 23:31:38 2007 Build Host: star99
Group : System Environment/Libraries Source RPM: ofa_user-1.2-0.src.rpm
Size : 201033 License: GPL/BSD
Signature : (none)
URL : http://www.openfabrics.org/
Summary : A library for direct userspace use of InfiniBand
Description :
libibverbs is a library that allows userspace processes to use
InfiniBand "verbs" as described in the InfiniBand Architecture
Specification. This includes direct hardware access for fast path
operations.
[root at node1 tmp]# /usr/local/mvapich2-ckpt/bin/mpiexec -l -np 4 -env
MV2_ON_DEMAND_THRESHOLD 1 ./bt.A.4
2: Fatal error in MPI_Init:
2: Other MPI error, error stack:
0: Fatal error in MPI_Init:
0: Other MPI error, error stack:
0: MPIR_Init_thread(259)...: Initialization failed
0: MPID_Init(102)..........: channel initialization failed
0: MPIDI_CH3_Init(178).....:
0: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
0: MPICM_Init_UD(900)......: Couldn't create completion channel
2: MPIR_Init_thread(259)...: Initialization failed
2: MPID_Init(102)..........: channel initialization failed
2: MPIDI_CH3_Init(178).....:
2: MPIDI_CH3I_CM_Init(1034): MPICM_Init_UD
2: MPICM_Init_UD(900)......: Couldn't create completion channel
rank 0 in job 1 node1_43661 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
but when I run using "/usr/local/mvapich2-ckpt/bin/mpiexec -l -np 4
./bt.A.4", it is OK.
XM
2007/11/9, wei huang <huanwei at cse.ohio-state.edu>:
> Hi Xie,
>
> It looks to me that your environment is not setup properly. Can you run
> normal mvapich2 (without CR) using:
>
> mpiexec -np 4 -env MV2_ON_DEMAND_THRESHOLD 1 ./bt.A.4
>
> Also, if you have multiple ofed installations can you make sure that
> bt.A.4 is linked with the correct ofed libraries.
>
> BTW, blcr 0.5.6 is a bit old. We would suggest you install the newest
> version 0.6.1, Although your problem does not seem to be related with the
> blcr version.
>
> Thanks
>
> Regards,
> Wei Huang
>
> 774 Dreese Lab, 2015 Neil Ave,
> Dept. of Computer Science and Engineering
> Ohio State University
> OH 43210
> Tel: (614)292-8501
>
>
More information about the mvapich-discuss
mailing list