[mvapich-discuss] [BUG] BLCR startup hangs

Maksym Planeta mplaneta at os.inf.tu-dresden.de
Tue Aug 18 12:28:30 EDT 2015


Hello,

I found a bug in MVAPICH startup process with C/R enabled. It turns out 
that argument -ckpoint-prefix is a required argument. If I do not 
specify it, the application hangs and never finishes. If this argument 
is a required hydra should fail with proper error message, but it doesn't.

The reason for that is that there are two different processes: the one 
which interacts with the user (mpiexec) and another one 
(hydra_pmi_proxy) which is responsible for checkpointing (not exclusively).

When hydra_pmi_proxy finds out that the parameter is not specified, it 
prints an error message on the console (I saw it in gdb) and effectively 
stops. mpiexec is not aware of that and wait for some response from 
proxy, which is never delivered.

$ mpiname -a
MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:nemesis

Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -O2
CXX: g++   -DNDEBUG -DNVALGRIND -O2
F77: gfortran   -O2
FC: gfortran   -O2

Configuration
--prefix=/home/planeta/opt/apps/mvapich/2.1-blcr --enable-fortran=all 
--with-device=ch3:nemesis:ib --enable-checkpointing

Bug triggered:

$ mpiexec -ckpoint-interval 10 -np 2 -hosts 172.31.128.50,172.31.128.51 
osu_bw
# OSU MPI Bandwidth Test
# Size        Bandwidth (MB/s)
<hangs here>
Shows after some time: [mpiexec at os-dhcp017] No checkpoint prefix provided

"Normal" work:
$ mpiexec -ckpoint-interval 10 -ckpoint-prefix /home/planeta/opt/chkpt 
-np 2 -hosts 172.31.128.50,172.31.128.51 osu_bw
# OSU MPI Bandwidth Test
# Size        Bandwidth (MB/s)
1                         3.08
2                         6.19
<runs until the end>


-- 
Regards,
Maksym Planeta


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5154 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150818/a95822b5/attachment.p7s>


More information about the mvapich-discuss mailing list