[mvapich-discuss] [BUG] BLCR startup hangs
Maksym Planeta
mplaneta at os.inf.tu-dresden.de
Tue Aug 18 12:28:30 EDT 2015
Hello,
I found a bug in MVAPICH startup process with C/R enabled. It turns out
that argument -ckpoint-prefix is a required argument. If I do not
specify it, the application hangs and never finishes. If this argument
is a required hydra should fail with proper error message, but it doesn't.
The reason for that is that there are two different processes: the one
which interacts with the user (mpiexec) and another one
(hydra_pmi_proxy) which is responsible for checkpointing (not exclusively).
When hydra_pmi_proxy finds out that the parameter is not specified, it
prints an error message on the console (I saw it in gdb) and effectively
stops. mpiexec is not aware of that and wait for some response from
proxy, which is never delivered.
$ mpiname -a
MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:nemesis
Compilation
CC: gcc -DNDEBUG -DNVALGRIND -O2
CXX: g++ -DNDEBUG -DNVALGRIND -O2
F77: gfortran -O2
FC: gfortran -O2
Configuration
--prefix=/home/planeta/opt/apps/mvapich/2.1-blcr --enable-fortran=all
--with-device=ch3:nemesis:ib --enable-checkpointing
Bug triggered:
$ mpiexec -ckpoint-interval 10 -np 2 -hosts 172.31.128.50,172.31.128.51
osu_bw
# OSU MPI Bandwidth Test
# Size Bandwidth (MB/s)
<hangs here>
Shows after some time: [mpiexec at os-dhcp017] No checkpoint prefix provided
"Normal" work:
$ mpiexec -ckpoint-interval 10 -ckpoint-prefix /home/planeta/opt/chkpt
-np 2 -hosts 172.31.128.50,172.31.128.51 osu_bw
# OSU MPI Bandwidth Test
# Size Bandwidth (MB/s)
1 3.08
2 6.19
<runs until the end>
--
Regards,
Maksym Planeta
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5154 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150818/a95822b5/attachment.p7s>
More information about the mvapich-discuss
mailing list