[mvapich-discuss] [BUG] BLCR startup hangs

Sourav Chakraborty chakraborty.52 at buckeyemail.osu.edu
Tue Aug 18 14:25:58 EDT 2015


Hi Maksym,

Thanks for the report.

The -ckpoint-prefix argument is required according to the Hydra user guide.
Please refer to it for more details.
https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Checkpoint.2FRestart_Support

Thanks,
Sourav


On Tue, Aug 18, 2015 at 12:28 PM, Maksym Planeta <
mplaneta at os.inf.tu-dresden.de> wrote:

> Hello,
>
> I found a bug in MVAPICH startup process with C/R enabled. It turns out
> that argument -ckpoint-prefix is a required argument. If I do not specify
> it, the application hangs and never finishes. If this argument is a
> required hydra should fail with proper error message, but it doesn't.
>
> The reason for that is that there are two different processes: the one
> which interacts with the user (mpiexec) and another one (hydra_pmi_proxy)
> which is responsible for checkpointing (not exclusively).
>
> When hydra_pmi_proxy finds out that the parameter is not specified, it
> prints an error message on the console (I saw it in gdb) and effectively
> stops. mpiexec is not aware of that and wait for some response from proxy,
> which is never delivered.
>
> $ mpiname -a
> MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:nemesis
>
> Compilation
> CC: gcc    -DNDEBUG -DNVALGRIND -O2
> CXX: g++   -DNDEBUG -DNVALGRIND -O2
> F77: gfortran   -O2
> FC: gfortran   -O2
>
> Configuration
> --prefix=/home/planeta/opt/apps/mvapich/2.1-blcr --enable-fortran=all
> --with-device=ch3:nemesis:ib --enable-checkpointing
>
> Bug triggered:
>
> $ mpiexec -ckpoint-interval 10 -np 2 -hosts 172.31.128.50,172.31.128.51
> osu_bw
> # OSU MPI Bandwidth Test
> # Size        Bandwidth (MB/s)
> <hangs here>
> Shows after some time: [mpiexec at os-dhcp017] No checkpoint prefix provided
>
> "Normal" work:
> $ mpiexec -ckpoint-interval 10 -ckpoint-prefix /home/planeta/opt/chkpt -np
> 2 -hosts 172.31.128.50,172.31.128.51 osu_bw
> # OSU MPI Bandwidth Test
> # Size        Bandwidth (MB/s)
> 1                         3.08
> 2                         6.19
> <runs until the end>
>
>
> --
> Regards,
> Maksym Planeta
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150818/267cee08/attachment.html>


More information about the mvapich-discuss mailing list