[mvapich-discuss] problem on mvapich2-1.8.1 checkpoint/restart with BLCR

Raghunath rajachan at cse.ohio-state.edu
Fri Feb 15 03:21:16 EST 2013


Suja,

>  /share/apps/mvapich2-1.8.1/bin/mpirun_rsh -np 8 -hostfile ./hostfile
> MV2_IBA_HCA=mlx4_0  MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1
> MV2_CKPT_FILE=~rpmaps/checkpoint/scripts/mvapichckpt
> MV2_DEBUG_SHOW_BACKTRACE=1 ./vector
>
> Also, I have tried it by avoiding the options MV2_IBA_HCA=mlx4_0
> MV2_USE_RDMAOE=1 MV2_DEFAULT_PORT=1.
>
> The environment variables set are
> export PATH=/share/apps/mvapich2-1.8.1/bin/:$PATH
> export
> LD_LIBRARY_PATH=/share/apps/mvapich2-1.8.1/lib:/usr/local/lib:$LD_LIBRARY_PATH

Thanks for sending this information. I see nothing strange here.

> I have not yet installed the fault tolerance backplane(FTB). Is that
> mandatory for checkpoint/restart?

No, FTB is not mandatory to get Checkpoint/Restart working with
MVAPICH. The only external library that is "mandatory" for the CR
mechanism to work is BLCR.

> I will also try with mvapich2-1.9
>
> (FYI,'vector' is  the executable of the vector addition program given here:
> http://www.cs.umanitoba.ca/~comp4510/examplesDIR/vsum.c )

Thanks for this pointer. I was able to checkpoint this sample
application successfully as well (again, I compiled MVAPICH using the
same config flags that you have used).

Do let us know if you continue facing issues even after upgrading to
MVAPICH2-1.9

--
Raghu


More information about the mvapich-discuss mailing list