[mvapich-discuss] Problem with mvapich2 + blcr
Raghu
rajachan at cse.ohio-state.edu
Fri Oct 31 08:48:08 EDT 2014
Hi Ramy,
For #1, the MVAPICH2_Sync_Checkpoint() call is exposed via mpi.h itself, so
you need not include another header, or link to any other library. I see
that you are installing to a non-standard prefix, but later using the
default mpicc in your path when running the test. Are you exporting $PATH
and $LD_LIBRARY_PATH appropriately? Can you send me the output of the
following:
$ mpiname
and
$ /home/gad/install_mvapich2-2.1a/bin/mpiname
I am guessing that your test run is picking up the system's default build
of MVAPICH2, which might not have been configured with checkpointing
support.
For #2, I quickly tested it locally, and things work as expected. I have a
feeling it could be related to the above point I made about using a build
which does not have this support enabled. Can you retry your runs
explicitly using the install in your prefix?
$ /home/gad/install_mvapich2-2.1a/bin/mpicc test.cpp -o testcp
$ /home/gad/install_mvapich2-2.1a/bin/mpiexec.hydra.....
If case #2 fails even after you verify that you are using the correct
install, can you try it once with mpirun_rsh (MVAPICH2's recommended
launcher)?
Raghu
On Fri, Oct 31, 2014 at 3:53 AM, Gad, Ramy <gad at uni-mainz.de> wrote:
> Hi,
>
>
> I have installed mvapich2 V2.0 and V2.1a with this configuration.
>
>
> ====
>
> ./configure --prefix=/home/gad/install_mvapich2-2.1a
> --with-ib-libpath=/global/packages/libibverbs-pd/lib --enable-ckpt
> --with-blcr=/opt/blcr --enable-checkpointing --with-hydra-ckpointlib=blcr
>
> ====
>
>
> I have BLCR installed on my system and its kernel module are loaded.
>
>
> ====
>
> gad at pandora1:/home/gad$ lsmod | grep blcr
> blcr 115465 0
> blcr_imports 10683 1 blcr
> gad at pandora1:/home/gad$
> gad at pandora1:/home/gad$ echo $LD_LIBRARY_PATH
> :/home/gad/install_mvapich2-2.0/lib:/opt/blcr/lib
> ====
>
>
> The problem are:
>
>
> 1- While compiling a programme with application initiated synchronous
> checkpointing (using MVAPICH2_Sync_Checkpoint() ) getting following error
> messages . : undefined reference to `MVAPICH2_Sync_Checkpoint' Is there any
> header file I need to include or link with any library ...??
>
>
>
> ====
>
> gad at pandora1:/home/gad/mvapich2-2.1a_test$ cat testcp.cpp
> #include "mpi.h"
> #include <unistd.h>
> #include <stdio.h>
>
>
>
> int main(int argc,char *argv[])
> {
> MPI_Init(&argc,&argv);
> printf("Computation\n");
> sleep(5);
> MPI_Barrier(MPI_COMM_WORLD);
> MVAPICH2_Sync_Checkpoint();
> MPI_Barrier(MPI_COMM_WORLD);
> printf("Computation\n");
> sleep(5);
> MPI_Finalize();
> return 0;
> }
> gad at pandora1:/home/gad/mvapich2-2.1a_test$ mpicc testcp.cpp -o testcp
> testcp.cpp: In function ‘int main(int, char**)’:
> testcp.cpp:13: error: ‘MVAPICH2_Sync_Checkpoint’ was not declared in this
> scope
> ====
>
>
>
> 2- When I try to checkpoint an MPI program with cr_checkpoint and
> restore it with cr_restart, I got the following error
>
> ====
>
> gad at pandora1:/home/gad$ cr_checkpoint -p 2120
> //2120 is the PID of mpirun process
>
> gad at pandora1:/home/gad$ cr_restart context.2120
> [mpiexec at pandora1] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN &
> ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
> [mpiexec at pandora1] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
> [mpiexec at pandora1] main (ui/mpich/mpiexec.c:336): process manager error
> waiting for completion
>
> ====
>
>
> I can see that a context file is only generated for the mpirun process
> "context.2120", however there are no context files generated for the MPI
> processes.
>
> Please can you help me with this problem so that MVAPICH2 checkpointing
> works with BLCR.
>
>
> Best Regards,
>
> Ramy Gad
> Johannes Gutenberg - Universität Mainz
> Zentrums für Datenverarbeitung (ZDV)
>
> Anselm-Franz-von-Bentzel-Weg 12
> 55128 Mainz
> Germany
> E-Mail: gad at uni-mainz.de
> Office Phone: +49-6131-39-26437
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141031/53b951e4/attachment.html>
More information about the mvapich-discuss
mailing list