[mvapich-discuss] Problem with mvapich2 + blcr

Gad, Ramy gad at uni-mainz.de
Mon Nov 3 09:02:36 EST 2014


Hi Raghu,

Yes I am exporting $PATH and $LD_LIBRARY_PATH

I use

export PATH="$PATH:/home/gad/install_mvapich2-2.1a/bin/"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/gad/install_mvapich2-2.1a/lib:/opt/blcr/lib"


this is the what you requested

====
gad at pandora1:/home/gad/mpiblast$ which mpiname
~/install_mvapich2-2.1a/bin/mpiname
gad at pandora1:/home/gad/mpiblast$ mpiname
MVAPICH2
gad at pandora1:/home/gad/mpiblast$ ~/install_mvapich2-2.1a/bin/mpiname
MVAPICH2
gad at pandora1:/home/gad/mpiblast$ echo $PATH
/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/gad/.rvm/bin:/home/gad/install_mvapich2-2.1a/bin/:/home/gad/mpiblast/mpiblast-1.6.0_mpi4.5/bin
gad at pandora1:/home/gad/mpiblast$ echo $LD_LIBRARY_PATH
:/home/gad/install_mvapich2-2.1a/lib:/opt/blcr/lib
gad at pandora1:/home/gad/mpiblast$ cd ~/mvapich2-2.1a_test/
gad at pandora1:/home/gad/mvapich2-2.1a_test$ cat testcp.cpp
#include "mpi.h"
    #include <unistd.h>
    #include <stdio.h>



    int main(int argc,char *argv[])
    {
        MPI_Init(&argc,&argv);
        printf("Computation\n");
        sleep(5);
        MPI_Barrier(MPI_COMM_WORLD);
        MVAPICH2_Sync_Checkpoint();
        MPI_Barrier(MPI_COMM_WORLD);
        printf("Computation\n");
        sleep(5);
        MPI_Finalize();
        return 0;
    }

gad at pandora1:/home/gad/mvapich2-2.1a_test$ ~/install_mvapich2-2.1a/bin/mpicc testcp.cpp -o testcp
testcp.cpp: In function ‘int main(int, char**)’:
testcp.cpp:13: error: ‘MVAPICH2_Sync_Checkpoint’ was not declared in this scope
====


I have also tested manual checkpointing with mpiexec.hydra and mpiexec as a launcher, but I get the same problem



I have also tested automatic checkpointing, see the following.

======
gad at pandora1:/home/gad/mpiblast$ mpirun   -ckpointlib  blcr  -ckpoint-interval 60  -ckpoint-prefix /home/gad/cr_openmpi/    -n 4 mpiblast -d drosoph.nt -i melano500m.20059.fa -p blastn -o results.txt
[proxy:0:0 at pandora1] requesting checkpoint
[proxy:0:0 at pandora1] checkpoint completed
[proxy:0:0 at pandora1] requesting checkpoint
[proxy:0:0 at pandora1] checkpoint completed
[proxy:0:0 at pandora1] requesting checkpoint
[proxy:0:0 at pandora1] HYDT_ckpoint_checkpoint (tools/ckpoint/ckpoint.c:115): Previous checkpoint has not completed.[proxy:0:0 at pandora1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:931): checkpoint suspend failed
[proxy:0:0 at pandora1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at pandora1] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec at pandora1] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert (!closed) failed
[mpiexec at pandora1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec at pandora1] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at pandora1] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
gad at pandora1:/home/gad/mpiblast$ ls -lh  /home/gad/cr_openmpi/
total 90M
-rw------- 1 gad users 65M 31. Okt 17:41 context-num0-0-0
-rw------- 1 gad users 25M 31. Okt 17:43 context-num1-0-0
gad at pandora1:/home/gad/mpiblast$ cr_restart -f /home/gad/cr_openmpi/context-num1-0-0
- cr_load_file_info: Garbage in context file! (type=0)
- Error loading file_info.
- cr_rstrt_child [3011]:  Unable to restore files!  (err=-22)
Restart failed: Invalid argument
gad at pandora1:/home/gad/mpiblast$ cr_restart -f /home/gad/cr_openmpi/context-num0-0-0
=======

You can see that automatic checkpointing works for the first 2 checkpoints then it fails.

Restarting failed.

Thank you for your time.

Best Regards,

Ramy


Hi Ramy,

For #1, the MVAPICH2_Sync_Checkpoint() call is exposed via mpi.h itself, so you need not include another header, or link to any other library. I see that you are installing to a non-standard prefix, but later using the default mpicc in your path when running the test. Are you exporting $PATH and $LD_LIBRARY_PATH appropriately? Can you send me the output of the following:

$ mpiname
and
$ /home/gad/install_mvapich2-2.1a/bin/mpiname

I am guessing that your test run is picking up the system's default build of MVAPICH2, which might not have been configured with checkpointing support.

For #2, I quickly tested it locally, and things work as expected. I have a feeling it could be related to the above point I made about using a build which does not have this support enabled. Can you retry your runs explicitly using the install in your prefix?

$ /home/gad/install_mvapich2-2.1a/bin/mpicc test.cpp -o testcp
$  /home/gad/install_mvapich2-2.1a/bin/mpiexec.hydra.....

If case #2 fails even after you verify that you are using the correct install, can you try it once with mpirun_rsh (MVAPICH2's recommended launcher)?



Raghu

On Fri, Oct 31, 2014 at 3:53 AM, Gad, Ramy <gad at uni-mainz.de<mailto:gad at uni-mainz.de>> wrote:

Hi,


I have installed mvapich2 V2.0 and V2.1a with this configuration.


====

./configure --prefix=/home/gad/install_mvapich2-2.1a --with-ib-libpath=/global/packages/libibverbs-pd/lib --enable-ckpt --with-blcr=/opt/blcr --enable-checkpointing --with-hydra-ckpointlib=blcr

====


I have BLCR installed on my system and its kernel module are loaded.


====

gad at pandora1:/home/gad$ lsmod | grep blcr
blcr                  115465  0
blcr_imports           10683  1 blcr
gad at pandora1:/home/gad$
gad at pandora1:/home/gad$ echo $LD_LIBRARY_PATH
:/home/gad/install_mvapich2-2.0/lib:/opt/blcr/lib
====


The problem are:


1- While compiling a programme with application initiated synchronous checkpointing (using MVAPICH2_Sync_Checkpoint() ) getting following error messages . : undefined reference to `MVAPICH2_Sync_Checkpoint' Is there any header file I need to include or link with any library ...??



====

gad at pandora1:/home/gad/mvapich2-2.1a_test$ cat testcp.cpp
#include "mpi.h"
    #include <unistd.h>
    #include <stdio.h>



    int main(int argc,char *argv[])
    {
        MPI_Init(&argc,&argv);
        printf("Computation\n");
        sleep(5);
        MPI_Barrier(MPI_COMM_WORLD);
        MVAPICH2_Sync_Checkpoint();
        MPI_Barrier(MPI_COMM_WORLD);
        printf("Computation\n");
        sleep(5);
        MPI_Finalize();
        return 0;
    }
gad at pandora1:/home/gad/mvapich2-2.1a_test$ mpicc testcp.cpp -o testcp
testcp.cpp: In function ‘int main(int, char**)’:
testcp.cpp:13: error: ‘MVAPICH2_Sync_Checkpoint’ was not declared in this scope
====



2- When I try to checkpoint an MPI program with cr_checkpoint and restore it with cr_restart, I got the following error

====

gad at pandora1:/home/gad$ cr_checkpoint -p 2120                          //2120 is the PID of mpirun process

gad at pandora1:/home/gad$ cr_restart context.2120
[mpiexec at pandora1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
[mpiexec at pandora1] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at pandora1] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

====


I can see that a context file is only generated for the mpirun process "context.2120", however there are no context files generated for the MPI processes.


Please can you help me with this problem so that MVAPICH2 checkpointing works with BLCR.


Best Regards,

Ramy Gad
Johannes Gutenberg - Universität Mainz
Zentrums für Datenverarbeitung (ZDV)

Anselm-Franz-von-Bentzel-Weg 12
55128 Mainz
Germany
E-Mail: gad at uni-mainz.de<mailto:gad at uni-mainz.de>
Office Phone: +49-6131-39-26437<tel:%2B49-6131-39-26437>


_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141103/9cffbb9b/attachment-0001.html>


More information about the mvapich-discuss mailing list