[mvapich-discuss] Problem with mvapich2 + blcr

Gad, Ramy gad at uni-mainz.de
Fri Oct 31 03:53:53 EDT 2014


Hi,


I have installed mvapich2 V2.0 and V2.1a with this configuration.


====

./configure --prefix=/home/gad/install_mvapich2-2.1a --with-ib-libpath=/global/packages/libibverbs-pd/lib --enable-ckpt --with-blcr=/opt/blcr --enable-checkpointing --with-hydra-ckpointlib=blcr

====


I have BLCR installed on my system and its kernel module are loaded.


====

gad at pandora1:/home/gad$ lsmod | grep blcr
blcr                  115465  0
blcr_imports           10683  1 blcr
gad at pandora1:/home/gad$
gad at pandora1:/home/gad$ echo $LD_LIBRARY_PATH
:/home/gad/install_mvapich2-2.0/lib:/opt/blcr/lib
====


The problem are:


1- While compiling a programme with application initiated synchronous checkpointing (using MVAPICH2_Sync_Checkpoint() ) getting following error messages . : undefined reference to `MVAPICH2_Sync_Checkpoint' Is there any header file I need to include or link with any library ...??



====

gad at pandora1:/home/gad/mvapich2-2.1a_test$ cat testcp.cpp
#include "mpi.h"
    #include <unistd.h>
    #include <stdio.h>



    int main(int argc,char *argv[])
    {
        MPI_Init(&argc,&argv);
        printf("Computation\n");
        sleep(5);
        MPI_Barrier(MPI_COMM_WORLD);
        MVAPICH2_Sync_Checkpoint();
        MPI_Barrier(MPI_COMM_WORLD);
        printf("Computation\n");
        sleep(5);
        MPI_Finalize();
        return 0;
    }
gad at pandora1:/home/gad/mvapich2-2.1a_test$ mpicc testcp.cpp -o testcp
testcp.cpp: In function 'int main(int, char**)':
testcp.cpp:13: error: 'MVAPICH2_Sync_Checkpoint' was not declared in this scope
====



2- When I try to checkpoint an MPI program with cr_checkpoint and restore it with cr_restart, I got the following error

====

gad at pandora1:/home/gad$ cr_checkpoint -p 2120                          //2120 is the PID of mpirun process

gad at pandora1:/home/gad$ cr_restart context.2120
[mpiexec at pandora1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed
[mpiexec at pandora1] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec at pandora1] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

====


I can see that a context file is only generated for the mpirun process "context.2120", however there are no context files generated for the MPI processes.


Please can you help me with this problem so that MVAPICH2 checkpointing works with BLCR.


Best Regards,

Ramy Gad
Johannes Gutenberg - Universität Mainz
Zentrums für Datenverarbeitung (ZDV)

Anselm-Franz-von-Bentzel-Weg 12
55128 Mainz
Germany
E-Mail: gad at uni-mainz.de<mailto:gad at uni-mainz.de>
Office Phone: +49-6131-39-26437

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20141031/ab4989f3/attachment-0001.html>


More information about the mvapich-discuss mailing list