[mvapich-discuss] Problems with migration in mvapich2-1.8rc1

Iván Cores González ivan.coresg at udc.es
Mon Apr 16 05:21:54 EDT 2012


Hi . 
Sorry for the delay in replying to your answer. I've change the configuration options, 
now, the configuration of mvapich is: 

./configure --prefix=$HOME/mvapich/mvapich2-1.8rc1/build --with-device=ch3:mrail 
--with-rdma=gen2 --enable-shared --enable-ckpt --with-blcr=$HOME/blcr-0.8.4/build 
--enable-ckpt-migration --enable-checkpointing --with-hydra-ckpointlib=blcr 
--with-ftb=$HOME/ftb/ftb-0.6.2/build --disable-ckpt-aggregation 

I test the BLCR installation and works fine with the cr_run and cr_restart commands. 
But, when I run the application with: 

mpirun_rsh -np 2 -hostfile mpd.hosts -sparehosts mpd.hostsMIG MV2_CKPT_FILE=$HOME/ckpt/check ./a.out 

where mpd.hosts is: 
compute-0-10 
compute-0-10 

and mpd.hostsMIG is: 
compute-0-9 
compute-0-9 

and I execute : 
mv2_trigger compute-0-10 

the checkpoint files are created, but something is wrong: 
check.0.0 100% 9591KB 9.4MB/s 00:00 
check.0.1 100% 9507KB 9.3MB/s 00:01 
Connection to compute-0-10 closed. 
[compute-0-9.local:mpispawn_1][child_handler] MPI process (rank: 0, pid: 29671) terminated with signal 11 -> abort job 
[pluton.des.udc.es:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node compute-0-9 aborted: MPI process error (1) 

I think it could be a problem with the restart, but I can not find the mistake. 
I can not execute the " prelink --undo --all " command because I don't have root privileges. 
Could this be the cause of the problem? Am I doing something wrong? 

Thanks, 
Iván Cores. 



De: "Raghunath" <rajachan at cse.ohio-state.edu> 
Para: "Iván Cores González" <ivan.coresg at udc.es> 
CC: mvapich-discuss at cse.ohio-state.edu 
Enviados: Lunes, 2 de Abril 2012 16:47:35 
Asunto: Re: [mvapich-discuss] Problems with migration in mvapich2-1.8rc1 

Hi, 


Thanks for posting this to the list. 

It looks like you have configured MVAPICH2 with the QLogic PSM-CH3 interface. 
Currently, the process Migration support in MVAPICH2 is available only with the 
OFA-IB-CH3 interface ( --with-device=ch3:mrail --with-rdma=gen2 ) 
for Mellanox IB adapters, which is also the default one. 


Do let us know if you have additional questions on this. 


Thanks, 
-- Raghu 



2012/4/2 Iván Cores González < ivan.coresg at udc.es > 




Hello, 
I am testing the new mvapich2-1.8rc1 version in a small cluster with infiniband and I have problems trying the migration features. 

I install FTB, BLCR and MVAPICH without problems. The configuration of mvapich is: 

./configure --prefix=$HOME/mvapich/mvapich2-1.8rc1/build --with-device=ch3:psm --enable-shared --enable-ckpt --with-blcr=$HOME/blcr-0.8.4/build --enable-ckpt-migration --enable-checkpointing --with-hydra-ckpointlib=blcr --with-ftb=$HOME/ftb/ftb-0.6.2/build --disable-ckpt-aggregation 

Once I have executed the ftb daemos (ftb_database_server in the front-end and the ftb_agent in the computing nodes) and loaded the BLCR modules I only run the application with: 

mpirun_rsh -np 2 -hostfile mpd.hosts -sparehosts mpd.hostsMIG ./a.out 

where mpd.hosts is: 
compute-0-10 
compute-0-10 

and mpd.hostsMIG is: 
compute-0-9 
compute-0-9 

However, when I execute 
mv2_trigger compute-0-10 

or the other options to migrate ( pkill -SIGUSR2 mpispawn ) nothing happens. Only the FTB information is showed but the job continue working in the same node. 
I can not execute the " prelink --undo --all " command because I have not root privileges. Could this be the cause of the problem? 
Am I doing something wrong? 

Thanks, 
Iván Cores. 





_______________________________________________ 
mvapich-discuss mailing list 
mvapich-discuss at cse.ohio-state.edu 
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120416/9c9af9b0/attachment-0001.html


More information about the mvapich-discuss mailing list