[mvapich-discuss] Problems with migration in mvapich2-1.8rc1
Iván Cores González
ivan.coresg at udc.es
Mon Apr 16 05:21:54 EDT 2012
Hi .
Sorry for the delay in replying to your answer. I've change the configuration options,
now, the configuration of mvapich is:
./configure --prefix=$HOME/mvapich/mvapich2-1.8rc1/build --with-device=ch3:mrail
--with-rdma=gen2 --enable-shared --enable-ckpt --with-blcr=$HOME/blcr-0.8.4/build
--enable-ckpt-migration --enable-checkpointing --with-hydra-ckpointlib=blcr
--with-ftb=$HOME/ftb/ftb-0.6.2/build --disable-ckpt-aggregation
I test the BLCR installation and works fine with the cr_run and cr_restart commands.
But, when I run the application with:
mpirun_rsh -np 2 -hostfile mpd.hosts -sparehosts mpd.hostsMIG MV2_CKPT_FILE=$HOME/ckpt/check ./a.out
where mpd.hosts is:
compute-0-10
compute-0-10
and mpd.hostsMIG is:
compute-0-9
compute-0-9
and I execute :
mv2_trigger compute-0-10
the checkpoint files are created, but something is wrong:
check.0.0 100% 9591KB 9.4MB/s 00:00
check.0.1 100% 9507KB 9.3MB/s 00:01
Connection to compute-0-10 closed.
[compute-0-9.local:mpispawn_1][child_handler] MPI process (rank: 0, pid: 29671) terminated with signal 11 -> abort job
[pluton.des.udc.es:mpirun_rsh][process_mpispawn_connection] mpispawn_1 from node compute-0-9 aborted: MPI process error (1)
I think it could be a problem with the restart, but I can not find the mistake.
I can not execute the " prelink --undo --all " command because I don't have root privileges.
Could this be the cause of the problem? Am I doing something wrong?
Thanks,
Iván Cores.
De: "Raghunath" <rajachan at cse.ohio-state.edu>
Para: "Iván Cores González" <ivan.coresg at udc.es>
CC: mvapich-discuss at cse.ohio-state.edu
Enviados: Lunes, 2 de Abril 2012 16:47:35
Asunto: Re: [mvapich-discuss] Problems with migration in mvapich2-1.8rc1
Hi,
Thanks for posting this to the list.
It looks like you have configured MVAPICH2 with the QLogic PSM-CH3 interface.
Currently, the process Migration support in MVAPICH2 is available only with the
OFA-IB-CH3 interface ( --with-device=ch3:mrail --with-rdma=gen2 )
for Mellanox IB adapters, which is also the default one.
Do let us know if you have additional questions on this.
Thanks,
-- Raghu
2012/4/2 Iván Cores González < ivan.coresg at udc.es >
Hello,
I am testing the new mvapich2-1.8rc1 version in a small cluster with infiniband and I have problems trying the migration features.
I install FTB, BLCR and MVAPICH without problems. The configuration of mvapich is:
./configure --prefix=$HOME/mvapich/mvapich2-1.8rc1/build --with-device=ch3:psm --enable-shared --enable-ckpt --with-blcr=$HOME/blcr-0.8.4/build --enable-ckpt-migration --enable-checkpointing --with-hydra-ckpointlib=blcr --with-ftb=$HOME/ftb/ftb-0.6.2/build --disable-ckpt-aggregation
Once I have executed the ftb daemos (ftb_database_server in the front-end and the ftb_agent in the computing nodes) and loaded the BLCR modules I only run the application with:
mpirun_rsh -np 2 -hostfile mpd.hosts -sparehosts mpd.hostsMIG ./a.out
where mpd.hosts is:
compute-0-10
compute-0-10
and mpd.hostsMIG is:
compute-0-9
compute-0-9
However, when I execute
mv2_trigger compute-0-10
or the other options to migrate ( pkill -SIGUSR2 mpispawn ) nothing happens. Only the FTB information is showed but the job continue working in the same node.
I can not execute the " prelink --undo --all " command because I have not root privileges. Could this be the cause of the problem?
Am I doing something wrong?
Thanks,
Iván Cores.
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120416/9c9af9b0/attachment-0001.html
More information about the mvapich-discuss
mailing list