[mvapich-discuss] BLCR+FTB+MVAPICH2:problems about migration between two nodes

Rajeev.c.p rajeevcp at yahoo.com
Fri Apr 3 14:36:54 EDT 2015


 
hi mvapich team,I have one question on the usage of MPI_WAITSOME and MPI_IRECV .I am posting a set of MPI_IRECV in a main thread and then i wait using MPI_WAITSOME in the main thread. once the mpi_waitsome signals  completion of the recv requests i fork threads to process each of the completed receive handle and then inside the child  threads i then post the next set of MPI_IRECV for the specific receive slot.
So the way it works is main thread does the waitsome and the child thread does the processing of completed recv and  posts the next set of MPI_IRECV This posting of MPI_WAITSOME and forking threads to process them happens in a do while loop.Will this cause any issue since i post the completed recv to different threads for processing and post the next recv inside the child thread  but the control is immediately returned to the main thread to immediately wait on the MPI_WAITSOME. so there could be a case where MPI_WAITSOME and MPI_IRECV are happening in parallel for the same recv slots.will MPI synchronise itself between the two calls for the same recv slots. will this cause deadlock issues or data corruption or missing data?



Thanks and Regards,Rajeev     On Friday, April 3, 2015 7:47 PM, Jian Lin <lin.2180 at osu.edu> wrote:
   

 Hi, Qingze,

The problems should be related to the prelinking feature of your OS. 
It is required to disable prelinking to allow restart job on the other
nodes. Please refer to the following note of BLCR, modify your OS
setting, and try again.
<https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink>

Thanks!


On Thu, 2 Apr 2015 10:48:52 +0800
hljgqz <15776869853 at 163.com> wrote:

>  To whom it may concern,
>        I have two problems when using BLCR+FTB+MVAPICH2 .I can
> restart mpi job in the original nodes, But I can't restart mpi job no
> different sets of nodes. 1)I can't restart the mpi job on a different
> node. I do as the userguide said. I have two nodes named node1 and
> node3 . Firstly, I run " mpirun_rsh -np 2 -hostfile hosts ./proc  "on
> node3, hosts file is consist of node1:4 Then I run cr_checkpoint -p
> <pid> .AND I can restart it using the original hosts file. BUT if I
> change hosts( the hostfile ) to node3:4 as the userguide said and
> restart the job but it doesn't work ,and report error :
> [node3:mpispawn_0][child_handler] MPI process (rank: 1, pid: 5174)
> terminated with signal 4 -> abort job
> [node3:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node
> node3 aborted: MPI process error (1) 2)I tried mpirun_rsh -np 4
> -hostfile ./hosts -sparehosts ./spare_hosts ./prog , BUT it doesn't
> work , and give back the messages : [root at node3 node0]# mpirun_rsh
> -np 4 -hostfile ./hosts -sparehosts ./spare_hosts ./hello [root at node3
> node0]# [FTB_WARNING][ftb_agent.c: line 46]FTBM_Wait failed 1 my
> configuration: [root at node3 node0]# mpiname -a MVAPICH2 2.1rc2 Thu Mar
> 12 20:00:00 EDT 2014 ch3:mrail Compilation CC: gcc    -DNDEBUG
> -DNVALGRIND -O2 CXX: g++  -DNDEBUG -DNVALGRIND -O2 F77: gfortran
> -L/home/node0/16/blcr/lib -L/lib  -O2 FC: gfortran  -O2
> Configuration
> --enable-ckpt-migration --with-blcr=/home/node0/16/blcr
> --without-hwloc
> 
> Hope for write back!
> sincerely yours
> Gong qingze
> 



-- 
Jian Lin
http://linjian.org

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss


  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150403/537e6a0e/attachment.html>


More information about the mvapich-discuss mailing list