[mvapich-discuss] mvapich2 fault tolerance problem.

Wed Oct 26 08:17:01 EDT 2011

Hi Anatoly,

Thanks for posting this issue on the list.
The desired fault-tolerance feature is not available in MVAPICH2 yet.
It will be available in the future.

--
Raghu

On Wed, Oct 26, 2011 at 5:28 AM, Anatoly G <anatolyrishon at gmail.com> wrote:

> *I use mvapich2.*
> Compile configuration:
>
> ./configure --with-device=ch3:sock --enable-debuginfo
>  --prefix=/space/local/mvapich2 CFLAGS=-fPIC --enable-shared
> --enable-threads --enable-sharedlibs=gcc --with-pm=mpd:hydra
>
> mvapich2-1.7rc2
>
> I tying to build master/slave application, which I execute using
> mpiexec.hydra.
> Master is always process with rank 0. All rest of processes are
> slaves.(currently about ~10 processes).
> Communication protocol (point to point):
> *Slave:*
> 1) sends single integer to master 1000 times in loop using MPI_Send
> operation
> 2) waits in MPI_Recv to recieve single integer from master.
> 3) Executes MPI_Finalize()
>
> *Master:*
> 1) Master initialized with MPI_Errhandler_set(MPI_COMM_WORLD,
> MPI_ERRORS_RETURN);
> 2) Master passes on cyclic buffer of slaves ranks, and listens to each one
> of them by command MPI_Recv with slave rank.
> Loop performed 1000 * Number of slaves.
> 3) After end of loop master sends to each slave 0 as integer using
> MPI_Send operation.
> 4) Executes MPI_Finalize().
>
> *Purpose of application*: Tolerance to Process Failures
> On the failure of number of slaves, continue work with the rest ones.
>
> *Execute command*:
> mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
> /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync
>
> *MPI initialization code*:
> MPI::Init(argc, argv);
> MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>
> *machines.txt contains*:
> student1-eth1:1
> student2-eth1:3
> student3-eth1:20
>
> *Execution results*:
> 1) When I run application as written above, everything works.
> 2) Then I try to simulate failure of the one of slaves by calling abort()
>  function on iteration 10 of the slave loop.
> As result master get SIGUSR1 signal and fails.
>
> *Questions:*
> 1) I don't understand what should I do in order to get an error status
> from MPI_Recv command in master code?
> 2) In the case I use
> MPI_Irecv + MPI_Waitany in master code, how can I recognize which slave
> "dead"? (Slave executed on remote host).
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20111026/a1fba3e2/attachment.html