[mvapich-discuss] mvapich2 fault tolerance problem.

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Oct 26 07:35:36 EDT 2011


Thank you for the note, we'll take a look at why things are behaving this way.

On Wed, Oct 26, 2011 at 5:28 AM, Anatoly G <anatolyrishon at gmail.com> wrote:
> I use mvapich2.
> Compile configuration:
>
> ./configure --with-device=ch3:sock
> --enable-debuginfo --prefix=/space/local/mvapich2 CFLAGS=-fPIC --enable-shared
> --enable-threads --enable-sharedlibs=gcc --with-pm=mpd:hydra
>
> mvapich2-1.7rc2
>
> I tying to build master/slave application, which I execute using
> mpiexec.hydra.
> Master is always process with rank 0. All rest of processes are
> slaves.(currently about ~10 processes).
> Communication protocol (point to point):
> Slave:
> 1) sends single integer to master 1000 times in loop using MPI_Send
> operation
> 2) waits in MPI_Recv to recieve single integer from master.
> 3) Executes MPI_Finalize()
> Master:
> 1) Master initialized with MPI_Errhandler_set(MPI_COMM_WORLD,
> MPI_ERRORS_RETURN);
> 2) Master passes on cyclic buffer of slaves ranks, and listens to each one
> of them by command MPI_Recv with slave rank.
> Loop performed 1000 * Number of slaves.
> 3) After end of loop master sends to each slave 0 as integer using
> MPI_Send operation.
> 4) Executes MPI_Finalize().
> Purpose of application: Tolerance to Process Failures
> On the failure of number of slaves, continue work with the rest ones.
> Execute command:
> mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
> /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync
> MPI initialization code:
> MPI::Init(argc, argv);
> MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>
> machines.txt contains:
> student1-eth1:1
> student2-eth1:3
> student3-eth1:20
> Execution results:
> 1) When I run application as written above, everything works.
> 2) Then I try to simulate failure of the one of slaves by calling abort()
>  function on iteration 10 of the slave loop.
> As result master get SIGUSR1 signal and fails.
> Questions:
> 1) I don't understand what should I do in order to get an error status
> from MPI_Recv command in master code?
> 2) In the case I use
> MPI_Irecv + MPI_Waitany in master code, how can I recognize which slave
> "dead"? (Slave executed on remote host).
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list