[mvapich-discuss] Fault Tolerance & Recovery on MAPICH2-1

Sashi Balasingam sashibala2 at yahoo.com
Sun Nov 27 14:36:59 EST 2011


I am currently using MVAPICH2-1.6rc2, on a multi-node cluster (with SuSe Linux) using Infiniband network (Mellanox QDR) , and have this question on Fault Tolerance :
 
In the event a node/process dies, I don’t want the entire MPI job to fail on all processes, but continue normally, except without that dead process, and any communication to that process can return an error message, but not hang. 
 
Is there a simple way to get the above done ?
 
Thanks in advance for any suggestions,
 
Sash
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20111127/5b281713/attachment.html


More information about the mvapich-discuss mailing list