[mvapich-discuss] Fault Tolerance & Recovery on MAPICH2-1

Raghunath rajachan at cse.ohio-state.edu
Mon Nov 28 00:57:54 EST 2011


Hi Sashi,

Thanks for posting your query on the list.
Have you tried MVAPICH2 1.7 with Nemesis-IB channel and the Hydra launcher?
This should provide the desired support you are looking for.

You can get a tarball of MVAPICH2-1.7 from the following location:
http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.7.tgz

Use the "--with-device=ch3:nemesis:ib" flag at configure time, to build the
library
with the Nemesis-IB channel. You can find more information on configuring a
build
for the Nemesis channel in our userguide:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7.html#x1-150004.6

You can also find more information on using the Hydra process manager in
the userguide:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7.html#x1-330005.2.2

Please do let us know if this works for you.

Thanks,
--
Raghu


On Sun, Nov 27, 2011 at 2:36 PM, Sashi Balasingam <sashibala2 at yahoo.com>wrote:

> I am currently using MVAPICH2-1.6rc2, on a multi-node cluster (with SuSe
> Linux) using Infiniband network (Mellanox QDR) , and have this question on
> Fault Tolerance :
>
> In the event a node/process dies, I don’t want the entire MPI job to fail
> on all processes, but continue normally, except without that dead process,
> and any communication to that process can return an error message, but not
> hang.
> **** **
> Is there a simple way to get the above done ?
> ** **
> Thanks in advance for any suggestions,
> ** **
> Sash
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20111128/42e0d6eb/attachment-0001.html


More information about the mvapich-discuss mailing list