[mvapich-discuss] Mvapich fault toleance features

Raghunath rajachan at cse.ohio-state.edu
Wed Feb 8 16:04:20 EST 2012


Hi Rui,

Thanks for posting this to list. I apologize for the delay in replying to
your post.

MVAPICH2 does inherit the ability to handle process failures from MPICH2
1.4.1p1.
You can use this feature by configuring MVAPICH2 to use the Nemesis
interface
and the IB-netmod, and launching your job with the Hydra process manager.
However, there is a known bug (a hang during finalize), the fix for which
will
be available in the upcoming release.

In addition to process failures, the upcoming release will also have
support  to handle
IB communication failures in the IB-netmod.

Use the "--with-device=ch3:nemesis:ib" flag at configure time, to build the
library
with the Nemesis-IB channel. You can find more information on configuring a
build
for the Nemesis channel in our userguide:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7.html#x1-150004.6

You can also find more information on using the Hydra process manager in
the userguide:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.7.html#x1-330005.2.2

Please let us know if this helps.

Thanks,
--
Raghu


On Mon, Feb 6, 2012 at 3:02 AM, Rui Wang <wangraying at gmail.com> wrote:

> Hi all,****
>
> ** **
>
> I saw on the website that the latest version MVAPICH2 1.8 has included
> MPICH2 1.4.1p1. As far as I know, MPICH2-1.4 supports some fault tolerance
> features, that if I kill one process of a task, the whole task will not
> abort and none of the communication operations will hang. I have done some
> experiments to verify these features. However, when I did the same
> experiments using MVAPICH2 1.8, it does not support such features. ****
>
> ** **
>
> I’m writing to enquire that does MVAPICH2 1.8 have plan to support these
> features, if so, how soon? It seems user-driven fault tolerance is becoming
> an alternative solution to address fault tolerance issues for the future.*
> ***
>
> ** **
>
> Thanks,****
>
> ** **
>
> Rui ****
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120208/63213bda/attachment.html


More information about the mvapich-discuss mailing list