[mvapich-discuss] [PATCH] Initialize libibverbs’s data structures to handle fork
Honggang LI
honli at redhat.com
Tue Dec 15 09:32:03 EST 2020
On Tue, Dec 15, 2020 at 02:02:46PM +0000, Subramoni, Hari wrote:
> Hi, Honggang.
>
> Thanks for identifying this and posting the patch. We appreciate your feedback.
>
> Please note that this was intentional. The purpose of this test was to check if the underlying IB-enabled MPI communication runtime has taken care of fork safety even if the application has not.
>
> We noticed that when using frameworks like TensorFlow 2.0 and higher over Horovod+MPI, the training would hang because TensorFlow was using fork. To simulate use cases like this and ensure MPI libraries will not hang, we had created this test so that we can catch this internally in our testing and validation.
Something like this should be documented in release note of osu benchmark.
>
> We also introduced a new variable "MV2_SUPPORT_FORK_SAFETY" (which is disabled by default due to performance reasons) to make sure MVAPICH2 takes care of fork safety for applications that require it.
Thanks
More information about the mvapich-discuss
mailing list