[mvapich-discuss] [PATCH] Initialize libibverbs’s data structures to handle fork

Honggang LI honli at redhat.com
Tue Dec 15 09:32:03 EST 2020


On Tue, Dec 15, 2020 at 02:02:46PM +0000, Subramoni, Hari wrote:
> Hi, Honggang.
> 
> Thanks for identifying this and posting the patch. We appreciate your feedback.
> 
> Please note that this was intentional. The purpose of this test was to check if the underlying IB-enabled MPI communication runtime has taken care of fork safety even if the application has not.
> 
> We noticed that when using frameworks like TensorFlow 2.0 and higher over Horovod+MPI, the training would hang because TensorFlow was using fork. To simulate use cases like this and ensure MPI libraries will not hang, we had created this test so that we can catch this internally in our testing and validation. 

Something like this should be documented in release note of osu benchmark.

> 
> We also introduced a new variable "MV2_SUPPORT_FORK_SAFETY" (which is disabled by default due to performance reasons) to make sure MVAPICH2 takes care of fork safety for applications that require it.

Thanks



More information about the mvapich-discuss mailing list