[mvapich-discuss] [PATCH] Initialize libibverbs’s data structures to handle fork

Subramoni, Hari subramoni.1 at osu.edu
Tue Dec 15 10:02:51 EST 2020


Hi, Honggang.

Thanks for the feedback.

We will update the README in OMB to indicate this so that folks are aware.

Best,
Hari.

-----Original Message-----
From: Honggang LI <honli at redhat.com> 
Sent: Tuesday, December 15, 2020 9:32 AM
To: Subramoni, Hari <subramoni.1 at osu.edu>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: [PATCH] Initialize libibverbs’s data structures to handle fork

On Tue, Dec 15, 2020 at 02:02:46PM +0000, Subramoni, Hari wrote:
> Hi, Honggang.
> 
> Thanks for identifying this and posting the patch. We appreciate your feedback.
> 
> Please note that this was intentional. The purpose of this test was to check if the underlying IB-enabled MPI communication runtime has taken care of fork safety even if the application has not.
> 
> We noticed that when using frameworks like TensorFlow 2.0 and higher over Horovod+MPI, the training would hang because TensorFlow was using fork. To simulate use cases like this and ensure MPI libraries will not hang, we had created this test so that we can catch this internally in our testing and validation. 

Something like this should be documented in release note of osu benchmark.

> 
> We also introduced a new variable "MV2_SUPPORT_FORK_SAFETY" (which is disabled by default due to performance reasons) to make sure MVAPICH2 takes care of fork safety for applications that require it.

Thanks




More information about the mvapich-discuss mailing list