[mvapich-discuss] [PATCH] Initialize libibverbs’s data structures to handle fork

Subramoni, Hari subramoni.1 at osu.edu
Tue Dec 15 09:02:46 EST 2020


Hi, Honggang.

Thanks for identifying this and posting the patch. We appreciate your feedback.

Please note that this was intentional. The purpose of this test was to check if the underlying IB-enabled MPI communication runtime has taken care of fork safety even if the application has not.

We noticed that when using frameworks like TensorFlow 2.0 and higher over Horovod+MPI, the training would hang because TensorFlow was using fork. To simulate use cases like this and ensure MPI libraries will not hang, we had created this test so that we can catch this internally in our testing and validation. 

We also introduced a new variable "MV2_SUPPORT_FORK_SAFETY" (which is disabled by default due to performance reasons) to make sure MVAPICH2 takes care of fork safety for applications that require it.

Please let us know if you have further comments or feedback.

Best,
Hari.

-----Original Message-----
From: Honggang LI <honli at redhat.com> 
Sent: Tuesday, December 15, 2020 4:49 AM
To: Subramoni, Hari <subramoni.1 at osu.edu>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [PATCH] Initialize libibverbs’s data structures to handle fork

From: Honggang Li <honli at redhat.com>

The benchmark 'osu_latency_mp.c' uses 'fork()' without calling ibv_fork_init().

Setting the environment variable RDMAV_FORK_SAFE or IBV_FORK_SAFE has the same effect as calling ibv_fork_init().

Without this patch, the benchmark hang on when environment variable not set up.

Signed-off-by: Honggang Li <honli at redhat.com>
---
 mpi/pt2pt/osu_latency_mp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mpi/pt2pt/osu_latency_mp.c b/mpi/pt2pt/osu_latency_mp.c index 008b1a77de40..ab8b48f95b2e 100644
--- a/mpi/pt2pt/osu_latency_mp.c
+++ b/mpi/pt2pt/osu_latency_mp.c
@@ -26,6 +26,9 @@ int main(int argc, char *argv[])
     options.bench = PT2PT;
     options.subtype = LAT_MP;
 
+    if (putenv("RDMAV_FORK_SAFE=1") != 0)
+	exit(EXIT_FAILURE);
+
     set_header(HEADER);
     set_benchmark_name("osu_latency_mp");
 
--
2.25.4




More information about the mvapich-discuss mailing list