[mvapich-discuss] SIGSEGV in cm_completion_handler

Subramoni, Hari subramoni.1 at osu.edu
Thu Oct 3 05:30:29 EDT 2019


Hi, Alex.

Many thanks for the report and the patch. We appreciate it.

We ran into this issue some time back and had moved getenv  called from set_pkey_index out of the cm_completion_handler to a different location which gets called from the main thread itself. I believe this should also solve this issue – right?

Best,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> On Behalf Of Alexander Melnikov
Sent: Thursday, October 3, 2019 4:21 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] SIGSEGV in cm_completion_handler

The mvapich2 library is multithreaded, so you should avoid using MT-unsafe calls like setenv.
For example, using setenv in hwloc_bind.c sometimes leads to a program crash on SIGSEGV. The call chain is as follows:
- in cm_completion_handler thread: cm_handle_msg->cm_accept->cm_qp_create->cm_qp_conn_create->set_pkey_index->getenv
- in main thread: MPID_Init->MPIDI_CH3I_set_affinity->setenv

The problem was solved using the following patch:
--- mvapich2-2.3.2-orig/src/mpid/ch3/channels/common/src/affinity/hwloc_bind.c 2019-08-09 04:26:25.000000000 +0500
+++ mvapich2-2.3.2/src/mpid/ch3/channels/common/src/affinity/hwloc_bind.c 2019-08-19 09:58:34.352721117 +0500
@@ -2784,6 +2784,8 @@
     int num_local_procs;
     long N_CPUs_online;
     mv2_arch_type arch_type;
+    int enforce_hybrid = 0;
+    int enforce_hybrid_numa = 0;

     MPIDI_STATE_DECL(MPID_STATE_MPIDI_CH3I_SET_AFFINITY);
     MPIDI_FUNC_ENTER(MPID_STATE_MPIDI_CH3I_SET_AFFINITY);
@@ -2805,13 +2807,15 @@
         arch_type == MV2_ARCH_INTEL_PLATINUM_8160_2S_48 ||
         arch_type == MV2_ARCH_AMD_EPYC_7551_64 /* EPYC */ ||
         arch_type == MV2_ARCH_AMD_EPYC_7742_128 /* rome */) {
-        setenv ("MV2_CPU_BINDING_POLICY", "hybrid", 0);
+        if (getenv("MV2_CPU_BINDING_POLICY") == NULL)
+          enforce_hybrid = 1;

         /* if CPU is EPYC, further force hybrid_binding_policy to NUMA */
         if (arch_type == MV2_ARCH_AMD_EPYC_7551_64 ||
             arch_type == MV2_ARCH_AMD_EPYC_7742_128 /* rome */) {
-            setenv ("MV2_HYBRID_BINDING_POLICY", "numa", 0);
-        }
+            if (getenv("MV2_HYBRID_BINDING_POLICY") == NULL)
+              enforce_hybrid_numa = 1;
+        }
     }

     if (mv2_enable_affinity && (num_local_procs > N_CPUs_online)) {
@@ -2844,7 +2848,8 @@

     if (mv2_enable_affinity && (value = getenv("MV2_CPU_MAPPING")) == NULL) {
         /* Affinity is on and the user has not specified a mapping string */
-        if ((value = getenv("MV2_CPU_BINDING_POLICY")) != NULL) {
+        value = enforce_hybrid ? "hybrid" : getenv("MV2_CPU_BINDING_POLICY");
+        if (value != NULL) {
             /* User has specified a binding policy */
             if (!strcmp(value, "bunch") || !strcmp(value, "BUNCH")) {
                 mv2_binding_policy = POLICY_BUNCH;
@@ -2900,7 +2905,8 @@

                    /* since mv2_threads_per_proc > 0, check if any threads
                     * binding policy have been explicitly specified */
-                   if ((value = getenv("MV2_HYBRID_BINDING_POLICY")) != NULL) {
+                   value = enforce_hybrid_numa ? "numa" : getenv("MV2_HYBRID_BINDING_POLICY");
+                   if (value != NULL) {
                        if (!strcmp(value, "linear") || !strcmp(value, "LINEAR")) {
                            mv2_hybrid_binding_policy = HYBRID_LINEAR;
                        } else if (!strcmp(value, "compact") || !strcmp(value, "COMPACT")) {

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191003/37063bdb/attachment-0001.html>


More information about the mvapich-discuss mailing list