[mvapich-discuss] Bug: deadlock between ibv_destroy_srq and async_thread

Matthew Koop koop at cse.ohio-state.edu
Wed May 28 17:35:13 EDT 2008


David,

Can you try the attached patch and let us know if it solves your problem?
This adds some synchronization between the threads.

Thanks,

Matt

On Fri, 23 May 2008 David_Kewley at Dell.com wrote:

> I have a user running a 192-way job using MVAPICH2 1.0.1 and OFED
> 1.2.5.5,
> where MPI_Finalize() does not return.  In the two example jobs I've
> examined,
> 189 processes exited, but the other three hung.  The ranks that hung
> were
> different in the two examples, so I don't think the "3" is significant.
>
> All processes I've looked at appear to be stuck in the same way.  In
> normal
> running, each process has four threads.  When the process gets stuck,
> only the
> original thread remains.  Here is a gdb backtrace from one:
>
> #0  0x00000036b2608b3a in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> #1  0x0000002a9595405b in ibv_cmd_destroy_srq (srq=0x82b370) at
> src/cmd.c:582
> #2  0x0000002a962b5419 in mthca_destroy_srq (srq=0x82b3bc) at
> src/verbs.c:475
> #3  0x0000002a9564878e in MPIDI_CH3I_CM_Finalize () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #4  0x0000002a955c053b in MPIDI_CH3_Finalize () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #5  0x0000002a95626202 in MPID_Finalize () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #6  0x0000002a955f7fee in PMPI_Finalize () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #7  0x0000002a955f7eae in pmpi_finalize_ () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #8  0x0000000000459ff8 in stoprog_ ()
> #9  0x000000000047afa6 in MAIN__ ()
> #10 0x0000000000405d62 in main ()
>
> After hours of opportunity to study the MVAPICH2 code :), I think I
> tracked it
> down to lines 1302-1306 in rdma_iba_init.c:
>
>             if (MPIDI_CH3I_RDMA_Process.has_srq) {
>                 pthread_cancel(MPIDI_CH3I_RDMA_Process.async_thread[i]);
>                 pthread_join(MPIDI_CH3I_RDMA_Process.async_thread[i],
> NULL);
>                 ibv_destroy_srq(MPIDI_CH3I_RDMA_Process.srq_hndl[i]);
>             }
>
> Consider what would happen if async_thread() were processing a
> IBV_EVENT_SRQ_LIMIT_REACHED event when pthread_cancel() was called on
> async_thread().  async_thread() has already called ibv_get_async_event()
> for this event, but it has not yet called ibv_ack_async_event().  The
> result would be the observed deadlock in this part of
> ibv_cmd_destroy_srq():
>
>         pthread_mutex_lock(&srq->mutex);
>         while (srq->events_completed != resp.events_reported)
>                 pthread_cond_wait(&srq->cond, &srq->mutex);
>         pthread_mutex_unlock(&srq->mutex);
>
> That is, events_completed == events_reported-1 at this point.  The
> pthread_cond_signal() would be called, and events_completed could be
> made
> equal to events_reported, only by by calling ibv_ack_async_event() on
> this
> event.  But that will never happen because async_thread() is the only
> code
> that would have done that, and it's already been pthread_cancel()'d and
> pthread_join()'d before ibv_destroy_srq() is called.
>
> I think the fix is to add some sort of synchronization between
> async_thread() and the code that calls the pthread_cancel() on it.  To
> the
> MVAPICH developers: Do you think you can work up a fix soon, and forward
> the patch for testing?
>
> Thanks,
> David
>
>
> David Kewley
> Dell Infrastructure Consulting Services
> Onsite Engineer at the Maui HPC Center
> Cell: 602-460-7617
> David_Kewley at Dell.com
>
> Dell Services: http://www.dell.com/services/
> How am I doing? Email my manager Russell_Kelly at Dell.com with any
> feedback.
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
Index: src/mpid/osu_ch3/channels/mrail/src/gen2/ibv_channel_manager.c
===================================================================
--- src/mpid/osu_ch3/channels/mrail/src/gen2/ibv_channel_manager.c	(revision 2532)
+++ src/mpid/osu_ch3/channels/mrail/src/gen2/ibv_channel_manager.c	(working copy)
@@ -786,6 +786,14 @@
             fprintf(stderr, "Error getting event!\n"); 
         }
 
+        for(i = 0; i < rdma_num_hcas; i++) {
+            if(MPIDI_CH3I_RDMA_Process.nic_context[i] == context) {
+                hca_num = i;
+            }
+        }
+
+        pthread_mutex_lock(&MPIDI_CH3I_RDMA_Process.async_mutex_lock[hca_num]);
+
         switch (event.event_type) {
             /* Fatal */
             case IBV_EVENT_CQ_ERR:
@@ -837,11 +845,6 @@
 
                 pthread_spin_lock(&MPIDI_CH3I_RDMA_Process.srq_post_spin_lock);
 
-                for(i = 0; i < rdma_num_hcas; i++) {
-                    if(MPIDI_CH3I_RDMA_Process.nic_context[i] == context) {
-                        hca_num = i;
-                    }
-                }
 
                 if(-1 == hca_num) {
                     /* Was not able to find the context,
@@ -914,6 +917,7 @@
         }
 
         ibv_ack_async_event(&event);
+        pthread_mutex_unlock(&MPIDI_CH3I_RDMA_Process.async_mutex_lock[hca_num]);
     }
 }
 
Index: src/mpid/osu_ch3/channels/mrail/src/gen2/rdma_impl.h
===================================================================
--- src/mpid/osu_ch3/channels/mrail/src/gen2/rdma_impl.h	(revision 2532)
+++ src/mpid/osu_ch3/channels/mrail/src/gen2/rdma_impl.h	(working copy)
@@ -111,6 +111,7 @@
     struct ibv_srq              *srq_hndl[MAX_NUM_HCAS];
     pthread_spinlock_t          srq_post_spin_lock;
     pthread_mutex_t             srq_post_mutex_lock[MAX_NUM_HCAS];
+    pthread_mutex_t             async_mutex_lock[MAX_NUM_HCAS];
     pthread_cond_t              srq_post_cond[MAX_NUM_HCAS];
     uint32_t                    srq_zero_post_counter[MAX_NUM_HCAS];
     pthread_t                   async_thread[MAX_NUM_HCAS];
Index: src/mpid/osu_ch3/channels/mrail/src/gen2/rdma_iba_init.c
===================================================================
--- src/mpid/osu_ch3/channels/mrail/src/gen2/rdma_iba_init.c	(revision 2532)
+++ src/mpid/osu_ch3/channels/mrail/src/gen2/rdma_iba_init.c	(working copy)
@@ -126,6 +126,10 @@
 
     rdma_num_rails = rdma_num_hcas * rdma_num_ports * rdma_num_qp_per_port;
 
+    for(i = 0; i < rdma_num_hcas; i++) {
+        pthread_mutex_init(&MPIDI_CH3I_RDMA_Process.async_mutex_lock[i], 0); 
+    }
+
     DEBUG_PRINT("num_qp_per_port %d, num_rails = %d\n", rdma_num_qp_per_port,
 	    rdma_num_rails);
 
@@ -776,9 +780,13 @@
 	if (MPIDI_CH3I_RDMA_Process.has_srq) {
 	    pthread_cond_destroy(&MPIDI_CH3I_RDMA_Process.srq_post_cond[i]);
 	    pthread_mutex_destroy(&MPIDI_CH3I_RDMA_Process.srq_post_mutex_lock[i]);
+
+	    pthread_mutex_lock(&MPIDI_CH3I_RDMA_Process.async_mutex_lock[i]);
 	    pthread_cancel(MPIDI_CH3I_RDMA_Process.async_thread[i]);
 	    pthread_join(MPIDI_CH3I_RDMA_Process.async_thread[i], NULL);
 	    err = ibv_destroy_srq(MPIDI_CH3I_RDMA_Process.srq_hndl[i]);
+	    pthread_mutex_unlock(&MPIDI_CH3I_RDMA_Process.async_mutex_lock[i]);
+	    pthread_mutex_destroy(&MPIDI_CH3I_RDMA_Process.async_mutex_lock[i]);
 	    if (err)
 		MPIU_Error_printf("Failed to destroy SRQ (%d)\n", err);
 	}


More information about the mvapich-discuss mailing list