[mvapich-discuss] Bug: deadlock between ibv_destroy_srq and async_thread

Christian Guggenberger christian.guggenberger at rzg.mpg.de
Sat May 24 11:10:58 EDT 2008


On Fri, May 23, 2008 at 09:23:37PM -0500, David_Kewley at dell.com wrote:
> 
> #0  0x00000036b2608b3a in pthread_cond_wait@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> #1  0x0000002a9595405b in ibv_cmd_destroy_srq (srq=0x82b370) at
> src/cmd.c:582
> #2  0x0000002a962b5419 in mthca_destroy_srq (srq=0x82b3bc) at
> src/verbs.c:475
> #3  0x0000002a9564878e in MPIDI_CH3I_CM_Finalize () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #4  0x0000002a955c053b in MPIDI_CH3_Finalize () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #5  0x0000002a95626202 in MPID_Finalize () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #6  0x0000002a955f7fee in PMPI_Finalize () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #7  0x0000002a955f7eae in pmpi_finalize_ () from
> /opt/mvapich2/1.0.1/intel/10.1.015/lib/libmpich.so
> #8  0x0000000000459ff8 in stoprog_ ()
> #9  0x000000000047afa6 in MAIN__ ()
> #10 0x0000000000405d62 in main ()
> 
> I think the fix is to add some sort of synchronization between
> async_thread() and the code that calls the pthread_cancel() on it.  To
> the
> MVAPICH developers: Do you think you can work up a fix soon, and forward
> the patch for testing?

a short-term workaround would be to disable SRQ at runtime with the
appropriate environment settings. We had seen exactly the same backtrace,
but so far devel has not found the real cause/fix for it. Just curious -
what distro/arch are you using ?

cheers.
 - Christian



More information about the mvapich-discuss mailing list