[mvapich-discuss] Hang in MVAPICH2-2.2 in PSM with MPI_THREAD_MULTIPLE

Panda, Dhabaleswar panda at cse.ohio-state.edu
Fri Aug 11 01:58:57 EDT 2017


Hi, Adam, 

Thanks for your note. As you know, MVAPICH2 2.0 is quite old now (released more than three years back on 06/20/2014). Today, we have released MVAPICH2 2.3b. Please try with the latest MVAPICH2 2.2-GA or MVAPICH2 2.3b versions. If the issue persists, we will be able to take a look at it in more detail. 

Thanks, 

DK
________________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Moody, Adam T. [moody20 at llnl.gov]
Sent: Friday, August 11, 2017 1:07 AM
To: Subramoni, Hari
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Hang in MVAPICH2-2.2 in PSM with MPI_THREAD_MULTIPLE

Hi guys,
I still don't have a simple reproducer to forward to you, but I think I tracked the source of the problem.  First I should say that this is using MV2-2.0, not MV2-2.2 like I first thought.  Here is the stack trace we see with MV2-2.0:

pthread_spin_lock
psm_irecv ----> attempts to grab psmlock from call to _psm_enter_
MPID_Irecv
MPIDU_Sched_continue
MPIDU_Sched_progress
psm_progress_wait ----> holds psmlock from call to _psm_progress_enter_
MPIR_Test_impl
PMPI_Test

I can see that in MV2-2.2 psm_progress_wait grabs a new psm progress lock which didn't exist in MV2-2.0, so it's likely that the problem is already fixed.  I've asked the application team to update to MV2-2.2 and try again.  I'll follow up if we still hit any snags.
Thanks,
-Adam
________________________________________
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> on behalf of Moody, Adam T. <moody20 at llnl.gov>
Sent: Thursday, August 3, 2017 6:37:46 PM
To: Hari Subramoni
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Hang in MVAPICH2-2.2 in PSM with MPI_THREAD_MULTIPLE

Thanks, Hari.  I can reproduce it here, but I don't have a simple reproducer that I can send you yet.  I'll keep working this tomorrow under a debugger, but if you have ideas before that, I'm happy to try them.
-Adam
________________________________________
From: hari.subramoni at gmail.com <hari.subramoni at gmail.com> on behalf of Hari Subramoni <subramoni.1 at osu.edu>
Sent: Thursday, August 3, 2017 6:35:49 PM
To: Moody, Adam T.
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Hang in MVAPICH2-2.2 in PSM with MPI_THREAD_MULTIPLE

Hi Adam,

Thanks for the report. We will take a look at it and get back to you soon.

Do you happen to have any reproducer for this issue?

Regards,
Hari.

On Aug 3, 2017 9:32 PM, "Moody, Adam T." <moody20 at llnl.gov<mailto:moody20 at llnl.gov>> wrote:
Hello MVAPICH team,
We've got a user reporting an hang in MPI_Test when using MVAPICH2-2.2 for PSM.  This only happens when using MPI_THREAD_MULTIPLE.  Although the app is using MPI_THREAD_MULTIPLE, they don't actually use threads in this case.  They do not hit the hang if they call MPI_Init instead.

The stack trace on the main thread looks like the following:

pthread_spin_lock
psm_irecv
MPID_Irecv
MPIDU_Sched_continue
MPIDU_Sched_progress
psm_progress_wait
MPIR_Test_impl
PMPI_Test

I think the main thread has perhaps deadlocked itself on the primary PSM lock, but I'm not entirely clear why.

Can you see where the main thread might have grabbed the psm lock the first time based on this stack trace?

Or perhaps it bailed out of a function w/o releasing the lock?
Thanks,
-Adam
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 5671 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170811/b28c6612/attachment-0001.bin>


More information about the mvapich-discuss mailing list