[mvapich-discuss] Hang in MVAPICH2-2.2 in PSM with MPI_THREAD_MULTIPLE

Moody, Adam T. moody20 at llnl.gov
Fri Aug 11 01:07:34 EDT 2017


Hi guys,
I still don't have a simple reproducer to forward to you, but I think I tracked the source of the problem.  First I should say that this is using MV2-2.0, not MV2-2.2 like I first thought.  Here is the stack trace we see with MV2-2.0:

pthread_spin_lock
psm_irecv ----> attempts to grab psmlock from call to _psm_enter_
MPID_Irecv
MPIDU_Sched_continue
MPIDU_Sched_progress
psm_progress_wait ----> holds psmlock from call to _psm_progress_enter_
MPIR_Test_impl
PMPI_Test

I can see that in MV2-2.2 psm_progress_wait grabs a new psm progress lock which didn't exist in MV2-2.0, so it's likely that the problem is already fixed.  I've asked the application team to update to MV2-2.2 and try again.  I'll follow up if we still hit any snags.
Thanks,
-Adam
________________________________________
From: mvapich-discuss <mvapich-discuss-bounces at cse.ohio-state.edu> on behalf of Moody, Adam T. <moody20 at llnl.gov>
Sent: Thursday, August 3, 2017 6:37:46 PM
To: Hari Subramoni
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Hang in MVAPICH2-2.2 in PSM with MPI_THREAD_MULTIPLE

Thanks, Hari.  I can reproduce it here, but I don't have a simple reproducer that I can send you yet.  I'll keep working this tomorrow under a debugger, but if you have ideas before that, I'm happy to try them.
-Adam
________________________________________
From: hari.subramoni at gmail.com <hari.subramoni at gmail.com> on behalf of Hari Subramoni <subramoni.1 at osu.edu>
Sent: Thursday, August 3, 2017 6:35:49 PM
To: Moody, Adam T.
Cc: mvapich-discuss at cse.ohio-state.edu
Subject: Re: [mvapich-discuss] Hang in MVAPICH2-2.2 in PSM with MPI_THREAD_MULTIPLE

Hi Adam,

Thanks for the report. We will take a look at it and get back to you soon.

Do you happen to have any reproducer for this issue?

Regards,
Hari.

On Aug 3, 2017 9:32 PM, "Moody, Adam T." <moody20 at llnl.gov<mailto:moody20 at llnl.gov>> wrote:
Hello MVAPICH team,
We've got a user reporting an hang in MPI_Test when using MVAPICH2-2.2 for PSM.  This only happens when using MPI_THREAD_MULTIPLE.  Although the app is using MPI_THREAD_MULTIPLE, they don't actually use threads in this case.  They do not hit the hang if they call MPI_Init instead.

The stack trace on the main thread looks like the following:

pthread_spin_lock
psm_irecv
MPID_Irecv
MPIDU_Sched_continue
MPIDU_Sched_progress
psm_progress_wait
MPIR_Test_impl
PMPI_Test

I think the main thread has perhaps deadlocked itself on the primary PSM lock, but I'm not entirely clear why.

Can you see where the main thread might have grabbed the psm lock the first time based on this stack trace?

Or perhaps it bailed out of a function w/o releasing the lock?
Thanks,
-Adam
_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu>
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

_______________________________________________
mvapich-discuss mailing list
mvapich-discuss at cse.ohio-state.edu
http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list