[mvapich-discuss] hangs in MPI_WIN_LOCK and MPI_WIN_UNLOCK

Panda, Dhabaleswar panda at cse.ohio-state.edu
Sat Apr 18 01:31:36 EDT 2015


Hi,

Thanks for your note. You are using a very older version of MVAPICH2. Please update
it to the latest MVAPICH2 2.1 GA version. Let us know if you still see the issue with the
latest version and we will be happy to extend help.

Thanks,

DK
________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Rajeev.c.p [rajeevcp at yahoo.com]
Sent: Saturday, April 18, 2015 1:25 AM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] hangs in MPI_WIN_LOCK and MPI_WIN_UNLOCK

Hi Mvapich Team
We are running our application acrocss a culster of 8 Linux boxes using MPI. We are facing the following problem at random instance of time
We try to do a RDMA get from Node[0] to the following nodes Node 4,5,6,7 . Multiple gets can happen at same time at Node[0] process since each of the gets happens in a different threads.
But these get's are synchronized using a spinLock so that only one thread goes into doing the complete MPI get data fetch.

Our code we used to do will look something like below
getDatabyDMA()
{
    SpinLock::lock
    MPI_Win_lock
    MPI_Get
    MPI_Win_unlock
    SpinLock::unlock
}

We run this cluster continously and data fetches also happens continously. But at random instances of time we get hang either in the MPI_WIN_LOCK or MPI_win_Unlock  with the following stack trace
#0  0x00007f3653561294 in __lll_lock_wait () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007f365355c619 in _L_lock_1008 () from /lib64/libpthread.so.0
No symbol table info available.
#2  0x00007f365355c42e in pthread_mutex_lock () from /lib64/libpthread.so.0
No symbol table info available.
#3  0x000000000174ff88 in MPIDI_CH3I_Progress ()
No symbol table info available.
#4  0x0000000001794e4b in MPIDI_Win_unlock ()
No symbol table info available.
#5  0x0000000001744515 in PMPI_Win_unlock ()
No symbol table info available.
Once this happens the entire cluster hangs and then we have to restart the cluster to make it work again.
We are using Mvapich 1.9 with suse Linux with MPI_THREAD_MULTIPLE
The same trace pops up when the hang happens in MPI_win_unlock
Any help with this issue will be highly appreciated.

Thanks and Regards

Rajeev


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 8651 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150418/cbdf9df2/attachment-0001.bin>


More information about the mvapich-discuss mailing list