[mvapich-discuss] hangs in MPI_WIN_LOCK and MPI_WIN_UNLOCK

Rajeev.c.p rajeevcp at yahoo.com
Sat Apr 18 01:39:41 EDT 2015


DK,
I missed that point to update.We tried this with both 1.9 and also with 2.1 in both cases we faced the same issue. is this a known issue for which a fix is available?
Thx,
-Rajeev 


     On Saturday, April 18, 2015 11:01 AM, "Panda, Dhabaleswar" <panda at cse.ohio-state.edu> wrote:
   

 #yiv5465159309 P {margin-top:0;margin-bottom:0;}Hi,

Thanks for your note. You are using a very older version of MVAPICH2. Please update
it to the latest MVAPICH2 2.1 GA version. Let us know if you still see the issue with the
latest version and we will be happy to extend help. 

Thanks, 

DK
From: mvapich-discuss-bounces at cse.ohio-state.edu on behalf of Rajeev.c.p [rajeevcp at yahoo.com]
Sent: Saturday, April 18, 2015 1:25 AM
To: mvapich-discuss at cse.ohio-state.edu
Subject: [mvapich-discuss] hangs in MPI_WIN_LOCK and MPI_WIN_UNLOCK

Hi Mvapich TeamWe are running our application acrocss a culster of 8 Linux boxes using MPI. We are facing the following problem at random instance of timeWe try to do a RDMA get from Node[0] to the following nodes Node 4,5,6,7 . Multiple gets can happen at same time at Node[0] process since each of the gets happens in a different threads.But these get's are synchronized using a spinLock so that only one thread goes into doing the complete MPI get data fetch.
Our code we used to do will look something like belowgetDatabyDMA(){    SpinLock::lock
    MPI_Win_lock    MPI_Get
    MPI_Win_unlock
    SpinLock::unlock
}
We run this cluster continously and data fetches also happens continously. But at random instances of time we get hang either in the MPI_WIN_LOCK or MPI_win_Unlock  with the following stack trace#0  0x00007f3653561294 in __lll_lock_wait () from /lib64/libpthread.so.0No symbol table info available.#1  0x00007f365355c619 in _L_lock_1008 () from /lib64/libpthread.so.0No symbol table info available.#2  0x00007f365355c42e in pthread_mutex_lock () from /lib64/libpthread.so.0No symbol table info available.#3  0x000000000174ff88 in MPIDI_CH3I_Progress ()No symbol table info available.#4  0x0000000001794e4b in MPIDI_Win_unlock ()No symbol table info available.#5  0x0000000001744515 in PMPI_Win_unlock ()No symbol table info available.Once this happens the entire cluster hangs and then we have to restart the cluster to make it work again.We are using Mvapich 1.9 with suse Linux with MPI_THREAD_MULTIPLEThe same trace pops up when the hang happens in MPI_win_unlockAny help with this issue will be highly appreciated. 
Thanks and Regards
Rajeev



  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150418/acd6b9a1/attachment.html>


More information about the mvapich-discuss mailing list