[mvapich-discuss] hangs in MPI_WIN_LOCK and MPI_WIN_UNLOCK

Rajeev.c.p rajeevcp at yahoo.com
Sat Apr 18 01:25:33 EDT 2015


Hi Mvapich TeamWe are running our application acrocss a culster of 8 Linux boxes using MPI. We are facing the following problem at random instance of timeWe try to do a RDMA get from Node[0] to the following nodes Node 4,5,6,7 . Multiple gets can happen at same time at Node[0] process since each of the gets happens in a different threads.But these get's are synchronized using a spinLock so that only one thread goes into doing the complete MPI get data fetch.
Our code we used to do will look something like belowgetDatabyDMA(){    SpinLock::lock
    MPI_Win_lock    MPI_Get
    MPI_Win_unlock
    SpinLock::unlock
}
We run this cluster continously and data fetches also happens continously. But at random instances of time we get hang either in the MPI_WIN_LOCK or MPI_win_Unlock  with the following stack trace#0  0x00007f3653561294 in __lll_lock_wait () from /lib64/libpthread.so.0No symbol table info available.#1  0x00007f365355c619 in _L_lock_1008 () from /lib64/libpthread.so.0No symbol table info available.#2  0x00007f365355c42e in pthread_mutex_lock () from /lib64/libpthread.so.0No symbol table info available.#3  0x000000000174ff88 in MPIDI_CH3I_Progress ()No symbol table info available.#4  0x0000000001794e4b in MPIDI_Win_unlock ()No symbol table info available.#5  0x0000000001744515 in PMPI_Win_unlock ()No symbol table info available.Once this happens the entire cluster hangs and then we have to restart the cluster to make it work again.We are using Mvapich 1.9 with suse Linux with MPI_THREAD_MULTIPLEThe same trace pops up when the hang happens in MPI_win_unlockAny help with this issue will be highly appreciated. 
Thanks and Regards
Rajeev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150418/ed74c67d/attachment.html>


More information about the mvapich-discuss mailing list