[mvapich-discuss] hangs in MPI_WIN_LOCK and MPI_WIN_UNLOCK

Mingzhe Li li.2192 at osu.edu
Sat Apr 18 07:38:03 EDT 2015


Hi Rajeev,

Thanks for your notes. Could you please send us a small reproducer for this
issue?

Thanks,
Mingzhe

On Sat, Apr 18, 2015 at 1:25 AM, Rajeev.c.p <rajeevcp at yahoo.com> wrote:

> Hi Mvapich Team
> We are running our application acrocss a culster of 8 Linux boxes using
> MPI. We are facing the following problem at random instance of time
> We try to do a RDMA get from Node[0] to the following nodes Node 4,5,6,7 .
> Multiple gets can happen at same time at Node[0] process since each of the
> gets happens in a different threads.
> But these get's are synchronized using a spinLock so that only one thread
> goes into doing the complete MPI get data fetch.
>
> Our code we used to do will look something like below
> getDatabyDMA()
> {
>     SpinLock::lock
>     MPI_Win_lock
>     MPI_Get
>     MPI_Win_unlock
>     SpinLock::unlock
> }
>
> We run this cluster continously and data fetches also happens continously.
> But at random instances of time we get hang either in the MPI_WIN_LOCK or
> MPI_win_Unlock  with the following stack trace
> #0  0x00007f3653561294 in __lll_lock_wait () from /lib64/libpthread.so.0
> No symbol table info available.
> #1  0x00007f365355c619 in _L_lock_1008 () from /lib64/libpthread.so.0
> No symbol table info available.
> #2  0x00007f365355c42e in pthread_mutex_lock () from /lib64/libpthread.so.0
> No symbol table info available.
> #3  0x000000000174ff88 in MPIDI_CH3I_Progress ()
> No symbol table info available.
> #4  0x0000000001794e4b in MPIDI_Win_unlock ()
> No symbol table info available.
> #5  0x0000000001744515 in PMPI_Win_unlock ()
> No symbol table info available.
> Once this happens the entire cluster hangs and then we have to restart the
> cluster to make it work again.
> We are using Mvapich 1.9 with suse Linux with MPI_THREAD_MULTIPLE
> The same trace pops up when the hang happens in MPI_win_unlock
> Any help with this issue will be highly appreciated.
>
> Thanks and Regards
>
> Rajeev
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150418/32dc6d9f/attachment-0001.html>


More information about the mvapich-discuss mailing list