[mvapich-discuss] ISend and IRecv not finishing in Multithread-MPI

Hari Subramoni subramoni.1 at osu.edu
Sun Nov 10 10:51:06 EST 2013


Hi Roshan,

Sorry to hear that the issue persists. We will try this out locally and get
back to you soon.

Thx,
Hari.


On Sun, Nov 10, 2013 at 10:42 AM, Roshan Dathathri <roshan at csa.iisc.ernet.in
> wrote:

> Hi Hari,
>
> The issue persists with mvapich2-2.0a. The source files haven't changed.
> Here is the output of mpiname -a
> MVAPICH2 2.0a Fri Aug 23 13:38:52 EDT 2013 ch3:mrail
>
> Compilation
> CC: icc    -DNDEBUG -DNVALGRIND -O2
> CXX: icpc   -DNDEBUG -DNVALGRIND -O2
> F77: gfortran -L/lib -L/lib   -O2
> FC: gfortran   -O2
>
> Configuration
> --prefix=/usr/local
>
>
>
> On Tue, Nov 5, 2013 at 3:27 PM, Uday R Bondhugula <uday at csa.iisc.ernet.in>wrote:
>
>>
>> We'll be able to try it with mvapich2-2.0a and let you know shortly.
>> Thanks.
>>
>>
>> ~ Uday
>>
>> On Monday 04 November 2013 12:22 AM, Roshan Dathathri wrote:
>>
>>> Hi Hari,
>>>
>>> Thanks for responding. It would be difficult to upgrade the software in
>>> the cluster at the moment, since many of us are targeting deadlines in
>>> the near future, and we wouldn't want to change the status-quo in the
>>> cluster. So, we would prefer if it works with the current version.
>>>
>>> Here is the output of mpiname -a:
>>> MVAPICH2 1.8.1 Thu Sep 27 18:55:23 EDT 2012 ch3:mrail
>>>
>>> Compilation
>>> CC: gcc    -DNDEBUG -DNVALGRIND -O2
>>> CXX: c++   -DNDEBUG -DNVALGRIND -O2
>>> F77: gfortran   -O2
>>> FC: gfortran   -O2
>>>
>>> Configuration
>>>
>>> Please find attached the source files.
>>> Note: The source files use Isend() since that is sufficient for
>>> correctness. Irsend() was only used to debug the issue; all Isend()
>>> calls can be replaced with Irsend() calls if required.
>>> To compile:
>>> mpicc -cc=icpc -D__MPI -O3 -fp-model precise -ansi-alias -ipo  -openmp
>>> -openmp-link=static -D__USE_BLOCK_CYCLIC
>>> -D__DYNSCHEDULER_DEDICATED_RECEIVER -DTIME -DPOLYBENCH_USE_SCALAR_LB
>>> -DPOLYBENCH_TIME polybench.c cholesky.dist_dynsched.c
>>> sigma_cholesky.dist_dynsched.c pi_cholesky.dist_dynsched.c polyrt.c -o
>>> dist_dynsched -ltbb -lm
>>> Optional flags (for generating debug logs): -D__DEBUG_FLUSH
>>> -D__DYNSCHEDULER_DEBUG_PRINT -D__DYNSCHEDULER_MORE_DEBUG_PRINT
>>> Example usage:
>>> mpirun_rsh  -np 32 -hostfile hosts MV2_ENABLE_AFFINITY=0
>>> OMP_NUM_THREADS=8 ./dist_dynsched 2> out_dist_dynsched
>>>
>>>
>>>
>>> On Sun, Nov 3, 2013 at 8:00 PM, Hari Subramoni <subramoni.1 at osu.edu
>>> <mailto:subramoni.1 at osu.edu>> wrote:
>>>
>>>     Hi Roshan,
>>>
>>>     MVAPICH2-1.8.1 is rather old and we have gone across multiple MPICH
>>>     releases after that as well. Could you please try with the latest
>>>     version of MVAPICH2 (MVPICH2-2.0a) and let us know if the issue
>>>     still happens? The latest version of the code is available for
>>>     download from the following site -
>>>     http://mvapich.cse.ohio-state.edu/download/mvapich2/.
>>>
>>>     In the mean time, we can also try to reproduce the error on our
>>>     side. Could you please send your code with the detailed build
>>>     instructions? Could you also let us know how you had built your
>>>     version of MVAPICH2? You can execute mpiname -a to obtain the
>>>     MVAPICH2 build information.
>>>
>>>     Thanks,
>>>     Hari.
>>>
>>>     ------------
>>>
>>>     Hi,
>>>
>>>     I am running a multi-threaded MPI program using MVAPICH2 1.8.1
>>>     with MV2_ENABLE_AFFINITY=0. In each MPI node, there are multiple
>>>     threads - one of these posts Irecv() to receive data from other nodes
>>>     while the rest could post Irsend() (ready mode) to send data to the
>>> other
>>>     nodes. Each thread periodically checks whether the posted
>>>     communication calls have been completed using Test(). The application
>>>     hangs since some of the sends and receives posted have not completed.
>>>     Here are the statistics collected from debug logs (one per node) that
>>>     were generated from an execution of the program:
>>>     Across all nodes, total number of :-
>>>     Irsend() posted: 50339
>>>     Irecv() posted with matched Irsend(): 50339 (since it is ready mode)
>>>     (more Irecv() could have been posted)
>>>     Irsend() completed: 48062
>>>     Irecv() completed: 47296
>>>     For multiple runs on the same number of nodes, this behavior is
>>>     consistent; though the actual numbers vary a lot, the relative
>>> difference
>>>     does not vary by much.
>>>     The behavior is similar if Irsend() is replaced with Issend() or
>>> Isend().
>>>     The return value of all MPI calls are checked for errors. None of
>>> the calls
>>>     return an error for the execution in consideration.
>>>
>>>     What could be the issue for this unexpected behavior? Are there any
>>>     compiler or runtime flags that would help debugging the issue?
>>>
>>>     Machine information:
>>>     32-node InfiniBand cluster of dual-SMP Xeon servers. Each node on the
>>>     cluster consists of two quad-core Intel Xeon E5430 2.66 GHz
>>> processors
>>>     with 12 MB L2 cache and 16 GB RAM. The InfiniBand host adapter is a
>>>     Mellanox MT25204 (InfiniHost III Lx HCA).
>>>     The program was run on 32 nodes with 8 OpenMP threads on each node.
>>>
>>>     Application information:
>>>     A single thread on each node posts multiple anonymous Irecv()
>>>     preemptively. Once it is receives data, it can produce tasks which
>>> need
>>>     to be computed. The rest of the threads consume/compute these tasks,
>>>     and can produce more tasks and post multiple Irsend().
>>>     There is no wait or sleep anywhere in the program; the threads are
>>>     spinning or busy-waiting.
>>>
>>>     I can share the debug logs if required. Each log is a text file of
>>> around
>>>     6MB with detailed information of the execution on that node.
>>>     I can also share the source files if required. All the source files
>>> put
>>>     together would be a few thousand lines of code.
>>>
>>>     Please let me know if you need more information.
>>>
>>>     --
>>>     Thanks,
>>>     Roshan
>>>
>>>
>>>     --
>>>     This message has been scanned for viruses and
>>>     dangerous content by *MailScanner* <http://www.mailscanner.info/>,
>>>
>>>     and is
>>>     believed to be clean.
>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Roshan
>>>
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and
>>> is
>>> believed to be clean.
>>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>>
>
>
> --
> Thanks,
> Roshan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131110/318c061d/attachment.html>


More information about the mvapich-discuss mailing list