[mvapich-discuss] Application failing when run with more than 64 processors

Devendar Bureddy bureddy at cse.ohio-state.edu
Tue May 7 11:56:47 EDT 2013


I am closing this issue on mvapich-discuss list for everybody's
information. We had offline discussion and this issue turned out be with
application code.

-Devendar


On Tue, Apr 30, 2013 at 11:48 PM, Devendar Bureddy <
bureddy at cse.ohio-state.edu> wrote:

> Hi Wesley
>
> This is turned out be a issue with in the application code (palabos).
> The buffer used in the MPI_Isend is deleted before MPI_Wait is called on
> its request. The following are few code snippets to explain this.
>
>
> File: palabos/src/parallelism/sendRecvPool.cpp
> ---------------------------------------------------------------------
> void SendPoolCommunicator::startCommunication(int toProc, bool
> staticMessage)
> {
>       ....
>       std::vector<int> dynamicDataSizes;
>       ....
>       if (!staticMessage) {
>       ...
>              global::mpi().iSend(&dynamicDataSizes[0] ...;
>       }
>       ....
> }
>
> In the above function "dynamicDataSizes " is a local vector buffer used in
> the MPI_Isend.  The vector object destructor is called(buffer is
> deallocated) when execution is return from this function.  The
> corresponding wait for the above isend is called later in the below
> function.  This is incorrect as per mpi-standard.
>
> void SendPoolCommunicator::finalize(bool staticMessage) {
>      ....
>      if (!staticMessage) {
>               global::mpi().wait(&entry.sizeRequest, &entry.sizeStatus);
>      }
>      ...
> }
>
> This is working when MV2_ON_DEMAND_THRESHOLD because, in this case,  all
> connections are setup during MPI_Init and small message MPI_Isend data
> is transferred in line. So buffer destroyed in the destructor did not show
> any effect.
>
> Where as in default optimal case(> 64 processes), the IB connections
> between the processes are established on demand. In this case, the first
> data transfer initiates the connection and buffers are queued internally
> until the connection is established. In this scenario, the first
> message transferred is MPI_Isend from palabos (from above mentioned code
> snippet) and that buffer is queued internally which is destroyed as soon as
> execution returns from SendPoolCommunicator::startCommunication().
>
> Just to verify this I have added a MPI_Wait() just after  MPI_Isend and
> commented MPI_Wait in SendPoolCommunicator::finalize() as below. Things are
> running fine as expected with this chanfe. I'm not sure if this is right
> way to fix the application.
>
> /nv/hp16/dbureddy3/data/palabos/src/parallelism/sendRecvPool.cpp (
> modified file)
> ........................
>
> void SendPoolCommunicator::startCommunication(int toProc, bool
> staticMessage)
> {
>       ....
>       std::vector<int> dynamicDataSizes;
>       ....
>       if (!staticMessage) {
>       ...
>         global::mpi().iSend(&dynamicDataSizes[0], dynamicDataSizes.size(),
> toProc,
>                             &entry.sizeRequest);
>        * global::mpi().wait(&entry.sizeRequest, &entry.sizeStatus);*
>
>       }
>       ....
> }
> void SendPoolCommunicator::finalize(bool staticMessage) {
>         ....
>         if (!staticMessage) {
>            * // global::mpi().wait(&entry.sizeRequest, &entry.sizeStatus);
> *
>         }
>         ....
> }
>
>
>
> -Devendar
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130507/cdee60d1/attachment.html


More information about the mvapich-discuss mailing list