[mvapich-discuss] mpi send/recieve hanging, how to diagnose?

Fri Jul 12 10:07:00 EDT 2013

To answer the previous responers questions:

1.) Configure and run time flags:
an mpiname -a shows

MVAPICH2 1.8.1 Thu Sep 27 18:55:23 EDT 2012 ch3:mrail

Compilation
CC: icc -fpic -m64   -DNDEBUG -DNVALGRIND -O2
CXX: icpc -fpic -m64  -DNDEBUG -DNVALGRIND -O2
F77: ifort -fpic  -O2
FC: ifort -fpic  -O2

Configuration
CC=icc CXX=icpc F77=ifort FC=ifort CFLAGS=-fpic -m64 CXXFLAGS=-fpic -m64 
FFLAGS=-fpic FCFLAGS=-fpic --enable-f77 --enable-fc --enable-cxx 
--enable-romio --enable-threads=default --with-hwloc 
--disable-multi-aliases --enable-xrc=yes --enable-hybrid 
--prefix=/usr/local/other/SLES11.1/mvapich2/1.8.1/intel-13.1.2.183

and I run with mpirun_rsh -hostfile $PBS_NODEFILE -np num_proc 
./Application.x

with no other runtime flags set.

2.) I tried playing with the MV2_DEFAULT_MAX_SEND_WQE but it did not help.

3.) It does not seem to be necessarily the number of sends. I managed to 
make a small reproducer that is experiencing the same problem as in the 
real code. Essentially the code in question is our own gather operation 
to gather a 2D distributed array using nonblocking sends and recieves.
For a given number of processors (I have been testing with 3456 as that 
is the target for the real code), below a certain size of the global 
array the code works fine and successfully gathers. Above a certain size 
of the global array (so the local arrays are bigger now) but on the same 
proccessor count the code hangs are after a number of receives (less 
that 3456) but varies each time I run and fails on even trying to gather 
a single array. Likewise with the larger size of the global array the 
code works if I run on fewer processors. It is like it is some 
combination of message size and count.

Below is the relevant routines from the reproducer if that would be of 
any use. The first creates a request for the gather and sets such 
information as what process will receive the gathered array, the indices 
of the global array each process has, and allocates some buffers. Then 
there is a gather call where the actually sending is done and a 
collective wait with the receives.

    subroutine CreateRequest(MyGrid, Root, request, tag)
       type(DistGrid), intent(in) :: MyGrid
       integer,        intent(in) :: Root
       type(CommRequest), intent(inout) :: request
       integer,        intent(in) :: tag

       integer :: nDEs ! number of processors

       nDEs = MyGrid%nproc

       allocate (request%i1(0:nDEs-1))
       allocate (request%in(0:nDEs-1))
       allocate (request%j1(0:nDEs-1))
       allocate (request%jn(0:nDEs-1))
       allocate (request%im(0:nDEs-1))
       allocate (request%jm(0:nDEs-1))
       allocate (request%RECV(0:nDEs-1))
       allocate (request%SEND(0:nDEs-1))

       request%amRoot        =  (MyGrid%myid == Root)  ! root will 
receive global array
       request%active        = .true.
       request%nDEs          =  nDEs
       request%myPE          =  MyGrid%myid
       request%comm          =  MPI_COMM_WORLD
       request%root          =  root
       request%tag           =  tag

       request%i1 = MyGrid%i1
       request%in = MyGrid%in
       request%j1 = MyGrid%j1
       request%jn = MyGrid%jn
       request%im = request%in-request%i1+1 ! size of first dim of local 
array on each process
       request%jm = request%jn-request%j1+1 ! size of 2nd dim of local 
array on each process
       request%im_world = MyGrid%im_world
       request%jm_world = MyGrid%jm_world
       request%im0 = MyGrid%im
       request%jm0 = MyGrid%jm

       if(request%amRoot) then
          allocate(request%DstArray(request%IM_WORLD, request%JM_WORLD))
       endif

       if(request%amRoot) then
          allocate (request%Var(0:request%IM_WORLD*request%JM_WORLD-1))
       else
          allocate (request%Var(1))
       endif

    end subroutine CreateRequest

    subroutine ArrayIGather(local_array,request)
       real, intent(in) :: local_array(:,:)
       type(CommRequest), intent(inout) :: request

! Local variables

       integer  :: i1, in, j1, jn,status

allocate(request%local_array(size(LOCAL_ARRAY,1),size(LOCAL_ARRAY,2)))

! In senders, copy input to contiguous buffer for safety
!-------------------------------------------------------

       request%local_array = local_array

       if(request%amRoot) then
          i1 = request%i1(request%mype)
          in = request%in(request%mype)
          j1 = request%j1(request%mype)
          jn = request%jn(request%mype)
          request%DstArray(i1:in,j1:jn) = local_array
       else
          call MPI_ISend(request%Local_Array, size(Local_Array), MPI_REAL, &
               request%root, request%tag, request%comm, request%send(0), 
status)
       end if

   end subroutine ArrayIGather

   subroutine CollectiveWait(request, DstArray)
      type(CommRequest), intent(inout) :: request
      real, pointer :: DstArray(:,:)

      integer :: status

      integer               :: i,j,k,n
      integer               :: count

        ROOT_GATH: if(request%amRoot) then
           k = 0
           PE_GATH: do n=0,request%nDEs-1
              count = request%IM(n)*request%JM(n)
              if(request%mype/=n) then
                 call MPI_Recv(request%var(k), count, MPI_REAL, &
                      n, request%tag, request%comm, MPI_STATUS_IGNORE, 
status)
                 write(*,*)'on proc recieving from ',request%mype,n
                 do J=request%J1(n),request%JN(n)
                    do I=request%I1(n),request%IN(n)
                       request%DstArray(I,J) = request%var(k)
                       k = k+1
                    end do
                 end do
              else
                 k = k + count
              end if
           end do PE_GATH
           DstArray => request%DstArray
        else
           call MPI_WAIT(request%send(0),MPI_STATUS_IGNORE,status)
        endif ROOT_GATH

        deallocate(request%var )
        deallocate(request%recv)
        deallocate(request%send)
        deallocate(request%i1  )
        deallocate(request%in  )
        deallocate(request%j1  )
        deallocate(request%jn  )
        deallocate(request%im  )
        deallocate(request%jm  )

        nullify(request%var     )
        nullify(request%send    )
        nullify(request%recv    )
        nullify(request%DstArray)

        if(associated(request%Local_Array)) deallocate(request%Local_Array)
        nullify(request%Local_Array)

        request%active = .false.

    end subroutine CollectiveWait

On 07/05/2013 03:59 PM, Devendar Bureddy wrote:
> Hi Ben
>
>  Can you please tell us the configure and run-time flags?
>  Can you try with a run time parameter MV2_DEFAULT_MAX_SEND_WQE=256 
> and see if that changes the behavior?
>  Do you know after how many MPI_Isends (out of large count of 
> MPI_Isend) the application is waiting on the completions?
>
> -Devendar
>
>
>
> On Fri, Jul 5, 2013 at 3:15 PM, Ben <Benjamin.M.Auer at nasa.gov 
> <mailto:Benjamin.M.Auer at nasa.gov>> wrote:
>
>     Hi,
>     I'm currently having what seems to be an issue with mvapich.
>     I'm part of a team that maintains a global climate model mostly
>     written in Fortran 90/95. At a point in the code, there are
>     large number of MPI_ISends/MPI_Recv (anywhere from thousands to
>     hundreds of thousands) when when the data that is distributed
>     across all mpi processes has to be collected on
>     a particular processor to be transformed to a different resolution
>     before being written.
>     Above a certain resolution/number of mpiprocs the model simply
>     hangs at the receive after the send.
>     The strange thing this is that at the same resolution at lower
>     processor count it works fine.
>     For example at the troublesome resolution the model runs on 864
>     processors but hangs with 1536 processors.
>     However, at a lower resolution the same code runs fine on 1536
>     processors and above.
>     We are currently using the Intel 13 fortran compiler and had been
>     using mvapich 1.8.1, although mvapich 1.9 also exhibits this
>     behaviour. Does anyone have any suggests on how to diagnose what
>     is going on or some parameters that we could play with that might
>     help? This was perhaps a bit hand-wavy but we are rather stumped
>     at this point how to proceed. Interestingly we have gotten the
>     code to run with other mpi stacks at the resolution/processor
>     count where mvapich hangs. I can provide more details if needed.
>     Thanks
>
>     -- 
>     Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
>     NASA GSFC,  Global Modeling and Assimilation Office
>     Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
>     Phone: 301-286-9176 <tel:301-286-9176>               Fax:
>     301-614-6246 <tel:301-614-6246>
>
>     _______________________________________________
>     mvapich-discuss mailing list
>     mvapich-discuss at cse.ohio-state.edu
>     <mailto:mvapich-discuss at cse.ohio-state.edu>
>     http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
> -- 
> Devendar

-- 
Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
NASA GSFC,  Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
Phone: 301-286-9176               Fax: 301-614-6246

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130712/b43d53d0/attachment.html