[mvapich-discuss] mpi send/recieve hanging, how to diagnose?
Ben
Benjamin.M.Auer at nasa.gov
Fri Jul 12 10:07:00 EDT 2013
To answer the previous responers questions:
1.) Configure and run time flags:
an mpiname -a shows
MVAPICH2 1.8.1 Thu Sep 27 18:55:23 EDT 2012 ch3:mrail
Compilation
CC: icc -fpic -m64 -DNDEBUG -DNVALGRIND -O2
CXX: icpc -fpic -m64 -DNDEBUG -DNVALGRIND -O2
F77: ifort -fpic -O2
FC: ifort -fpic -O2
Configuration
CC=icc CXX=icpc F77=ifort FC=ifort CFLAGS=-fpic -m64 CXXFLAGS=-fpic -m64
FFLAGS=-fpic FCFLAGS=-fpic --enable-f77 --enable-fc --enable-cxx
--enable-romio --enable-threads=default --with-hwloc
--disable-multi-aliases --enable-xrc=yes --enable-hybrid
--prefix=/usr/local/other/SLES11.1/mvapich2/1.8.1/intel-13.1.2.183
and I run with mpirun_rsh -hostfile $PBS_NODEFILE -np num_proc
./Application.x
with no other runtime flags set.
2.) I tried playing with the MV2_DEFAULT_MAX_SEND_WQE but it did not help.
3.) It does not seem to be necessarily the number of sends. I managed to
make a small reproducer that is experiencing the same problem as in the
real code. Essentially the code in question is our own gather operation
to gather a 2D distributed array using nonblocking sends and recieves.
For a given number of processors (I have been testing with 3456 as that
is the target for the real code), below a certain size of the global
array the code works fine and successfully gathers. Above a certain size
of the global array (so the local arrays are bigger now) but on the same
proccessor count the code hangs are after a number of receives (less
that 3456) but varies each time I run and fails on even trying to gather
a single array. Likewise with the larger size of the global array the
code works if I run on fewer processors. It is like it is some
combination of message size and count.
Below is the relevant routines from the reproducer if that would be of
any use. The first creates a request for the gather and sets such
information as what process will receive the gathered array, the indices
of the global array each process has, and allocates some buffers. Then
there is a gather call where the actually sending is done and a
collective wait with the receives.
subroutine CreateRequest(MyGrid, Root, request, tag)
type(DistGrid), intent(in) :: MyGrid
integer, intent(in) :: Root
type(CommRequest), intent(inout) :: request
integer, intent(in) :: tag
integer :: nDEs ! number of processors
nDEs = MyGrid%nproc
allocate (request%i1(0:nDEs-1))
allocate (request%in(0:nDEs-1))
allocate (request%j1(0:nDEs-1))
allocate (request%jn(0:nDEs-1))
allocate (request%im(0:nDEs-1))
allocate (request%jm(0:nDEs-1))
allocate (request%RECV(0:nDEs-1))
allocate (request%SEND(0:nDEs-1))
request%amRoot = (MyGrid%myid == Root) ! root will
receive global array
request%active = .true.
request%nDEs = nDEs
request%myPE = MyGrid%myid
request%comm = MPI_COMM_WORLD
request%root = root
request%tag = tag
request%i1 = MyGrid%i1
request%in = MyGrid%in
request%j1 = MyGrid%j1
request%jn = MyGrid%jn
request%im = request%in-request%i1+1 ! size of first dim of local
array on each process
request%jm = request%jn-request%j1+1 ! size of 2nd dim of local
array on each process
request%im_world = MyGrid%im_world
request%jm_world = MyGrid%jm_world
request%im0 = MyGrid%im
request%jm0 = MyGrid%jm
if(request%amRoot) then
allocate(request%DstArray(request%IM_WORLD, request%JM_WORLD))
endif
if(request%amRoot) then
allocate (request%Var(0:request%IM_WORLD*request%JM_WORLD-1))
else
allocate (request%Var(1))
endif
end subroutine CreateRequest
subroutine ArrayIGather(local_array,request)
real, intent(in) :: local_array(:,:)
type(CommRequest), intent(inout) :: request
! Local variables
integer :: i1, in, j1, jn,status
allocate(request%local_array(size(LOCAL_ARRAY,1),size(LOCAL_ARRAY,2)))
! In senders, copy input to contiguous buffer for safety
!-------------------------------------------------------
request%local_array = local_array
if(request%amRoot) then
i1 = request%i1(request%mype)
in = request%in(request%mype)
j1 = request%j1(request%mype)
jn = request%jn(request%mype)
request%DstArray(i1:in,j1:jn) = local_array
else
call MPI_ISend(request%Local_Array, size(Local_Array), MPI_REAL, &
request%root, request%tag, request%comm, request%send(0),
status)
end if
end subroutine ArrayIGather
subroutine CollectiveWait(request, DstArray)
type(CommRequest), intent(inout) :: request
real, pointer :: DstArray(:,:)
integer :: status
integer :: i,j,k,n
integer :: count
ROOT_GATH: if(request%amRoot) then
k = 0
PE_GATH: do n=0,request%nDEs-1
count = request%IM(n)*request%JM(n)
if(request%mype/=n) then
call MPI_Recv(request%var(k), count, MPI_REAL, &
n, request%tag, request%comm, MPI_STATUS_IGNORE,
status)
write(*,*)'on proc recieving from ',request%mype,n
do J=request%J1(n),request%JN(n)
do I=request%I1(n),request%IN(n)
request%DstArray(I,J) = request%var(k)
k = k+1
end do
end do
else
k = k + count
end if
end do PE_GATH
DstArray => request%DstArray
else
call MPI_WAIT(request%send(0),MPI_STATUS_IGNORE,status)
endif ROOT_GATH
deallocate(request%var )
deallocate(request%recv)
deallocate(request%send)
deallocate(request%i1 )
deallocate(request%in )
deallocate(request%j1 )
deallocate(request%jn )
deallocate(request%im )
deallocate(request%jm )
nullify(request%var )
nullify(request%send )
nullify(request%recv )
nullify(request%DstArray)
if(associated(request%Local_Array)) deallocate(request%Local_Array)
nullify(request%Local_Array)
request%active = .false.
end subroutine CollectiveWait
On 07/05/2013 03:59 PM, Devendar Bureddy wrote:
> Hi Ben
>
> Can you please tell us the configure and run-time flags?
> Can you try with a run time parameter MV2_DEFAULT_MAX_SEND_WQE=256
> and see if that changes the behavior?
> Do you know after how many MPI_Isends (out of large count of
> MPI_Isend) the application is waiting on the completions?
>
> -Devendar
>
>
>
> On Fri, Jul 5, 2013 at 3:15 PM, Ben <Benjamin.M.Auer at nasa.gov
> <mailto:Benjamin.M.Auer at nasa.gov>> wrote:
>
> Hi,
> I'm currently having what seems to be an issue with mvapich.
> I'm part of a team that maintains a global climate model mostly
> written in Fortran 90/95. At a point in the code, there are
> large number of MPI_ISends/MPI_Recv (anywhere from thousands to
> hundreds of thousands) when when the data that is distributed
> across all mpi processes has to be collected on
> a particular processor to be transformed to a different resolution
> before being written.
> Above a certain resolution/number of mpiprocs the model simply
> hangs at the receive after the send.
> The strange thing this is that at the same resolution at lower
> processor count it works fine.
> For example at the troublesome resolution the model runs on 864
> processors but hangs with 1536 processors.
> However, at a lower resolution the same code runs fine on 1536
> processors and above.
> We are currently using the Intel 13 fortran compiler and had been
> using mvapich 1.8.1, although mvapich 1.9 also exhibits this
> behaviour. Does anyone have any suggests on how to diagnose what
> is going on or some parameters that we could play with that might
> help? This was perhaps a bit hand-wavy but we are rather stumped
> at this point how to proceed. Interestingly we have gotten the
> code to run with other mpi stacks at the resolution/processor
> count where mvapich hangs. I can provide more details if needed.
> Thanks
>
> --
> Ben Auer, PhD SSAI, Scientific Programmer/Analyst
> NASA GSFC, Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
> Phone: 301-286-9176 <tel:301-286-9176> Fax:
> 301-614-6246 <tel:301-614-6246>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> <mailto:mvapich-discuss at cse.ohio-state.edu>
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
> --
> Devendar
--
Ben Auer, PhD SSAI, Scientific Programmer/Analyst
NASA GSFC, Global Modeling and Assimilation Office
Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771
Phone: 301-286-9176 Fax: 301-614-6246
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20130712/b43d53d0/attachment.html
More information about the mvapich-discuss
mailing list