[mvapich-discuss] hang at large numbers of processors

Mon Nov 3 16:50:53 EST 2008

Justin,

I think there are a couple things here:

1.) Simply exporting the variables is not sufficient for the setup at
TACC. You'll need to set it the following way:

ibrun VIADEV_USE_SHMEM_COLL=0 ./executable_name

Since the ENVs weren't being propogated the setting wasn't taking effect
(and that is why you still saw the shmem functions in the backtrace).

2.) There was a limitation in the 1.0 versions where when the
shared memory bcast implementation was run on more than 1K nodes there
would be a hang. Since the shared memory allreduce uses a bcast internally
it is also hanging you can try just disabling the bcast:

ibrun VIADEV_USE_SHMEM_BCAST=0 ./executable_name

Let us know if this works or if you have additional questions.

Thanks,
Matt

On Mon, 3 Nov 2008, Justin wrote:

> Hi,
>
> We are using mvapich_devel_1.0 on Ranger.  I am seeing my current lockup
> at 16,384 processors at the following stacktrace:
>
> #0  0x00002b015c4f85ff in poll_rdma_buffer (vbuf_addr=0x7fff52849020,
> out_of_order=0x7fff52849030) at viacheck.c:206
> #1  0x00002b015c4f79ed in MPID_DeviceCheck (blocking=1384419360) at
> viacheck.c:505
> #2  0x00002b015c4db00b in MPID_RecvComplete (request=0x7fff52849020,
> status=0x7fff52849030, error_code=0x2b) at mpid_recv.c:106
> #3  0x00002b015c5032f7 in MPI_Waitall (count=1384419360,
> array_of_requests=0x7fff52849030, array_of_statuses=0x2b) at waitall.c:190
> #4  0x00002b015c4ebd3c in MPI_Sendrecv (sendbuf=0x7fff52849020,
> sendcount=1384419376, sendtype=43, dest=35, sendtag=64,
> recvbuf=0x2aaaad75d000, recvcount=1, recvtype=6, source=3585,
> recvtag=14, comm=130, status=0x7fff528491fc) at sendrecv.c:98
> #5  0x00002b015c4c9d2d in intra_Allreduce (sendbuf=0x7fff52849020,
> recvbuf=0x7fff52849030, count=4, datatype=0x23, op=64,
> comm=0x2aaaad75d000) at intra_fns_new.c:5682
> #6  0x00002b015c4c9516 in intra_shmem_Allreduce (sendbuf=0x7fff52849020,
> recvbuf=0x7fff52849030, count=1, datatype=0x23, op=64,
> comm=0x2aaaad75d000) at intra_fns_new.c:6014
> #7  0x00002b015c494286 in MPI_Allreduce (sendbuf=0x7fff52849020,
> recvbuf=0x7fff52849030, count=43, datatype=35, op=64, comm=-1384787968)
> at allreduce.c:83
> #8  0x00002b015b67f4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so
>
> I was seeing lockups at smaller powers of two but adding the following
> seemed to stop those:
>
> export VIADEV_USE_SHMEM_COLL=0
> export VIADEV_USE_SHMEM_ALLREDUCE=0
>
> Now I am just seeing it at 16K.  What is odd to me is that if the 2
> commands above stop the shared memory optimizations then why does the
> stacktrace still show 'ntra_shmem_Allreduce' being called?
>
> Here is some other info that might be useful:
>
> login3:/scratch/00975/luitjens/scalingice/ranger.med/ %mpirun_rsh -v
> OSU MVAPICH VERSION 1.0-SingleRail
> Build-ID: custom
>
> MPI Path:
> lrwxrwxrwx  1 tg802225 G-800594 46 May 27 14:29 include ->
> /opt/apps/intel10_1/mvapich-devel/1.0/include/
> lrwxrwxrwx  1 tg802225 G-800594 49 May 27 14:29 lib ->
> /opt/apps/intel10_1/mvapich-devel/1.0/lib/shared/
>
>
> Thanks,
> Justin
>
> Dhabaleswar Panda wrote:
> > Justin,
> >
> > Could you let us know which stack (MVAPICH or MVAPICH2) you are using on
> > Ranger. These two stacks have the parameters named differently. Also, on
> > what exact process count you see this problem. If you can also let us know
> > the version number of mvapich/mvapich2 stack and/or the path of the MPI
> > library on Ranger, it will be helpful.
> >
> > Thanks,
> >
> > DK
> >
> > On Mon, 3 Nov 2008, Justin wrote:
> >
> >
> >> We are running into hangs on Ranger using mvapich that are not present
> >> on other machines.  These hangs seem to only occur on arge problems with
> >> large numbers of processors.  We have ran into similar problems on some
> >> LLNL machines in the past and were able to get around them by disabling
> >> the shared memory optimizations.  In these cases the problem had to do
> >> with fixed sized buffers used in the shared memory optimizations.
> >>
> >> We would like to disable shared memory on Ranger but are confused with
> >> all the different parameters dealing with shared memory optimizations.
> >> How do we know which parameters affect the run?  For example do we use
> >> the parameters that begin with MV_ or VIADEV_?  From past conversations
> >> I have had with support teams the parameters that have an effect vary
> >> according to the hardware/mpi build.  What is the best way to determine
> >> which parameters are active?
> >>
> >> Also here is a stacktrace from one of our hangs:
> >>
> >> .stack.i132-112.ranger.tacc.utexas.edu.16033
> >> Intel(R) Debugger for applications running on Intel(R) 64, Version
> >> 10.1-35 , Build 20080310
> >> Attaching to program:
> >> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus,
> >> process 16033
> >> Reading symbols from
> >> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus...(no
> >> debugging symbols found)...done.
> >> smpi_net_lookup () at mpid_smpi.c:1381
> >> #0  0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c:1381
> >> #1  0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at mpid_smpi.c:1360
> >> #2  0x00002ada6b4f293c in MPID_DeviceCheck (blocking=7154160) at
> >> viacheck.c:505
> >> #3  0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
> >> status=0x10, error_code=0x4) at mpid_recv.c:106
> >> #4  0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
> >> array_of_requests=0x10, array_of_statuses=0x4) at waitall.c:190
> >> #5  0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0, sendcount=16,
> >> sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680, recvcount=1,
> >> recvtype=6, source=2278, recvtag=14, comm=130, status=0x7fff4385028c) at
> >> sendrecv.c:98
> >> #6  0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
> >> recvbuf=0x10, count=4, datatype=0xe, op=22045696, comm=0x1506680) at
> >> intra_fns_new.c:5682
> >> #7  0x00002ada6b4c4516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
> >> recvbuf=0x10, count=1, datatype=0xe, op=22045696, comm=0x1506680) at
> >> intra_fns_new.c:6014
> >> #8  0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0, recvbuf=0x10,
> >> count=4, datatype=14, op=22045696, comm=22046336) at allreduce.c:83
> >> #9  0x00002ada6a67a4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
> >> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so
> >>
> >> In this case what would be the likely parameter I could play with in
> >> order to potentially stop a hang in MPI_Allreduce?
> >>
> >> Thanks,
> >> Justin
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>