[mvapich-discuss] hang at large numbers of processors

Mon Nov 3 22:04:54 EST 2008

Justin,

Thanks for this update. Even though the backtrace shows
'intra_shmem_Allreduce' it is not following the shared memory path, within
that function a fallback is called.

A couple things:

- Does it work if all shared memory collectives are turned off?
(VIADEV_USE_SHMEM_COLL=0)

- Have you tried the 1.0.1 installed on TACC at all?

Matt

On Mon, 3 Nov 2008, Justin wrote:

> Here is an update:
>
> I am running on ranger with the following ibrun command:
>
>     ibrun VIADEV_USE_SHMEM_BCAST=0 VIADEV_USE_SHMEM_ALLREDUCE=0 ../sus
>
> where sus is our executable.  With this i'm still occasionally seeing a
> hang at large numbers of processors at this stack trace:
>
> #0  0x00002abc19a38510 in smpi_net_lookup () at mpid_smpi.c:1381
> #1  0x00002abc19a38414 in MPID_SMP_Check_incoming () at mpid_smpi.c:1360
> #2  0x00002abc19a5293c in MPID_DeviceCheck (blocking=7154160) at
> viacheck.c:505
> #3  0x00002abc19a3600b in MPID_RecvComplete (request=0x6d29f0,
> status=0x10, error_code=0xb) at mpid_recv.c:106
> #4  0x00002abc19a5e2f7 in MPI_Waitall (count=7154160,
> array_of_requests=0x10, array_of_statuses=0xb) at waitall.c:190
> #5  0x00002abc19a46d3c in MPI_Sendrecv (sendbuf=0x6d29f0, sendcount=16,
> sendtype=11, dest=11, sendtag=22046016, recvbuf=0x1506810, recvcount=1,
> recvtype=6, source=2912, recvtag=14, comm=130, status=0x7fff952efd2c) at
> sendrecv.c:98
> #6  0x00002abc19a24d2d in intra_Allreduce (sendbuf=0x6d29f0,
> recvbuf=0x10, count=4, datatype=0xb, op=22046016, comm=0x1506810) at
> intra_fns_new.c:5682
> #7  0x00002abc19a24516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
> recvbuf=0x10, count=1, datatype=0xb, op=22046016, comm=0x1506810) at
> intra_fns_new.c:6014
> #8  0x00002abc199ef286 in MPI_Allreduce (sendbuf=0x6d29f0, recvbuf=0x10,
> count=11, datatype=11, op=22046016, comm=22046736) at allreduce.c:83
> #9  0x00002abc18bda4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so
> #10 0x0000000007d0db10
>
> all reduce is still using shared memory.
>
> Do you have any more suggestions?
>
> Thanks,
> Justin
>
> Matthew Koop wrote:
> > Justin,
> >
> > I think there are a couple things here:
> >
> > 1.) Simply exporting the variables is not sufficient for the setup at
> > TACC. You'll need to set it the following way:
> >
> > ibrun VIADEV_USE_SHMEM_COLL=0 ./executable_name
> >
> > Since the ENVs weren't being propogated the setting wasn't taking effect
> > (and that is why you still saw the shmem functions in the backtrace).
> >
> > 2.) There was a limitation in the 1.0 versions where when the
> > shared memory bcast implementation was run on more than 1K nodes there
> > would be a hang. Since the shared memory allreduce uses a bcast internally
> > it is also hanging you can try just disabling the bcast:
> >
> > ibrun VIADEV_USE_SHMEM_BCAST=0 ./executable_name
> >
> > Let us know if this works or if you have additional questions.
> >
> > Thanks,
> > Matt
> >
> > On Mon, 3 Nov 2008, Justin wrote:
> >
> >
> >> Hi,
> >>
> >> We are using mvapich_devel_1.0 on Ranger.  I am seeing my current lockup
> >> at 16,384 processors at the following stacktrace:
> >>
> >> #0  0x00002b015c4f85ff in poll_rdma_buffer (vbuf_addr=0x7fff52849020,
> >> out_of_order=0x7fff52849030) at viacheck.c:206
> >> #1  0x00002b015c4f79ed in MPID_DeviceCheck (blocking=1384419360) at
> >> viacheck.c:505
> >> #2  0x00002b015c4db00b in MPID_RecvComplete (request=0x7fff52849020,
> >> status=0x7fff52849030, error_code=0x2b) at mpid_recv.c:106
> >> #3  0x00002b015c5032f7 in MPI_Waitall (count=1384419360,
> >> array_of_requests=0x7fff52849030, array_of_statuses=0x2b) at waitall.c:190
> >> #4  0x00002b015c4ebd3c in MPI_Sendrecv (sendbuf=0x7fff52849020,
> >> sendcount=1384419376, sendtype=43, dest=35, sendtag=64,
> >> recvbuf=0x2aaaad75d000, recvcount=1, recvtype=6, source=3585,
> >> recvtag=14, comm=130, status=0x7fff528491fc) at sendrecv.c:98
> >> #5  0x00002b015c4c9d2d in intra_Allreduce (sendbuf=0x7fff52849020,
> >> recvbuf=0x7fff52849030, count=4, datatype=0x23, op=64,
> >> comm=0x2aaaad75d000) at intra_fns_new.c:5682
> >> #6  0x00002b015c4c9516 in intra_shmem_Allreduce (sendbuf=0x7fff52849020,
> >> recvbuf=0x7fff52849030, count=1, datatype=0x23, op=64,
> >> comm=0x2aaaad75d000) at intra_fns_new.c:6014
> >> #7  0x00002b015c494286 in MPI_Allreduce (sendbuf=0x7fff52849020,
> >> recvbuf=0x7fff52849030, count=43, datatype=35, op=64, comm=-1384787968)
> >> at allreduce.c:83
> >> #8  0x00002b015b67f4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
> >> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so
> >>
> >> I was seeing lockups at smaller powers of two but adding the following
> >> seemed to stop those:
> >>
> >> export VIADEV_USE_SHMEM_COLL=0
> >> export VIADEV_USE_SHMEM_ALLREDUCE=0
> >>
> >> Now I am just seeing it at 16K.  What is odd to me is that if the 2
> >> commands above stop the shared memory optimizations then why does the
> >> stacktrace still show 'ntra_shmem_Allreduce' being called?
> >>
> >> Here is some other info that might be useful:
> >>
> >> login3:/scratch/00975/luitjens/scalingice/ranger.med/ %mpirun_rsh -v
> >> OSU MVAPICH VERSION 1.0-SingleRail
> >> Build-ID: custom
> >>
> >> MPI Path:
> >> lrwxrwxrwx  1 tg802225 G-800594 46 May 27 14:29 include ->
> >> /opt/apps/intel10_1/mvapich-devel/1.0/include/
> >> lrwxrwxrwx  1 tg802225 G-800594 49 May 27 14:29 lib ->
> >> /opt/apps/intel10_1/mvapich-devel/1.0/lib/shared/
> >>
> >>
> >> Thanks,
> >> Justin
> >>
> >> Dhabaleswar Panda wrote:
> >>
> >>> Justin,
> >>>
> >>> Could you let us know which stack (MVAPICH or MVAPICH2) you are using on
> >>> Ranger. These two stacks have the parameters named differently. Also, on
> >>> what exact process count you see this problem. If you can also let us know
> >>> the version number of mvapich/mvapich2 stack and/or the path of the MPI
> >>> library on Ranger, it will be helpful.
> >>>
> >>> Thanks,
> >>>
> >>> DK
> >>>
> >>> On Mon, 3 Nov 2008, Justin wrote:
> >>>
> >>>
> >>>
> >>>> We are running into hangs on Ranger using mvapich that are not present
> >>>> on other machines.  These hangs seem to only occur on arge problems with
> >>>> large numbers of processors.  We have ran into similar problems on some
> >>>> LLNL machines in the past and were able to get around them by disabling
> >>>> the shared memory optimizations.  In these cases the problem had to do
> >>>> with fixed sized buffers used in the shared memory optimizations.
> >>>>
> >>>> We would like to disable shared memory on Ranger but are confused with
> >>>> all the different parameters dealing with shared memory optimizations.
> >>>> How do we know which parameters affect the run?  For example do we use
> >>>> the parameters that begin with MV_ or VIADEV_?  From past conversations
> >>>> I have had with support teams the parameters that have an effect vary
> >>>> according to the hardware/mpi build.  What is the best way to determine
> >>>> which parameters are active?
> >>>>
> >>>> Also here is a stacktrace from one of our hangs:
> >>>>
> >>>> .stack.i132-112.ranger.tacc.utexas.edu.16033
> >>>> Intel(R) Debugger for applications running on Intel(R) 64, Version
> >>>> 10.1-35 , Build 20080310
> >>>> Attaching to program:
> >>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus,
> >>>> process 16033
> >>>> Reading symbols from
> >>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus...(no
> >>>> debugging symbols found)...done.
> >>>> smpi_net_lookup () at mpid_smpi.c:1381
> >>>> #0  0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c:1381
> >>>> #1  0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at mpid_smpi.c:1360
> >>>> #2  0x00002ada6b4f293c in MPID_DeviceCheck (blocking=7154160) at
> >>>> viacheck.c:505
> >>>> #3  0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
> >>>> status=0x10, error_code=0x4) at mpid_recv.c:106
> >>>> #4  0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
> >>>> array_of_requests=0x10, array_of_statuses=0x4) at waitall.c:190
> >>>> #5  0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0, sendcount=16,
> >>>> sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680, recvcount=1,
> >>>> recvtype=6, source=2278, recvtag=14, comm=130, status=0x7fff4385028c) at
> >>>> sendrecv.c:98
> >>>> #6  0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
> >>>> recvbuf=0x10, count=4, datatype=0xe, op=22045696, comm=0x1506680) at
> >>>> intra_fns_new.c:5682
> >>>> #7  0x00002ada6b4c4516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
> >>>> recvbuf=0x10, count=1, datatype=0xe, op=22045696, comm=0x1506680) at
> >>>> intra_fns_new.c:6014
> >>>> #8  0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0, recvbuf=0x10,
> >>>> count=4, datatype=14, op=22045696, comm=22046336) at allreduce.c:83
> >>>> #9  0x00002ada6a67a4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
> >>>> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so
> >>>>
> >>>> In this case what would be the likely parameter I could play with in
> >>>> order to potentially stop a hang in MPI_Allreduce?
> >>>>
> >>>> Thanks,
> >>>> Justin
> >>>> _______________________________________________
> >>>> mvapich-discuss mailing list
> >>>> mvapich-discuss at cse.ohio-state.edu
> >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>
> >>>>
> >>>>
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >>
>