[mvapich-discuss] hang at large numbers of processors

Tue Nov 4 09:18:34 EST 2008

Karl,

> Just FYI so that everyone is aware, we actually do propagate all user
> environment variables on Ranger so it is sufficient to simply set
> VIADEV parameters in your job script as long as jobs are launched with
> ibrun.

Thanks for the clarification here.

DK

> Karl
>
> On Nov 3, 2008, at 9:04 PM, Matthew Koop wrote:
>
> > Justin,
> >
> > Thanks for this update. Even though the backtrace shows
> > 'intra_shmem_Allreduce' it is not following the shared memory path,
> > within
> > that function a fallback is called.
> >
> > A couple things:
> >
> > - Does it work if all shared memory collectives are turned off?
> > (VIADEV_USE_SHMEM_COLL=0)
> >
> > - Have you tried the 1.0.1 installed on TACC at all?
> >
> > Matt
> >
> > On Mon, 3 Nov 2008, Justin wrote:
> >
> >> Here is an update:
> >>
> >> I am running on ranger with the following ibrun command:
> >>
> >>    ibrun VIADEV_USE_SHMEM_BCAST=0 VIADEV_USE_SHMEM_ALLREDUCE=0 ../sus
> >>
> >> where sus is our executable.  With this i'm still occasionally
> >> seeing a
> >> hang at large numbers of processors at this stack trace:
> >>
> >> #0  0x00002abc19a38510 in smpi_net_lookup () at mpid_smpi.c:1381
> >> #1  0x00002abc19a38414 in MPID_SMP_Check_incoming () at mpid_smpi.c:
> >> 1360
> >> #2  0x00002abc19a5293c in MPID_DeviceCheck (blocking=7154160) at
> >> viacheck.c:505
> >> #3  0x00002abc19a3600b in MPID_RecvComplete (request=0x6d29f0,
> >> status=0x10, error_code=0xb) at mpid_recv.c:106
> >> #4  0x00002abc19a5e2f7 in MPI_Waitall (count=7154160,
> >> array_of_requests=0x10, array_of_statuses=0xb) at waitall.c:190
> >> #5  0x00002abc19a46d3c in MPI_Sendrecv (sendbuf=0x6d29f0,
> >> sendcount=16,
> >> sendtype=11, dest=11, sendtag=22046016, recvbuf=0x1506810,
> >> recvcount=1,
> >> recvtype=6, source=2912, recvtag=14, comm=130,
> >> status=0x7fff952efd2c) at
> >> sendrecv.c:98
> >> #6  0x00002abc19a24d2d in intra_Allreduce (sendbuf=0x6d29f0,
> >> recvbuf=0x10, count=4, datatype=0xb, op=22046016, comm=0x1506810) at
> >> intra_fns_new.c:5682
> >> #7  0x00002abc19a24516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
> >> recvbuf=0x10, count=1, datatype=0xb, op=22046016, comm=0x1506810) at
> >> intra_fns_new.c:6014
> >> #8  0x00002abc199ef286 in MPI_Allreduce (sendbuf=0x6d29f0,
> >> recvbuf=0x10,
> >> count=11, datatype=11, op=22046016, comm=22046736) at allreduce.c:83
> >> #9  0x00002abc18bda4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
> >> /work/00975/luitjens/SCIRun/optimized/lib/
> >> libPackages_Uintah_CCA_Components_Schedulers.so
> >> #10 0x0000000007d0db10
> >>
> >> all reduce is still using shared memory.
> >>
> >> Do you have any more suggestions?
> >>
> >> Thanks,
> >> Justin
> >>
> >> Matthew Koop wrote:
> >>> Justin,
> >>>
> >>> I think there are a couple things here:
> >>>
> >>> 1.) Simply exporting the variables is not sufficient for the setup
> >>> at
> >>> TACC. You'll need to set it the following way:
> >>>
> >>> ibrun VIADEV_USE_SHMEM_COLL=0 ./executable_name
> >>>
> >>> Since the ENVs weren't being propogated the setting wasn't taking
> >>> effect
> >>> (and that is why you still saw the shmem functions in the
> >>> backtrace).
> >>>
> >>> 2.) There was a limitation in the 1.0 versions where when the
> >>> shared memory bcast implementation was run on more than 1K nodes
> >>> there
> >>> would be a hang. Since the shared memory allreduce uses a bcast
> >>> internally
> >>> it is also hanging you can try just disabling the bcast:
> >>>
> >>> ibrun VIADEV_USE_SHMEM_BCAST=0 ./executable_name
> >>>
> >>> Let us know if this works or if you have additional questions.
> >>>
> >>> Thanks,
> >>> Matt
> >>>
> >>> On Mon, 3 Nov 2008, Justin wrote:
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> We are using mvapich_devel_1.0 on Ranger.  I am seeing my current
> >>>> lockup
> >>>> at 16,384 processors at the following stacktrace:
> >>>>
> >>>> #0  0x00002b015c4f85ff in poll_rdma_buffer
> >>>> (vbuf_addr=0x7fff52849020,
> >>>> out_of_order=0x7fff52849030) at viacheck.c:206
> >>>> #1  0x00002b015c4f79ed in MPID_DeviceCheck (blocking=1384419360) at
> >>>> viacheck.c:505
> >>>> #2  0x00002b015c4db00b in MPID_RecvComplete
> >>>> (request=0x7fff52849020,
> >>>> status=0x7fff52849030, error_code=0x2b) at mpid_recv.c:106
> >>>> #3  0x00002b015c5032f7 in MPI_Waitall (count=1384419360,
> >>>> array_of_requests=0x7fff52849030, array_of_statuses=0x2b) at
> >>>> waitall.c:190
> >>>> #4  0x00002b015c4ebd3c in MPI_Sendrecv (sendbuf=0x7fff52849020,
> >>>> sendcount=1384419376, sendtype=43, dest=35, sendtag=64,
> >>>> recvbuf=0x2aaaad75d000, recvcount=1, recvtype=6, source=3585,
> >>>> recvtag=14, comm=130, status=0x7fff528491fc) at sendrecv.c:98
> >>>> #5  0x00002b015c4c9d2d in intra_Allreduce (sendbuf=0x7fff52849020,
> >>>> recvbuf=0x7fff52849030, count=4, datatype=0x23, op=64,
> >>>> comm=0x2aaaad75d000) at intra_fns_new.c:5682
> >>>> #6  0x00002b015c4c9516 in intra_shmem_Allreduce
> >>>> (sendbuf=0x7fff52849020,
> >>>> recvbuf=0x7fff52849030, count=1, datatype=0x23, op=64,
> >>>> comm=0x2aaaad75d000) at intra_fns_new.c:6014
> >>>> #7  0x00002b015c494286 in MPI_Allreduce (sendbuf=0x7fff52849020,
> >>>> recvbuf=0x7fff52849030, count=43, datatype=35, op=64,
> >>>> comm=-1384787968)
> >>>> at allreduce.c:83
> >>>> #8  0x00002b015b67f4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
> >>>> /work/00975/luitjens/SCIRun/optimized/lib/
> >>>> libPackages_Uintah_CCA_Components_Schedulers.so
> >>>>
> >>>> I was seeing lockups at smaller powers of two but adding the
> >>>> following
> >>>> seemed to stop those:
> >>>>
> >>>> export VIADEV_USE_SHMEM_COLL=0
> >>>> export VIADEV_USE_SHMEM_ALLREDUCE=0
> >>>>
> >>>> Now I am just seeing it at 16K.  What is odd to me is that if the 2
> >>>> commands above stop the shared memory optimizations then why does
> >>>> the
> >>>> stacktrace still show 'ntra_shmem_Allreduce' being called?
> >>>>
> >>>> Here is some other info that might be useful:
> >>>>
> >>>> login3:/scratch/00975/luitjens/scalingice/ranger.med/ %mpirun_rsh
> >>>> -v
> >>>> OSU MVAPICH VERSION 1.0-SingleRail
> >>>> Build-ID: custom
> >>>>
> >>>> MPI Path:
> >>>> lrwxrwxrwx  1 tg802225 G-800594 46 May 27 14:29 include ->
> >>>> /opt/apps/intel10_1/mvapich-devel/1.0/include/
> >>>> lrwxrwxrwx  1 tg802225 G-800594 49 May 27 14:29 lib ->
> >>>> /opt/apps/intel10_1/mvapich-devel/1.0/lib/shared/
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Justin
> >>>>
> >>>> Dhabaleswar Panda wrote:
> >>>>
> >>>>> Justin,
> >>>>>
> >>>>> Could you let us know which stack (MVAPICH or MVAPICH2) you are
> >>>>> using on
> >>>>> Ranger. These two stacks have the parameters named differently.
> >>>>> Also, on
> >>>>> what exact process count you see this problem. If you can also
> >>>>> let us know
> >>>>> the version number of mvapich/mvapich2 stack and/or the path of
> >>>>> the MPI
> >>>>> library on Ranger, it will be helpful.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> DK
> >>>>>
> >>>>> On Mon, 3 Nov 2008, Justin wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> We are running into hangs on Ranger using mvapich that are not
> >>>>>> present
> >>>>>> on other machines.  These hangs seem to only occur on arge
> >>>>>> problems with
> >>>>>> large numbers of processors.  We have ran into similar problems
> >>>>>> on some
> >>>>>> LLNL machines in the past and were able to get around them by
> >>>>>> disabling
> >>>>>> the shared memory optimizations.  In these cases the problem
> >>>>>> had to do
> >>>>>> with fixed sized buffers used in the shared memory optimizations.
> >>>>>>
> >>>>>> We would like to disable shared memory on Ranger but are
> >>>>>> confused with
> >>>>>> all the different parameters dealing with shared memory
> >>>>>> optimizations.
> >>>>>> How do we know which parameters affect the run?  For example do
> >>>>>> we use
> >>>>>> the parameters that begin with MV_ or VIADEV_?  From past
> >>>>>> conversations
> >>>>>> I have had with support teams the parameters that have an
> >>>>>> effect vary
> >>>>>> according to the hardware/mpi build.  What is the best way to
> >>>>>> determine
> >>>>>> which parameters are active?
> >>>>>>
> >>>>>> Also here is a stacktrace from one of our hangs:
> >>>>>>
> >>>>>> .stack.i132-112.ranger.tacc.utexas.edu.16033
> >>>>>> Intel(R) Debugger for applications running on Intel(R) 64,
> >>>>>> Version
> >>>>>> 10.1-35 , Build 20080310
> >>>>>> Attaching to program:
> >>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/
> >>>>>> StandAlone/sus,
> >>>>>> process 16033
> >>>>>> Reading symbols from
> >>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/
> >>>>>> StandAlone/sus...(no
> >>>>>> debugging symbols found)...done.
> >>>>>> smpi_net_lookup () at mpid_smpi.c:1381
> >>>>>> #0  0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c:1381
> >>>>>> #1  0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at
> >>>>>> mpid_smpi.c:1360
> >>>>>> #2  0x00002ada6b4f293c in MPID_DeviceCheck (blocking=7154160) at
> >>>>>> viacheck.c:505
> >>>>>> #3  0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
> >>>>>> status=0x10, error_code=0x4) at mpid_recv.c:106
> >>>>>> #4  0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
> >>>>>> array_of_requests=0x10, array_of_statuses=0x4) at waitall.c:190
> >>>>>> #5  0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0,
> >>>>>> sendcount=16,
> >>>>>> sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680,
> >>>>>> recvcount=1,
> >>>>>> recvtype=6, source=2278, recvtag=14, comm=130,
> >>>>>> status=0x7fff4385028c) at
> >>>>>> sendrecv.c:98
> >>>>>> #6  0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
> >>>>>> recvbuf=0x10, count=4, datatype=0xe, op=22045696,
> >>>>>> comm=0x1506680) at
> >>>>>> intra_fns_new.c:5682
> >>>>>> #7  0x00002ada6b4c4516 in intra_shmem_Allreduce
> >>>>>> (sendbuf=0x6d29f0,
> >>>>>> recvbuf=0x10, count=1, datatype=0xe, op=22045696,
> >>>>>> comm=0x1506680) at
> >>>>>> intra_fns_new.c:6014
> >>>>>> #8  0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0,
> >>>>>> recvbuf=0x10,
> >>>>>> count=4, datatype=14, op=22045696, comm=22046336) at
> >>>>>> allreduce.c:83
> >>>>>> #9  0x00002ada6a67a4f8 in _ZN6Uintah12MPIScheduler7executeEii
> >>>>>> () in
> >>>>>> /work/00975/luitjens/SCIRun/optimized/lib/
> >>>>>> libPackages_Uintah_CCA_Components_Schedulers.so
> >>>>>>
> >>>>>> In this case what would be the likely parameter I could play
> >>>>>> with in
> >>>>>> order to potentially stop a hang in MPI_Allreduce?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Justin
> >>>>>> _______________________________________________
> >>>>>> mvapich-discuss mailing list
> >>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>
> >>>>>>
> >>>>>>
> >>>> _______________________________________________
> >>>> mvapich-discuss mailing list
> >>>> mvapich-discuss at cse.ohio-state.edu
> >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>
> >>>>
> >>
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>