[mvapich-discuss] hang at large numbers of processors

Tue Nov 4 16:55:46 EST 2008

Karl,

> Just to make sure I'm clear, I assume we should probably use a value
> of 3936 for DEFAULT_SHMEM_BCAST_LEADERS for Ranger to support runs
> across the entire system (ie. our max # of compute nodes)?

In MVAPICH 1.1 and MVAPICH2, we have set it to 4096 (reflecting 4K nodes)
to take care of most of the large-scale IB clusters out there today.
Setting it to 3936 should work. However, it is a non-power-of-2 number. We
have not done in-depth testing whether it will lead to any side effect or
not. To be safe, please use 4096 for the time being. This parameter is
defined as a run-time environmental variable in MVAPICH 1.1 and MVAPICH2.
Once these new releases are made (to happen soon) and you install these
versions on Ranger, one can run some experiments with a value of 3936 and
then this value can be adjusted.

Thanks,

DK

> Thanks,
>
> Karl
>
> On Nov 4, 2008, at 2:12 PM, Matthew Koop wrote:
>
> >
> >> It appears my hangs have been resolved by doing two things:
> >>
> >> 1) update from 1.0 to 1.0.1
> >> 2) disable shared memory broadcast (would hang on 16K in 1.0.1).
> >
> > Good to hear that it is working for you now.
> >
> >> Is number 2 fixed in 1.1?  If so when is 1.1's release date?
> >
> > This is fixed in 1.1 and will be released next week. You can also
> > change
> > the value at compile time in the 1.0.1 release by changing
> >
> > #define DEFAULT_SHMEM_BCAST_LEADERS 1024
> >
> > to a higher value (however many nodes are used at maximum) in
> > src/env/initutil.c. The MPI library would have to be recompiled to
> > take
> > this change though.
> >
> > Thanks,
> >
> > Matt
> >
> >
> >> I will contact TACC and let them know the solution to my problem so
> >> they
> >> can relay it to others who have a similar problem.
> >>
> >> Karl W. Schulz wrote:
> >>> Just FYI so that everyone is aware, we actually do propagate all
> >>> user
> >>> environment variables on Ranger so it is sufficient to simply set
> >>> VIADEV parameters in your job script as long as jobs are launched
> >>> with
> >>> ibrun.
> >>>
> >>> Karl
> >>>
> >>> On Nov 3, 2008, at 9:04 PM, Matthew Koop wrote:
> >>>
> >>>> Justin,
> >>>>
> >>>> Thanks for this update. Even though the backtrace shows
> >>>> 'intra_shmem_Allreduce' it is not following the shared memory path,
> >>>> within
> >>>> that function a fallback is called.
> >>>>
> >>>> A couple things:
> >>>>
> >>>> - Does it work if all shared memory collectives are turned off?
> >>>> (VIADEV_USE_SHMEM_COLL=0)
> >>>>
> >>>> - Have you tried the 1.0.1 installed on TACC at all?
> >>>>
> >>>> Matt
> >>>>
> >>>> On Mon, 3 Nov 2008, Justin wrote:
> >>>>
> >>>>> Here is an update:
> >>>>>
> >>>>> I am running on ranger with the following ibrun command:
> >>>>>
> >>>>>   ibrun VIADEV_USE_SHMEM_BCAST=0 VIADEV_USE_SHMEM_ALLREDUCE=0 ../
> >>>>> sus
> >>>>>
> >>>>> where sus is our executable.  With this i'm still occasionally
> >>>>> seeing a
> >>>>> hang at large numbers of processors at this stack trace:
> >>>>>
> >>>>> #0  0x00002abc19a38510 in smpi_net_lookup () at mpid_smpi.c:1381
> >>>>> #1  0x00002abc19a38414 in MPID_SMP_Check_incoming () at
> >>>>> mpid_smpi.c:1360
> >>>>> #2  0x00002abc19a5293c in MPID_DeviceCheck (blocking=7154160) at
> >>>>> viacheck.c:505
> >>>>> #3  0x00002abc19a3600b in MPID_RecvComplete (request=0x6d29f0,
> >>>>> status=0x10, error_code=0xb) at mpid_recv.c:106
> >>>>> #4  0x00002abc19a5e2f7 in MPI_Waitall (count=7154160,
> >>>>> array_of_requests=0x10, array_of_statuses=0xb) at waitall.c:190
> >>>>> #5  0x00002abc19a46d3c in MPI_Sendrecv (sendbuf=0x6d29f0,
> >>>>> sendcount=16,
> >>>>> sendtype=11, dest=11, sendtag=22046016, recvbuf=0x1506810,
> >>>>> recvcount=1,
> >>>>> recvtype=6, source=2912, recvtag=14, comm=130,
> >>>>> status=0x7fff952efd2c) at
> >>>>> sendrecv.c:98
> >>>>> #6  0x00002abc19a24d2d in intra_Allreduce (sendbuf=0x6d29f0,
> >>>>> recvbuf=0x10, count=4, datatype=0xb, op=22046016,
> >>>>> comm=0x1506810) at
> >>>>> intra_fns_new.c:5682
> >>>>> #7  0x00002abc19a24516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
> >>>>> recvbuf=0x10, count=1, datatype=0xb, op=22046016,
> >>>>> comm=0x1506810) at
> >>>>> intra_fns_new.c:6014
> >>>>> #8  0x00002abc199ef286 in MPI_Allreduce (sendbuf=0x6d29f0,
> >>>>> recvbuf=0x10,
> >>>>> count=11, datatype=11, op=22046016, comm=22046736) at
> >>>>> allreduce.c:83
> >>>>> #9  0x00002abc18bda4f8 in _ZN6Uintah12MPIScheduler7executeEii ()
> >>>>> in
> >>>>> /work/00975/luitjens/SCIRun/optimized/lib/
> >>>>> libPackages_Uintah_CCA_Components_Schedulers.so
> >>>>>
> >>>>> #10 0x0000000007d0db10
> >>>>>
> >>>>> all reduce is still using shared memory.
> >>>>>
> >>>>> Do you have any more suggestions?
> >>>>>
> >>>>> Thanks,
> >>>>> Justin
> >>>>>
> >>>>> Matthew Koop wrote:
> >>>>>> Justin,
> >>>>>>
> >>>>>> I think there are a couple things here:
> >>>>>>
> >>>>>> 1.) Simply exporting the variables is not sufficient for the
> >>>>>> setup at
> >>>>>> TACC. You'll need to set it the following way:
> >>>>>>
> >>>>>> ibrun VIADEV_USE_SHMEM_COLL=0 ./executable_name
> >>>>>>
> >>>>>> Since the ENVs weren't being propogated the setting wasn't taking
> >>>>>> effect
> >>>>>> (and that is why you still saw the shmem functions in the
> >>>>>> backtrace).
> >>>>>>
> >>>>>> 2.) There was a limitation in the 1.0 versions where when the
> >>>>>> shared memory bcast implementation was run on more than 1K
> >>>>>> nodes there
> >>>>>> would be a hang. Since the shared memory allreduce uses a bcast
> >>>>>> internally
> >>>>>> it is also hanging you can try just disabling the bcast:
> >>>>>>
> >>>>>> ibrun VIADEV_USE_SHMEM_BCAST=0 ./executable_name
> >>>>>>
> >>>>>> Let us know if this works or if you have additional questions.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Matt
> >>>>>>
> >>>>>> On Mon, 3 Nov 2008, Justin wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> We are using mvapich_devel_1.0 on Ranger.  I am seeing my
> >>>>>>> current
> >>>>>>> lockup
> >>>>>>> at 16,384 processors at the following stacktrace:
> >>>>>>>
> >>>>>>> #0  0x00002b015c4f85ff in poll_rdma_buffer
> >>>>>>> (vbuf_addr=0x7fff52849020,
> >>>>>>> out_of_order=0x7fff52849030) at viacheck.c:206
> >>>>>>> #1  0x00002b015c4f79ed in MPID_DeviceCheck
> >>>>>>> (blocking=1384419360) at
> >>>>>>> viacheck.c:505
> >>>>>>> #2  0x00002b015c4db00b in MPID_RecvComplete
> >>>>>>> (request=0x7fff52849020,
> >>>>>>> status=0x7fff52849030, error_code=0x2b) at mpid_recv.c:106
> >>>>>>> #3  0x00002b015c5032f7 in MPI_Waitall (count=1384419360,
> >>>>>>> array_of_requests=0x7fff52849030, array_of_statuses=0x2b) at
> >>>>>>> waitall.c:190
> >>>>>>> #4  0x00002b015c4ebd3c in MPI_Sendrecv (sendbuf=0x7fff52849020,
> >>>>>>> sendcount=1384419376, sendtype=43, dest=35, sendtag=64,
> >>>>>>> recvbuf=0x2aaaad75d000, recvcount=1, recvtype=6, source=3585,
> >>>>>>> recvtag=14, comm=130, status=0x7fff528491fc) at sendrecv.c:98
> >>>>>>> #5  0x00002b015c4c9d2d in intra_Allreduce
> >>>>>>> (sendbuf=0x7fff52849020,
> >>>>>>> recvbuf=0x7fff52849030, count=4, datatype=0x23, op=64,
> >>>>>>> comm=0x2aaaad75d000) at intra_fns_new.c:5682
> >>>>>>> #6  0x00002b015c4c9516 in intra_shmem_Allreduce
> >>>>>>> (sendbuf=0x7fff52849020,
> >>>>>>> recvbuf=0x7fff52849030, count=1, datatype=0x23, op=64,
> >>>>>>> comm=0x2aaaad75d000) at intra_fns_new.c:6014
> >>>>>>> #7  0x00002b015c494286 in MPI_Allreduce (sendbuf=0x7fff52849020,
> >>>>>>> recvbuf=0x7fff52849030, count=43, datatype=35, op=64,
> >>>>>>> comm=-1384787968)
> >>>>>>> at allreduce.c:83
> >>>>>>> #8  0x00002b015b67f4f8 in _ZN6Uintah12MPIScheduler7executeEii
> >>>>>>> () in
> >>>>>>> /work/00975/luitjens/SCIRun/optimized/lib/
> >>>>>>> libPackages_Uintah_CCA_Components_Schedulers.so
> >>>>>>>
> >>>>>>>
> >>>>>>> I was seeing lockups at smaller powers of two but adding the
> >>>>>>> following
> >>>>>>> seemed to stop those:
> >>>>>>>
> >>>>>>> export VIADEV_USE_SHMEM_COLL=0
> >>>>>>> export VIADEV_USE_SHMEM_ALLREDUCE=0
> >>>>>>>
> >>>>>>> Now I am just seeing it at 16K.  What is odd to me is that if
> >>>>>>> the 2
> >>>>>>> commands above stop the shared memory optimizations then why
> >>>>>>> does the
> >>>>>>> stacktrace still show 'ntra_shmem_Allreduce' being called?
> >>>>>>>
> >>>>>>> Here is some other info that might be useful:
> >>>>>>>
> >>>>>>> login3:/scratch/00975/luitjens/scalingice/ranger.med/
> >>>>>>> %mpirun_rsh -v
> >>>>>>> OSU MVAPICH VERSION 1.0-SingleRail
> >>>>>>> Build-ID: custom
> >>>>>>>
> >>>>>>> MPI Path:
> >>>>>>> lrwxrwxrwx  1 tg802225 G-800594 46 May 27 14:29 include ->
> >>>>>>> /opt/apps/intel10_1/mvapich-devel/1.0/include/
> >>>>>>> lrwxrwxrwx  1 tg802225 G-800594 49 May 27 14:29 lib ->
> >>>>>>> /opt/apps/intel10_1/mvapich-devel/1.0/lib/shared/
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Justin
> >>>>>>>
> >>>>>>> Dhabaleswar Panda wrote:
> >>>>>>>
> >>>>>>>> Justin,
> >>>>>>>>
> >>>>>>>> Could you let us know which stack (MVAPICH or MVAPICH2) you are
> >>>>>>>> using on
> >>>>>>>> Ranger. These two stacks have the parameters named differently.
> >>>>>>>> Also, on
> >>>>>>>> what exact process count you see this problem. If you can also
> >>>>>>>> let us know
> >>>>>>>> the version number of mvapich/mvapich2 stack and/or the path of
> >>>>>>>> the MPI
> >>>>>>>> library on Ranger, it will be helpful.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> DK
> >>>>>>>>
> >>>>>>>> On Mon, 3 Nov 2008, Justin wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> We are running into hangs on Ranger using mvapich that are not
> >>>>>>>>> present
> >>>>>>>>> on other machines.  These hangs seem to only occur on arge
> >>>>>>>>> problems with
> >>>>>>>>> large numbers of processors.  We have ran into similar
> >>>>>>>>> problems
> >>>>>>>>> on some
> >>>>>>>>> LLNL machines in the past and were able to get around them by
> >>>>>>>>> disabling
> >>>>>>>>> the shared memory optimizations.  In these cases the problem
> >>>>>>>>> had
> >>>>>>>>> to do
> >>>>>>>>> with fixed sized buffers used in the shared memory
> >>>>>>>>> optimizations.
> >>>>>>>>>
> >>>>>>>>> We would like to disable shared memory on Ranger but are
> >>>>>>>>> confused with
> >>>>>>>>> all the different parameters dealing with shared memory
> >>>>>>>>> optimizations.
> >>>>>>>>> How do we know which parameters affect the run?  For example
> >>>>>>>>> do
> >>>>>>>>> we use
> >>>>>>>>> the parameters that begin with MV_ or VIADEV_?  From past
> >>>>>>>>> conversations
> >>>>>>>>> I have had with support teams the parameters that have an
> >>>>>>>>> effect
> >>>>>>>>> vary
> >>>>>>>>> according to the hardware/mpi build.  What is the best way to
> >>>>>>>>> determine
> >>>>>>>>> which parameters are active?
> >>>>>>>>>
> >>>>>>>>> Also here is a stacktrace from one of our hangs:
> >>>>>>>>>
> >>>>>>>>> .stack.i132-112.ranger.tacc.utexas.edu.16033
> >>>>>>>>> Intel(R) Debugger for applications running on Intel(R) 64,
> >>>>>>>>> Version
> >>>>>>>>> 10.1-35 , Build 20080310
> >>>>>>>>> Attaching to program:
> >>>>>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/
> >>>>>>>>> StandAlone/sus,
> >>>>>>>>>
> >>>>>>>>> process 16033
> >>>>>>>>> Reading symbols from
> >>>>>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/
> >>>>>>>>> StandAlone/sus...(no
> >>>>>>>>>
> >>>>>>>>> debugging symbols found)...done.
> >>>>>>>>> smpi_net_lookup () at mpid_smpi.c:1381
> >>>>>>>>> #0  0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c:
> >>>>>>>>> 1381
> >>>>>>>>> #1  0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at
> >>>>>>>>> mpid_smpi.c:1360
> >>>>>>>>> #2  0x00002ada6b4f293c in MPID_DeviceCheck
> >>>>>>>>> (blocking=7154160) at
> >>>>>>>>> viacheck.c:505
> >>>>>>>>> #3  0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
> >>>>>>>>> status=0x10, error_code=0x4) at mpid_recv.c:106
> >>>>>>>>> #4  0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
> >>>>>>>>> array_of_requests=0x10, array_of_statuses=0x4) at waitall.c:
> >>>>>>>>> 190
> >>>>>>>>> #5  0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0,
> >>>>>>>>> sendcount=16,
> >>>>>>>>> sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680,
> >>>>>>>>> recvcount=1,
> >>>>>>>>> recvtype=6, source=2278, recvtag=14, comm=130,
> >>>>>>>>> status=0x7fff4385028c) at
> >>>>>>>>> sendrecv.c:98
> >>>>>>>>> #6  0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
> >>>>>>>>> recvbuf=0x10, count=4, datatype=0xe, op=22045696,
> >>>>>>>>> comm=0x1506680) at
> >>>>>>>>> intra_fns_new.c:5682
> >>>>>>>>> #7  0x00002ada6b4c4516 in intra_shmem_Allreduce
> >>>>>>>>> (sendbuf=0x6d29f0,
> >>>>>>>>> recvbuf=0x10, count=1, datatype=0xe, op=22045696,
> >>>>>>>>> comm=0x1506680) at
> >>>>>>>>> intra_fns_new.c:6014
> >>>>>>>>> #8  0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0,
> >>>>>>>>> recvbuf=0x10,
> >>>>>>>>> count=4, datatype=14, op=22045696, comm=22046336) at
> >>>>>>>>> allreduce.c:83
> >>>>>>>>> #9  0x00002ada6a67a4f8 in
> >>>>>>>>> _ZN6Uintah12MPIScheduler7executeEii () in
> >>>>>>>>> /work/00975/luitjens/SCIRun/optimized/lib/
> >>>>>>>>> libPackages_Uintah_CCA_Components_Schedulers.so
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> In this case what would be the likely parameter I could play
> >>>>>>>>> with in
> >>>>>>>>> order to potentially stop a hang in MPI_Allreduce?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Justin
> >>>>>>>>> _______________________________________________
> >>>>>>>>> mvapich-discuss mailing list
> >>>>>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-
> >>>>>>>>> discuss
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> mvapich-discuss mailing list
> >>>>>>> mvapich-discuss at cse.ohio-state.edu
> >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>> _______________________________________________
> >>>> mvapich-discuss mailing list
> >>>> mvapich-discuss at cse.ohio-state.edu
> >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
>