[mvapich-discuss] hang at large numbers of processors

Tue Nov 4 15:42:24 EST 2008

Matt,

Just to make sure I'm clear, I assume we should probably use a value  
of 3936 for DEFAULT_SHMEM_BCAST_LEADERS for Ranger to support runs  
across the entire system (ie. our max # of compute nodes)?

Thanks,

Karl

On Nov 4, 2008, at 2:12 PM, Matthew Koop wrote:

>
>> It appears my hangs have been resolved by doing two things:
>>
>> 1) update from 1.0 to 1.0.1
>> 2) disable shared memory broadcast (would hang on 16K in 1.0.1).
>
> Good to hear that it is working for you now.
>
>> Is number 2 fixed in 1.1?  If so when is 1.1's release date?
>
> This is fixed in 1.1 and will be released next week. You can also  
> change
> the value at compile time in the 1.0.1 release by changing
>
> #define DEFAULT_SHMEM_BCAST_LEADERS 1024
>
> to a higher value (however many nodes are used at maximum) in
> src/env/initutil.c. The MPI library would have to be recompiled to  
> take
> this change though.
>
> Thanks,
>
> Matt
>
>
>> I will contact TACC and let them know the solution to my problem so  
>> they
>> can relay it to others who have a similar problem.
>>
>> Karl W. Schulz wrote:
>>> Just FYI so that everyone is aware, we actually do propagate all  
>>> user
>>> environment variables on Ranger so it is sufficient to simply set
>>> VIADEV parameters in your job script as long as jobs are launched  
>>> with
>>> ibrun.
>>>
>>> Karl
>>>
>>> On Nov 3, 2008, at 9:04 PM, Matthew Koop wrote:
>>>
>>>> Justin,
>>>>
>>>> Thanks for this update. Even though the backtrace shows
>>>> 'intra_shmem_Allreduce' it is not following the shared memory path,
>>>> within
>>>> that function a fallback is called.
>>>>
>>>> A couple things:
>>>>
>>>> - Does it work if all shared memory collectives are turned off?
>>>> (VIADEV_USE_SHMEM_COLL=0)
>>>>
>>>> - Have you tried the 1.0.1 installed on TACC at all?
>>>>
>>>> Matt
>>>>
>>>> On Mon, 3 Nov 2008, Justin wrote:
>>>>
>>>>> Here is an update:
>>>>>
>>>>> I am running on ranger with the following ibrun command:
>>>>>
>>>>>   ibrun VIADEV_USE_SHMEM_BCAST=0 VIADEV_USE_SHMEM_ALLREDUCE=0 ../ 
>>>>> sus
>>>>>
>>>>> where sus is our executable.  With this i'm still occasionally  
>>>>> seeing a
>>>>> hang at large numbers of processors at this stack trace:
>>>>>
>>>>> #0  0x00002abc19a38510 in smpi_net_lookup () at mpid_smpi.c:1381
>>>>> #1  0x00002abc19a38414 in MPID_SMP_Check_incoming () at
>>>>> mpid_smpi.c:1360
>>>>> #2  0x00002abc19a5293c in MPID_DeviceCheck (blocking=7154160) at
>>>>> viacheck.c:505
>>>>> #3  0x00002abc19a3600b in MPID_RecvComplete (request=0x6d29f0,
>>>>> status=0x10, error_code=0xb) at mpid_recv.c:106
>>>>> #4  0x00002abc19a5e2f7 in MPI_Waitall (count=7154160,
>>>>> array_of_requests=0x10, array_of_statuses=0xb) at waitall.c:190
>>>>> #5  0x00002abc19a46d3c in MPI_Sendrecv (sendbuf=0x6d29f0,  
>>>>> sendcount=16,
>>>>> sendtype=11, dest=11, sendtag=22046016, recvbuf=0x1506810,  
>>>>> recvcount=1,
>>>>> recvtype=6, source=2912, recvtag=14, comm=130,
>>>>> status=0x7fff952efd2c) at
>>>>> sendrecv.c:98
>>>>> #6  0x00002abc19a24d2d in intra_Allreduce (sendbuf=0x6d29f0,
>>>>> recvbuf=0x10, count=4, datatype=0xb, op=22046016,  
>>>>> comm=0x1506810) at
>>>>> intra_fns_new.c:5682
>>>>> #7  0x00002abc19a24516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
>>>>> recvbuf=0x10, count=1, datatype=0xb, op=22046016,  
>>>>> comm=0x1506810) at
>>>>> intra_fns_new.c:6014
>>>>> #8  0x00002abc199ef286 in MPI_Allreduce (sendbuf=0x6d29f0,
>>>>> recvbuf=0x10,
>>>>> count=11, datatype=11, op=22046016, comm=22046736) at  
>>>>> allreduce.c:83
>>>>> #9  0x00002abc18bda4f8 in _ZN6Uintah12MPIScheduler7executeEii ()  
>>>>> in
>>>>> /work/00975/luitjens/SCIRun/optimized/lib/ 
>>>>> libPackages_Uintah_CCA_Components_Schedulers.so
>>>>>
>>>>> #10 0x0000000007d0db10
>>>>>
>>>>> all reduce is still using shared memory.
>>>>>
>>>>> Do you have any more suggestions?
>>>>>
>>>>> Thanks,
>>>>> Justin
>>>>>
>>>>> Matthew Koop wrote:
>>>>>> Justin,
>>>>>>
>>>>>> I think there are a couple things here:
>>>>>>
>>>>>> 1.) Simply exporting the variables is not sufficient for the  
>>>>>> setup at
>>>>>> TACC. You'll need to set it the following way:
>>>>>>
>>>>>> ibrun VIADEV_USE_SHMEM_COLL=0 ./executable_name
>>>>>>
>>>>>> Since the ENVs weren't being propogated the setting wasn't taking
>>>>>> effect
>>>>>> (and that is why you still saw the shmem functions in the  
>>>>>> backtrace).
>>>>>>
>>>>>> 2.) There was a limitation in the 1.0 versions where when the
>>>>>> shared memory bcast implementation was run on more than 1K  
>>>>>> nodes there
>>>>>> would be a hang. Since the shared memory allreduce uses a bcast
>>>>>> internally
>>>>>> it is also hanging you can try just disabling the bcast:
>>>>>>
>>>>>> ibrun VIADEV_USE_SHMEM_BCAST=0 ./executable_name
>>>>>>
>>>>>> Let us know if this works or if you have additional questions.
>>>>>>
>>>>>> Thanks,
>>>>>> Matt
>>>>>>
>>>>>> On Mon, 3 Nov 2008, Justin wrote:
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are using mvapich_devel_1.0 on Ranger.  I am seeing my  
>>>>>>> current
>>>>>>> lockup
>>>>>>> at 16,384 processors at the following stacktrace:
>>>>>>>
>>>>>>> #0  0x00002b015c4f85ff in poll_rdma_buffer  
>>>>>>> (vbuf_addr=0x7fff52849020,
>>>>>>> out_of_order=0x7fff52849030) at viacheck.c:206
>>>>>>> #1  0x00002b015c4f79ed in MPID_DeviceCheck  
>>>>>>> (blocking=1384419360) at
>>>>>>> viacheck.c:505
>>>>>>> #2  0x00002b015c4db00b in MPID_RecvComplete  
>>>>>>> (request=0x7fff52849020,
>>>>>>> status=0x7fff52849030, error_code=0x2b) at mpid_recv.c:106
>>>>>>> #3  0x00002b015c5032f7 in MPI_Waitall (count=1384419360,
>>>>>>> array_of_requests=0x7fff52849030, array_of_statuses=0x2b) at
>>>>>>> waitall.c:190
>>>>>>> #4  0x00002b015c4ebd3c in MPI_Sendrecv (sendbuf=0x7fff52849020,
>>>>>>> sendcount=1384419376, sendtype=43, dest=35, sendtag=64,
>>>>>>> recvbuf=0x2aaaad75d000, recvcount=1, recvtype=6, source=3585,
>>>>>>> recvtag=14, comm=130, status=0x7fff528491fc) at sendrecv.c:98
>>>>>>> #5  0x00002b015c4c9d2d in intra_Allreduce  
>>>>>>> (sendbuf=0x7fff52849020,
>>>>>>> recvbuf=0x7fff52849030, count=4, datatype=0x23, op=64,
>>>>>>> comm=0x2aaaad75d000) at intra_fns_new.c:5682
>>>>>>> #6  0x00002b015c4c9516 in intra_shmem_Allreduce
>>>>>>> (sendbuf=0x7fff52849020,
>>>>>>> recvbuf=0x7fff52849030, count=1, datatype=0x23, op=64,
>>>>>>> comm=0x2aaaad75d000) at intra_fns_new.c:6014
>>>>>>> #7  0x00002b015c494286 in MPI_Allreduce (sendbuf=0x7fff52849020,
>>>>>>> recvbuf=0x7fff52849030, count=43, datatype=35, op=64,
>>>>>>> comm=-1384787968)
>>>>>>> at allreduce.c:83
>>>>>>> #8  0x00002b015b67f4f8 in _ZN6Uintah12MPIScheduler7executeEii  
>>>>>>> () in
>>>>>>> /work/00975/luitjens/SCIRun/optimized/lib/ 
>>>>>>> libPackages_Uintah_CCA_Components_Schedulers.so
>>>>>>>
>>>>>>>
>>>>>>> I was seeing lockups at smaller powers of two but adding the
>>>>>>> following
>>>>>>> seemed to stop those:
>>>>>>>
>>>>>>> export VIADEV_USE_SHMEM_COLL=0
>>>>>>> export VIADEV_USE_SHMEM_ALLREDUCE=0
>>>>>>>
>>>>>>> Now I am just seeing it at 16K.  What is odd to me is that if  
>>>>>>> the 2
>>>>>>> commands above stop the shared memory optimizations then why  
>>>>>>> does the
>>>>>>> stacktrace still show 'ntra_shmem_Allreduce' being called?
>>>>>>>
>>>>>>> Here is some other info that might be useful:
>>>>>>>
>>>>>>> login3:/scratch/00975/luitjens/scalingice/ranger.med/  
>>>>>>> %mpirun_rsh -v
>>>>>>> OSU MVAPICH VERSION 1.0-SingleRail
>>>>>>> Build-ID: custom
>>>>>>>
>>>>>>> MPI Path:
>>>>>>> lrwxrwxrwx  1 tg802225 G-800594 46 May 27 14:29 include ->
>>>>>>> /opt/apps/intel10_1/mvapich-devel/1.0/include/
>>>>>>> lrwxrwxrwx  1 tg802225 G-800594 49 May 27 14:29 lib ->
>>>>>>> /opt/apps/intel10_1/mvapich-devel/1.0/lib/shared/
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Justin
>>>>>>>
>>>>>>> Dhabaleswar Panda wrote:
>>>>>>>
>>>>>>>> Justin,
>>>>>>>>
>>>>>>>> Could you let us know which stack (MVAPICH or MVAPICH2) you are
>>>>>>>> using on
>>>>>>>> Ranger. These two stacks have the parameters named differently.
>>>>>>>> Also, on
>>>>>>>> what exact process count you see this problem. If you can also
>>>>>>>> let us know
>>>>>>>> the version number of mvapich/mvapich2 stack and/or the path of
>>>>>>>> the MPI
>>>>>>>> library on Ranger, it will be helpful.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> DK
>>>>>>>>
>>>>>>>> On Mon, 3 Nov 2008, Justin wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> We are running into hangs on Ranger using mvapich that are not
>>>>>>>>> present
>>>>>>>>> on other machines.  These hangs seem to only occur on arge
>>>>>>>>> problems with
>>>>>>>>> large numbers of processors.  We have ran into similar  
>>>>>>>>> problems
>>>>>>>>> on some
>>>>>>>>> LLNL machines in the past and were able to get around them by
>>>>>>>>> disabling
>>>>>>>>> the shared memory optimizations.  In these cases the problem  
>>>>>>>>> had
>>>>>>>>> to do
>>>>>>>>> with fixed sized buffers used in the shared memory  
>>>>>>>>> optimizations.
>>>>>>>>>
>>>>>>>>> We would like to disable shared memory on Ranger but are
>>>>>>>>> confused with
>>>>>>>>> all the different parameters dealing with shared memory
>>>>>>>>> optimizations.
>>>>>>>>> How do we know which parameters affect the run?  For example  
>>>>>>>>> do
>>>>>>>>> we use
>>>>>>>>> the parameters that begin with MV_ or VIADEV_?  From past
>>>>>>>>> conversations
>>>>>>>>> I have had with support teams the parameters that have an  
>>>>>>>>> effect
>>>>>>>>> vary
>>>>>>>>> according to the hardware/mpi build.  What is the best way to
>>>>>>>>> determine
>>>>>>>>> which parameters are active?
>>>>>>>>>
>>>>>>>>> Also here is a stacktrace from one of our hangs:
>>>>>>>>>
>>>>>>>>> .stack.i132-112.ranger.tacc.utexas.edu.16033
>>>>>>>>> Intel(R) Debugger for applications running on Intel(R) 64,  
>>>>>>>>> Version
>>>>>>>>> 10.1-35 , Build 20080310
>>>>>>>>> Attaching to program:
>>>>>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/ 
>>>>>>>>> StandAlone/sus,
>>>>>>>>>
>>>>>>>>> process 16033
>>>>>>>>> Reading symbols from
>>>>>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/ 
>>>>>>>>> StandAlone/sus...(no
>>>>>>>>>
>>>>>>>>> debugging symbols found)...done.
>>>>>>>>> smpi_net_lookup () at mpid_smpi.c:1381
>>>>>>>>> #0  0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c: 
>>>>>>>>> 1381
>>>>>>>>> #1  0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at
>>>>>>>>> mpid_smpi.c:1360
>>>>>>>>> #2  0x00002ada6b4f293c in MPID_DeviceCheck  
>>>>>>>>> (blocking=7154160) at
>>>>>>>>> viacheck.c:505
>>>>>>>>> #3  0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
>>>>>>>>> status=0x10, error_code=0x4) at mpid_recv.c:106
>>>>>>>>> #4  0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
>>>>>>>>> array_of_requests=0x10, array_of_statuses=0x4) at waitall.c: 
>>>>>>>>> 190
>>>>>>>>> #5  0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0,
>>>>>>>>> sendcount=16,
>>>>>>>>> sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680,
>>>>>>>>> recvcount=1,
>>>>>>>>> recvtype=6, source=2278, recvtag=14, comm=130,
>>>>>>>>> status=0x7fff4385028c) at
>>>>>>>>> sendrecv.c:98
>>>>>>>>> #6  0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
>>>>>>>>> recvbuf=0x10, count=4, datatype=0xe, op=22045696,
>>>>>>>>> comm=0x1506680) at
>>>>>>>>> intra_fns_new.c:5682
>>>>>>>>> #7  0x00002ada6b4c4516 in intra_shmem_Allreduce  
>>>>>>>>> (sendbuf=0x6d29f0,
>>>>>>>>> recvbuf=0x10, count=1, datatype=0xe, op=22045696,
>>>>>>>>> comm=0x1506680) at
>>>>>>>>> intra_fns_new.c:6014
>>>>>>>>> #8  0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0,
>>>>>>>>> recvbuf=0x10,
>>>>>>>>> count=4, datatype=14, op=22045696, comm=22046336) at  
>>>>>>>>> allreduce.c:83
>>>>>>>>> #9  0x00002ada6a67a4f8 in  
>>>>>>>>> _ZN6Uintah12MPIScheduler7executeEii () in
>>>>>>>>> /work/00975/luitjens/SCIRun/optimized/lib/ 
>>>>>>>>> libPackages_Uintah_CCA_Components_Schedulers.so
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In this case what would be the likely parameter I could play
>>>>>>>>> with in
>>>>>>>>> order to potentially stop a hang in MPI_Allreduce?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Justin
>>>>>>>>> _______________________________________________
>>>>>>>>> mvapich-discuss mailing list
>>>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich- 
>>>>>>>>> discuss
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>