[mvapich-discuss] hang at large numbers of processors

Tue Nov 4 08:00:06 EST 2008

Just FYI so that everyone is aware, we actually do propagate all user  
environment variables on Ranger so it is sufficient to simply set  
VIADEV parameters in your job script as long as jobs are launched with  
ibrun.

Karl

On Nov 3, 2008, at 9:04 PM, Matthew Koop wrote:

> Justin,
>
> Thanks for this update. Even though the backtrace shows
> 'intra_shmem_Allreduce' it is not following the shared memory path,  
> within
> that function a fallback is called.
>
> A couple things:
>
> - Does it work if all shared memory collectives are turned off?
> (VIADEV_USE_SHMEM_COLL=0)
>
> - Have you tried the 1.0.1 installed on TACC at all?
>
> Matt
>
> On Mon, 3 Nov 2008, Justin wrote:
>
>> Here is an update:
>>
>> I am running on ranger with the following ibrun command:
>>
>>    ibrun VIADEV_USE_SHMEM_BCAST=0 VIADEV_USE_SHMEM_ALLREDUCE=0 ../sus
>>
>> where sus is our executable.  With this i'm still occasionally  
>> seeing a
>> hang at large numbers of processors at this stack trace:
>>
>> #0  0x00002abc19a38510 in smpi_net_lookup () at mpid_smpi.c:1381
>> #1  0x00002abc19a38414 in MPID_SMP_Check_incoming () at mpid_smpi.c: 
>> 1360
>> #2  0x00002abc19a5293c in MPID_DeviceCheck (blocking=7154160) at
>> viacheck.c:505
>> #3  0x00002abc19a3600b in MPID_RecvComplete (request=0x6d29f0,
>> status=0x10, error_code=0xb) at mpid_recv.c:106
>> #4  0x00002abc19a5e2f7 in MPI_Waitall (count=7154160,
>> array_of_requests=0x10, array_of_statuses=0xb) at waitall.c:190
>> #5  0x00002abc19a46d3c in MPI_Sendrecv (sendbuf=0x6d29f0,  
>> sendcount=16,
>> sendtype=11, dest=11, sendtag=22046016, recvbuf=0x1506810,  
>> recvcount=1,
>> recvtype=6, source=2912, recvtag=14, comm=130,  
>> status=0x7fff952efd2c) at
>> sendrecv.c:98
>> #6  0x00002abc19a24d2d in intra_Allreduce (sendbuf=0x6d29f0,
>> recvbuf=0x10, count=4, datatype=0xb, op=22046016, comm=0x1506810) at
>> intra_fns_new.c:5682
>> #7  0x00002abc19a24516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
>> recvbuf=0x10, count=1, datatype=0xb, op=22046016, comm=0x1506810) at
>> intra_fns_new.c:6014
>> #8  0x00002abc199ef286 in MPI_Allreduce (sendbuf=0x6d29f0,  
>> recvbuf=0x10,
>> count=11, datatype=11, op=22046016, comm=22046736) at allreduce.c:83
>> #9  0x00002abc18bda4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
>> /work/00975/luitjens/SCIRun/optimized/lib/ 
>> libPackages_Uintah_CCA_Components_Schedulers.so
>> #10 0x0000000007d0db10
>>
>> all reduce is still using shared memory.
>>
>> Do you have any more suggestions?
>>
>> Thanks,
>> Justin
>>
>> Matthew Koop wrote:
>>> Justin,
>>>
>>> I think there are a couple things here:
>>>
>>> 1.) Simply exporting the variables is not sufficient for the setup  
>>> at
>>> TACC. You'll need to set it the following way:
>>>
>>> ibrun VIADEV_USE_SHMEM_COLL=0 ./executable_name
>>>
>>> Since the ENVs weren't being propogated the setting wasn't taking  
>>> effect
>>> (and that is why you still saw the shmem functions in the  
>>> backtrace).
>>>
>>> 2.) There was a limitation in the 1.0 versions where when the
>>> shared memory bcast implementation was run on more than 1K nodes  
>>> there
>>> would be a hang. Since the shared memory allreduce uses a bcast  
>>> internally
>>> it is also hanging you can try just disabling the bcast:
>>>
>>> ibrun VIADEV_USE_SHMEM_BCAST=0 ./executable_name
>>>
>>> Let us know if this works or if you have additional questions.
>>>
>>> Thanks,
>>> Matt
>>>
>>> On Mon, 3 Nov 2008, Justin wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>> We are using mvapich_devel_1.0 on Ranger.  I am seeing my current  
>>>> lockup
>>>> at 16,384 processors at the following stacktrace:
>>>>
>>>> #0  0x00002b015c4f85ff in poll_rdma_buffer  
>>>> (vbuf_addr=0x7fff52849020,
>>>> out_of_order=0x7fff52849030) at viacheck.c:206
>>>> #1  0x00002b015c4f79ed in MPID_DeviceCheck (blocking=1384419360) at
>>>> viacheck.c:505
>>>> #2  0x00002b015c4db00b in MPID_RecvComplete  
>>>> (request=0x7fff52849020,
>>>> status=0x7fff52849030, error_code=0x2b) at mpid_recv.c:106
>>>> #3  0x00002b015c5032f7 in MPI_Waitall (count=1384419360,
>>>> array_of_requests=0x7fff52849030, array_of_statuses=0x2b) at  
>>>> waitall.c:190
>>>> #4  0x00002b015c4ebd3c in MPI_Sendrecv (sendbuf=0x7fff52849020,
>>>> sendcount=1384419376, sendtype=43, dest=35, sendtag=64,
>>>> recvbuf=0x2aaaad75d000, recvcount=1, recvtype=6, source=3585,
>>>> recvtag=14, comm=130, status=0x7fff528491fc) at sendrecv.c:98
>>>> #5  0x00002b015c4c9d2d in intra_Allreduce (sendbuf=0x7fff52849020,
>>>> recvbuf=0x7fff52849030, count=4, datatype=0x23, op=64,
>>>> comm=0x2aaaad75d000) at intra_fns_new.c:5682
>>>> #6  0x00002b015c4c9516 in intra_shmem_Allreduce  
>>>> (sendbuf=0x7fff52849020,
>>>> recvbuf=0x7fff52849030, count=1, datatype=0x23, op=64,
>>>> comm=0x2aaaad75d000) at intra_fns_new.c:6014
>>>> #7  0x00002b015c494286 in MPI_Allreduce (sendbuf=0x7fff52849020,
>>>> recvbuf=0x7fff52849030, count=43, datatype=35, op=64,  
>>>> comm=-1384787968)
>>>> at allreduce.c:83
>>>> #8  0x00002b015b67f4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
>>>> /work/00975/luitjens/SCIRun/optimized/lib/ 
>>>> libPackages_Uintah_CCA_Components_Schedulers.so
>>>>
>>>> I was seeing lockups at smaller powers of two but adding the  
>>>> following
>>>> seemed to stop those:
>>>>
>>>> export VIADEV_USE_SHMEM_COLL=0
>>>> export VIADEV_USE_SHMEM_ALLREDUCE=0
>>>>
>>>> Now I am just seeing it at 16K.  What is odd to me is that if the 2
>>>> commands above stop the shared memory optimizations then why does  
>>>> the
>>>> stacktrace still show 'ntra_shmem_Allreduce' being called?
>>>>
>>>> Here is some other info that might be useful:
>>>>
>>>> login3:/scratch/00975/luitjens/scalingice/ranger.med/ %mpirun_rsh  
>>>> -v
>>>> OSU MVAPICH VERSION 1.0-SingleRail
>>>> Build-ID: custom
>>>>
>>>> MPI Path:
>>>> lrwxrwxrwx  1 tg802225 G-800594 46 May 27 14:29 include ->
>>>> /opt/apps/intel10_1/mvapich-devel/1.0/include/
>>>> lrwxrwxrwx  1 tg802225 G-800594 49 May 27 14:29 lib ->
>>>> /opt/apps/intel10_1/mvapich-devel/1.0/lib/shared/
>>>>
>>>>
>>>> Thanks,
>>>> Justin
>>>>
>>>> Dhabaleswar Panda wrote:
>>>>
>>>>> Justin,
>>>>>
>>>>> Could you let us know which stack (MVAPICH or MVAPICH2) you are  
>>>>> using on
>>>>> Ranger. These two stacks have the parameters named differently.  
>>>>> Also, on
>>>>> what exact process count you see this problem. If you can also  
>>>>> let us know
>>>>> the version number of mvapich/mvapich2 stack and/or the path of  
>>>>> the MPI
>>>>> library on Ranger, it will be helpful.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> DK
>>>>>
>>>>> On Mon, 3 Nov 2008, Justin wrote:
>>>>>
>>>>>
>>>>>
>>>>>> We are running into hangs on Ranger using mvapich that are not  
>>>>>> present
>>>>>> on other machines.  These hangs seem to only occur on arge  
>>>>>> problems with
>>>>>> large numbers of processors.  We have ran into similar problems  
>>>>>> on some
>>>>>> LLNL machines in the past and were able to get around them by  
>>>>>> disabling
>>>>>> the shared memory optimizations.  In these cases the problem  
>>>>>> had to do
>>>>>> with fixed sized buffers used in the shared memory optimizations.
>>>>>>
>>>>>> We would like to disable shared memory on Ranger but are  
>>>>>> confused with
>>>>>> all the different parameters dealing with shared memory  
>>>>>> optimizations.
>>>>>> How do we know which parameters affect the run?  For example do  
>>>>>> we use
>>>>>> the parameters that begin with MV_ or VIADEV_?  From past  
>>>>>> conversations
>>>>>> I have had with support teams the parameters that have an  
>>>>>> effect vary
>>>>>> according to the hardware/mpi build.  What is the best way to  
>>>>>> determine
>>>>>> which parameters are active?
>>>>>>
>>>>>> Also here is a stacktrace from one of our hangs:
>>>>>>
>>>>>> .stack.i132-112.ranger.tacc.utexas.edu.16033
>>>>>> Intel(R) Debugger for applications running on Intel(R) 64,  
>>>>>> Version
>>>>>> 10.1-35 , Build 20080310
>>>>>> Attaching to program:
>>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/ 
>>>>>> StandAlone/sus,
>>>>>> process 16033
>>>>>> Reading symbols from
>>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/ 
>>>>>> StandAlone/sus...(no
>>>>>> debugging symbols found)...done.
>>>>>> smpi_net_lookup () at mpid_smpi.c:1381
>>>>>> #0  0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c:1381
>>>>>> #1  0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at  
>>>>>> mpid_smpi.c:1360
>>>>>> #2  0x00002ada6b4f293c in MPID_DeviceCheck (blocking=7154160) at
>>>>>> viacheck.c:505
>>>>>> #3  0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
>>>>>> status=0x10, error_code=0x4) at mpid_recv.c:106
>>>>>> #4  0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
>>>>>> array_of_requests=0x10, array_of_statuses=0x4) at waitall.c:190
>>>>>> #5  0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0,  
>>>>>> sendcount=16,
>>>>>> sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680,  
>>>>>> recvcount=1,
>>>>>> recvtype=6, source=2278, recvtag=14, comm=130,  
>>>>>> status=0x7fff4385028c) at
>>>>>> sendrecv.c:98
>>>>>> #6  0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
>>>>>> recvbuf=0x10, count=4, datatype=0xe, op=22045696,  
>>>>>> comm=0x1506680) at
>>>>>> intra_fns_new.c:5682
>>>>>> #7  0x00002ada6b4c4516 in intra_shmem_Allreduce  
>>>>>> (sendbuf=0x6d29f0,
>>>>>> recvbuf=0x10, count=1, datatype=0xe, op=22045696,  
>>>>>> comm=0x1506680) at
>>>>>> intra_fns_new.c:6014
>>>>>> #8  0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0,  
>>>>>> recvbuf=0x10,
>>>>>> count=4, datatype=14, op=22045696, comm=22046336) at  
>>>>>> allreduce.c:83
>>>>>> #9  0x00002ada6a67a4f8 in _ZN6Uintah12MPIScheduler7executeEii  
>>>>>> () in
>>>>>> /work/00975/luitjens/SCIRun/optimized/lib/ 
>>>>>> libPackages_Uintah_CCA_Components_Schedulers.so
>>>>>>
>>>>>> In this case what would be the likely parameter I could play  
>>>>>> with in
>>>>>> order to potentially stop a hang in MPI_Allreduce?
>>>>>>
>>>>>> Thanks,
>>>>>> Justin
>>>>>> _______________________________________________
>>>>>> mvapich-discuss mailing list
>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>
>>>>>>
>>>>>>
>>>> _______________________________________________
>>>> mvapich-discuss mailing list
>>>> mvapich-discuss at cse.ohio-state.edu
>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>
>>>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss