[mvapich-discuss] hang at large numbers of processors

Tue Nov 4 10:16:28 EST 2008

Thanks,  I thought this was the case but I wasn't positive.

It appears my hangs have been resolved by doing two things: 

1) update from 1.0 to 1.0.1
2) disable shared memory broadcast (would hang on 16K in 1.0.1).

Is number 2 fixed in 1.1?  If so when is 1.1's release date?

I will contact TACC and let them know the solution to my problem so they 
can relay it
to others who have a similar problem.

Thanks,
Justin

Karl W. Schulz wrote:
> Just FYI so that everyone is aware, we actually do propagate all user 
> environment variables on Ranger so it is sufficient to simply set 
> VIADEV parameters in your job script as long as jobs are launched with 
> ibrun.
>
> Karl
>
> On Nov 3, 2008, at 9:04 PM, Matthew Koop wrote:
>
>> Justin,
>>
>> Thanks for this update. Even though the backtrace shows
>> 'intra_shmem_Allreduce' it is not following the shared memory path, 
>> within
>> that function a fallback is called.
>>
>> A couple things:
>>
>> - Does it work if all shared memory collectives are turned off?
>> (VIADEV_USE_SHMEM_COLL=0)
>>
>> - Have you tried the 1.0.1 installed on TACC at all?
>>
>> Matt
>>
>> On Mon, 3 Nov 2008, Justin wrote:
>>
>>> Here is an update:
>>>
>>> I am running on ranger with the following ibrun command:
>>>
>>>    ibrun VIADEV_USE_SHMEM_BCAST=0 VIADEV_USE_SHMEM_ALLREDUCE=0 ../sus
>>>
>>> where sus is our executable.  With this i'm still occasionally seeing a
>>> hang at large numbers of processors at this stack trace:
>>>
>>> #0  0x00002abc19a38510 in smpi_net_lookup () at mpid_smpi.c:1381
>>> #1  0x00002abc19a38414 in MPID_SMP_Check_incoming () at 
>>> mpid_smpi.c:1360
>>> #2  0x00002abc19a5293c in MPID_DeviceCheck (blocking=7154160) at
>>> viacheck.c:505
>>> #3  0x00002abc19a3600b in MPID_RecvComplete (request=0x6d29f0,
>>> status=0x10, error_code=0xb) at mpid_recv.c:106
>>> #4  0x00002abc19a5e2f7 in MPI_Waitall (count=7154160,
>>> array_of_requests=0x10, array_of_statuses=0xb) at waitall.c:190
>>> #5  0x00002abc19a46d3c in MPI_Sendrecv (sendbuf=0x6d29f0, sendcount=16,
>>> sendtype=11, dest=11, sendtag=22046016, recvbuf=0x1506810, recvcount=1,
>>> recvtype=6, source=2912, recvtag=14, comm=130, 
>>> status=0x7fff952efd2c) at
>>> sendrecv.c:98
>>> #6  0x00002abc19a24d2d in intra_Allreduce (sendbuf=0x6d29f0,
>>> recvbuf=0x10, count=4, datatype=0xb, op=22046016, comm=0x1506810) at
>>> intra_fns_new.c:5682
>>> #7  0x00002abc19a24516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
>>> recvbuf=0x10, count=1, datatype=0xb, op=22046016, comm=0x1506810) at
>>> intra_fns_new.c:6014
>>> #8  0x00002abc199ef286 in MPI_Allreduce (sendbuf=0x6d29f0, 
>>> recvbuf=0x10,
>>> count=11, datatype=11, op=22046016, comm=22046736) at allreduce.c:83
>>> #9  0x00002abc18bda4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
>>> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so 
>>>
>>> #10 0x0000000007d0db10
>>>
>>> all reduce is still using shared memory.
>>>
>>> Do you have any more suggestions?
>>>
>>> Thanks,
>>> Justin
>>>
>>> Matthew Koop wrote:
>>>> Justin,
>>>>
>>>> I think there are a couple things here:
>>>>
>>>> 1.) Simply exporting the variables is not sufficient for the setup at
>>>> TACC. You'll need to set it the following way:
>>>>
>>>> ibrun VIADEV_USE_SHMEM_COLL=0 ./executable_name
>>>>
>>>> Since the ENVs weren't being propogated the setting wasn't taking 
>>>> effect
>>>> (and that is why you still saw the shmem functions in the backtrace).
>>>>
>>>> 2.) There was a limitation in the 1.0 versions where when the
>>>> shared memory bcast implementation was run on more than 1K nodes there
>>>> would be a hang. Since the shared memory allreduce uses a bcast 
>>>> internally
>>>> it is also hanging you can try just disabling the bcast:
>>>>
>>>> ibrun VIADEV_USE_SHMEM_BCAST=0 ./executable_name
>>>>
>>>> Let us know if this works or if you have additional questions.
>>>>
>>>> Thanks,
>>>> Matt
>>>>
>>>> On Mon, 3 Nov 2008, Justin wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> We are using mvapich_devel_1.0 on Ranger.  I am seeing my current 
>>>>> lockup
>>>>> at 16,384 processors at the following stacktrace:
>>>>>
>>>>> #0  0x00002b015c4f85ff in poll_rdma_buffer (vbuf_addr=0x7fff52849020,
>>>>> out_of_order=0x7fff52849030) at viacheck.c:206
>>>>> #1  0x00002b015c4f79ed in MPID_DeviceCheck (blocking=1384419360) at
>>>>> viacheck.c:505
>>>>> #2  0x00002b015c4db00b in MPID_RecvComplete (request=0x7fff52849020,
>>>>> status=0x7fff52849030, error_code=0x2b) at mpid_recv.c:106
>>>>> #3  0x00002b015c5032f7 in MPI_Waitall (count=1384419360,
>>>>> array_of_requests=0x7fff52849030, array_of_statuses=0x2b) at 
>>>>> waitall.c:190
>>>>> #4  0x00002b015c4ebd3c in MPI_Sendrecv (sendbuf=0x7fff52849020,
>>>>> sendcount=1384419376, sendtype=43, dest=35, sendtag=64,
>>>>> recvbuf=0x2aaaad75d000, recvcount=1, recvtype=6, source=3585,
>>>>> recvtag=14, comm=130, status=0x7fff528491fc) at sendrecv.c:98
>>>>> #5  0x00002b015c4c9d2d in intra_Allreduce (sendbuf=0x7fff52849020,
>>>>> recvbuf=0x7fff52849030, count=4, datatype=0x23, op=64,
>>>>> comm=0x2aaaad75d000) at intra_fns_new.c:5682
>>>>> #6  0x00002b015c4c9516 in intra_shmem_Allreduce 
>>>>> (sendbuf=0x7fff52849020,
>>>>> recvbuf=0x7fff52849030, count=1, datatype=0x23, op=64,
>>>>> comm=0x2aaaad75d000) at intra_fns_new.c:6014
>>>>> #7  0x00002b015c494286 in MPI_Allreduce (sendbuf=0x7fff52849020,
>>>>> recvbuf=0x7fff52849030, count=43, datatype=35, op=64, 
>>>>> comm=-1384787968)
>>>>> at allreduce.c:83
>>>>> #8  0x00002b015b67f4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
>>>>> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so 
>>>>>
>>>>>
>>>>> I was seeing lockups at smaller powers of two but adding the 
>>>>> following
>>>>> seemed to stop those:
>>>>>
>>>>> export VIADEV_USE_SHMEM_COLL=0
>>>>> export VIADEV_USE_SHMEM_ALLREDUCE=0
>>>>>
>>>>> Now I am just seeing it at 16K.  What is odd to me is that if the 2
>>>>> commands above stop the shared memory optimizations then why does the
>>>>> stacktrace still show 'ntra_shmem_Allreduce' being called?
>>>>>
>>>>> Here is some other info that might be useful:
>>>>>
>>>>> login3:/scratch/00975/luitjens/scalingice/ranger.med/ %mpirun_rsh -v
>>>>> OSU MVAPICH VERSION 1.0-SingleRail
>>>>> Build-ID: custom
>>>>>
>>>>> MPI Path:
>>>>> lrwxrwxrwx  1 tg802225 G-800594 46 May 27 14:29 include ->
>>>>> /opt/apps/intel10_1/mvapich-devel/1.0/include/
>>>>> lrwxrwxrwx  1 tg802225 G-800594 49 May 27 14:29 lib ->
>>>>> /opt/apps/intel10_1/mvapich-devel/1.0/lib/shared/
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Justin
>>>>>
>>>>> Dhabaleswar Panda wrote:
>>>>>
>>>>>> Justin,
>>>>>>
>>>>>> Could you let us know which stack (MVAPICH or MVAPICH2) you are 
>>>>>> using on
>>>>>> Ranger. These two stacks have the parameters named differently. 
>>>>>> Also, on
>>>>>> what exact process count you see this problem. If you can also 
>>>>>> let us know
>>>>>> the version number of mvapich/mvapich2 stack and/or the path of 
>>>>>> the MPI
>>>>>> library on Ranger, it will be helpful.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> DK
>>>>>>
>>>>>> On Mon, 3 Nov 2008, Justin wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> We are running into hangs on Ranger using mvapich that are not 
>>>>>>> present
>>>>>>> on other machines.  These hangs seem to only occur on arge 
>>>>>>> problems with
>>>>>>> large numbers of processors.  We have ran into similar problems 
>>>>>>> on some
>>>>>>> LLNL machines in the past and were able to get around them by 
>>>>>>> disabling
>>>>>>> the shared memory optimizations.  In these cases the problem had 
>>>>>>> to do
>>>>>>> with fixed sized buffers used in the shared memory optimizations.
>>>>>>>
>>>>>>> We would like to disable shared memory on Ranger but are 
>>>>>>> confused with
>>>>>>> all the different parameters dealing with shared memory 
>>>>>>> optimizations.
>>>>>>> How do we know which parameters affect the run?  For example do 
>>>>>>> we use
>>>>>>> the parameters that begin with MV_ or VIADEV_?  From past 
>>>>>>> conversations
>>>>>>> I have had with support teams the parameters that have an effect 
>>>>>>> vary
>>>>>>> according to the hardware/mpi build.  What is the best way to 
>>>>>>> determine
>>>>>>> which parameters are active?
>>>>>>>
>>>>>>> Also here is a stacktrace from one of our hangs:
>>>>>>>
>>>>>>> .stack.i132-112.ranger.tacc.utexas.edu.16033
>>>>>>> Intel(R) Debugger for applications running on Intel(R) 64, Version
>>>>>>> 10.1-35 , Build 20080310
>>>>>>> Attaching to program:
>>>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus, 
>>>>>>>
>>>>>>> process 16033
>>>>>>> Reading symbols from
>>>>>>> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus...(no 
>>>>>>>
>>>>>>> debugging symbols found)...done.
>>>>>>> smpi_net_lookup () at mpid_smpi.c:1381
>>>>>>> #0  0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c:1381
>>>>>>> #1  0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at 
>>>>>>> mpid_smpi.c:1360
>>>>>>> #2  0x00002ada6b4f293c in MPID_DeviceCheck (blocking=7154160) at
>>>>>>> viacheck.c:505
>>>>>>> #3  0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
>>>>>>> status=0x10, error_code=0x4) at mpid_recv.c:106
>>>>>>> #4  0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
>>>>>>> array_of_requests=0x10, array_of_statuses=0x4) at waitall.c:190
>>>>>>> #5  0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0, 
>>>>>>> sendcount=16,
>>>>>>> sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680, 
>>>>>>> recvcount=1,
>>>>>>> recvtype=6, source=2278, recvtag=14, comm=130, 
>>>>>>> status=0x7fff4385028c) at
>>>>>>> sendrecv.c:98
>>>>>>> #6  0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
>>>>>>> recvbuf=0x10, count=4, datatype=0xe, op=22045696, 
>>>>>>> comm=0x1506680) at
>>>>>>> intra_fns_new.c:5682
>>>>>>> #7  0x00002ada6b4c4516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
>>>>>>> recvbuf=0x10, count=1, datatype=0xe, op=22045696, 
>>>>>>> comm=0x1506680) at
>>>>>>> intra_fns_new.c:6014
>>>>>>> #8  0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0, 
>>>>>>> recvbuf=0x10,
>>>>>>> count=4, datatype=14, op=22045696, comm=22046336) at allreduce.c:83
>>>>>>> #9  0x00002ada6a67a4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
>>>>>>> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so 
>>>>>>>
>>>>>>>
>>>>>>> In this case what would be the likely parameter I could play 
>>>>>>> with in
>>>>>>> order to potentially stop a hang in MPI_Allreduce?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Justin
>>>>>>> _______________________________________________
>>>>>>> mvapich-discuss mailing list
>>>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> _______________________________________________
>>>>> mvapich-discuss mailing list
>>>>> mvapich-discuss at cse.ohio-state.edu
>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>>>
>>>>>
>>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss