[mvapich-discuss] hang at large numbers of processors
Justin
luitjens at cs.utah.edu
Mon Nov 3 14:17:46 EST 2008
We are running into hangs on Ranger using mvapich that are not present
on other machines. These hangs seem to only occur on arge problems with
large numbers of processors. We have ran into similar problems on some
LLNL machines in the past and were able to get around them by disabling
the shared memory optimizations. In these cases the problem had to do
with fixed sized buffers used in the shared memory optimizations.
We would like to disable shared memory on Ranger but are confused with
all the different parameters dealing with shared memory optimizations.
How do we know which parameters affect the run? For example do we use
the parameters that begin with MV_ or VIADEV_? From past conversations
I have had with support teams the parameters that have an effect vary
according to the hardware/mpi build. What is the best way to determine
which parameters are active?
Also here is a stacktrace from one of our hangs:
.stack.i132-112.ranger.tacc.utexas.edu.16033
Intel(R) Debugger for applications running on Intel(R) 64, Version
10.1-35 , Build 20080310
Attaching to program:
/work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus,
process 16033
Reading symbols from
/work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus...(no
debugging symbols found)...done.
smpi_net_lookup () at mpid_smpi.c:1381
#0 0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c:1381
#1 0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at mpid_smpi.c:1360
#2 0x00002ada6b4f293c in MPID_DeviceCheck (blocking=7154160) at
viacheck.c:505
#3 0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
status=0x10, error_code=0x4) at mpid_recv.c:106
#4 0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
array_of_requests=0x10, array_of_statuses=0x4) at waitall.c:190
#5 0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0, sendcount=16,
sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680, recvcount=1,
recvtype=6, source=2278, recvtag=14, comm=130, status=0x7fff4385028c) at
sendrecv.c:98
#6 0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
recvbuf=0x10, count=4, datatype=0xe, op=22045696, comm=0x1506680) at
intra_fns_new.c:5682
#7 0x00002ada6b4c4516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
recvbuf=0x10, count=1, datatype=0xe, op=22045696, comm=0x1506680) at
intra_fns_new.c:6014
#8 0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0, recvbuf=0x10,
count=4, datatype=14, op=22045696, comm=22046336) at allreduce.c:83
#9 0x00002ada6a67a4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
/work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so
In this case what would be the likely parameter I could play with in
order to potentially stop a hang in MPI_Allreduce?
Thanks,
Justin
More information about the mvapich-discuss
mailing list