[mvapich-discuss] hang at large numbers of processors

Dhabaleswar Panda panda at cse.ohio-state.edu
Mon Nov 3 15:38:20 EST 2008


Justin,

Could you let us know which stack (MVAPICH or MVAPICH2) you are using on
Ranger. These two stacks have the parameters named differently. Also, on
what exact process count you see this problem. If you can also let us know
the version number of mvapich/mvapich2 stack and/or the path of the MPI
library on Ranger, it will be helpful.

Thanks,

DK

On Mon, 3 Nov 2008, Justin wrote:

> We are running into hangs on Ranger using mvapich that are not present
> on other machines.  These hangs seem to only occur on arge problems with
> large numbers of processors.  We have ran into similar problems on some
> LLNL machines in the past and were able to get around them by disabling
> the shared memory optimizations.  In these cases the problem had to do
> with fixed sized buffers used in the shared memory optimizations.
>
> We would like to disable shared memory on Ranger but are confused with
> all the different parameters dealing with shared memory optimizations.
> How do we know which parameters affect the run?  For example do we use
> the parameters that begin with MV_ or VIADEV_?  From past conversations
> I have had with support teams the parameters that have an effect vary
> according to the hardware/mpi build.  What is the best way to determine
> which parameters are active?
>
> Also here is a stacktrace from one of our hangs:
>
> .stack.i132-112.ranger.tacc.utexas.edu.16033
> Intel(R) Debugger for applications running on Intel(R) 64, Version
> 10.1-35 , Build 20080310
> Attaching to program:
> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus,
> process 16033
> Reading symbols from
> /work/00975/luitjens/SCIRun/optimized/Packages/Uintah/StandAlone/sus...(no
> debugging symbols found)...done.
> smpi_net_lookup () at mpid_smpi.c:1381
> #0  0x00002ada6b4d8510 in smpi_net_lookup () at mpid_smpi.c:1381
> #1  0x00002ada6b4d8414 in MPID_SMP_Check_incoming () at mpid_smpi.c:1360
> #2  0x00002ada6b4f293c in MPID_DeviceCheck (blocking=7154160) at
> viacheck.c:505
> #3  0x00002ada6b4d600b in MPID_RecvComplete (request=0x6d29f0,
> status=0x10, error_code=0x4) at mpid_recv.c:106
> #4  0x00002ada6b4fe2f7 in MPI_Waitall (count=7154160,
> array_of_requests=0x10, array_of_statuses=0x4) at waitall.c:190
> #5  0x00002ada6b4e6d3c in MPI_Sendrecv (sendbuf=0x6d29f0, sendcount=16,
> sendtype=4, dest=14, sendtag=22045696, recvbuf=0x1506680, recvcount=1,
> recvtype=6, source=2278, recvtag=14, comm=130, status=0x7fff4385028c) at
> sendrecv.c:98
> #6  0x00002ada6b4c4d2d in intra_Allreduce (sendbuf=0x6d29f0,
> recvbuf=0x10, count=4, datatype=0xe, op=22045696, comm=0x1506680) at
> intra_fns_new.c:5682
> #7  0x00002ada6b4c4516 in intra_shmem_Allreduce (sendbuf=0x6d29f0,
> recvbuf=0x10, count=1, datatype=0xe, op=22045696, comm=0x1506680) at
> intra_fns_new.c:6014
> #8  0x00002ada6b48f286 in MPI_Allreduce (sendbuf=0x6d29f0, recvbuf=0x10,
> count=4, datatype=14, op=22045696, comm=22046336) at allreduce.c:83
> #9  0x00002ada6a67a4f8 in _ZN6Uintah12MPIScheduler7executeEii () in
> /work/00975/luitjens/SCIRun/optimized/lib/libPackages_Uintah_CCA_Components_Schedulers.so
>
> In this case what would be the likely parameter I could play with in
> order to potentially stop a hang in MPI_Allreduce?
>
> Thanks,
> Justin
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list