[mvapich-discuss] Which VIADEV* parameters might free up a "hang"
on 64 or more cores, when job runs fine up to 32 cores?
Enda O'Brien
enda.obrien at dalco.ch
Mon Oct 13 11:23:25 EDT 2008
Hello,
I saw this address at the top of the mvapich.conf file on the system I'm using, so I thought I'd submit this question:
What parameter(s) in the mvapich.conf file might be adjusted to "free" up a job that is "hanging" on 64 or more cores, but which runs fine on 8, 16 or 32 cores?
When such a thing happens on a Quadrics cluster (as it sometimes does...), I can usually adjust (increase) LIBELAN_TPORT_BIGMSG and LIBELAN_ALLOC_SIZE to free the log-jam. That's just 2 parameters. However, there are ~100 VIADEV* parameters in mvapich.conf, and the ones I've adjusted so far haven't made any difference.
The main MPI function in the application in question is MPI_Alltoall, but it uses only ~3 minutes out of 80 on 32 cores.
Any tips, advice, recommendations gratefully received!
Best wishes,
Enda
P.S. Here are the settings I've tried:
VIADEV_VBUF_TOTAL_SIZE=49152
VIADEV_VBUF_POOL_SIZE=1024
VIADEV_ON_DEMAND_THRESHOLD=64
VIADEV_NUM_RDMA_BUFFER=64
VIADEV_USE_SHMEM_COLL=0
ADEV_USE_RDMA_BARRIER=1
VIADEV_SQ_SIZE_MAX=500
VIADEV_DEFAULT_QP_OUS_RD_ATOM=8
VIADEV_CQ_SIZE=100000
VIADEV_DEBUG=3
VIADEV_SRQ_MAX_SIZE=8192
VIADEV_ADAPTIVE_ENABLE_LIMIT=128
===========================
Enda O'Brien
DALCO AG Switzerland
Aille, Barna, Co. Galway, Ireland
Tel. +353 91 591307
Mob. +353 87 7517969
===========================
More information about the mvapich-discuss
mailing list