[mvapich-discuss] Which VIADEV* parameters might free up a "hang" on 64 or more cores, when job runs fine up to 32 cores?

Enda O'Brien enda.obrien at dalco.ch
Mon Oct 13 11:23:25 EDT 2008


Hello,

I saw this address at the top of the mvapich.conf file on the system I'm using, so I thought I'd submit this question:

What parameter(s) in the mvapich.conf file might be adjusted to "free" up a job that is "hanging" on 64 or more cores, but which runs fine on 8, 16 or 32 cores?

When such a thing happens on a Quadrics cluster (as it sometimes does...), I can usually adjust (increase) LIBELAN_TPORT_BIGMSG and LIBELAN_ALLOC_SIZE to free the log-jam.  That's just 2 parameters.  However, there are ~100 VIADEV* parameters in mvapich.conf, and the ones I've adjusted so far haven't made any difference.

The main MPI function in the application in question is MPI_Alltoall, but it uses only ~3 minutes out of 80 on 32 cores.

Any tips, advice, recommendations gratefully received!

Best wishes,
Enda

P.S. Here are the settings I've tried:
VIADEV_VBUF_TOTAL_SIZE=49152
VIADEV_VBUF_POOL_SIZE=1024
VIADEV_ON_DEMAND_THRESHOLD=64
VIADEV_NUM_RDMA_BUFFER=64
VIADEV_USE_SHMEM_COLL=0
ADEV_USE_RDMA_BARRIER=1
VIADEV_SQ_SIZE_MAX=500
VIADEV_DEFAULT_QP_OUS_RD_ATOM=8
VIADEV_CQ_SIZE=100000
VIADEV_DEBUG=3
VIADEV_SRQ_MAX_SIZE=8192
VIADEV_ADAPTIVE_ENABLE_LIMIT=128

===========================
   Enda O'Brien
       DALCO AG Switzerland
       Aille, Barna, Co. Galway, Ireland
          Tel. +353 91 591307
         Mob. +353 87 7517969
===========================



More information about the mvapich-discuss mailing list