[mvapich-discuss] Bug in mvapich-0.9.9-beta collective operations

Mishura, Dmitri dmitri.mishura at intel.com
Wed Apr 11 12:38:48 EDT 2007


Hi all,
Sorry, my previous mail doesn't appear to be sent to the list properly.

I would like to post one patch, which fixes collectives bandwidth issue with large (>0.5Mb) vector size in mvapich-0.9.9-beta.
Without this patch mvapich shows poor bandwidth on core counts >=64 on several Intel clusters. This appears to be due to error in indexing of collective threshold table in file intra_fns_new.c. This issue causes unexpected switching to old method (e.g. “recursive doubling” in intra_AllReduce). After this fix bandwidth (in this particular case this was allreduce) was substantially improved (6x on our Infiniband clusters: from 25Mb/s to 150Mb/s on 64 cores for sizes bigger than 512KB). 
 
Patch of src/coll/intra_fns_new.c:
========================================
98c98
< #define COLL_SIZE  4
---
> #define COLL_SIZE  5
103c103
< int coll_table[COLL_COUNT][COLL_SIZE+1] = {{-1, -1, -1, 16384, 16384},
---
> int coll_table[COLL_COUNT][COLL_SIZE] = {{-1, -1, -1, 16384, 16384},

=========================================


Dmitry Mishura, Intel Nizhny Novgorod Lab



More information about the mvapich-discuss mailing list