[mvapich-discuss] the bug in mvapich-0.9.9-beta2
Shalnov, Sergey
Sergey.Shalnov at intel.com
Wed Apr 11 11:53:03 EDT 2007
Hello,
I started working with mvapich-0.9.9-beta2 and found that performance of
collective operation like MPI_Allreduce dramatically decreases after
512kb of message size. Following is the numbers for my experiments
Size of messages in bytes mvapich-0.9.9
4 096 89.6467
8 192 117.332
16 384 142.787
32 768 158.847
65 536 144.555
131 072 152.266
262 144 166.436
524 288 32.6395
1 048 576 30.8811
2 097 152 27.8332
4 194 304 26.4895
8 388 608 26.157
16 777 216 25.4449
33 554 432 25.9249
First column is size of message for MPI_allreduce routine and second
column is network bandwidth in MB/s.
We looked into code and found some bug in
$MVAPICH_BUILD_HOME/src/coll/intra_fns_new.c:103. This line is array
coll_table definition. Second dimension of this array is macro COLL_SIZE
that defined as 5 in line 98. As I understand this is not correct to
define COLL_SIZE as 5 - it must be defined as 4 and definition of
coll_table must be rewritten as int
coll_table[COLL_COUNT][COLL_SIZE+1]...
This should be done because in line 555 at the same file we can see
following code:
If(lgn > COLL_SIZE) lgn = COLL_SIZE;
After this lgn is using as array index that outbound the arrays.
After made small fix in this file I found following results:
Size of messages in bytes mvapich-0.9.9-fixed
4 096 87.7509
8 192 118.255
16 384 139.281
32 768 153.702
65 536 137.334
131 072 141.985
262 144 152.65
524 288 187.648
1 048 576 153.005
2 097 152 120.034
4 194 304 108.731
8 388 608 99.7531
16 777 216 97.5817
33 554 432 96.4089
So, I think I found the bug in mvapich-0.9.9-beta2 code.
Thank you
Sergey
More information about the mvapich-discuss
mailing list