[mvapich-discuss] proposal to fix MPI_allreduce bandwidth.

Shalnov, Sergey Sergey.Shalnov at intel.com
Wed Apr 18 07:45:33 EDT 2007


Hello,
I had downloaded fresh version of  mvapich-0.9.9 by svn and made several
experiments with collective operations like MPI_Allreduce and
MPI_Allgatherv. I found that bandwidth for MPI_Allreduce has some kind
of hole in case of block size from 16k to 512k transmission. I am not
sure about different architectures but it appears on my Intel based
infiniband clusters (I tested it on two clusters but results are from
one of them). 

In attached Microsoft spreadsheet with results and graphs to help you to
examine my results. There are three columns:
1 - mvapich-0.9.9 is the version of mvapich-0.9.9 from tar ball.
2 - mvapich-0.9.9-trunk is the version from main trunk (Dmitri Mishura's
fix
http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2007-April/0007
34.html included)
3 - mvapich-0.9.9-fixed1 is #2 + attached patch.

Attached patch file is the patch for
$MVAPICH_BUILD_HOME/src/coll/intra_fns_new.c:73 file. 
This line looks like 

       #define SHMEM_COLL_ALLREDUCE_THRESHOLD (1<<19) 

And I can propose to change it to:

       #define SHMEM_COLL_ALLREDUCE_THRESHOLD (1<<15)

This change improves bandwidth on my cluster as showed below:

Size of messages in bytes	mvapich-0.9.9	mvapich-0.9.9-fixed1
mvapich-0.9.9-trunk
4 096				89.6467			93.0733
93.2853
8 192				117.332			139.493
137.496
16 384			142.787			184.153
185.812
32 768			158.847			286.147
206.245
65 536			144.555			328.089
192.31
131 072			152.266			289.667
190.743
262 144			166.436			279.73
203.1
524 288			32.6395			253.501
252.428
1 048 576			30.8811			231.03
229.957
2 097 152			27.8332			199.249
201.419
4 194 304			26.4895			191.914
192.835
8 388 608			26.157			183.449
184.363
16 777 216			25.4449			178.985
181.572
33 554 432			25.9249			177.411
179.012

The testing method is to send same amount of bytes (167772160 bytes) on
each iteration by different size of block (size of messages in bytes).
It means each iteration we can measure network bandwidth for particular
message size in MPI collective operation.

Thank you 
Sergey



-------------- next part --------------
A non-text attachment was scrubbed...
Name: bug_fix_tests_mvapich-0.9.9_main_trunk.xls
Type: application/vnd.ms-excel
Size: 48640 bytes
Desc: bug_fix_tests_mvapich-0.9.9_main_trunk.xls
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070418/1ba97181/bug_fix_tests_mvapich-0.9.9_main_trunk-0001.xls
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix2.patch
Type: application/octet-stream
Size: 428 bytes
Desc: fix2.patch
Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070418/1ba97181/fix2-0001.obj


More information about the mvapich-discuss mailing list