[mvapich-discuss] proposal to fix MPI_allreduce bandwidth.

Wed Apr 18 11:03:35 EDT 2007

Hi Shalnov,

Thanks for sending the performance data. We are looking into this,

Thanks,
Amith

On Wed, 18 Apr 2007, Shalnov, Sergey wrote:

> Hello,
> I had downloaded fresh version of  mvapich-0.9.9 by svn and made several
> experiments with collective operations like MPI_Allreduce and
> MPI_Allgatherv. I found that bandwidth for MPI_Allreduce has some kind
> of hole in case of block size from 16k to 512k transmission. I am not
> sure about different architectures but it appears on my Intel based
> infiniband clusters (I tested it on two clusters but results are from
> one of them).
>
> In attached Microsoft spreadsheet with results and graphs to help you to
> examine my results. There are three columns:
> 1 - mvapich-0.9.9 is the version of mvapich-0.9.9 from tar ball.
> 2 - mvapich-0.9.9-trunk is the version from main trunk (Dmitri Mishura's
> fix
> http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2007-April/0007
> 34.html included)
> 3 - mvapich-0.9.9-fixed1 is #2 + attached patch.
>
> Attached patch file is the patch for
> $MVAPICH_BUILD_HOME/src/coll/intra_fns_new.c:73 file.
> This line looks like
>
>        #define SHMEM_COLL_ALLREDUCE_THRESHOLD (1<<19)
>
> And I can propose to change it to:
>
>        #define SHMEM_COLL_ALLREDUCE_THRESHOLD (1<<15)
>
> This change improves bandwidth on my cluster as showed below:
>
> Size of messages in bytes	mvapich-0.9.9	mvapich-0.9.9-fixed1
> mvapich-0.9.9-trunk
> 4 096				89.6467			93.0733
> 93.2853
> 8 192				117.332			139.493
> 137.496
> 16 384			142.787			184.153
> 185.812
> 32 768			158.847			286.147
> 206.245
> 65 536			144.555			328.089
> 192.31
> 131 072			152.266			289.667
> 190.743
> 262 144			166.436			279.73
> 203.1
> 524 288			32.6395			253.501
> 252.428
> 1 048 576			30.8811			231.03
> 229.957
> 2 097 152			27.8332			199.249
> 201.419
> 4 194 304			26.4895			191.914
> 192.835
> 8 388 608			26.157			183.449
> 184.363
> 16 777 216			25.4449			178.985
> 181.572
> 33 554 432			25.9249			177.411
> 179.012
>
> The testing method is to send same amount of bytes (167772160 bytes) on
> each iteration by different size of block (size of messages in bytes).
> It means each iteration we can measure network bandwidth for particular
> message size in MPI collective operation.
>
> Thank you
> Sergey
>
>
>
>