[mvapich-discuss] MPI_Type_vector() function uses dynamic allocation?

Tue Jan 27 03:20:33 EST 2015

 Hi Khaled,

Thank for your reply.

I think that a version with support for a pack/unpack static buffer could
be very useful.

Infact, in a typical algorithm you have data-matrix on which you iterate a
compute kernel several times; the matrix is divided on a set of GPUs
arranged on the edge of a 2D- or 3D-grid, and each GPU stores part of the
matrix allocated on neighbor GPUs as halo rows and columns.

This mapping requires that at beginning of each iteration you have to
update the halos, and then dynamic allocation of buffers for pack/unpack
becomes a significant overhead.

Following I'll report some results comparing a code with dynamic allocation
of pack/unpack buffers with one using a static allocation of the same
buffers:

 - pseudocode using MVAPICH2 derivative types (dynamic allocation of
buffers):

/****************************************************************************************/
  MPI_Datatype newtype;
  MPI_Type_vector( ... );
  MPI_Type_commit(&newtype);

  for( ii=0; ii<N; ii++ ) {
     // update the halos
     MPI_Sendrecv(
      src, 1, newtype, dst_rank, 0,
      dst, 1, newtype, src_rank, 0,
      MPI_COMM_WORLD, MPI_STATUS_IGNORE
    );
   }
/****************************************************************************************/

 running this code I get the following results:

   Num of elements

Size of buffer

Time/iteration (us)

Bandwidth (GB/s)

1024

8 KB

70.382

0.233

2048

16 KB

71.006

0.461

4096

32 KB

86.585

0.757

8192

64 KB

95.398

1.374

16384

128 KB

138.651

1.891

1048576

1 MB

1728.361

1.213

2097152

2 MB

2232.564

1.879

4194304

4 MB

3074.515

2.728

8388608

8 MB

4944.433

3.393

16777216

16 MB

9512.532

3.527

This is a pseudocode of the same code using a custom version of pack/unpack
using static buffers:

/****************************************************************************************/
 cudaMalloc( (void **) &sndBuf, ... );
 cudaMalloc( (void **) &rcvBuf,  ... );

 for( ii=0; ii<N; ii++ ){
     // Pack non-contiguous elements in pre-allocated buffer
     my_pack( src, sndBuf );

     // Update the halos
     MPI_Sendrecv(
      sndBuf, nElem, type_of_data, dst_rank, 0,
      rcvBuf,  nElem, type_of_data, src_rank, 0,
      MPI_COMM_WORLD, MPI_STATUS_IGNORE
    );

    // Unpack
    my_unpack( rcvBuf, dst );
  }
/****************************************************************************************/

in the last case I get better results specially for MPI buffers larger than
2KB.

   Num of elements

Size of buffer

Time/iteration (us)

Bandwidth (GB/s)

1024

8 KB

60.880

0.291

2048

16 KB

85.626

0.580

4096

32 KB

90.001

1.057

8192

64 KB

105.442

1.795

16384

128 KB

135.409

2.413

1048576

1 MB

558.556

3.997

2097152

2 MB

1031.525

4.176

4194304

4 MB

1982.740

4.272

8388608

8 MB

3915.898

4.330

16777216

16 MB

7774.758

4.360

Thanks
-- 
Davide Marchi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150127/c62f2b0a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cleardot.gif
Type: image/gif
Size: 43 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150127/c62f2b0a/attachment-0001.gif>