[mvapich-discuss] MPI_Type_vector() function uses dynamic allocation?
Davide Marchi
davide.marchi at student.unife.it
Tue Jan 27 03:20:33 EST 2015
Hi Khaled,
Thank for your reply.
I think that a version with support for a pack/unpack static buffer could
be very useful.
Infact, in a typical algorithm you have data-matrix on which you iterate a
compute kernel several times; the matrix is divided on a set of GPUs
arranged on the edge of a 2D- or 3D-grid, and each GPU stores part of the
matrix allocated on neighbor GPUs as halo rows and columns.
This mapping requires that at beginning of each iteration you have to
update the halos, and then dynamic allocation of buffers for pack/unpack
becomes a significant overhead.
Following I'll report some results comparing a code with dynamic allocation
of pack/unpack buffers with one using a static allocation of the same
buffers:
- pseudocode using MVAPICH2 derivative types (dynamic allocation of
buffers):
/****************************************************************************************/
MPI_Datatype newtype;
MPI_Type_vector( ... );
MPI_Type_commit(&newtype);
for( ii=0; ii<N; ii++ ) {
// update the halos
MPI_Sendrecv(
src, 1, newtype, dst_rank, 0,
dst, 1, newtype, src_rank, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE
);
}
/****************************************************************************************/
running this code I get the following results:
Num of elements
Size of buffer
Time/iteration (us)
Bandwidth (GB/s)
1024
8 KB
70.382
0.233
2048
16 KB
71.006
0.461
4096
32 KB
86.585
0.757
8192
64 KB
95.398
1.374
16384
128 KB
138.651
1.891
1048576
1 MB
1728.361
1.213
2097152
2 MB
2232.564
1.879
4194304
4 MB
3074.515
2.728
8388608
8 MB
4944.433
3.393
16777216
16 MB
9512.532
3.527
This is a pseudocode of the same code using a custom version of pack/unpack
using static buffers:
/****************************************************************************************/
cudaMalloc( (void **) &sndBuf, ... );
cudaMalloc( (void **) &rcvBuf, ... );
for( ii=0; ii<N; ii++ ){
// Pack non-contiguous elements in pre-allocated buffer
my_pack( src, sndBuf );
// Update the halos
MPI_Sendrecv(
sndBuf, nElem, type_of_data, dst_rank, 0,
rcvBuf, nElem, type_of_data, src_rank, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE
);
// Unpack
my_unpack( rcvBuf, dst );
}
/****************************************************************************************/
in the last case I get better results specially for MPI buffers larger than
2KB.
Num of elements
Size of buffer
Time/iteration (us)
Bandwidth (GB/s)
1024
8 KB
60.880
0.291
2048
16 KB
85.626
0.580
4096
32 KB
90.001
1.057
8192
64 KB
105.442
1.795
16384
128 KB
135.409
2.413
1048576
1 MB
558.556
3.997
2097152
2 MB
1031.525
4.176
4194304
4 MB
1982.740
4.272
8388608
8 MB
3915.898
4.330
16777216
16 MB
7774.758
4.360
Thanks
--
Davide Marchi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150127/c62f2b0a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cleardot.gif
Type: image/gif
Size: 43 bytes
Desc: not available
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150127/c62f2b0a/attachment-0001.gif>
More information about the mvapich-discuss
mailing list