[Mvapich-discuss] Possible buffer overflow for large messages?

John Moore john at flexcompute.com
Wed Sep 28 13:37:56 EDT 2022


Hello,

We have a code that does a large Gatherv operation, where the size of the
gathered message > 4GB. It is approximately 8GB. We have noticed that the
result of the gatherv operation is incorrect for these large calls. The
sizes that we are passing into Gatherv are all within the int limit, and we
are using custom data types (MPI_Type_Contiguous) to allow for this larger
message size.

We have also tried replacing the Gatherv call with Isend/Irecv calls, which
are all within the int representation range in terms of the number of bytes
communicated, with the same incorrect result.

When we compile with OpenMPI, the result is correct. Also, when we run the
operations on smaller data sets with MVAPICH2 the result is correct.

This job is being run across two nodes with 16 ranks total (8 ranks each)
When we place all the data on a single node, and use the same input data
and number of ranks, we again get the correct result. This leads me to
believe that some remote send/receive buffer is being exceeded.

We are running MVAPICH2-GDR-2.3.6, but these buffers are all CPU buffers,
and we are running this executable with MV2_USE_CUDA=0. Perhaps there are
some environmental variables to change here? Any advice would be greatly
appreciated.

Thank you,
John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220928/bc0355ec/attachment-0013.html>


More information about the Mvapich-discuss mailing list