<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1256">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Latha;
panose-1:2 0 4 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:10.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{mso-style-priority:99;
mso-style-link:"Plain Text Char";
margin:0in;
font-size:10.0pt;
font-family:Consolas;}
span.PlainTextChar
{mso-style-name:"Plain Text Char";
mso-style-priority:99;
mso-style-link:"Plain Text";
font-family:Consolas;}
span.EmailStyle22
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Hi, Adam.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks for the report. Sorry to hear that youre facing issues.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">We will take a look at this and get back to you shortly.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thx,<br>
Hari.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:11.0pt">From:</span></b><span style="font-size:11.0pt"> Mvapich-discuss <mvapich-discuss-bounces@lists.osu.edu>
<b>On Behalf Of </b>Goldman, Adam via Mvapich-discuss<br>
<b>Sent:</b> Wednesday, May 18, 2022 5:29 PM<br>
<b>To:</b> mvapich-discuss@lists.osu.edu<br>
<b>Cc:</b> Heinz, Michael <michael.heinz@intel.com>; Wan, Kaike <kaike.wan@intel.com><br>
<b>Subject:</b> [Mvapich-discuss] 2 Issues with data validation on OSU 5.9<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal" style="mso-line-height-alt:.75pt"><span style="font-size:1.0pt;color:white">Hello, I am running OSU 5.9 with data validation and have noticed 2 issues: 1) Running with high ranks/node on "osu_multi_lat" will result in 'Out of Memory' failures:
</span><span style="font-size:1.0pt;color:white"><o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal" style="mso-line-height-alt:.75pt"><span style="font-size:1.0pt;color:white">ZjQcmQRYFpfptBannerStart<o:p></o:p></span></p>
</div>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%;border-radius:4px">
<tbody>
<tr>
<td style="padding:12.0pt 0in 12.0pt 0in">
<table class="MsoNormalTable" border="1" cellspacing="0" cellpadding="0" width="100%" style="width:100.0%;background:#CFD3D7;border:none;border-top:solid #8C8E91 3.0pt">
<tbody>
<tr>
<td valign="top" style="border:none;padding:0in 7.5pt 3.75pt 4.5pt">
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" align="left">
<tbody>
<tr>
<td style="padding:3.0pt 6.0pt 3.0pt 6.0pt">
<p class="MsoNormal"><b><span style="font-size:10.5pt;font-family:"Arial",sans-serif;color:black">This Message Is From an External Sender
<o:p></o:p></span></b></p>
</td>
</tr>
<tr>
<td style="padding:3.0pt 6.0pt 3.0pt 6.0pt">
<p class="MsoNormal"><span style="font-size:9.0pt;font-family:"Arial",sans-serif;color:black">This message came from outside your organization.
<o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
<table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" align="right">
<tbody>
<tr>
<td style="padding:3.0pt 0in 3.0pt 0in">
<p class="MsoNormal"> <a href="https://us-phishalarm-ewt.proofpoint.com/EWT/v1/KGKeukY!vwQd8gZND6YgRRdxf65kd2CWQBVMbV4DqdQBL5NPAlklHnGfup4baPPdu-dPmXcOvRX36MnOTKyx76M1X8OWbOWM2CN9uSjyxExNQDPi_lBBJt-bRqEeoOge-JZvCUeOL5guq_AGE3C9EWQ0XcN36w$" target="_blank"><strong><span style="font-size:10.5pt;font-family:"Arial",sans-serif;color:black;border:solid #666666 1.0pt;padding:6.0pt;font-weight:normal;text-decoration:none"> Report Suspicious </span></strong></a>
<span style="font-size:12.0pt;font-family:"Times New Roman",serif"><o:p></o:p></span></p>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<div>
<p class="MsoNormal" style="mso-line-height-alt:.75pt"><span style="font-size:1.0pt;color:white">ZjQcmQRYFpfptBannerEnd</span><span style="font-size:1.0pt;font-family:"Times New Roman",serif;color:white"><o:p></o:p></span></p>
</div>
<p class="MsoPlainText"><span style="font-size:11.0pt">Hello,<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">I am running OSU 5.9 with data validation and have noticed 2 issues:<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">1) Running with high ranks/node on "osu_multi_lat" will result in 'Out of Memory' failures:<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Configuration: <o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> 48 ranks/node * 4 nodes (192 ranks total)<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> Running over OMPI with OFI (psm3 provider).<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> Args: "-c"<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> Mem Size: 64GB/node<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">ERROR (Dmesg): <o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> [107540.289787] Out of memory: Killed process 114599 (osu_multi_lat) total-vm:2278092kB, anon-rss:1636984kB, file-rss:0kB, shmem-rss:1644kB, UID:0 pgtables:4236kB oom_score_adj:0<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> [107540.456582] oom_reaper: reaped process 114599 (osu_multi_lat), now anon-rss:0kB, file-rss:0kB, shmem-rss:1644kB<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">This was easily repeatable, however, if I started at message size 524288 ("-m 524288:") I could get a bit past (2 more message sizes).<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">I think there might be a memory leak with data validation.<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Without data validation I do not use even half the total memory usage.<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">2) Running pt2pt on CUDA with args "H D" or "D H" will not work
<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Configuration: <o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> 1 ranks/node * 2 nodes (2 ranks total)<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> Running over CUDA enabled OMPI with OFI (psm3 provider).<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> Args: "<OSU> -c [DST] [SRC]"<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">ERROR: (osu_bibw -c D H)<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> # OSU MPI-CUDA Bi-Directional Bandwidth Test v5.9<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> # Send Buffer on DEVICE (D) and Receive Buffer on HOST (H)<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> # Size Bandwidth (MB/s) Validation<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"> [../../util/osu_util_mpi.c:940] CUDA call 'cudaMemcpy((void *)s_buf, (void *)temp_s_buffer, size, cudaMemcpyHostToDevice)' failed with 1: invalid argument<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Looks to be repeatable on all pt2pt benchmarks.<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Quick look at code shows that we do not check what the src and dst buffers are before calling memcpy/cudaMemcpy.<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Managed buffers (MH and MD) are also not handled correctly and seem to report false errors on validation.
<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Regards,<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Adam Goldman<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt">Intel Corporation<o:p></o:p></span></p>
<p class="MsoPlainText"><span style="font-size:11.0pt"><a href="mailto:adam.goldman@intel.com">adam.goldman@intel.com</a><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
</div>
</body>
</html>