<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:Courier;
panose-1:0 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"Avenir Next LT Pro";
panose-1:2 11 5 4 2 2 2 2 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:12.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-CA" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">Good day OSU team,<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">I’ve been debugging an issue with our system that just came up after installing new extra ConnectX-5 in some servers (12 out of 48) our system. When adding a single rank from a server hosting
an extra CX-5 we start seeing problem as below:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">In this case, I’ve the disabled the PCIe slot hosting the extra CX-5 on all server except one and still getting the error. If I remove the offending server from the “all_cards.cfg” host file,
I can now use all host and the maximum number of ranks. The CX5 were added a month ago and I initially suspected I did make a mistake the way I build the latest code, but I’ve tried multiple versions release to Rockport and still getting into this state.
Depending on the version been used I’m getting different error (which is strange). The problem starts exhibiting when the cables were added to the CX-5 and the switches. Unless I disabled the PCIe slot hosting the extra card I cannot run the simple test below.
Removing the “MV2_HOMOGENOUS_CLUSTER=1” do not make any difference and explicitly specifying the “MV2_IBA_HCA=mlx5_0” doesn’t help either.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">Note: The interfaces were not configured at this state, and I have not used the card at all. They are CX-5 VPI cards, and they are still configured for IB. I couldn’t find any info in the User’s
Guide related to the problem I’m seeing. Likely a configuration issue at my end but I don’t know what I’m missing here. Unfortunately, I had to disable all cards for running test but if required I can reconfigure some of the server to reproduce the problem
and capture extra information.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">I did test using the official mvapich2-2.3.6, and the 3 drops we received for Rockport.
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">/opt/bm/hpc/mvapich2-2.3.7pre-rockportqos-nov15/bin/mpiexec -np $((1*47)) -f all_cards.cfg -env MV2_USE_RDMA_CM=1 -env MV2_HOMOGENEOUS_CLUSTER=1 -env MV2_HYBRID_ENABLE_THRESHOLD=102400
-env MV2_NDREG_ENTRIES_MAX=100000 -env MV2_NDREG_ENTRIES=50000 -env MV2_IBA_HCA=mlx5_0 /opt/bm/hpc/mvapich2-latest/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -f -i 100<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">ssh: connect to host 172.20.141.148 port 22: No route to host<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">^C[mpiexec@dell-s13-h1] Sending Ctrl-C to processes as requested<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">[mpiexec@dell-s13-h1] Press Ctrl-C again to force abort<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">[mpiexec@dell-s13-h1] HYDU_sock_write (../../../../src/pm/hydra/utils/sock/sock.c:303): write error (Bad file descriptor)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">[mpiexec@dell-s13-h1] HYD_pmcd_pmiserv_send_signal (../../../../src/pm/hydra/pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">[mpiexec@dell-s13-h1] ui_cmd_cb (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">[mpiexec@dell-s13-h1] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:76): callback returned error status<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">[mpiexec@dell-s13-h1] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">[mpiexec@dell-s13-h1] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:340): process manager error waiting for completion<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:Courier">[user@dell-s13-h1
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt">Regards, <o:p></o:p></span></p>
<p class="MsoNormal"><b><span style="font-size:8.0pt;font-family:"Avenir Next LT Pro",sans-serif">Nicolas Gagnon<o:p></o:p></span></b></p>
<p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Avenir Next LT Pro",sans-serif">Principal Designer/Architect, Engineering<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Avenir Next LT Pro",sans-serif"><a href="mailto:ngagnon@rockportnetworks.com">ngagnon@rockportnetworks.com</a>
<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Avenir Next LT Pro",sans-serif">Rockport | Simplify the Network<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Avenir Next LT Pro",sans-serif"><o:p> </o:p></span></p>
<div>
<p class="MsoNormal"><span lang="EN-US"><a href="https://urldefense.com/v3/__https://rockportnetworks.com/__;!!KGKeukY!npzMFArZLrvAybsrQuuOZLE6oGSuIcSNhWOKrbm4z1Ai_cfMNXqwegbZ8CO0eM0pu27RLw8KVQ$"><span style="color:windowtext;text-decoration:none"><span style="font-size:11.0pt"><img border="0" width="135" height="29" style="width:1.4062in;height:.302in" id="Picture_x0020_1" src="cid:image001.png@01D8015B.FBC62C30" alt="signature_849490256"></span></span></a></span><span lang="EN-US" style="font-size:11.0pt"><o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
</div>
</body>
</html>