<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:PMingLiU;
panose-1:2 2 5 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"\@PMingLiU";
panose-1:2 1 6 1 0 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:10.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style>
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Hello,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I am writing to follow up on the report I sent regarding the FATAL event of local access violation that occurred during the execution of our MPI job. As this was a critical issue, I wanted to ensure that it
has been addressed properly and that there are no further concerns. Could you please confirm if you have received the report and if any progress has been made towards resolving the issue? I would greatly appreciate it if you could keep me informed about any
updates regarding this matter.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thank you,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">-ZQ<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span style="font-size:12.0pt;color:black">From:
</span></b><span style="font-size:12.0pt;color:black">You, Zhi-Qiang <zyou@osc.edu><br>
<b>Date: </b>Wednesday, April 26, 2023 at 12:41 PM<br>
<b>To: </b>mvapich-discuss@lists.osu.edu <mvapich-discuss@lists.osu.edu><br>
<b>Subject: </b>Report of FATAL event local access violation in a large MPI job<o:p></o:p></span></p>
</div>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">Hello,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">I am writing to report an error message that I encountered in a large hybrid-MPI job while using the MVAPICH2 2.3.6. The error message reads as follows:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">[p0555.ten.osc.edu:mpi_rank_483][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1340: Got FATAL event local access violation work queue error on QP
0x1621<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] Send desc error in msg to 483, wc_opcode=0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] Msg from 483: wc.status=10 (remote access error), wc.wr_id=0x107e83e0, wc.opcode=0, vbuf->phead->type=32 = MPIDI_CH3_PKT_RNDV_REQ_TO_SEND<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">[p0538.ten.osc.edu:mpi_rank_471][mv2_print_wc_status_error] IBV_WC_REM_ACCESS_ERR: This event is generated when a protection error occurs on a remote data buffer to be read
by an RDMA read, written by an RDMA Write or accessed by an atomic operation. The error is reported only on RDMA operations or atomic operations. Relevant to: RC or DC QPs.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:499: [] Got completion with error 10, vendor code=0x88, dest rank=483<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">: No such file or directory (2)<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">We executed a job consisting of 1104 tasks, each with 12 threads, across 92 nodes using SLURM's cyclic distribution. The relevant environment variables
were set as follows:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">OMP_NUM_THREADS=4<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">OMP_STACKSIZE=512M<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">MV2_USE_ALIGNED_ALLOC=1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">MV2_ENABLE_AFFINITY=0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">MV2_CPU_BINDING_POLICY=hybrid<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">MV2_USE_RDMA_CM=0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">MV2_HOMOGENEOUS_CLUSTER=1<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">MV2_IBA_HCA=mlx5_0<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">The mvapich2 used was mvapich2/2.3.6 with Intel 2021.3.0:<br>
<br>
$ mpiname -a<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">MVAPICH2 2.3.6 Mon March 29 22:00:00 EST 2021 ch3:mrail<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">Compilation<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">CC: icc -DNDEBUG -DNVALGRIND -O2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">CXX: icpc -DNDEBUG -DNVALGRIND -O2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">F77: ifort -O2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">FC: ifort -O2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">Configuration<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">--prefix=/opt/mvapich2/intel/2021.3/2.3.6 --enable-shared --with-mpe --enable-romio --enable-mpit-pvars=mv2 --disable-option-checking --with-file-system=ufs+nfs+gpfs --enable-slurm=yes
--with-pmi=pmi2 --with-pm=slurm<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">The system runs RHEL 7.9 and MOFED 5.6.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual"> <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">The same MPI application runs successfully on 24 nodes with the same environment. Are there any parameters or variables I should tune when running the application at a larger
scale?<br>
<br>
I would appreciate your assistance in resolving this issue. If you need any further information or clarification, please do not hesitate to let me know. Thank you for your attention to this matter.<br>
<br>
Best regards,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;mso-ligatures:standardcontextual">-ZQ<o:p></o:p></span></p>
</div>
</body>
</html>