[Mvapich-discuss] Report of FATAL event local access violation in a large MPI job

Tue May 9 10:24:34 EDT 2023

Hi ZQ,

MVAPICH2 2.3.6 is an older version. Please update your installation to MVAPICH2 2.3.7 and let us know whether the problem persists and we will be happy to take a look at it.

Thanks,

DK

________________________________________
From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> on behalf of You, Zhi-Qiang via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Tuesday, May 9, 2023 10:06 AM
To: mvapich-discuss at lists.osu.edu
Subject: Re: [Mvapich-discuss] Report of FATAL event local access violation     in a large MPI job

Hello,

I am writing to follow up on the report I sent regarding the FATAL event of local access violation that occurred during the execution of our MPI job. As this was a critical issue, I wanted to ensure that it has been addressed properly and that there are no further concerns. Could you please confirm if you have received the report and if any progress has been made towards resolving the issue? I would greatly appreciate it if you could keep me informed about any updates regarding this matter.

Thank you,
-ZQ

From: You, Zhi-Qiang <zyou at osc.edu>
Date: Wednesday, April 26, 2023 at 12:41 PM
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Subject: Report of FATAL event local access violation in a large MPI job
Hello,

I am writing to report an error message that I encountered in a large hybrid-MPI job while using the MVAPICH2 2.3.6. The error message reads as follows:

[p0555.ten.osc.edu:mpi_rank_483][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1340: Got FATAL event local access violation work queue error on QP 0x1621

[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] Send desc error in msg to 483, wc_opcode=0
[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] Msg from 483: wc.status=10 (remote access error), wc.wr_id=0x107e83e0, wc.opcode=0, vbuf->phead->type=32 = MPIDI_CH3_PKT_RNDV_REQ_TO_SEND
[p0538.ten.osc.edu:mpi_rank_471][mv2_print_wc_status_error] IBV_WC_REM_ACCESS_ERR: This event is generated when a protection error occurs on a remote data buffer to be read by an RDMA read, written by an RDMA Write or accessed by an atomic operation. The error is reported only on RDMA operations or atomic operations. Relevant to: RC or DC QPs.
[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:499: [] Got completion with error 10, vendor code=0x88, dest rank=483
: No such file or directory (2)

We executed a job consisting of 1104 tasks, each with 12 threads, across 92 nodes using SLURM's cyclic distribution. The relevant environment variables were set as follows:
OMP_NUM_THREADS=4
OMP_STACKSIZE=512M
MV2_USE_ALIGNED_ALLOC=1
MV2_ENABLE_AFFINITY=0
MV2_CPU_BINDING_POLICY=hybrid
MV2_USE_RDMA_CM=0
MV2_HOMOGENEOUS_CLUSTER=1
MV2_IBA_HCA=mlx5_0

The mvapich2 used was mvapich2/2.3.6 with Intel 2021.3.0:

$ mpiname -a
MVAPICH2 2.3.6 Mon March 29 22:00:00 EST 2021 ch3:mrail

Compilation
CC: icc    -DNDEBUG -DNVALGRIND -O2
CXX: icpc   -DNDEBUG -DNVALGRIND -O2
F77: ifort   -O2
FC: ifort   -O2

Configuration
--prefix=/opt/mvapich2/intel/2021.3/2.3.6 --enable-shared --with-mpe --enable-romio --enable-mpit-pvars=mv2 --disable-option-checking --with-file-system=ufs+nfs+gpfs --enable-slurm=yes --with-pmi=pmi2 --with-pm=slurm

The system runs RHEL 7.9 and MOFED 5.6.

The same MPI application runs successfully on 24 nodes with the same environment. Are there any parameters or variables I should tune when running the application at a larger scale?

I would appreciate your assistance in resolving this issue. If you need any further information or clarification, please do not hesitate to let me know. Thank you for your attention to this matter.

Best regards,
-ZQ