[Mvapich-discuss] Report of FATAL event local access violation in a large MPI job

Wed Apr 26 12:41:50 EDT 2023

Hello,

I am writing to report an error message that I encountered in a large hybrid-MPI job while using the MVAPICH2 2.3.6. The error message reads as follows:

[p0555.ten.osc.edu:mpi_rank_483][async_thread] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1340: Got FATAL event local access violation work queue error on QP 0x1621

[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] Send desc error in msg to 483, wc_opcode=0
[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] Msg from 483: wc.status=10 (remote access error), wc.wr_id=0x107e83e0, wc.opcode=0, vbuf->phead->type=32 = MPIDI_CH3_PKT_RNDV_REQ_TO_SEND
[p0538.ten.osc.edu:mpi_rank_471][mv2_print_wc_status_error] IBV_WC_REM_ACCESS_ERR: This event is generated when a protection error occurs on a remote data buffer to be read by an RDMA read, written by an RDMA Write or accessed by an atomic operation. The error is reported only on RDMA operations or atomic operations. Relevant to: RC or DC QPs.
[p0538.ten.osc.edu:mpi_rank_471][handle_cqe] src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:499: [] Got completion with error 10, vendor code=0x88, dest rank=483
: No such file or directory (2)

We executed a job consisting of 1104 tasks, each with 12 threads, across 92 nodes using SLURM's cyclic distribution. The relevant environment variables were set as follows:

OMP_NUM_THREADS=4
OMP_STACKSIZE=512M
MV2_USE_ALIGNED_ALLOC=1
MV2_ENABLE_AFFINITY=0
MV2_CPU_BINDING_POLICY=hybrid
MV2_USE_RDMA_CM=0
MV2_HOMOGENEOUS_CLUSTER=1
MV2_IBA_HCA=mlx5_0

The mvapich2 used was mvapich2/2.3.6 with Intel 2021.3.0:

$ mpiname -a
MVAPICH2 2.3.6 Mon March 29 22:00:00 EST 2021 ch3:mrail

Compilation
CC: icc    -DNDEBUG -DNVALGRIND -O2
CXX: icpc   -DNDEBUG -DNVALGRIND -O2
F77: ifort   -O2
FC: ifort   -O2

Configuration
--prefix=/opt/mvapich2/intel/2021.3/2.3.6 --enable-shared --with-mpe --enable-romio --enable-mpit-pvars=mv2 --disable-option-checking --with-file-system=ufs+nfs+gpfs --enable-slurm=yes --with-pmi=pmi2 --with-pm=slurm

The system runs RHEL 7.9 and MOFED 5.6.

The same MPI application runs successfully on 24 nodes with the same environment. Are there any parameters or variables I should tune when running the application at a larger scale?

I would appreciate your assistance in resolving this issue. If you need any further information or clarification, please do not hesitate to let me know. Thank you for your attention to this matter.

Best regards,
-ZQ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20230426/60641cb7/attachment-0006.html>