[Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7

Panda, Dhabaleswar panda at cse.ohio-state.edu
Tue Sep 9 05:32:08 EDT 2025


Hi Purum,

Thanks for reporting this issue with the testing methodology and the patch. We will test it out.

Please note that MVAPICH2 2.3.7 version is getting old. The latest is the 4.x series. Please start using the latest versions.

Thanks,

DK

________________________________________
From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> on behalf of 서푸름 via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Tuesday, September 9, 2025 2:42 AM
To: Announcement about MVAPICH (MPIoverInfiniBand, RoCE, Omni-Path,     Slingshot,iWARP and EFA) Librariesdeveloped atNBCL/OSU
Cc: 진현욱(Hyun-Wook Jin); 이종빈
Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7

Dear MVAPICH Team, Hello, I would like to report a deadlock issue related to the MV2_USE_BLOCKING in MVAPICH2 version 2. 3. 7. To help reproduce the issue, I have detailed the environment, test method, and the suspected cause and solution below. 
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/KGKeukY!vOQf0yYNA6YgpRdxXw6FV3I2OFRs6qA_tKNyp9Ld_4spwlb2cwEBP64kNzr4D0Lyhwo6B4WhqLODJK7p6ZCtuHPMDPmrrM82D5vFRDH6C_wpCm2_zMirdP_zHhxLtWljdre8xaqJQcPMGv-c2he8$>
Report Suspicious

ZjQcmQRYFpfptBannerEnd
Dear MVAPICH Team,
Hello, I would like to report a deadlock issue related to the MV2_USE_BLOCKING in MVAPICH2 version 2.3.7.
To help reproduce the issue, I have detailed the environment, test method, and the suspected cause and solution below.

[Environment]
Homogeneous 2-node setup
Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
CPU : AMD Ryzen Threadripper 2950X (16 - Core Processor)
OS : Kernel 5.15.104, Ubuntu 20.04
MPI : MVAPICH2-2.3.7 (latest release)

[Test Method]
osu-micro-benchmarks-7.5, MPI_IGather() non-blocking benchmark
32 process(16 process on each node)
Increased iteration easily reproduces dead-lock issue

[Reason & Solution]
Suspected Issue: Re-arming of the completion channel is not handled correctly

[Source Code]
Relevant Source File : mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c
Function : static inline int perform_blocking_progress_for_ib(int hca_num, int num_cqs)
Suggested Fix: ibv_req_notify_cq() should be called after acknowledging the completion events

You can view a proposed patch here:
https://www.diffchecker.com/P4kKplpZ/<https://urldefense.com/v3/__https://www.diffchecker.com/P4kKplpZ/__;!!KGKeukY!0f3xIInYTO4QfxtJvKP57TtDQLukDIQCvHyE1S-0H6SpgsdKdgNadntujhQnugfVUNeyKiu47y9pBfyEwxjsgjLHGkGRATouEg$>

Thank you for your support.
Best regards,
purum.


More information about the Mvapich-discuss mailing list