From purum5548 at konkuk.ac.kr Tue Sep 9 02:42:02 2025 From: purum5548 at konkuk.ac.kr (=?ks_c_5601-1987?B?vK3Hqrin?=) Date: Tue, 9 Sep 2025 06:42:02 +0000 Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 Message-ID: Dear MVAPICH Team, Hello, I would like to report a deadlock issue related to the MV2_USE_BLOCKING in MVAPICH2 version 2.3.7. To help reproduce the issue, I have detailed the environment, test method, and the suspected cause and solution below. [Environment] Homogeneous 2-node setup Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] CPU : AMD Ryzen Threadripper 2950X (16 - Core Processor) OS : Kernel 5.15.104, Ubuntu 20.04 MPI : MVAPICH2-2.3.7 (latest release) [Test Method] osu-micro-benchmarks-7.5, MPI_IGather() non-blocking benchmark 32 process(16 process on each node) Increased iteration easily reproduces dead-lock issue [Reason & Solution] Suspected Issue: Re-arming of the completion channel is not handled correctly [Source Code] Relevant Source File : mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c Function : static inline int perform_blocking_progress_for_ib(int hca_num, int num_cqs) Suggested Fix: ibv_req_notify_cq() should be called after acknowledging the completion events You can view a proposed patch here: https://urldefense.com/v3/__https://www.diffchecker.com/P4kKplpZ/__;!!KGKeukY!0f3xIInYTO4QfxtJvKP57TtDQLukDIQCvHyE1S-0H6SpgsdKdgNadntujhQnugfVUNeyKiu47y9pBfyEwxjsgjLHGkGRATouEg$ Thank you for your support. Best regards, purum. -------------- next part -------------- An HTML attachment was scrubbed... URL: From panda at cse.ohio-state.edu Tue Sep 9 05:32:08 2025 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Tue, 9 Sep 2025 09:32:08 +0000 Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 In-Reply-To: References: Message-ID: Hi Purum, Thanks for reporting this issue with the testing methodology and the patch. We will test it out. Please note that MVAPICH2 2.3.7 version is getting old. The latest is the 4.x series. Please start using the latest versions. Thanks, DK ________________________________________ From: Mvapich-discuss on behalf of ??? via Mvapich-discuss Sent: Tuesday, September 9, 2025 2:42 AM To: Announcement about MVAPICH (MPIoverInfiniBand, RoCE, Omni-Path, Slingshot,iWARP and EFA) Librariesdeveloped atNBCL/OSU Cc: ???(Hyun-Wook Jin); ??? Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 Dear MVAPICH Team, Hello, I would like to report a deadlock issue related to the MV2_USE_BLOCKING in MVAPICH2 version 2.?3.?7. To help reproduce the issue, I have detailed the environment, test method, and the suspected cause and solution below.? ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ZjQcmQRYFpfptBannerEnd Dear MVAPICH Team, Hello, I would like to report a deadlock issue related to the MV2_USE_BLOCKING in MVAPICH2 version 2.3.7. To help reproduce the issue, I have detailed the environment, test method, and the suspected cause and solution below. [Environment] Homogeneous 2-node setup Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6] CPU : AMD Ryzen Threadripper 2950X (16 - Core Processor) OS : Kernel 5.15.104, Ubuntu 20.04 MPI : MVAPICH2-2.3.7 (latest release) [Test Method] osu-micro-benchmarks-7.5, MPI_IGather() non-blocking benchmark 32 process(16 process on each node) Increased iteration easily reproduces dead-lock issue [Reason & Solution] Suspected Issue: Re-arming of the completion channel is not handled correctly [Source Code] Relevant Source File : mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c Function : static inline int perform_blocking_progress_for_ib(int hca_num, int num_cqs) Suggested Fix: ibv_req_notify_cq() should be called after acknowledging the completion events You can view a proposed patch here: https://www.diffchecker.com/P4kKplpZ/ Thank you for your support. Best regards, purum. From chuck at ece.cmu.edu Tue Sep 9 10:24:17 2025 From: chuck at ece.cmu.edu (Chuck Cranor) Date: Tue, 9 Sep 2025 10:24:17 -0400 Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 In-Reply-To: References: Message-ID: !-------------------------------------------------------------------| This Message Is From an External Sender This message came from outside your organization. |-------------------------------------------------------------------! On Tue, Sep 09, 2025 at 09:32:08AM +0000, Panda, Dhabaleswar via Mvapich-discuss wrote: > Please note that MVAPICH2 2.3.7 version is getting old. The latest is > the 4.x series. Please start using the latest versions. the 4.x series does not support acceleration with our legacy hardware (we've got a ~500 node cluster with Intel/Qlogic TrueScale gear that uses the "ib_qib" linux kernel driver and the "psm" library in userland). 2.3.7 still works with this setup, modulo some issues. e.g. bad usage of snprintf() in src/mpid/ch3/channels/common/src/affinity/hwloc_bind.c can cause a "*** buffer overflow detected ***" crash at startup. it does this: char mapping[_POSIX2_LINE_MAX]; // ... j += snprintf (mapping+j, _POSIX2_LINE_MAX, ":"); if j > 0 the second arg to snprintf shouldn't be _POSIX2_LINE_MAX. i fixed this by adding a wrapper function over snprintf() that catches the overflow. it seems like this error is triggered with newer tool chains (old code works fine on ubuntu22, crashes on ubuntu24). also, if you configure mvapich2 "--with-pmi=pmix --with-pm=slurm" you may end up crashing due to PMIx_Init() failing. this is due to being linked to multiple incompatable versions of hwloc at the same time (i.e. mvapich2 builds its own internal hwloc -- default is "--with-hwloc=v1" and then it links to libpmix.so which is linked to the system installed hwloc [a v2 hwloc]). i found that PMIx_Init()'s call to hwloc_topology_init() going to the v1 version compiled with mvapich and then it's call to hwloc_topology_set_io_types_filter() going to the v2 version in the system installed shared libhwloc.so lib. i fixed this by adding a new "--with-hwloc=v2ext" config option to mvapich2 build to tell it to use the system libhwloc.so and not build any hwloc stuff from contrib. chuck From panda at cse.ohio-state.edu Wed Sep 10 05:58:39 2025 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Wed, 10 Sep 2025 09:58:39 +0000 Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 In-Reply-To: References: Message-ID: Hi Chuck, Thanks for sharing this issue with us. We will look at this issue and get back to you soon. Thanks, DK ________________________________________ From: Chuck Cranor Sent: Tuesday, September 9, 2025 10:24 AM To: Panda, Dhabaleswar; Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU Subject: Re: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 !-------------------------------------------------------------------| This Message Is From an External Sender This message came from outside your organization. |-------------------------------------------------------------------! On Tue, Sep 09, 2025 at 09:32:08AM +0000, Panda, Dhabaleswar via Mvapich-discuss wrote: > Please note that MVAPICH2 2.3.7 version is getting old. The latest is > the 4.x series. Please start using the latest versions. the 4.x series does not support acceleration with our legacy hardware (we've got a ~500 node cluster with Intel/Qlogic TrueScale gear that uses the "ib_qib" linux kernel driver and the "psm" library in userland). 2.3.7 still works with this setup, modulo some issues. e.g. bad usage of snprintf() in src/mpid/ch3/channels/common/src/affinity/hwloc_bind.c can cause a "*** buffer overflow detected ***" crash at startup. it does this: char mapping[_POSIX2_LINE_MAX]; // ... j += snprintf (mapping+j, _POSIX2_LINE_MAX, ":"); if j > 0 the second arg to snprintf shouldn't be _POSIX2_LINE_MAX. i fixed this by adding a wrapper function over snprintf() that catches the overflow. it seems like this error is triggered with newer tool chains (old code works fine on ubuntu22, crashes on ubuntu24). also, if you configure mvapich2 "--with-pmi=pmix --with-pm=slurm" you may end up crashing due to PMIx_Init() failing. this is due to being linked to multiple incompatable versions of hwloc at the same time (i.e. mvapich2 builds its own internal hwloc -- default is "--with-hwloc=v1" and then it links to libpmix.so which is linked to the system installed hwloc [a v2 hwloc]). i found that PMIx_Init()'s call to hwloc_topology_init() going to the v1 version compiled with mvapich and then it's call to hwloc_topology_set_io_types_filter() going to the v2 version in the system installed shared libhwloc.so lib. i fixed this by adding a new "--with-hwloc=v2ext" config option to mvapich2 build to tell it to use the system libhwloc.so and not build any hwloc stuff from contrib. chuck From chuck at ece.cmu.edu Wed Sep 10 09:04:19 2025 From: chuck at ece.cmu.edu (Chuck Cranor) Date: Wed, 10 Sep 2025 09:04:19 -0400 Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 In-Reply-To: References: Message-ID: !-------------------------------------------------------------------| This Message Is From an External Sender This message came from outside your organization. |-------------------------------------------------------------------! ok, thanks. if you want to look at the way i addressed these issues, i put my local mvapich2 diffs on github: https://urldefense.com/v3/__https://github.com/pdlfs/mvapich2__;!!KGKeukY!2iVClqhANkjrst3MetTgSt4aeDr1gb6Wx80cxAcmTqOl4_3m_DCYBMC2WGHJbZskfhkbn52NolsAZI54LWdgJaS-PNM$ chuck On Wed, Sep 10, 2025 at 09:58:39AM +0000, Panda, Dhabaleswar wrote: > Hi Chuck, > > Thanks for sharing this issue with us. We will look at this issue and get back to you soon. > > Thanks, > > DK From mgs.rus.52 at gmail.com Wed Sep 10 14:15:19 2025 From: mgs.rus.52 at gmail.com (Alex) Date: Wed, 10 Sep 2025 14:15:19 -0400 Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 In-Reply-To: References: Message-ID: Can you use mvapich with ofi and set ofi up to use psm2? On Tue, 9 Sept 2025, 10:25 Chuck Cranor via Mvapich-discuss, < mvapich-discuss at lists.osu.edu> wrote: > On Tue, Sep 09, 2025 at 09:32:08AM +0000, Panda, Dhabaleswar via > Mvapich-discuss wrote: > > Please note that MVAPICH2 2.3.7 version is getting old. The latest is > > the 4.x series. Please start using the latest versions. > > > the 4.x series does not support acceleration with our legacy hardware > (we've got a ~500 node cluster with Intel/Qlogic TrueScale gear that > uses the "ib_qib" linux kernel driver and the "psm" library in userland). > > > 2.3.7 still works with this setup, modulo some issues. e.g. bad > usage of snprintf() in > src/mpid/ch3/channels/common/src/affinity/hwloc_bind.c > can cause a "*** buffer overflow detected ***" crash at startup. it does > this: > > char mapping[_POSIX2_LINE_MAX]; > // ... > j += snprintf (mapping+j, _POSIX2_LINE_MAX, ":"); > > if j > 0 the second arg to snprintf shouldn't be _POSIX2_LINE_MAX. > i fixed this by adding a wrapper function over snprintf() that > catches the overflow. it seems like this error is triggered with > newer tool chains (old code works fine on ubuntu22, crashes on > ubuntu24). > > also, if you configure mvapich2 "--with-pmi=pmix --with-pm=slurm" > you may end up crashing due to PMIx_Init() failing. this is due > to being linked to multiple incompatable versions of hwloc at the > same time (i.e. mvapich2 builds its own internal hwloc -- default > is "--with-hwloc=v1" and then it links to libpmix.so which is linked > to the system installed hwloc [a v2 hwloc]). i found that PMIx_Init()'s > call to hwloc_topology_init() going to the v1 version compiled with mvapich > and then it's call to hwloc_topology_set_io_types_filter() going > to the v2 version in the system installed shared libhwloc.so lib. > i fixed this by adding a new "--with-hwloc=v2ext" config option > to mvapich2 build to tell it to use the system libhwloc.so and not > build any hwloc stuff from contrib. > > > chuck > _______________________________________________ > Mvapich-discuss mailing list > Mvapich-discuss at lists.osu.edu > https://lists.osu.edu/mailman/listinfo/mvapich-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chuck at ece.cmu.edu Wed Sep 10 15:00:37 2025 From: chuck at ece.cmu.edu (Chuck Cranor) Date: Wed, 10 Sep 2025 15:00:37 -0400 Subject: [Mvapich-discuss] [MVAPICH2-2.3.7] Deadlock Issue with MV2_USE_BLOCKING in MVAPICH2-2.3.7 In-Reply-To: References: Message-ID: !-------------------------------------------------------------------| This Message Is From an External Sender This message came from outside your organization. |-------------------------------------------------------------------! On Wed, Sep 10, 2025 at 02:15:19PM -0400, Alex wrote: > Can you use mvapich with ofi and set ofi up to use psm2? I don't think so. The psm2 library only supports OmniPath devices (linux kernel driver "hfi1") and does not work with the older devices we have (e.g. devices that use the linux "qib" driver). there was, at some point, a plain "psm" provider in libfabric, but I was advised not to use (due to unresolved concurrency issues) and it has since been removed from libfabric. chuck From panda at cse.ohio-state.edu Fri Sep 19 11:52:52 2025 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Fri, 19 Sep 2025 15:52:52 +0000 Subject: [Mvapich-discuss] Announcing the release of MVAPICH 4.1 GA In-Reply-To: References: Message-ID: The MVAPICH team is pleased to announce the release of MVAPICH 4.1 GA. * Overall Features and Enhancements (since 4.0) - Based on MPICH 4.3.0 - Improved rndv protocol performance in point-to-point operations - Improved MPIT PVAR support - Updated embedded OMB to v7.5.1 * Bug Fixes (since 4.0) - Fixed exponential slowdown in shmem startup time - Thanks to Alex for the report - Fixed bug where OFI shm provider would be selected in unsupported cases For downloading MVAPICH 4.1 GA library and associated user guide, please visit the following URL: http://mvapich.cse.ohio-state.edu All questions, feedback, bug reports, hints for performance tuning, patches, and enhancements are welcome. Please post it to the mvapich-discuss mailing list (mvapich-discuss at lists.osu.edu). Thanks, The MVAPICH Team PS: We are also happy to announce that the number of organizations using MVAPICH libraries (and registered at the MVAPICH site) has crossed 3,450 worldwide (in 92 countries). The number of downloads from the MVAPICH site has crossed 1,950,000 (1.95 million). The MVAPICH team would like to thank all its users and organizations!! From nmorey at suse.com Wed Sep 24 12:03:05 2025 From: nmorey at suse.com (Nicolas Morey) Date: Wed, 24 Sep 2025 18:03:05 +0200 Subject: [Mvapich-discuss] [PATCH 4.0|4.1] romio: test: fix bad snprintf arguments Message-ID: !-------------------------------------------------------------------| This Message Is From an External Sender This message came from outside your organization. |-------------------------------------------------------------------! Even though there can not be a buffer overflow as the string is properly sized, noncontig_coll2 fails when built with -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 : FAIL: noncontig_coll2 ===================== Thread 1 "noncontig_coll2" received signal SIGABRT, Aborted. 0x00007ffff709c5fc in __pthread_kill_implementation () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff709c5fc in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x00007ffff7042106 in raise () from /lib64/libc.so.6 #2 0x00007ffff702938b in abort () from /lib64/libc.so.6 #3 0x00007ffff702a3ab in __libc_message_impl.cold () from /lib64/libc.so.6 #4 0x00007ffff712b4fb in __fortify_fail () from /lib64/libc.so.6 #5 0x00007ffff712adc6 in __chk_fail () from /lib64/libc.so.6 #6 0x00007ffff712c8f5 in __snprintf_chk () from /lib64/libc.so.6 #7 0x000000000040275e in snprintf (__s=0x4aafee "", __n=, __fmt=0x404077 "%s,") at /usr/include/bits/stdio2.h:68 #8 default_str (mynod=, len=61, array=0x59fca0, dest=0x4aafd0 "hostname,") at src/mpi/romio/test/noncontig_coll2.c:189 #9 main (argc=, argv=) at src/mpi/romio/test/noncontig_coll2.c:330 This is due to the len parameter of snprintf not being updated as we advance in the string. Fix this issue by introducing a remaining len var that contains the exact amount of bytes left. Signed-off-by: Nicolas Morey --- src/mpi/romio/test/noncontig_coll2.c | 32 ++++++++++++++++++---------- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/src/mpi/romio/test/noncontig_coll2.c b/src/mpi/romio/test/noncontig_coll2.c index 2b37d4749fc9..beade70c2388 100644 --- a/src/mpi/romio/test/noncontig_coll2.c +++ b/src/mpi/romio/test/noncontig_coll2.c @@ -181,12 +181,14 @@ int cb_gather_name_array(MPI_Comm comm, ADIO_cb_name_array * arrayp) void default_str(int mynod, int len, ADIO_cb_name_array array, char *dest) { char *ptr; - int i, p; + int i, p, rlen; if (!mynod) { ptr = dest; + rlen = len; for (i = 0; i < array->namect; i++) { - p = snprintf(ptr, len, "%s,", array->names[i]); + p = snprintf(ptr, rlen, "%s,", array->names[i]); ptr += p; + rlen = rlen - p; } /* chop off that last comma */ dest[strlen(dest) - 1] = '\0'; @@ -197,12 +199,14 @@ void default_str(int mynod, int len, ADIO_cb_name_array array, char *dest) void reverse_str(int mynod, int len, ADIO_cb_name_array array, char *dest) { char *ptr; - int i, p; + int i, p, rlen; if (!mynod) { ptr = dest; - for (i = (array->namect - 1); i >= 0; i--) { - p = snprintf(ptr, len, "%s,", array->names[i]); + rlen = len; + for (i = (array->namect - 1); i >= 0; i--) { + p = snprintf(ptr, rlen, "%s,", array->names[i]); ptr += p; + rlen = rlen - p; } dest[strlen(dest) - 1] = '\0'; } @@ -212,18 +216,21 @@ void reverse_str(int mynod, int len, ADIO_cb_name_array array, char *dest) void reverse_alternating_str(int mynod, int len, ADIO_cb_name_array array, char *dest) { char *ptr; - int i, p; + int i, p, rlen; if (!mynod) { ptr = dest; + rlen = len; /* evens */ for (i = (array->namect - 1); i >= 0; i -= 2) { - p = snprintf(ptr, len, "%s,", array->names[i]); + p = snprintf(ptr, rlen, "%s,", array->names[i]); ptr += p; + rlen = rlen - p; } /* odds */ for (i = (array->namect - 2); i > 0; i -= 2) { - p = snprintf(ptr, len, "%s,", array->names[i]); + p = snprintf(ptr, rlen, "%s,", array->names[i]); ptr += p; + rlen = rlen - p; } dest[strlen(dest) - 1] = '\0'; } @@ -233,16 +240,19 @@ void reverse_alternating_str(int mynod, int len, ADIO_cb_name_array array, char void simple_shuffle_str(int mynod, int len, ADIO_cb_name_array array, char *dest) { char *ptr; - int i, p; + int i, p, rlen; if (!mynod) { ptr = dest; + rlen = len; for (i = (array->namect / 2); i < array->namect; i++) { - p = snprintf(ptr, len, "%s,", array->names[i]); + p = snprintf(ptr, rlen, "%s,", array->names[i]); ptr += p; + rlen = rlen - p; } for (i = 0; i < (array->namect / 2); i++) { - p = snprintf(ptr, len, "%s,", array->names[i]); + p = snprintf(ptr, rlen, "%s,", array->names[i]); ptr += p; + rlen = rlen - p; } dest[strlen(dest) - 1] = '\0'; } -- 2.50.1.1.g5ceaece06de7 From nmorey at suse.com Thu Sep 25 03:44:00 2025 From: nmorey at suse.com (Nicolas Morey) Date: Thu, 25 Sep 2025 09:44:00 +0200 Subject: [Mvapich-discuss] Self test issues with libfabric with mvapich 4.[01] Message-ID: !-------------------------------------------------------------------| This Message Is From an External Sender This message came from outside your organization. |-------------------------------------------------------------------! Hi, I'm the RDMA maintainer for OpenSUSE and I'm currently working on packaging MVAPICH4. I've tried both 4.0 and 4.1 and for both I have some issues when built with OFI/libfabric (no issue with UCX apart some failures on s390x but I haven't checked that out yet) The staging project is here: https://urldefense.com/v3/__https://build.opensuse.org/package/show/home:NMorey:branches:science:HPC/mvapich4__;!!KGKeukY!ygpEeAnFd0aMVFkidyPUfNoC3Ytn4wywyut3Ho_wNlojW4KC6pXgP1cmkpz1G_YTcvsKOAWKlIq_jLwbWFKAXmU$ We are building with libfabric 2.3.0. It seems to be working alright when building against older SLE, but breaks against SLE16 or Tumbleweed. So my best guess is that GCC (7 in older SLE, 13 for SLE16, 15 for Factory) has some impact on that. Symptoms when building the testsuite is (stripped down passed tests): [ 936s] make check-TESTS [ 936s] make[4]: Entering directory '/home/abuild/rpmbuild/BUILD/mvapich4-ofi-4.1-build/mvapich-4.1/src/mpi/romio/test' [ 936s] make[5]: Entering directory '/home/abuild/rpmbuild/BUILD/mvapich4-ofi-4.1-build/mvapich-4.1/src/mpi/romio/test' [ 937s] FAIL: coll_test [ 941s] FAIL: noncontig_coll [ 941s] FAIL: split_coll [ 944s] FAIL: noncontig_coll2 [ 944s] FAIL: aggregation1 [ 948s] FAIL: fcoll_test [ 950s] FAIL: pfcoll_test [ 950s] ============================================================================ [ 950s] Testsuite summary for ROMIO 4.3.0 [ 950s] ============================================================================ [ 950s] # TOTAL: 34 [ 950s] # PASS: 27 [ 950s] # SKIP: 0 [ 950s] # XFAIL: 0 [ 950s] # FAIL: 7 [ 950s] # XPASS: 0 [ 950s] # ERROR: 0 With test-suite.log: FAIL: noncontig_coll ==================== Abort(17) on node 0: Fatal error in internal_Waitall: See the MPI_ERROR field in MPI_Status for the error code FAIL noncontig_coll (exit status: 17) Trying to run manually: abuild at portia:~/rpmbuild/BUILD/mvapich4-ofi-4.1-build/mvapich-4.1/src/mpi/romio/test> mpirun -np 2 ./noncontig_coll -fname foo Abort(17) on node 0: Fatal error in internal_Waitall: See the MPI_ERROR field in MPI_Status for the error code (gdb) bt #0 0x00007ffff6a4548e in exit () from /lib64/libc.so.6 #1 0x00007ffff7278dca in MPL_exit (exit_code=) at src/mpl/src/msg/mpl_msg.c:89 #2 MPID_Abort (comm=0x0, mpi_errno=, exit_code=, error_msg=) at src/mpid/ch4/src/ch4_globals.c:124 #3 0x00007ffff71e7c36 in MPIR_Handle_fatal_error (comm_ptr=comm_ptr at entry=0x0, fcname=fcname at entry=0x7ffff7904740 <__func__.0.lto_priv.312> "internal_Waitall", errcode=errcode at entry=17) at src/mpi/errhan/errutil.c:604 #4 0x00007ffff71e8017 in MPIR_Err_return_comm (comm_ptr=0x0, fcname=0x7ffff7904740 <__func__.0.lto_priv.312> "internal_Waitall", errcode=) at src/mpi/errhan/errutil.c:300 #5 0x00007ffff6ffc47e in internal_Waitall (count=3, array_of_requests=0x55555b226480, array_of_statuses=0x1) at src/binding/c/request/waitall.c:129 #6 0x00007ffff787358b in ADIOI_Calc_others_req (fd=fd at entry=0x5555555a0860, count_my_req_procs=1, count_my_req_per_proc=count_my_req_per_proc at entry=0x55555b226380, my_req=my_req at entry=0x55555b2263c0, nprocs=, myrank=0, count_others_req_procs_ptr=0x7fffffffdb00, count_others_req_per_proc_ptr=0x7fffffffdb08, others_req_ptr=0x7fffffffdb28) at src/mpi/romio/adio/common/ad_aggregate.c:515 #7 0x00007ffff7887d03 in ADIOI_GEN_WriteStridedColl (fd=, buf=0x55555b21e410, count=1, datatype=-1946157049, file_ptr_type=, offset=, status=0x7fffffffdcb0, error_code=0x7fffffffdbd4) at src/mpi/romio/adio/common/ad_write_coll.c:169 #8 0x00007ffff78665cc in MPIOI_File_write_all (fh=0x5555555a0860, offset=offset at entry=0, file_ptr_type=file_ptr_type at entry=101, buf=buf at entry=0x55555b21e410, count=count at entry=1, datatype=-1946157049, myname=0x7ffff7a7a3a0 "MPI_FILE_WRITE_ALL", status=0x7fffffffdcb0) at src/mpi/romio/mpi-io/write_all.c:172 #9 0x00007ffff7866723 in PMPI_File_write_all (fh=, buf=buf at entry=0x55555b21e410, count=count at entry=1, datatype=, status=status at entry=0x7fffffffdcb0) at src/mpi/romio/mpi-io/write_all.c:69 #10 0x0000555555555534 in main (argc=, argv=) at /home/abuild/rpmbuild/BUILD/mvapich4-ofi-4.1-build/mvapich-4.1/src/mpi/romio/test/noncontig_coll.c:110 Looking a little bit more, MPIR_Waitall when checking statuses does see a rc = 4237839 popping up. I tried a while back to track who had set this value but didn't achieve much. Let me know if I can run some more test / analysis for you as I don't really know where to look from here. Nicolas From panda at cse.ohio-state.edu Mon Sep 29 09:30:48 2025 From: panda at cse.ohio-state.edu (Panda, Dhabaleswar) Date: Mon, 29 Sep 2025 13:30:48 +0000 Subject: [Mvapich-discuss] Spack and Docker versions of MVAPICH 4.1GA release are available Message-ID: This is a short announcement to indicate that MVAPICH 4.1GA is now available through Spack. Docker version of the MVAPICH 4.1GA release is also available. Please visit the following URL for additional information: http://mvapich.cse.ohio-state.edu All questions, feedback, bug reports, hints for performance tuning, patches, and enhancements are welcome. Please post it to the mvapich-discuss mailing list (mvapich-discuss at lists.osu.edu). Thanks, The MVAPICH Team PS: We are also happy to announce that the number of organizations using the MVAPICH libraries (and registered at the MVAPICH site) has crossed 3,450 worldwide (in 92 countries). The number of downloads from the MVAPICH site has crossed 1,955,000 (1.955 million). The MVAPICH team would like to thank all its users and organizations!! From aruhela at tacc.utexas.edu Mon Sep 29 09:41:16 2025 From: aruhela at tacc.utexas.edu (Amit Ruhela) Date: Mon, 29 Sep 2025 13:41:16 +0000 Subject: [Mvapich-discuss] osu-micro-benchmarks-7.5-1 error with NVIDIA Message-ID: Hi Dr. Panda, I?m encountering the following errors while compiling OMB 7.5.1 with OpenMPI 5.0.8 and CUDA 13.0 on the Vista machine. I don?t think this issue is machine-specific, but it might be related to the latest CUDA version. Do you have any suggestions on how to resolve these errors? CC=mpicc CXX=mpicxx ./configure --prefix=$PWD/build --enable-cuda --enable-ncclomb --with-nccl=$TACC_NCCL_DIR --with-cuda=$TACC_CUDA_DIR --enable-mpi4=0 In file included from ../../../util/osu_util_mpi.h:12, from ../../../util/osu_util_mpi.c:11: ../../../util/osu_util_mpi.c: In function 'prefetch_data': ../../../util/osu_util_mpi.c:3386:50: error: incompatible type for argument 3 of 'cudaMemPrefetchAsync' 3386 | CUDA_CHECK(cudaMemPrefetchAsync(buf, length, devid, um_stream)); | ^~~~~ | | | int ../../../util/osu_util.h:136:22: note: in definition of macro 'CUDA_CHECK' 136 | int errno = (stmt); \ | ^~~~ In file included from /home1/apps/nvidia/Linux_aarch64/25.9/cuda/13.0/include/cuda_runtime.h:95, from ../../../util/osu_util.h:44: /home1/apps/nvidia/Linux_aarch64/25.9/cuda/13.0/include/cuda_runtime_api.h:7152:117: note: expected 'struct cudaMemLocation' but argument is of type 'int' 7152 | extern __host__ cudaError_t CUDARTAPI cudaMemPrefetchAsync(const void *devPtr, size_t count, struct cudaMemLocation location, unsigned int flags, cudaStream_t stream __dv(0)); | ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~ ../../../util/osu_util_mpi.c:3386:57: warning: passing argument 4 of 'cudaMemPrefetchAsync' makes integer from pointer without a cast [-Wint-conversion] 3386 | CUDA_CHECK(cudaMemPrefetchAsync(buf, length, devid, um_stream)); | ^~~~~~~~~ | | | cudaStream_t {aka struct CUstream_st *} ../../../util/osu_util.h:136:22: note: in definition of macro 'CUDA_CHECK' 136 | int errno = (stmt); \ | ^~~~ /home1/apps/nvidia/Linux_aarch64/25.9/cuda/13.0/include/cuda_runtime_api.h:7152:140: note: expected 'unsigned int' but argument is of type 'cudaStream_t' {aka 'struct CUstream_st *'} 7152 | extern __host__ cudaError_t CUDARTAPI cudaMemPrefetchAsync(const void *devPtr, size_t count, struct cudaMemLocation location, unsigned int flags, cudaStream_t stream __dv(0)); | ~~~~~~~~~~~~~^~~~~ ../../../util/osu_util_mpi.c:3386:16: error: too few arguments to function 'cudaMemPrefetchAsync' 3386 | CUDA_CHECK(cudaMemPrefetchAsync(buf, length, devid, um_stream)); | ^~~~~~~~~~~~~~~~~~~~ ../../../util/osu_util.h:136:22: note: in definition of macro 'CUDA_CHECK' 136 | int errno = (stmt); \ | ^~~~ /home1/apps/nvidia/Linux_aarch64/25.9/cuda/13.0/include/cuda_runtime_api.h:7152:39: note: declared here 7152 | extern __host__ cudaError_t CUDARTAPI cudaMemPrefetchAsync(const void *devPtr, size_t count, struct cudaMemLocation location, unsigned int flags, cudaStream_t stream __dv(0)); | ^~~~~~~~~~~~~~~~~~~~ make[4]: *** [Makefile:621: ../../../util/osu_util_mpi.o] Error 1 Thanks, Amit Ruhela, Ph.D. HPC Software Tools Group Texas Advanced Computing Center The University of Texas at Austin Email: aruhela at tacc.utexas.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From zyou at osc.edu Mon Sep 29 10:44:21 2025 From: zyou at osc.edu (You, Zhi-Qiang) Date: Mon, 29 Sep 2025 14:44:21 +0000 Subject: [Mvapich-discuss] PMI2 + Slurm Support in MVAPICH Spack? Message-ID: Hi, I am trying to enable the PMI2 variant in the MVAPICH Spack installation. However, this variant seems to conflict with Slurm support, as indicated in the package configuration: https://github.com/spack/spack-packages/blob/develop/repos/spack_repo/builtin/packages/mvapich/package.py#L122 I have checked the MVAPICH documentation, and PMI2 appears to be supported: https://mvapich-docs.readthedocs.io/en/mvapich-plus/cvar.html Could someone confirm whether enabling this variant alongside Slurm is not feasible, or if there is a workaround? Thank you for your help, ZQ -------------- next part -------------- An HTML attachment was scrubbed... URL: