[Mvapich-discuss] Self test issues with libfabric with mvapich 4.[01]

Nicolas Morey nmorey at suse.com
Thu Sep 25 03:44:00 EDT 2025


!-------------------------------------------------------------------|
  This Message Is From an External Sender
  This message came from outside your organization.
|-------------------------------------------------------------------!

Hi,

I'm the RDMA maintainer for OpenSUSE and I'm currently working on packaging MVAPICH4.
I've tried both 4.0 and 4.1 and for both I have some issues when built with  OFI/libfabric (no issue with UCX apart some failures on s390x but I haven't checked that out yet)

The staging project is here: https://urldefense.com/v3/__https://build.opensuse.org/package/show/home:NMorey:branches:science:HPC/mvapich4__;!!KGKeukY!ygpEeAnFd0aMVFkidyPUfNoC3Ytn4wywyut3Ho_wNlojW4KC6pXgP1cmkpz1G_YTcvsKOAWKlIq_jLwbWFKAXmU$ 
We are building with libfabric 2.3.0. 
It seems to be working alright when building against older SLE, but breaks against SLE16 or Tumbleweed. So my best guess is that GCC (7 in older SLE, 13 for SLE16, 15 for Factory) has some impact on that.

Symptoms when building the testsuite is (stripped down passed tests):
[  936s] make  check-TESTS
[  936s] make[4]: Entering directory '/home/abuild/rpmbuild/BUILD/mvapich4-ofi-4.1-build/mvapich-4.1/src/mpi/romio/test'
[  936s] make[5]: Entering directory '/home/abuild/rpmbuild/BUILD/mvapich4-ofi-4.1-build/mvapich-4.1/src/mpi/romio/test'
[  937s] FAIL: coll_test
[  941s] FAIL: noncontig_coll
[  941s] FAIL: split_coll
[  944s] FAIL: noncontig_coll2
[  944s] FAIL: aggregation1
[  948s] FAIL: fcoll_test
[  950s] FAIL: pfcoll_test
[  950s] ============================================================================
[  950s] Testsuite summary for ROMIO 4.3.0
[  950s] ============================================================================
[  950s] # TOTAL: 34
[  950s] # PASS:  27
[  950s] # SKIP:  0
[  950s] # XFAIL: 0
[  950s] # FAIL:  7
[  950s] # XPASS: 0
[  950s] # ERROR: 0

With test-suite.log:
FAIL: noncontig_coll
====================

Abort(17) on node 0: Fatal error in internal_Waitall: See the MPI_ERROR field in MPI_Status for the error code
FAIL noncontig_coll (exit status: 17)


Trying to run manually:
abuild at portia:~/rpmbuild/BUILD/mvapich4-ofi-4.1-build/mvapich-4.1/src/mpi/romio/test> mpirun -np 2 ./noncontig_coll -fname foo
Abort(17) on node 0: Fatal error in internal_Waitall: See the MPI_ERROR field in MPI_Status for the error code

(gdb) bt
#0  0x00007ffff6a4548e in exit () from /lib64/libc.so.6
#1  0x00007ffff7278dca in MPL_exit (exit_code=<optimized out>) at src/mpl/src/msg/mpl_msg.c:89
#2  MPID_Abort (comm=0x0, mpi_errno=<optimized out>, exit_code=<optimized out>, error_msg=<optimized out>) at src/mpid/ch4/src/ch4_globals.c:124
#3  0x00007ffff71e7c36 in MPIR_Handle_fatal_error (comm_ptr=comm_ptr at entry=0x0, fcname=fcname at entry=0x7ffff7904740 <__func__.0.lto_priv.312> "internal_Waitall", errcode=errcode at entry=17)
    at src/mpi/errhan/errutil.c:604
#4  0x00007ffff71e8017 in MPIR_Err_return_comm (comm_ptr=0x0, fcname=0x7ffff7904740 <__func__.0.lto_priv.312> "internal_Waitall", errcode=<optimized out>) at src/mpi/errhan/errutil.c:300
#5  0x00007ffff6ffc47e in internal_Waitall (count=3, array_of_requests=0x55555b226480, array_of_statuses=0x1) at src/binding/c/request/waitall.c:129
#6  0x00007ffff787358b in ADIOI_Calc_others_req (fd=fd at entry=0x5555555a0860, count_my_req_procs=1, count_my_req_per_proc=count_my_req_per_proc at entry=0x55555b226380, my_req=my_req at entry=0x55555b2263c0, 
    nprocs=<optimized out>, myrank=0, count_others_req_procs_ptr=0x7fffffffdb00, count_others_req_per_proc_ptr=0x7fffffffdb08, others_req_ptr=0x7fffffffdb28) at src/mpi/romio/adio/common/ad_aggregate.c:515
#7  0x00007ffff7887d03 in ADIOI_GEN_WriteStridedColl (fd=<optimized out>, buf=0x55555b21e410, count=1, datatype=-1946157049, file_ptr_type=<optimized out>, offset=<optimized out>, status=0x7fffffffdcb0, 
    error_code=0x7fffffffdbd4) at src/mpi/romio/adio/common/ad_write_coll.c:169
#8  0x00007ffff78665cc in MPIOI_File_write_all (fh=0x5555555a0860, offset=offset at entry=0, file_ptr_type=file_ptr_type at entry=101, buf=buf at entry=0x55555b21e410, count=count at entry=1, datatype=-1946157049, 
    myname=0x7ffff7a7a3a0 <myname.0.lto_priv> "MPI_FILE_WRITE_ALL", status=0x7fffffffdcb0) at src/mpi/romio/mpi-io/write_all.c:172
#9  0x00007ffff7866723 in PMPI_File_write_all (fh=<optimized out>, buf=buf at entry=0x55555b21e410, count=count at entry=1, datatype=<optimized out>, status=status at entry=0x7fffffffdcb0)
    at src/mpi/romio/mpi-io/write_all.c:69
#10 0x0000555555555534 in main (argc=<optimized out>, argv=<optimized out>) at /home/abuild/rpmbuild/BUILD/mvapich4-ofi-4.1-build/mvapich-4.1/src/mpi/romio/test/noncontig_coll.c:110


Looking a little bit more, MPIR_Waitall when checking statuses does see a rc = 4237839 popping up. I tried a while back to track who had set this value but didn't achieve much.

Let me know if I can run some more test / analysis for you as I don't really know where to look from here.

Nicolas



More information about the Mvapich-discuss mailing list