[Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM

You, Zhi-Qiang zyou at osc.edu
Mon Jan 13 18:15:21 EST 2025


Hi Reyhan,

Thank you for the prompt reply. I have tested the new RPM but the issue remains.

Regarding the account reactivation, please contact OSC help: oschelp at osc.edu<mailto:oschelp at osc.edu>

Best,
ZQ

From: Motlagh, Reyhan <motlagh.2 at osu.edu>
Date: Tuesday, January 14, 2025 at 6:01 AM
To: You, Zhi-Qiang <zyou at osc.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>, Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Subject: Re: Failed to unpack MVAPICH-Plus RPM
Hi ZQ,

By default we build with the version of slurm included with the OS package manager (slurm 22 for rhel9). It looks like Cardinal uses slurm 24, so this may be causing some incompatibilities. Can you try out the RPM below to see if that’s the resolution? We’re also looking into this on our end.

https://mvapich.cse.ohio-state.edu/download/mvapich/plus/4.0/cuda/UCX/mofed24.10/mvapich-plus-4.0-cuda12.4.rhel9.ofed24.10.ucx.gcc13.2.0.slurm24-4.0-1.x86_64.rpm

It looks like my osc account has been disabled, to help with this troubleshooting, who can I reach out to for reactivation (I assume this is all on Cardinal)? Username is rmotlagh.

Regarding your questions:

  1.  Yes, we are hoping to have MVAPICH 4.0 released within the month.
  2.  We have unified redundant envs (like having separate envs for HIP and CUDA) and made naming conventions more consistent for our CVARs. So yes, replace that with MVP_ENABLE_GPU
  3.  Some of these are done in the netmod layer now. You can set IB devices with “UCX_NET_DEVICES=mlx5_0:1” and “UCX_SOCKADDR_TLS_PRIORITY=rdmacm” (rdmacm may require a new rpm with --with-rdmacm ucx configure flag, I will update the website rpms to allow for this if it passes our testing). MVP_HOMOGENEOUS_CLUSTER’s equivalent is irrelevant now, performance is good regardless of this flag.

Best,
Reyhan

From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> on behalf of You, Zhi-Qiang via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Date: Saturday, January 11, 2025 at 9:32 PM
To: Panda, Dhabaleswar <panda at cse.ohio-state.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>
Subject: Re: [Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM
Hi DK,

Thank you for the prompt fix. The RPM is now functioning correctly. However, I encountered the following error while running a simple ping-pong MPI test over two nodes:

slurmstepd: error: pmijobid missing in fullinit command

I suspected this might be due to PMI incompatibility. I referred to this documentation<https://urldefense.com/v3/__https:/mvapich-docs.readthedocs.io/en/latest/cvar.html*mvapich-environment-variables__;Iw!!KGKeukY!3fo-CIZdjSLr3Qr4T-N801LdCwjo-3DZiuA5KjZOvLaCn4id5M3xni5dWZHrZEnZrHIvm_FdrzIPC23DUe4941agsuMkFyC1$> and learned about setting MVP_PMI_VERSION to 2 to align with our SLURM configuration. However, the issue persists. I also checked the output of mpichversion -a and confirmed that the --with-pmi=pmi2 option is enabled, leading me to conclude that this is not a PMI compatibility issue.

Additionally, I have a few related questions:

  1.  Will there be an MVAPICH 4.0 release, or will it be replaced by the MVAPICH-Plus CPU-only version?
  2.  The documentation linked above lists many environment variables that I haven’t encountered before when using MVAPICH2-GDR. Are these new variables specific to MVAPICH 4.0? Are variables like MV2_USE_CUDA/MVP_USE_CUDA still available, or should they be replaced with MVP_ENABLE_GPU?
  3.  Could you help confirm if the following variables are still supported in MVAPICH?

     *   MVP_USE_RDMA_CM
     *   MVP_HOMOGENEOUS_CLUSTER
     *   MVP_IBA_HCA

Thank you for your time and assistance!

Best regards,
ZQ


From: Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Date: Saturday, January 11, 2025 at 3:14 AM
To: You, Zhi-Qiang <zyou at osc.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>
Subject: RE: Failed to unpack MVAPICH-Plus RPM
Hi ZQ,

As we have communicated with you separately, a new RPM has been uploaded. Please try this version and let us know whether you see any additional issues.

DK

From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> On Behalf Of You, Zhi-Qiang via Mvapich-discuss
Sent: Thursday, January 2, 2025 1:54 PM
To: mvapich-discuss at lists.osu.edu
Subject: [Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM

Hello,

I downloaded the MVAPICH-Plus 4.0 RPM from the following link:
https://mvapich.cse.ohio-state.edu/download/mvapich/plus/4.0/cuda/UCX/mofed5.0/mvapich-plus-4.0-cuda12.4.rhel9.ofed24.10.ucx.gcc13.2.0.slurm-4.0-1.x86_64.rpm, but I encountered an issue when trying to unpack it using cpio. The process failed with the error:

cpio: premature end of file

I have no issues unpacking other RPMs, so it seems this file might be corrupted. Could you please check and confirm?

Thank you,
ZQ

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20250113/44f1e0c8/attachment-0002.html>


More information about the Mvapich-discuss mailing list