[Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM

Motlagh, Reyhan motlagh.2 at osu.edu
Tue Jan 14 10:52:42 EST 2025


Hi,

Amit - We are working on documenting some commonly used deprecated envs and their replacements in our readthedocs, but a comprehensive list would be thousands of variables. Between our readthedocs<https://mvapich-docs.readthedocs.io/en/latest/cvar.html>, “ucx_info -c”, and “fi_info -e” you should be able to see all available depending on the netmod used.

Karen and ZQ – Thanks! I’m able to login now, we’re looking into the issue.

Reyhan

From: Amit Ruhela <aruhela at tacc.utexas.edu>
Date: Tuesday, January 14, 2025 at 10:01 AM
To: Motlagh, Reyhan <motlagh.2 at osu.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>, Panda, Dhabaleswar <panda at cse.ohio-state.edu>, You, Zhi-Qiang <zyou at osc.edu>
Subject: Re: Failed to unpack MVAPICH-Plus RPM
Hi Reyhan, Is it possible to get a matrix listing environment variables that have been deprecated and what environment variables replace/recommended for the older ones? Thanks Amit Ruhela From: Mvapich-discuss <mvapich-discuss-bounces+aruhela=tacc. utexas. edu@ lists. osu. edu>
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/KGKeukY!vwQf0kaND6YApRdx344E1hV_lGfT2wmZFU6HMdh_L2HOzcT0W5Yqj01M42dnsuEQ3Ohgv3-MIe7Xmwv96yONf0vlkRjsKXfyBjOp16GU0ScjVBeT8YvPY7EofabFPKvxMd3hjK_LW085$>
Report Suspicious <https://us-phishalarm-ewt.proofpoint.com/EWT/v1/KGKeukY!vwQf0kaND6YApRdx344E1hV_lGfT2wmZFU6HMdh_L2HOzcT0W5Yqj01M42dnsuEQ3Ohgv3-MIe7Xmwv96yONf0vlkRjsKXfyBjOp16GU0ScjVBeT8YvPY7EofabFPKvxMd3hjK_LW085$>


ZjQcmQRYFpfptBannerEnd
Hi Reyhan,

Is it possible to get a matrix listing environment variables that have been deprecated and what environment variables replace/recommended for the older ones?

Thanks
Amit Ruhela

________________________________
From: Mvapich-discuss <mvapich-discuss-bounces+aruhela=tacc.utexas.edu at lists.osu.edu> on behalf of You, Zhi-Qiang via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Monday, January 13, 2025 5:15 PM
To: Motlagh, Reyhan <motlagh.2 at osu.edu>; Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>; Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Subject: Re: [Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM


Hi Reyhan,



Thank you for the prompt reply. I have tested the new RPM but the issue remains.

Regarding the account reactivation, please contact OSC help: oschelp at osc.edu<mailto:oschelp at osc.edu>

Best,

ZQ



From: Motlagh, Reyhan <motlagh.2 at osu.edu>
Date: Tuesday, January 14, 2025 at 6:01 AM
To: You, Zhi-Qiang <zyou at osc.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>, Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Subject: Re: Failed to unpack MVAPICH-Plus RPM

Hi ZQ,



By default we build with the version of slurm included with the OS package manager (slurm 22 for rhel9). It looks like Cardinal uses slurm 24, so this may be causing some incompatibilities. Can you try out the RPM below to see if that’s the resolution? We’re also looking into this on our end.



https://mvapich.cse.ohio-state.edu/download/mvapich/plus/4.0/cuda/UCX/mofed24.10/mvapich-plus-4.0-cuda12.4.rhel9.ofed24.10.ucx.gcc13.2.0.slurm24-4.0-1.x86_64.rpm



It looks like my osc account has been disabled, to help with this troubleshooting, who can I reach out to for reactivation (I assume this is all on Cardinal)? Username is rmotlagh.

Regarding your questions:

  1.  Yes, we are hoping to have MVAPICH 4.0 released within the month.
  2.  We have unified redundant envs (like having separate envs for HIP and CUDA) and made naming conventions more consistent for our CVARs. So yes, replace that with MVP_ENABLE_GPU
  3.  Some of these are done in the netmod layer now. You can set IB devices with “UCX_NET_DEVICES=mlx5_0:1” and “UCX_SOCKADDR_TLS_PRIORITY=rdmacm” (rdmacm may require a new rpm with --with-rdmacm ucx configure flag, I will update the website rpms to allow for this if it passes our testing). MVP_HOMOGENEOUS_CLUSTER’s equivalent is irrelevant now, performance is good regardless of this flag.



Best,

Reyhan



From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> on behalf of You, Zhi-Qiang via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Date: Saturday, January 11, 2025 at 9:32 PM
To: Panda, Dhabaleswar <panda at cse.ohio-state.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>
Subject: Re: [Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM

Hi DK,



Thank you for the prompt fix. The RPM is now functioning correctly. However, I encountered the following error while running a simple ping-pong MPI test over two nodes:

slurmstepd: error: pmijobid missing in fullinit command

I suspected this might be due to PMI incompatibility. I referred to this documentation<https://urldefense.com/v3/__https:/mvapich-docs.readthedocs.io/en/latest/cvar.html*mvapich-environment-variables__;Iw!!KGKeukY!3fo-CIZdjSLr3Qr4T-N801LdCwjo-3DZiuA5KjZOvLaCn4id5M3xni5dWZHrZEnZrHIvm_FdrzIPC23DUe4941agsuMkFyC1$> and learned about setting MVP_PMI_VERSION to 2 to align with our SLURM configuration. However, the issue persists. I also checked the output of mpichversion -a and confirmed that the --with-pmi=pmi2 option is enabled, leading me to conclude that this is not a PMI compatibility issue.



Additionally, I have a few related questions:

  1.  Will there be an MVAPICH 4.0 release, or will it be replaced by the MVAPICH-Plus CPU-only version?
  2.  The documentation linked above lists many environment variables that I haven’t encountered before when using MVAPICH2-GDR. Are these new variables specific to MVAPICH 4.0? Are variables like MV2_USE_CUDA/MVP_USE_CUDA still available, or should they be replaced with MVP_ENABLE_GPU?
  3.  Could you help confirm if the following variables are still supported in MVAPICH?

     *   MVP_USE_RDMA_CM

     *   MVP_HOMOGENEOUS_CLUSTER

     *   MVP_IBA_HCA



Thank you for your time and assistance!



Best regards,

ZQ





From: Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Date: Saturday, January 11, 2025 at 3:14 AM
To: You, Zhi-Qiang <zyou at osc.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>
Subject: RE: Failed to unpack MVAPICH-Plus RPM

Hi ZQ,



As we have communicated with you separately, a new RPM has been uploaded. Please try this version and let us know whether you see any additional issues.



DK



From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> On Behalf Of You, Zhi-Qiang via Mvapich-discuss
Sent: Thursday, January 2, 2025 1:54 PM
To: mvapich-discuss at lists.osu.edu
Subject: [Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM



Hello,



I downloaded the MVAPICH-Plus 4.0 RPM from the following link:

https://mvapich.cse.ohio-state.edu/download/mvapich/plus/4.0/cuda/UCX/mofed5.0/mvapich-plus-4.0-cuda12.4.rhel9.ofed24.10.ucx.gcc13.2.0.slurm-4.0-1.x86_64.rpm, but I encountered an issue when trying to unpack it using cpio. The process failed with the error:

cpio: premature end of file



I have no issues unpacking other RPMs, so it seems this file might be corrupted. Could you please check and confirm?



Thank you,

ZQ


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20250114/ea99332f/attachment-0002.html>


More information about the Mvapich-discuss mailing list