[Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM
Amit Ruhela
aruhela at tacc.utexas.edu
Tue Jan 14 13:48:15 EST 2025
Thanks Reyhan,
This sounds good to me.
Best Regards
Amit R
________________________________
From: Motlagh, Reyhan <motlagh.2 at osu.edu>
Sent: Tuesday, January 14, 2025 9:52 AM
To: Amit Ruhela <aruhela at tacc.utexas.edu>; Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>; Panda, Dhabaleswar <panda at cse.ohio-state.edu>; You, Zhi-Qiang <zyou at osc.edu>
Subject: Re: Failed to unpack MVAPICH-Plus RPM
Hi,
Amit - We are working on documenting some commonly used deprecated envs and their replacements in our readthedocs, but a comprehensive list would be thousands of variables. Between our readthedocs<https://urldefense.com/v3/__https://mvapich-docs.readthedocs.io/en/latest/cvar.html__;!!KGKeukY!xa2HO6TnGc0hhJ6FGJg5hy5KSrSkT_YPHpndO2wy0qZBiev6Ig0Vfn6vdHRV-FTLKZHLQGDAf_fCoR5mVt8pgENYopQm6BsgZA$ >, “ucx_info -c”, and “fi_info -e” you should be able to see all available depending on the netmod used.
Karen and ZQ – Thanks! I’m able to login now, we’re looking into the issue.
Reyhan
From: Amit Ruhela <aruhela at tacc.utexas.edu>
Date: Tuesday, January 14, 2025 at 10:01 AM
To: Motlagh, Reyhan <motlagh.2 at osu.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>, Panda, Dhabaleswar <panda at cse.ohio-state.edu>, You, Zhi-Qiang <zyou at osc.edu>
Subject: Re: Failed to unpack MVAPICH-Plus RPM
Hi Reyhan, Is it possible to get a matrix listing environment variables that have been deprecated and what environment variables replace/recommended for the older ones? Thanks Amit Ruhela From: Mvapich-discuss <mvapich-discuss-bounces+aruhela=tacc. utexas. edu@ lists. osu. edu>
Hi Reyhan,
Is it possible to get a matrix listing environment variables that have been deprecated and what environment variables replace/recommended for the older ones?
Thanks
Amit Ruhela
________________________________
From: Mvapich-discuss <mvapich-discuss-bounces+aruhela=tacc.utexas.edu at lists.osu.edu> on behalf of You, Zhi-Qiang via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Monday, January 13, 2025 5:15 PM
To: Motlagh, Reyhan <motlagh.2 at osu.edu>; Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>; Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Subject: Re: [Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM
Hi Reyhan,
Thank you for the prompt reply. I have tested the new RPM but the issue remains.
Regarding the account reactivation, please contact OSC help: oschelp at osc.edu<mailto:oschelp at osc.edu>
Best,
ZQ
From: Motlagh, Reyhan <motlagh.2 at osu.edu>
Date: Tuesday, January 14, 2025 at 6:01 AM
To: You, Zhi-Qiang <zyou at osc.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>, Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Subject: Re: Failed to unpack MVAPICH-Plus RPM
Hi ZQ,
By default we build with the version of slurm included with the OS package manager (slurm 22 for rhel9). It looks like Cardinal uses slurm 24, so this may be causing some incompatibilities. Can you try out the RPM below to see if that’s the resolution? We’re also looking into this on our end.
https://mvapich.cse.ohio-state.edu/download/mvapich/plus/4.0/cuda/UCX/mofed24.10/mvapich-plus-4.0-cuda12.4.rhel9.ofed24.10.ucx.gcc13.2.0.slurm24-4.0-1.x86_64.rpm
It looks like my osc account has been disabled, to help with this troubleshooting, who can I reach out to for reactivation (I assume this is all on Cardinal)? Username is rmotlagh.
Regarding your questions:
1. Yes, we are hoping to have MVAPICH 4.0 released within the month.
2. We have unified redundant envs (like having separate envs for HIP and CUDA) and made naming conventions more consistent for our CVARs. So yes, replace that with MVP_ENABLE_GPU
3. Some of these are done in the netmod layer now. You can set IB devices with “UCX_NET_DEVICES=mlx5_0:1” and “UCX_SOCKADDR_TLS_PRIORITY=rdmacm” (rdmacm may require a new rpm with --with-rdmacm ucx configure flag, I will update the website rpms to allow for this if it passes our testing). MVP_HOMOGENEOUS_CLUSTER’s equivalent is irrelevant now, performance is good regardless of this flag.
Best,
Reyhan
From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> on behalf of You, Zhi-Qiang via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Date: Saturday, January 11, 2025 at 9:32 PM
To: Panda, Dhabaleswar <panda at cse.ohio-state.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>
Subject: Re: [Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM
Hi DK,
Thank you for the prompt fix. The RPM is now functioning correctly. However, I encountered the following error while running a simple ping-pong MPI test over two nodes:
slurmstepd: error: pmijobid missing in fullinit command
I suspected this might be due to PMI incompatibility. I referred to this documentation<https://urldefense.com/v3/__https:/mvapich-docs.readthedocs.io/en/latest/cvar.html*mvapich-environment-variables__;Iw!!KGKeukY!3fo-CIZdjSLr3Qr4T-N801LdCwjo-3DZiuA5KjZOvLaCn4id5M3xni5dWZHrZEnZrHIvm_FdrzIPC23DUe4941agsuMkFyC1$> and learned about setting MVP_PMI_VERSION to 2 to align with our SLURM configuration. However, the issue persists. I also checked the output of mpichversion -a and confirmed that the --with-pmi=pmi2 option is enabled, leading me to conclude that this is not a PMI compatibility issue.
Additionally, I have a few related questions:
1. Will there be an MVAPICH 4.0 release, or will it be replaced by the MVAPICH-Plus CPU-only version?
2. The documentation linked above lists many environment variables that I haven’t encountered before when using MVAPICH2-GDR. Are these new variables specific to MVAPICH 4.0? Are variables like MV2_USE_CUDA/MVP_USE_CUDA still available, or should they be replaced with MVP_ENABLE_GPU?
3. Could you help confirm if the following variables are still supported in MVAPICH?
* MVP_USE_RDMA_CM
* MVP_HOMOGENEOUS_CLUSTER
* MVP_IBA_HCA
Thank you for your time and assistance!
Best regards,
ZQ
From: Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Date: Saturday, January 11, 2025 at 3:14 AM
To: You, Zhi-Qiang <zyou at osc.edu>, Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>
Subject: RE: Failed to unpack MVAPICH-Plus RPM
Hi ZQ,
As we have communicated with you separately, a new RPM has been uploaded. Please try this version and let us know whether you see any additional issues.
DK
From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu> On Behalf Of You, Zhi-Qiang via Mvapich-discuss
Sent: Thursday, January 2, 2025 1:54 PM
To: mvapich-discuss at lists.osu.edu
Subject: [Mvapich-discuss] Failed to unpack MVAPICH-Plus RPM
Hello,
I downloaded the MVAPICH-Plus 4.0 RPM from the following link:
https://mvapich.cse.ohio-state.edu/download/mvapich/plus/4.0/cuda/UCX/mofed5.0/mvapich-plus-4.0-cuda12.4.rhel9.ofed24.10.ucx.gcc13.2.0.slurm-4.0-1.x86_64.rpm, but I encountered an issue when trying to unpack it using cpio. The process failed with the error:
cpio: premature end of file
I have no issues unpacking other RPMs, so it seems this file might be corrupted. Could you please check and confirm?
Thank you,
ZQ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20250114/9fbe8d8e/attachment-0002.html>
More information about the Mvapich-discuss
mailing list