[Mvapich-discuss] Azure HBv4 mpi failure
Sandhu, Prabhjot(Nicky)@DWR
Prabhjot.Sandhu at water.ca.gov
Fri Jul 12 12:01:41 EDT 2024
Hi DK
It fails at a different point in the application code which is interesting. The same code runs fine with the HBv2 setup. Also compiling it against HPC-X runs fine as well on HBv4
Schism code can be obtained here https://urldefense.com/v3/__https://github.com/schism-dev/schism__;!!KGKeukY!33wuU66KdVsf0lwLQHTUQe9vF7u_p3HeuBwVw3aNy2-bEYfzZ_gra2ezzS6BVDgTyEpIOk0eTYLGxSkPdG7sy1mdfsebWUyiOMKTEors-A$ which is the application.
Some of the example error files are attached here. The before_ and after_ refers to before and after the patch.
Perhaps I can run a test suite for mvapich2?
Nicky
________________________________
From: Panda, Dhabaleswar <panda at cse.ohio-state.edu>
Sent: Friday, July 12, 2024 6:21:47 AM
To: Paniraja Guptha, Akshay <panirajaguptha.1 at osu.edu>; Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>; Sandhu, Prabhjot(Nicky)@DWR <Prabhjot.Sandhu at water.ca.gov>
Subject: Re: Azure HBv4 mpi failure
[You don't often get email from panda at cse.ohio-state.edu. Learn why this is important at https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!KGKeukY!33wuU66KdVsf0lwLQHTUQe9vF7u_p3HeuBwVw3aNy2-bEYfzZ_gra2ezzS6BVDgTyEpIOk0eTYLGxSkPdG7sy1mdfsebWUyiOMJUO8M9lA$ ]
Sorry to know that the issue still persists with the patch. Can you please provide some more details on the failure you are seeing?
Thanks,
DK
________________________________________
From: Mvapich-discuss <mvapich-discuss-bounces+panda.2=osu.edu at lists.osu.edu> on behalf of Sandhu, Prabhjot(Nicky)@DWR via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Thursday, July 11, 2024 8:41 PM
To: Paniraja Guptha, Akshay; Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU
Subject: Re: [Mvapich-discuss] Azure HBv4 mpi failure
Hi Akshay, I patched it and the warning message goes away but the application still fails with internal checks. In other words, the issue remains. From: Paniraja Guptha, Akshay <panirajaguptha. 1@ osu. edu> Sent: Thursday, July 11, 2024
Hi Akshay,
I patched it and the warning message goes away but the application still fails with internal checks. In other words, the issue remains.
From: Paniraja Guptha, Akshay <panirajaguptha.1 at osu.edu>
Sent: Thursday, July 11, 2024 12:36 PM
To: Paniraja Guptha, Akshay <panirajaguptha.1 at osu.edu>; Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu>; Sandhu, Prabhjot(Nicky)@DWR <Prabhjot.Sandhu at water.ca.gov>
Subject: RE: Azure HBv4 mpi failure
You don't often get email from panirajaguptha.1 at osu.edu<mailto:panirajaguptha.1 at osu.edu>. Learn why this is important<https://urldefense.com/v3/__https://gcc02.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Faka.ms*2FLearnAboutSenderIdentification__*3B!!KGKeukY!0TzJcH9W0K9L5-SZAqrNAJCksDvztRhINlrGkx2NCf0n0BS8P_e6UP_lJoZVpr9zBkMOs-sU0lXs0haOKeBwi5h-B4S_cw8SWdM7L8Tmvg*24&data=05*7C02*7CPrabhjot.Sandhu*40water.ca.gov*7Cd8a64d4b6f644002eaf108dca27599ab*7Cb71d56524b834257afcd7fd177884564*7C0*7C0*7C638563873182557953*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C0*7C*7C*7C&sdata=wjw5GrQIs0OgmfwjTua084pEOe6ldlveD9ZgMZmq*2Bog*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJQ!!KGKeukY!33wuU66KdVsf0lwLQHTUQe9vF7u_p3HeuBwVw3aNy2-bEYfzZ_gra2ezzS6BVDgTyEpIOk0eTYLGxSkPdG7sy1mdfsebWUyiOMJnvdWenA$ <https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!KGKeukY!0TzJcH9W0K9L5-SZAqrNAJCksDvztRhINlrGkx2NCf0n0BS8P_e6UP_lJoZVpr9zBkMOs-sU0lXs0haOKeBwi5h-B4S_cw8SWdM7L8Tmvg$>>
Hi Nicky,
Can you please try the attached patch?
-Akshay
From: Mvapich-discuss <mvapich-discuss-bounces at lists.osu.edu<mailto:mvapich-discuss-bounces at lists.osu.edu>> On Behalf Of Paniraja Guptha, Akshay via Mvapich-discuss
Sent: Monday, July 1, 2024 11:38 AM
To: Sandhu, Prabhjot(Nicky)@DWR <Prabhjot.Sandhu at water.ca.gov<mailto:Prabhjot.Sandhu at water.ca.gov>>; Announcement about MVAPICH (MPI over InfiniBand, RoCE, Omni-Path, Slingshot, iWARP and EFA) Libraries developed at NBCL/OSU <mvapich-discuss at lists.osu.edu<mailto:mvapich-discuss at lists.osu.edu>>
Subject: Re: [Mvapich-discuss] Azure HBv4 mpi failure
Hi Nicky,
Thanks for bringing this to our attention. We will take a look at the issue and get back to you.
-Akshay Paniraja Guptha
From: Mvapich-discuss <mvapich-discuss-bounces+panirajaguptha.1=osu.edu at lists.osu.edu<mailto:mvapich-discuss-bounces+panirajaguptha.1=osu.edu at lists.osu.edu>> On Behalf Of Sandhu, Prabhjot(Nicky)@DWR via Mvapich-discuss
Sent: Monday, July 1, 2024 11:09 AM
To: mvapich-discuss at lists.osu.edu<mailto:mvapich-discuss at lists.osu.edu>
Subject: [Mvapich-discuss] Azure HBv4 mpi failure
I compiled my code against the lastest alma linux 8. 7 and mvapich2-2. 3. 7-1 on Azure. The code performs very well when using HBv2-series or HBv3-series, however it fails when using HBv4-series with the following warning at start of the mpirun
I compiled my code against the lastest alma linux 8.7 and mvapich2-2.3.7-1 on Azure. The code performs very well when using HBv2-series<https://urldefense.com/v3/__https://gcc02.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Flearn.microsoft.com*2Fen-us*2Fazure*2Fvirtual-machines*2Fsizes*2Fhigh-performance-compute*2Fhb-family*hbv2-series__*3BIw!!KGKeukY!wHkEMZ0eG8-_lzRbW3pQoiNeTm2zvI6k4mCGcQ5RhL_zSzxaLb28swQvFn_sXZm35ID-u19N9dXDw0rWbGB0sUpj2J05VChdBNmn6MzFmg*24&data=05*7C02*7CPrabhjot.Sandhu*40water.ca.gov*7Cd8a64d4b6f644002eaf108dca27599ab*7Cb71d56524b834257afcd7fd177884564*7C0*7C0*7C638563873182565969*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C0*7C*7C*7C&sdata=sa*2BbBktweekguSQQ4T*2FHokZDn8mU340EQb1vmlkKfQU*3D&reserved=0__;JSUlJSUlJSUlJSUlJSolJSUlJSUlJSUlJSUlJSUlJSUlJQ!!KGKeukY!33wuU66KdVsf0lwLQHTUQe9vF7u_p3HeuBwVw3aNy2-bEYfzZ_gra2ezzS6BVDgTyEpIOk0eTYLGxSkPdG7sy1mdfsebWUyiOMLcsAB_4A$ <https://urldefense.com/v3/__https:/learn.microsoft.com/en-us/azure/virtual-machines/sizes/high-performance-compute/hb-family*hbv2-series__;Iw!!KGKeukY!wHkEMZ0eG8-_lzRbW3pQoiNeTm2zvI6k4mCGcQ5RhL_zSzxaLb28swQvFn_sXZm35ID-u19N9dXDw0rWbGB0sUpj2J05VChdBNmn6MzFmg$>> or HBv3-series<https://urldefense.com/v3/__https://gcc02.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Flearn.microsoft.com*2Fen-us*2Fazure*2Fvirtual-machines*2Fsizes*2Fhigh-performance-compute*2Fhb-family*hbv3-series__*3BIw!!KGKeukY!wHkEMZ0eG8-_lzRbW3pQoiNeTm2zvI6k4mCGcQ5RhL_zSzxaLb28swQvFn_sXZm35ID-u19N9dXDw0rWbGB0sUpj2J05VChdBNl1qZThrw*24&data=05*7C02*7CPrabhjot.Sandhu*40water.ca.gov*7Cd8a64d4b6f644002eaf108dca27599ab*7Cb71d56524b834257afcd7fd177884564*7C0*7C0*7C638563873182572023*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C0*7C*7C*7C&sdata=kJSv6VNU4HU0BK*2F8ISu3j*2B8dUE1D6WpwjKyVbbY44rc*3D&reserved=0__;JSUlJSUlJSUlJSUlJSolJSUlJSUlJSUlJSUlJSUlJSUlJQ!!KGKeukY!33wuU66KdVsf0lwLQHTUQe9vF7u_p3HeuBwVw3aNy2-bEYfzZ_gra2ezzS6BVDgTyEpIOk0eTYLGxSkPdG7sy1mdfsebWUyiOMIPAM-B9A$ <https://urldefense.com/v3/__https:/learn.microsoft.com/en-us/azure/virtual-machines/sizes/high-performance-compute/hb-family*hbv3-series__;Iw!!KGKeukY!wHkEMZ0eG8-_lzRbW3pQoiNeTm2zvI6k4mCGcQ5RhL_zSzxaLb28swQvFn_sXZm35ID-u19N9dXDw0rWbGB0sUpj2J05VChdBNl1qZThrw$>>, however it fails when using HBv4-series<https://urldefense.com/v3/__https://gcc02.safelinks.protection.outlook.com/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Flearn.microsoft.com*2Fen-us*2Fazure*2Fvirtual-machines*2Fsizes*2Fhigh-performance-compute*2Fhb-family*hbv4-series__*3BIw!!KGKeukY!wHkEMZ0eG8-_lzRbW3pQoiNeTm2zvI6k4mCGcQ5RhL_zSzxaLb28swQvFn_sXZm35ID-u19N9dXDw0rWbGB0sUpj2J05VChdBNmJDg7Qiw*24&data=05*7C02*7CPrabhjot.Sandhu*40water.ca.gov*7Cd8a64d4b6f644002eaf108dca27599ab*7Cb71d56524b834257afcd7fd177884564*7C0*7C0*7C638563873182576978*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C0*7C*7C*7C&sdata=*2FCJEEG74iIYF5Q*2FA2VkrVP5ne74Cy*2FuoP2U0tW61Huw*3D&reserved=0__;JSUlJSUlJSUlJSUlJSolJSUlJSUlJSUlJSUlJSUlJSUlJSU!!KGKeukY!33wuU66KdVsf0lwLQHTUQe9vF7u_p3HeuBwVw3aNy2-bEYfzZ_gra2ezzS6BVDgTyEpIOk0eTYLGxSkPdG7sy1mdfsebWUyiOMJWuQxyBw$ <https://urldefense.com/v3/__https:/learn.microsoft.com/en-us/azure/virtual-machines/sizes/high-performance-compute/hb-family*hbv4-series__;Iw!!KGKeukY!wHkEMZ0eG8-_lzRbW3pQoiNeTm2zvI6k4mCGcQ5RhL_zSzxaLb28swQvFn_sXZm35ID-u19N9dXDw0rWbGB0sUpj2J05VChdBNmJDg7Qiw$>> with the following warning at start of the mpirun after which the application code also fails.
[get_link_speed] Invalid link speed 128
Has anyone seen this message? Are there any env vars or config vars to be set?
Nicky
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240712/ce405f9e/attachment-0002.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stderr_after_1.txt
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240712/ce405f9e/attachment-0006.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stderr_before_1.txt
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240712/ce405f9e/attachment-0007.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: stderr_before_2.txt
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20240712/ce405f9e/attachment-0008.txt>
More information about the Mvapich-discuss
mailing list