[mvapich-discuss] MVAPICH-GDR 2.3.3: Bug using Multiple Nodes

Subramoni, Hari subramoni.1 at osu.edu
Mon Jan 27 10:50:44 EST 2020


Dear, Andreas.

This is good to know. You can set this in the default env for now. Nothing more should be needed. We’ve taken care of this internally in the code base now. So, with the next release, you should not need this flag.

Best,
Hari.

From: Herten, Andreas <a.herten at fz-juelich.de>
Sent: Monday, January 27, 2020 4:45 AM
To: Subramoni, Hari <subramoni.1 at osu.edu>
Cc: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>; Markus Schmitt <mschmitt at pks.mpg.de>; Alvarez, Damian <d.alvarez at fz-juelich.de>
Subject: Re: MVAPICH-GDR 2.3.3: Bug using Multiple Nodes

Dear Hari,

With this environment variable set, the issue does not occur!

I’ve updated the error description in the Github Gist with the output:
                https://gist.github.com/AndiH/cf1c0ec5110170526ad345c0ce82f74b#env-variable-mv2_use_rdma_cm0<https://urldefense.com/v3/__https:/gist.github.com/AndiH/cf1c0ec5110170526ad345c0ce82f74b*env-variable-mv2_use_rdma_cm0__;Iw!!KGKeukY!nMMO2okrCvQvZMOKFtJQLV4pGUaRrLFuVPnvJUU3LP8eqDgtoq7EVulxlCKwzSkQ-w$>

We would set this in our LMod module now as a work-around. Any further tests I can do to circle in on the problem?

Best,

-Andreas


Am 24.01.2020 um 19:02 schrieb Subramoni, Hari <subramoni.1 at osu.edu<mailto:subramoni.1 at osu.edu>>:

Hi, Andreas.

Can you please set "MV2_USE_RDMA_CM=0" and try?

Thx,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu<mailto:mvapich-discuss-bounces at cse.ohio-state.edu> <mvapich-discuss-bounces at mailman.cse.ohio-state.edu<mailto:mvapich-discuss-bounces at mailman.cse.ohio-state.edu>> On Behalf Of Herten, Andreas
Sent: Thursday, January 23, 2020 8:59 AM
To: mvapich-discuss at cse.ohio-state.edu<mailto:mvapich-discuss at cse.ohio-state.edu> <mvapich-discuss at mailman.cse.ohio-state.edu<mailto:mvapich-discuss at mailman.cse.ohio-state.edu>>
Cc: Markus Schmitt <mschmitt at pks.mpg.de<mailto:mschmitt at pks.mpg.de>>
Subject: [mvapich-discuss] MVAPICH-GDR 2.3.3: Bug using Multiple Nodes

Dear all,

As Hari already mentioned, the MPI_Allreduce() bug reported before was fixed with the latest built of the RPM. Thanks again for the swift response!

Unfortunately, going forward with our test case at hand, we encountered another bug – and quite a serious one. We cannot launch an MPI program on more than one node; `srun --nodes 1 ./test` works, but `srun --nodes 2 ./test` does not.

As before, please find a description of the problem in this Gist, including a reproducer:
                https://gist.github.com/AndiH/cf1c0ec5110170526ad345c0ce82f74b#mvapich2-gdr-multi-node-mpi-bug<https://urldefense.com/v3/__https:/gist.github.com/AndiH/cf1c0ec5110170526ad345c0ce82f74b*mvapich2-gdr-multi-node-mpi-bug__;Iw!!KGKeukY!hBYVKeo3AyxsqlLF0bkq7oI_G1W4ZvXBREKO-kV1lYlCRkrtJnTnIFEViJXMyoYnFsf1iaAO8fyYxYk$>

Please make sure to have a look at the note at the end of the readme relating to our OFED stack update next week.

Best,

-Andreas
—
NVIDIA Application Lab
Jülich Supercomputing Centre
Forschungszentrum Jülich, Germany
+49 2461 61 1825

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200127/3618b3a2/attachment.html>


More information about the mvapich-discuss mailing list