[Mvapich-discuss] Assistance Needed: Inter-Node GPU Data Exchange Error

Panda, Dhabaleswar panda at cse.ohio-state.edu
Wed May 1 22:46:34 EDT 2024


Not sure which version of MVAPICH2 you are using. There is no release from the MVAPICH team which is named as MVAPICH2 3.2.1 :-(

For GPU clusters, please use the latest MVAPICH-Plus 3.0GA version.

DK


________________________________________
From: john  <runweicheng at gmail.com>
Sent: Wednesday, May 1, 2024 9:48 PM
To: Panda, Dhabaleswar; mug-conf at lists.osu.edu
Cc: mvapich-discuss at lists.osu.edu; mvapich at lists.osu.edu
Subject: Assistance Needed: Inter-Node GPU Data Exchange Error

!-------------------------------------------------------------------|
  This Message Is From an External Sender
  This message came from outside your organization.
|-------------------------------------------------------------------!

Dear Prof Panda

I hope this email finds you well. I am reaching out regarding an issue I've
encountered while working on our Linux cluster.

Currently, our setup involves a Linux cluster where each node is equipped
with two GPU accelerators, and the nodes are interconnected via Ethernet.
We're utilizing MVAPICH2 version 3.2.1 with the configuration options
"--with-rdma=gen2" and "--enable-cuda."

While exchanging data between GPU memories within the same node functions
smoothly, encountering errors arises when attempting to exchange data across
two nodes.

1 Test Case 1: Testing GPU-to-GPU data exchange on two GPUs on Node 125
yielded successful results, as outlined below:

mpiexec -n 2 -hosts hw4-125,hw4-125 -env MV2_USE_CUDA 1 ./testp2p
# Size        Bandwidth (MB/s)
1024                     16.77
2048                    218.80
4096                    451.86
8192                    910.31
16384                  1118.03
32768                  2081.15
65536                  3386.86
131072                 5490.97
262144                 7741.95
524288                 9630.69
1048576                9826.39
2097152               10349.62
4194304               10699.22

2. Test Case 2: Attempting to exchange data between GPU memories across
Nodes 124 and 125 resulted in a bug.

mpiexec -n 2 -hosts hw4-125,hw4-124 -env MV2_USE_CUDA 1 ./testp2p
# Size        Bandwidth (MB/s)
1024                     54.89
2048                    155.74
4096                    271.98
8192                    461.67
[hw4-125:mpi_rankaaaa][error_sighandler] Caught error: Segmentation fault
(signal 11)
[hw4-124:mpi_rankaaaa][error_sighandler] Caught error: Segmentation fault
(signal 11)

============================================================================
=======
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 211079 RUNNING AT hw4-124
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
============================================================================
=======
[proxy:0:0 at hw4-125] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:0 at hw4-125] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:
76): callback returned error status
[proxy:0:0 at hw4-125] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[mpiexec at hw4-125] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at hw4-125] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at hw4-125] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion
[mpiexec at hw4-125] main (ui/mpich/mpiexec.c:340): process manager error
waiting for completion


I would greatly appreciate your assistance in troubleshooting this issue.
Any insights or guidance you could provide would be immensely helpful in
resolving this challenge.

Thank you for your attention to this matter. Looking forward to your prompt
response.

Best regards

John




-----Original Message-----
From: Mvapich <mvapich-bounces+runweicheng=gmail.com at lists.osu.edu> On
Behalf Of Panda, Dhabaleswar K. via Mvapich
Sent: 2024年3月11日 5:42 PM
To: mug-conf at lists.osu.edu
Cc: mvapich-discuss at lists.osu.edu; mvapich at lists.osu.edu
Subject: [Mvapich] Save the Dates for MUG '24 Conference

We are happy to indicate that the 12th annual MVAPICH User Group (MUG)
conference will take place in Columbus, OH, USA during August 19-21, 2024.
It will be an in-person event with an option for remote attendance.

Please save the dates and stay tuned for future announcements!!

More details on the conference are available from
http://mug.mvapich.cse.ohio-state.edu/

Thanks,

The MUG '24 Organizers

PS: Interested in getting announcements related to the MUG events? Please
subscribe to the MUG Conference Mailing list (available from the MUG
conference page).
_______________________________________________
Mvapich mailing list
Mvapich at lists.osu.edu
https://lists.osu.edu/mailman/listinfo/mvapich




More information about the Mvapich-discuss mailing list