[Mvapich-discuss] Assistance Needed: Inter-Node GPU Data Exchange Error
Panda, Dhabaleswar
panda at cse.ohio-state.edu
Wed May 1 22:46:34 EDT 2024
Not sure which version of MVAPICH2 you are using. There is no release from the MVAPICH team which is named as MVAPICH2 3.2.1 :-(
For GPU clusters, please use the latest MVAPICH-Plus 3.0GA version.
DK
________________________________________
From: john <runweicheng at gmail.com>
Sent: Wednesday, May 1, 2024 9:48 PM
To: Panda, Dhabaleswar; mug-conf at lists.osu.edu
Cc: mvapich-discuss at lists.osu.edu; mvapich at lists.osu.edu
Subject: Assistance Needed: Inter-Node GPU Data Exchange Error
!-------------------------------------------------------------------|
This Message Is From an External Sender
This message came from outside your organization.
|-------------------------------------------------------------------!
Dear Prof Panda
I hope this email finds you well. I am reaching out regarding an issue I've
encountered while working on our Linux cluster.
Currently, our setup involves a Linux cluster where each node is equipped
with two GPU accelerators, and the nodes are interconnected via Ethernet.
We're utilizing MVAPICH2 version 3.2.1 with the configuration options
"--with-rdma=gen2" and "--enable-cuda."
While exchanging data between GPU memories within the same node functions
smoothly, encountering errors arises when attempting to exchange data across
two nodes.
1 Test Case 1: Testing GPU-to-GPU data exchange on two GPUs on Node 125
yielded successful results, as outlined below:
mpiexec -n 2 -hosts hw4-125,hw4-125 -env MV2_USE_CUDA 1 ./testp2p
# Size Bandwidth (MB/s)
1024 16.77
2048 218.80
4096 451.86
8192 910.31
16384 1118.03
32768 2081.15
65536 3386.86
131072 5490.97
262144 7741.95
524288 9630.69
1048576 9826.39
2097152 10349.62
4194304 10699.22
2. Test Case 2: Attempting to exchange data between GPU memories across
Nodes 124 and 125 resulted in a bug.
mpiexec -n 2 -hosts hw4-125,hw4-124 -env MV2_USE_CUDA 1 ./testp2p
# Size Bandwidth (MB/s)
1024 54.89
2048 155.74
4096 271.98
8192 461.67
[hw4-125:mpi_rankaaaa][error_sighandler] Caught error: Segmentation fault
(signal 11)
[hw4-124:mpi_rankaaaa][error_sighandler] Caught error: Segmentation fault
(signal 11)
============================================================================
=======
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 211079 RUNNING AT hw4-124
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
============================================================================
=======
[proxy:0:0 at hw4-125] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911):
assert (!closed) failed
[proxy:0:0 at hw4-125] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:
76): callback returned error status
[proxy:0:0 at hw4-125] main (pm/pmiserv/pmip.c:202): demux engine error waiting
for event
[mpiexec at hw4-125] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
badly; aborting
[mpiexec at hw4-125] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at hw4-125] HYD_pmci_wait_for_completion
(pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
completion
[mpiexec at hw4-125] main (ui/mpich/mpiexec.c:340): process manager error
waiting for completion
I would greatly appreciate your assistance in troubleshooting this issue.
Any insights or guidance you could provide would be immensely helpful in
resolving this challenge.
Thank you for your attention to this matter. Looking forward to your prompt
response.
Best regards
John
-----Original Message-----
From: Mvapich <mvapich-bounces+runweicheng=gmail.com at lists.osu.edu> On
Behalf Of Panda, Dhabaleswar K. via Mvapich
Sent: 2024年3月11日 5:42 PM
To: mug-conf at lists.osu.edu
Cc: mvapich-discuss at lists.osu.edu; mvapich at lists.osu.edu
Subject: [Mvapich] Save the Dates for MUG '24 Conference
We are happy to indicate that the 12th annual MVAPICH User Group (MUG)
conference will take place in Columbus, OH, USA during August 19-21, 2024.
It will be an in-person event with an option for remote attendance.
Please save the dates and stay tuned for future announcements!!
More details on the conference are available from
http://mug.mvapich.cse.ohio-state.edu/
Thanks,
The MUG '24 Organizers
PS: Interested in getting announcements related to the MUG events? Please
subscribe to the MUG Conference Mailing list (available from the MUG
conference page).
_______________________________________________
Mvapich mailing list
Mvapich at lists.osu.edu
https://lists.osu.edu/mailman/listinfo/mvapich
More information about the Mvapich-discuss
mailing list