[mvapich-discuss] Multi-rail communication between GPU and non-GPU nodes

Khuvis, Samuel skhuvis at osc.edu
Wed Oct 10 11:50:10 EDT 2018


Sure, here is all of the information:


  1.  Version of MVAPICH: 2.3
  2.  CUDA version 9.2.88 and Intel version 17.0.7
  3.  The GPU nodes have 2 HCAs and 2 GPUs. The non-GPU nodes have 1 HCA and 0 GPUs.

--
Samuel Khuvis
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-5178<tel:+16142925178> • Fax: (614) 292-7168<tel:+16142927168>


From: "Subramoni, Hari" <subramoni.1 at osu.edu>
Date: Wednesday, October 10, 2018 at 11:05 AM
To: "Khuvis, Samuel" <skhuvis at osc.edu>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Cc: "Subramoni, Hari" <subramoni.1 at osu.edu>
Subject: RE: [mvapich-discuss] Multi-rail communication between GPU and non-GPU nodes

Hi, Sam.

Sorry to hear that you’re facing issues.

My guess is that this is likely because of non-uniformity in the number of HCAs on GPU and non-GPU nodes.

Since we don’t have access to Pitzer yet, could you please send us the following information?


  1.  Version of MVAPICH you’re using with build configuration
  2.  CUDA and Complier versions
  3.  How many HCAs and and GPUs do the nodes have
  4.  Output of lspci -tv

Regards,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Khuvis, Samuel
Sent: Wednesday, October 10, 2018 10:14 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Multi-rail communication between GPU and non-GPU nodes

Hi,

We are running into issues with multi-rail between GPU and non-GPU nodes on Pitzer.

Testing with the OSU bandwidth benchmark, we have a segfault unless MV2_IBA_HCA=mlx5_0 is set. However, this results in 50% of bandwidth. What settings would allow full bandwidth between GPU and non-GPU nodes?

On 2 GPU nodes:
$ MV2_IBA_HCA= mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304             12359.86

$ MV2_IBA_HCA=mlx5_0 mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304              6898.83

On 1 GPU and 1 non-GPU node:
$ MV2_IBA_HCA= mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
[p0237.ten.osc.edu:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 302424 RUNNING AT p0237
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at p0026.ten.osc.edu] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0 at p0026.ten.osc.edu] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at p0026.ten.osc.edu] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

$ MV2_IBA_HCA=mlx5_0 mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304              7102.31

Thanks,
Samuel Khuvis
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-5178<tel:+16142925178> • Fax: (614) 292-7168<tel:+16142927168>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181010/97fb0afd/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lspci.log
Type: application/octet-stream
Size: 16698 bytes
Desc: lspci.log
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181010/97fb0afd/attachment-0001.obj>


More information about the mvapich-discuss mailing list