[mvapich-discuss] Multi-rail communication between GPU and non-GPU nodes

Subramoni, Hari subramoni.1 at osu.edu
Wed Oct 10 11:05:48 EDT 2018


Hi, Sam.

Sorry to hear that you’re facing issues.

My guess is that this is likely because of non-uniformity in the number of HCAs on GPU and non-GPU nodes.

Since we don’t have access to Pitzer yet, could you please send us the following information?


  1.  Version of MVAPICH you’re using with build configuration
  2.  CUDA and Complier versions
  3.  How many HCAs and and GPUs do the nodes have
  4.  Output of lspci -tv

Regards,
Hari.

From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Khuvis, Samuel
Sent: Wednesday, October 10, 2018 10:14 AM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Multi-rail communication between GPU and non-GPU nodes

Hi,

We are running into issues with multi-rail between GPU and non-GPU nodes on Pitzer.

Testing with the OSU bandwidth benchmark, we have a segfault unless MV2_IBA_HCA=mlx5_0 is set. However, this results in 50% of bandwidth. What settings would allow full bandwidth between GPU and non-GPU nodes?

On 2 GPU nodes:
$ MV2_IBA_HCA= mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304             12359.86

$ MV2_IBA_HCA=mlx5_0 mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304              6898.83

On 1 GPU and 1 non-GPU node:
$ MV2_IBA_HCA= mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
[p0237.ten.osc.edu:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 302424 RUNNING AT p0237
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at p0026.ten.osc.edu] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0 at p0026.ten.osc.edu] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at p0026.ten.osc.edu] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

$ MV2_IBA_HCA=mlx5_0 mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304              7102.31

Thanks,
Samuel Khuvis
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-5178<tel:+16142925178> • Fax: (614) 292-7168<tel:+16142927168>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181010/b62444a7/attachment-0001.html>


More information about the mvapich-discuss mailing list