[mvapich-discuss] Multi-rail communication between GPU and non-GPU nodes

Khuvis, Samuel skhuvis at osc.edu
Wed Oct 10 10:14:20 EDT 2018


Hi,

We are running into issues with multi-rail between GPU and non-GPU nodes on Pitzer.

Testing with the OSU bandwidth benchmark, we have a segfault unless MV2_IBA_HCA=mlx5_0 is set. However, this results in 50% of bandwidth. What settings would allow full bandwidth between GPU and non-GPU nodes?

On 2 GPU nodes:
$ MV2_IBA_HCA= mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304             12359.86

$ MV2_IBA_HCA=mlx5_0 mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304              6898.83

On 1 GPU and 1 non-GPU node:
$ MV2_IBA_HCA= mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
[p0237.ten.osc.edu:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 302424 RUNNING AT p0237
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0 at p0026.ten.osc.edu] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0 at p0026.ten.osc.edu] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0 at p0026.ten.osc.edu] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

$ MV2_IBA_HCA=mlx5_0 mpiexec -ppn 1 -np 2 -f config $prog -m 4194304:4194304
# OSU MPI Bandwidth Test v5.4.3
# Size      Bandwidth (MB/s)
4194304              7102.31

Thanks,
Samuel Khuvis
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-5178<tel:+16142925178> • Fax: (614) 292-7168<tel:+16142927168>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20181010/f3cb7152/attachment.html>


More information about the mvapich-discuss mailing list