[mvapich-discuss] recommended settings for heterogeneous multi-rail support?

Hari Subramoni subramoni.1 at osu.edu
Thu Mar 16 17:48:11 EDT 2017


Hi Rick,

Unfortunately, at this point, we don't have a method to achieve this in
mvapich2.

Let me try to see if I can modify the code to achieve this.

Regards,
Hari.

On Mar 16, 2017 5:32 PM, "Rick Warner" <rick at microway.com> wrote:

> Hi All,
>
> I'm working with a cluster that has (1) ConnectX 3 HCA in 9 out of 10
> compute nodes, but the 10th node has 2 HCAs installed (plus GPUs)
>
> What is the recommended way of making use of both HCAs on the 1 node?  If
> I run an MPI job without specifying anything regarding the HCAs it fails
> like this:
>
> [microway at athena-int ~]$ mpirun -np 10 --machinefile /etc/nodes
> ./cpi-mvapich
> Process 7 of 10 on athena-7
> Process 1 of 10 on athena-1
> Process 0 of 10 on athena-int
> Process 4 of 10 on athena-4
> Process 8 of 10 on athena-8
> Process 6 of 10 on athena-6
> Process 5 of 10 on athena-5
> Process 3 of 10 on athena-3
> Process 2 of 10 on athena-2
> Process 9 of 10 on athena-gpu-1
> [athena-gpu-1:mpi_rank_9][error_sighandler] Caught error: Segmentation
> fault (signal 11)
>
> ============================================================
> =======================
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> =   PID 7621 RUNNING AT athena-gpu-1
> =   EXIT CODE: 139
> =   CLEANING UP REMAINING PROCESSES
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> ============================================================
> =======================
> [proxy:0:0 at athena-int] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:0 at athena-int] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:0 at athena-int] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:0:5 at athena-5] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:5 at athena-5] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:5 at athena-5] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:0:1 at athena-1] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:1 at athena-1] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:1 at athena-1] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:0:2 at athena-2] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:2 at athena-2] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:2 at athena-2] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:0:3 at athena-3] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:3 at athena-3] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:3 at athena-3] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:0:4 at athena-4] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:4 at athena-4] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:4 at athena-4] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:0:6 at athena-6] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:6 at athena-6] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:6 at athena-6] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:0:7 at athena-7] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:7 at athena-7] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:7 at athena-7] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [proxy:0:8 at athena-8] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
> [proxy:0:8 at athena-8] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
> [proxy:0:8 at athena-8] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
> [mpiexec at athena-int] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
> [mpiexec at athena-int] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> [mpiexec at athena-int] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
> completion
> [mpiexec at athena-int] main (ui/mpich/mpiexec.c:344): process manager error
> waiting for completion
>
>
> athena-int is the master.  athena-1 through -8 are regular compute nodes,
> and -gpu-1 is the system with 2 IB cards since it has GPUs (1 IB card per
> CPU for direct IB->GPU transfer support [planning on adding more GPU
> systems later to use GPU direct]).
>
> If I run forcing mvapich to use only the 1st HCA it works fine:
>
> [microway at athena-int ~]$ mpirun -genv MV2_IBA_HCA mlx4_0 -np 10
> --machinefile /etc/nodes ./cpi-mvapich
> Process 6 of 10 on athena-6
> Process 7 of 10 on athena-7
> Process 1 of 10 on athena-1
> Process 2 of 10 on athena-2
> Process 4 of 10 on athena-4
> Process 3 of 10 on athena-3
> Process 5 of 10 on athena-5
> Process 8 of 10 on athena-8
> Process 0 of 10 on athena-int
> Process 9 of 10 on athena-gpu-1
> pi is approximately 3.1415926544231256, Error is 0.0000000008333325
> wall clock time = 0.022811
>
>
> I've played around with various MV2_ multirail settings but have not had
> any luck. What is the recommended way to configure and use a setup like
> this?
>
> Thanks,
> Rick
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20170316/0bfea013/attachment-0001.html>


More information about the mvapich-discuss mailing list