[mvapich-discuss] Job failures when running Verbs build of MVAPICH2 v2.2 on 2+ nodes using PSM hardware

Hari Subramoni subramoni.1 at osu.edu
Mon Oct 17 18:02:42 EDT 2016


Hi John,

This was due to some extra feature we introduced in the OFA-IB channel that
is not supported by OFA-PSM.

For performance reasons, we recommend not using an OFA-IB build for a
system with QLogic HCAs. Is there any reason why you want to use this
combination in particular?

Regards,
Hari.

On Mon, Oct 17, 2016 at 5:17 PM, Westlund, John A <john.a.westlund at intel.com
> wrote:

> If I install a “Verbs” build of MVAPICH2 on a PSM system any MPI job that
> needs to communicate between nodes fails:
>
>
>
>    [prun] Launch cmd = mpiexec.hydra -bootstrap slurm
> ./bin/xhpcg.gnu.mvapich2 32 32 32 10
>
>    [c2:mpi_rank_8][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c2:mpi_rank_8][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_1][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c1:mpi_rank_1][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c2:mpi_rank_10][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c2:mpi_rank_10][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_0][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c1:mpi_rank_0][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c2:mpi_rank_11][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c2:mpi_rank_11][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_3][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c1:mpi_rank_3][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c2:mpi_rank_12][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c2:mpi_rank_12][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c4:mpi_rank_30][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c4:mpi_rank_30][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c4:mpi_rank_29][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c4:mpi_rank_29][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_2][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c1:mpi_rank_2][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c2:mpi_rank_9][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c2:mpi_rank_9][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c4:mpi_rank_25][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c4:mpi_rank_25][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c2:mpi_rank_13][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c2:mpi_rank_13][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_4][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c1:mpi_rank_4][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c3:mpi_rank_21][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c3:mpi_rank_21][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c4:mpi_rank_24][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c4:mpi_rank_24][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c2:mpi_rank_14][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c2:mpi_rank_14][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_7][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c1:mpi_rank_7][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c3:mpi_rank_20][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c3:mpi_rank_20][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c4:mpi_rank_26][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c4:mpi_rank_26][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_5][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c1:mpi_rank_5][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c2:mpi_rank_15][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c2:mpi_rank_15][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c3:mpi_rank_19][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c3:mpi_rank_19][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c4:mpi_rank_28][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c4:mpi_rank_28][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_6][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c1:mpi_rank_6][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c3:mpi_rank_23][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c3:mpi_rank_23][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c4:mpi_rank_31][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c4:mpi_rank_31][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c3:mpi_rank_17][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c3:mpi_rank_17][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c4:mpi_rank_27][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c4:mpi_rank_27][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c3:mpi_rank_18][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c3:mpi_rank_18][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c3:mpi_rank_22][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c3:mpi_rank_22][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c3:mpi_rank_16][rdma_find_network_type] QLogic IB card detected in
> system
>
>    [c3:mpi_rank_16][rdma_find_network_type] Please re-configure the
> library with the '--with-device=ch3:psm' configure option for best
> performance
>
>    [c1:mpi_rank_0][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c1:mpi_rank_1][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c1:mpi_rank_2][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>   : Invalid argument (22)
>
>    [c1:mpi_rank_3][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c1:mpi_rank_4][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c1:mpi_rank_5][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c1:mpi_rank_6][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c1:mpi_rank_7][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c4:mpi_rank_24][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c4:mpi_rank_26][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c4:mpi_rank_27][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c4:mpi_rank_28][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c4:mpi_rank_29][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c4:mpi_rank_30][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c4:mpi_rank_31][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c4:mpi_rank_25][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c3:mpi_rank_16][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c3:mpi_rank_17][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c3:mpi_rank_18][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c3:mpi_rank_19][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c3:mpi_rank_20][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c3:mpi_rank_21][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c3:mpi_rank_22][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c3:mpi_rank_23][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c2:mpi_rank_8][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c2:mpi_rank_9][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c2:mpi_rank_10][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c2:mpi_rank_11][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c2:mpi_rank_12][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c2:mpi_rank_13][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c2:mpi_rank_14][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>    [c2:mpi_rank_15][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172:
> Failed to modify QP to RTR
>
>    : Invalid argument (22)
>
>
>
>    =========================================================
> ==========================
>
>    =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>
>    =   PID 74651 RUNNING AT c4
>
>    =   EXIT CODE: 255
>
>    =   CLEANING UP REMAINING PROCESSES
>
>    =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
>    ============================================================
> =======================
>
>    [proxy:0:0 at c1] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
>
>    [proxy:0:0 at c1] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
>
>    [proxy:0:0 at c1] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
>
>    [proxy:0:2 at c3] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
>
>    [proxy:0:2 at c3] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
>
>    [proxy:0:2 at c3] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
>
>    [proxy:0:1 at c2] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
>
>    [proxy:0:1 at c2] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
>
>    [proxy:0:1 at c2] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
>
>    srun: error: c3: task 2: Exited with exit code 7
>
>    srun: error: c2: task 1: Exited with exit code 7
>
>    srun: error: c1: task 0: Exited with exit code 7
>
>    [mpiexec at c1] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
>
>    [mpiexec at c1] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
>
>    [mpiexec at c1] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
> completion
>
>    [mpiexec at c1] main (ui/mpich/mpiexec.c:344): process manager error
> waiting for completion
>
>
>
> I’m used to the QLogic warnings -- but previously the job would still run.
>
>
>
> Thoughts?
>
>
>
> John
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161017/b6351ae2/attachment-0001.html>


More information about the mvapich-discuss mailing list