[mvapich-discuss] Job failures when running Verbs build of MVAPICH2 v2.2 on 2+ nodes using PSM hardware

Westlund, John A john.a.westlund at intel.com
Mon Oct 17 17:17:38 EDT 2016


If I install a "Verbs" build of MVAPICH2 on a PSM system any MPI job that needs to communicate between nodes fails:

   [prun] Launch cmd = mpiexec.hydra -bootstrap slurm ./bin/xhpcg.gnu.mvapich2 32 32 32 10
   [c2:mpi_rank_8][rdma_find_network_type] QLogic IB card detected in system
   [c2:mpi_rank_8][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_1][rdma_find_network_type] QLogic IB card detected in system
   [c1:mpi_rank_1][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c2:mpi_rank_10][rdma_find_network_type] QLogic IB card detected in system
   [c2:mpi_rank_10][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_0][rdma_find_network_type] QLogic IB card detected in system
   [c1:mpi_rank_0][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c2:mpi_rank_11][rdma_find_network_type] QLogic IB card detected in system
   [c2:mpi_rank_11][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_3][rdma_find_network_type] QLogic IB card detected in system
   [c1:mpi_rank_3][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c2:mpi_rank_12][rdma_find_network_type] QLogic IB card detected in system
   [c2:mpi_rank_12][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c4:mpi_rank_30][rdma_find_network_type] QLogic IB card detected in system
   [c4:mpi_rank_30][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c4:mpi_rank_29][rdma_find_network_type] QLogic IB card detected in system
   [c4:mpi_rank_29][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_2][rdma_find_network_type] QLogic IB card detected in system
   [c1:mpi_rank_2][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c2:mpi_rank_9][rdma_find_network_type] QLogic IB card detected in system
   [c2:mpi_rank_9][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c4:mpi_rank_25][rdma_find_network_type] QLogic IB card detected in system
   [c4:mpi_rank_25][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c2:mpi_rank_13][rdma_find_network_type] QLogic IB card detected in system
   [c2:mpi_rank_13][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_4][rdma_find_network_type] QLogic IB card detected in system
   [c1:mpi_rank_4][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c3:mpi_rank_21][rdma_find_network_type] QLogic IB card detected in system
   [c3:mpi_rank_21][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c4:mpi_rank_24][rdma_find_network_type] QLogic IB card detected in system
   [c4:mpi_rank_24][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c2:mpi_rank_14][rdma_find_network_type] QLogic IB card detected in system
   [c2:mpi_rank_14][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_7][rdma_find_network_type] QLogic IB card detected in system
   [c1:mpi_rank_7][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c3:mpi_rank_20][rdma_find_network_type] QLogic IB card detected in system
   [c3:mpi_rank_20][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c4:mpi_rank_26][rdma_find_network_type] QLogic IB card detected in system
   [c4:mpi_rank_26][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_5][rdma_find_network_type] QLogic IB card detected in system
   [c1:mpi_rank_5][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c2:mpi_rank_15][rdma_find_network_type] QLogic IB card detected in system
   [c2:mpi_rank_15][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c3:mpi_rank_19][rdma_find_network_type] QLogic IB card detected in system
   [c3:mpi_rank_19][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c4:mpi_rank_28][rdma_find_network_type] QLogic IB card detected in system
   [c4:mpi_rank_28][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_6][rdma_find_network_type] QLogic IB card detected in system
   [c1:mpi_rank_6][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c3:mpi_rank_23][rdma_find_network_type] QLogic IB card detected in system
   [c3:mpi_rank_23][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c4:mpi_rank_31][rdma_find_network_type] QLogic IB card detected in system
   [c4:mpi_rank_31][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c3:mpi_rank_17][rdma_find_network_type] QLogic IB card detected in system
   [c3:mpi_rank_17][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c4:mpi_rank_27][rdma_find_network_type] QLogic IB card detected in system
   [c4:mpi_rank_27][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c3:mpi_rank_18][rdma_find_network_type] QLogic IB card detected in system
   [c3:mpi_rank_18][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c3:mpi_rank_22][rdma_find_network_type] QLogic IB card detected in system
   [c3:mpi_rank_22][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c3:mpi_rank_16][rdma_find_network_type] QLogic IB card detected in system
   [c3:mpi_rank_16][rdma_find_network_type] Please re-configure the library with the '--with-device=ch3:psm' configure option for best performance
   [c1:mpi_rank_0][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c1:mpi_rank_1][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c1:mpi_rank_2][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
  : Invalid argument (22)
   [c1:mpi_rank_3][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c1:mpi_rank_4][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c1:mpi_rank_5][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c1:mpi_rank_6][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c1:mpi_rank_7][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c4:mpi_rank_24][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c4:mpi_rank_26][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c4:mpi_rank_27][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c4:mpi_rank_28][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c4:mpi_rank_29][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c4:mpi_rank_30][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c4:mpi_rank_31][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c4:mpi_rank_25][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c3:mpi_rank_16][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c3:mpi_rank_17][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c3:mpi_rank_18][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c3:mpi_rank_19][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c3:mpi_rank_20][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c3:mpi_rank_21][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c3:mpi_rank_22][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c3:mpi_rank_23][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c2:mpi_rank_8][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c2:mpi_rank_9][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c2:mpi_rank_10][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c2:mpi_rank_11][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c2:mpi_rank_12][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c2:mpi_rank_13][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c2:mpi_rank_14][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)
   [c2:mpi_rank_15][cm_qp_move_to_rtr] src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c:2172: Failed to modify QP to RTR
   : Invalid argument (22)

   ===================================================================================
   =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
   =   PID 74651 RUNNING AT c4
   =   EXIT CODE: 255
   =   CLEANING UP REMAINING PROCESSES
   =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
   ===================================================================================
   [proxy:0:0 at c1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
   [proxy:0:0 at c1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
   [proxy:0:0 at c1] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
   [proxy:0:2 at c3] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
   [proxy:0:2 at c3] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
   [proxy:0:2 at c3] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
   [proxy:0:1 at c2] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:909): assert (!closed) failed
   [proxy:0:1 at c2] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
   [proxy:0:1 at c2] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
   srun: error: c3: task 2: Exited with exit code 7
   srun: error: c2: task 1: Exited with exit code 7
   srun: error: c1: task 0: Exited with exit code 7
   [mpiexec at c1] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
   [mpiexec at c1] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
   [mpiexec at c1] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
   [mpiexec at c1] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

I'm used to the QLogic warnings -- but previously the job would still run.

Thoughts?

John

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20161017/2485d42d/attachment-0001.html>


More information about the mvapich-discuss mailing list