[mvapich-discuss] mvapich2-2.3b on heterogeneous cluster

Bob Soliday soliday at anl.gov
Fri Oct 6 15:10:16 EDT 2017


It still isn't working. I have been looking at the procedure 
rdma_open_hca. When MV2_NUM_HCAS is not set, then 
rdma_multirail_usage_policy == MV2_MRAIL_BINDING is true. It is getting 
the ib_dev from dev_list[mrail_user_defined_p2r_mapping] but 
mrail_user_defined_p2r_mapping is always 0 when rdma_local_id is 0. So 
when I print out the device name with ibv_get_device_name(ib_dev) I 
always see mlx4_0 and never mlx5_0. This then causes a "No active HCAs 
found on the system" because it checked the same device twice and never 
checked the other one.

On 10/06/2017 09:04 AM, Subramoni, Hari wrote:
> Hello,
>
> Sorry about the delay in getting back to you.
>
> Can you please apply this patch and try again? With this, you will not have to set MV2_NUM_HCAS=2.
>
> On a different note, can you please let us know why you are setting a very high value for on-demand threshold? This will affect the job startup time for large jobs. If you're not facing any issues, I would recommend to remove it.
>
> Please let us know if you face any other issues.
>
> diff --git a/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c b/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
> index 3f8d129..f590077 100644
> --- a/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
> +++ b/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
> @@ -343,11 +343,16 @@ int rdma_find_active_port(struct ibv_context *context,
>
>       for (j = 1; j <= RDMA_DEFAULT_MAX_PORTS; ++j) {
>           if ((!ibv_query_port(context, j, &port_attr)) && port_attr.state == IBV_PORT_ACTIVE) {
> -            if (likely(port_attr.lid || use_iboeth)) {
> -                DEBUG_PRINT("Active port number = %d, state = %s, lid = %d\r\n",
> -                            j,
> -                            (port_attr.state ==
> -                             IBV_PORT_ACTIVE) ? "Active" : "Not Active",
> +            /* port_attr.lid && !use_iboeth -> This is an IB device as it has
> +             * LID and user has not specified to use RoCE mode.
> +             * !port_attr.lid && use_iboeth -> This is a RoCE device as it does
> +             * not have a LID and uer has specified to use RoCE mode.
> +             */
> +            if (likely((port_attr.lid && !use_iboeth) ||
> +                       (!port_attr.lid && use_iboeth))) {
> +                PRINT_DEBUG(DEBUG_INIT_verbose>0,
> +                            "Active port number = %d, state = %s, lid = %d\r\n",
> +                            j, (port_attr.state == IBV_PORT_ACTIVE) ?
> + "Active" : "Not Active",
>                               port_attr.lid);
>                   return j;
>               } else {
>
> Thx,
> Hari.
>
> -----Original Message-----
> From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Bob Soliday
> Sent: Thursday, October 5, 2017 3:10 PM
> To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
> Subject: [mvapich-discuss] mvapich2-2.3b on heterogeneous cluster
>
> We recently added 4 nodes to our cluster. The older nodes all have 1 IB
> device:
>       device                 node GUID
>       ------              ----------------
>       mlx4_0              0002c90300043e94
>
> The new nodes have 2 IB devices:
>      device                 node GUID
>       ------              ----------------
>       mlx4_0              248a070300fc15d0
>       mlx5_0              a4bf01030018c34c
>
> The mlx4_0 device on the new nodes are listed as Ethernet link layers using ibv_devinfo. The mlx5_0 device with the InifiniBand link layer is what we are using. Setting MV2_NUM_HCAS=2 seemed to solve the problem of finding the active device. This also doesn't seem to cause problems when using the older nodes.
>
> If I add enough nodes to the job, eventually it will crash with:
>
> [weed5.cluster:mpi_rank_5][async_thread]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1112: Got FATAL event 3 [weed7.cluster:mpi_rank_7][handle_cqe] Send desc error in msg to 5,
> wc_opcode=0
> [weed7.cluster:mpi_rank_7][handle_cqe] Msg from 5: wc.status=10, wc.wr_id=0x2ffa040, wc.opcode=0, vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS [weed7.cluster:mpi_rank_7][handle_cqe]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 10, vendor code=0x88, dest rank=5
> : Numerical argument out of domain (33)
> [weed7.cluster:mpispawn_6][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
> [weed7.cluster:mpispawn_6][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [weed11.cluster:mpi_rank_11][handle_cqe] Send desc error in msg to 9,
> wc_opcode=0
> [weed11.cluster:mpi_rank_11][handle_cqe] Msg from 9: wc.status=10, wc.wr_id=0x2f80040, wc.opcode=0, vbuf->phead->type=24 = MPIDI_CH3_PKT_ADDRESS [weed9.cluster:mpi_rank_9][async_thread]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1112: Got FATAL event 3 [weed11.cluster:mpi_rank_11][handle_cqe]
> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got completion with error 10, vendor code=0x88, dest rank=9
> : Numerical argument out of domain (33)
> [weed7.cluster:mpispawn_6][child_handler] MPI process (rank: 7, pid:
> 4508) exited with status 252
> [weed9.cluster:mpispawn_8][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
> [weed9.cluster:mpispawn_8][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [weed11.cluster:mpispawn_10][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
> [weed11.cluster:mpispawn_10][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [weed5.cluster:mpispawn_4][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
> [weed5.cluster:mpispawn_4][mtpmi_processops] Error while reading PMI socket. MPI process died?
> [weed9.cluster:mpispawn_8][child_handler] MPI process (rank: 9, pid:
> 3645) exited with status 255
> [weed11.cluster:mpispawn_10][child_handler] MPI process (rank: 11, pid:
> 16975) exited with status 252
> [weed5.cluster:mpispawn_4][child_handler] MPI process (rank: 5, pid:
> 4656) exited with status 255
> [soliday at weed124 Pelegant_ringTracking1]$ [weed124.cluster:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 11. MPI process died?
> [weed124.cluster:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 11. MPI process died?
> [weed124.cluster:mpispawn_0][handle_mt_peer] Error while reading PMI socket. MPI process died?
> [weed2.cluster:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 7. MPI process died?
> [weed2.cluster:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 7. MPI process died?
> [weed2.cluster:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
> [weed6.cluster:mpispawn_5][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [weed6.cluster:mpispawn_5][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [weed6.cluster:mpispawn_5][handle_mt_peer] Error while reading PMI socket. MPI process died?
> [weed3.cluster:mpispawn_2][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [weed3.cluster:mpispawn_2][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [weed3.cluster:mpispawn_2][handle_mt_peer] Error while reading PMI socket. MPI process died?
> [weed8.cluster:mpispawn_7][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [weed8.cluster:mpispawn_7][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [weed8.cluster:mpispawn_7][handle_mt_peer] Error while reading PMI socket. MPI process died?
> [weed4.cluster:mpispawn_3][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [weed4.cluster:mpispawn_3][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
> [weed4.cluster:mpispawn_3][handle_mt_peer] Error while reading PMI socket. MPI process died?
>
> The launch command:
> /lustre/3rdPartySoftware/mvapich2-2.3b/bin/mpirun_rsh -rsh \
>     -hostfile /lustre/soliday/ElegantTests/Pelegant_ringTracking1/machines \
>     -np 12  MV2_ENABLE_AFFINITY=0 MV2_ON_DEMAND_THRESHOLD=5000 \
>     MV2_SHOW_HCA_BINDING=2 MV2_NUM_HCAS=2 \
>     /home/soliday/oag/apps/src/elegant/O.linux-x86_64/Pelegant
> manyParticles_p.ele
>
> machine file (weed124 is the only new node in the list):
> weed124
> weed124
> weed2
> weed3
> weed4
> weed5
> weed6
> weed7
> weed8
> weed9
> weed10
> weed11
>
> mpichversion:
> MVAPICH2 Version:         2.3b
> MVAPICH2 Release date:    Thu Aug 10 22:00:00 EST 2017
> MVAPICH2 Device:          ch3:mrail
> MVAPICH2 configure: --prefix=/lustre/3rdPartySoftware/mvapich2-2.3b
> --with-device=ch3:mrail --with-rdma=gen2 --disable-shared --enable-romio --with-file-system=lustre+nfs
> MVAPICH2 CC:      gcc    -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 CXX:     g++   -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 F77:     gfortran -L/lib -L/lib   -O2
> MVAPICH2 FC:      gfortran   -O2
>
> Hopefully someone knows what I am doing wrong.
> --Bob Soliday
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



More information about the mvapich-discuss mailing list