[mvapich-discuss] mvapich2-2.3b on heterogeneous cluster

Fri Oct 6 15:48:32 EDT 2017

That solves the problem about needing MV2_NUM_HCAS on a jobs that only 
uses my new nodes. When I have a mixed of old and new nodes, then it 
runs for a bit but still crashes soon after it starts. The mix will work 
if I only list each new node once in the machine file. If I list some of 
the nodes twice in the machine file then I get this crash:

tracking step 1
97 matrices recomputed for periodic Twiss parameter computation
statistics:    ET:     00:00:01 CP:    0.46 BIO:0 DIO:0 PF:0 MEM:26237
matched Twiss parameters for beam generation:
betax =  2.001130e+00 m  alphax =  9.511881e-17  etax = 5.415971e-03 m  
etax' = -4.625206e-16
betay =  9.802921e+00 m  alphay = -1.418245e-16  etay = 0.000000e+00 m  
etay' =  0.000000e+00
generating bunch 1
dumping bunch
[weed5.cluster:mpi_rank_5][async_thread] 
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1112: Got 
FATAL event 3

[weed8.cluster:mpi_rank_8][handle_cqe] Send desc error in msg to 5, 
wc_opcode=0
[weed8.cluster:mpi_rank_8][handle_cqe] Msg from 5: wc.status=10, 
wc.wr_id=0x25b8040, wc.opcode=0, vbuf->phead->type=0 = 
MPIDI_CH3_PKT_EAGER_SEND
[weed8.cluster:mpi_rank_8][handle_cqe] 
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got 
completion with error 10, vendor code=0x88, dest rank=5
: Numerical argument out of domain (33)
[weed7.cluster:mpi_rank_7][handle_cqe] Send desc error in msg to 5, 
wc_opcode=0
[weed7.cluster:mpi_rank_7][handle_cqe] Msg from 5: wc.status=10, 
wc.wr_id=0x27c62b0, wc.opcode=0, vbuf->phead->type=24 = 
MPIDI_CH3_PKT_ADDRESS
[weed7.cluster:mpi_rank_7][handle_cqe] 
src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got 
completion with error 10, vendor code=0x88, dest rank=5
: Numerical argument out of domain (33)
[weed5.cluster:mpispawn_4][readline] Unexpected End-Of-File on file 
descriptor 6. MPI process died?
[weed5.cluster:mpispawn_4][mtpmi_processops] Error while reading PMI 
socket. MPI process died?
etc ......

--Bob

On 10/06/2017 02:14 PM, Subramoni, Hari wrote:
> Hi,
>
> I just figured this out myself about an hour back. There are a couple of workarounds you can try here
>
> 1. MV2_SM_SCHEDULING=ROUND_ROBIN or
>
> 2. Please apply this patch and retry
>
> diff --git a/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c b/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_p
> index f590077..2ed2639 100644
> --- a/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
> +++ b/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
> @@ -585,6 +585,7 @@ int rdma_open_hca(struct mv2_MPIDI_CH3I_RDMA_Process_t *proc)
>       int i = 0, j = 0;
>       int num_devices = 0;
>       int num_usable_hcas = 0;
> +    int first_attempt_to_bind_failed = 0;
>       int mpi_errno = MPI_SUCCESS;
>       struct ibv_device *ib_dev = NULL;
>       struct ibv_device **dev_list = NULL;
> @@ -622,6 +623,7 @@ int rdma_open_hca(struct mv2_MPIDI_CH3I_RDMA_Process_t *proc)
>       num_usable_hcas = num_devices;
>   #endif /*RDMA_CM*/
>
> +retry_hca_open:
>       for (i = 0; i < num_devices; i++) {
>   #ifdef RDMA_CM
>           if (rdma_skip_network_card(network_type, dev_list[i])) {
> @@ -630,7 +632,8 @@ int rdma_open_hca(struct mv2_MPIDI_CH3I_RDMA_Process_t *proc)
>           }
>   #endif /*RDMA_CM*/
>
> -        if (rdma_multirail_usage_policy == MV2_MRAIL_BINDING) {
> +        if ((rdma_multirail_usage_policy == MV2_MRAIL_BINDING) &&
> +            (first_attempt_to_bind_failed)) {
>               /* Bind a process to a HCA */
>               if (mrail_use_default_mapping) {
>                   mrail_user_defined_p2r_mapping =
> @@ -700,6 +703,10 @@ int rdma_open_hca(struct mv2_MPIDI_CH3I_RDMA_Process_t *proc)
>       }
>
>       if (unlikely(rdma_num_hcas == 0)) {
> +        if (!first_attempt_to_bind_failed) {
> +            first_attempt_to_bind_failed = 1;
> +            goto retry_hca_open;
> +        }
>           MPIR_ERR_SETFATALANDJUMP2(mpi_errno, MPI_ERR_OTHER,
>                                     "**fail", "%s %d",
>                                     "No active HCAs found on the system!!!",
>
> Thx,
> Hari.
>
> -----Original Message-----
> From: Bob Soliday [mailto:soliday at anl.gov]
> Sent: Friday, October 6, 2017 3:10 PM
> To: Subramoni, Hari <subramoni.1 at osu.edu>; mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
> Subject: Re: [mvapich-discuss] mvapich2-2.3b on heterogeneous cluster
>
> It still isn't working. I have been looking at the procedure rdma_open_hca. When MV2_NUM_HCAS is not set, then rdma_multirail_usage_policy == MV2_MRAIL_BINDING is true. It is getting the ib_dev from dev_list[mrail_user_defined_p2r_mapping] but mrail_user_defined_p2r_mapping is always 0 when rdma_local_id is 0. So when I print out the device name with ibv_get_device_name(ib_dev) I always see mlx4_0 and never mlx5_0. This then causes a "No active HCAs found on the system" because it checked the same device twice and never checked the other one.
>
> On 10/06/2017 09:04 AM, Subramoni, Hari wrote:
>> Hello,
>>
>> Sorry about the delay in getting back to you.
>>
>> Can you please apply this patch and try again? With this, you will not have to set MV2_NUM_HCAS=2.
>>
>> On a different note, can you please let us know why you are setting a very high value for on-demand threshold? This will affect the job startup time for large jobs. If you're not facing any issues, I would recommend to remove it.
>>
>> Please let us know if you face any other issues.
>>
>> diff --git a/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
>> b/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
>> index 3f8d129..f590077 100644
>> --- a/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
>> +++ b/src/mpid/ch3/channels/mrail/src/gen2/rdma_iba_priv.c
>> @@ -343,11 +343,16 @@ int rdma_find_active_port(struct ibv_context
>> *context,
>>
>>        for (j = 1; j <= RDMA_DEFAULT_MAX_PORTS; ++j) {
>>            if ((!ibv_query_port(context, j, &port_attr)) && port_attr.state == IBV_PORT_ACTIVE) {
>> -            if (likely(port_attr.lid || use_iboeth)) {
>> -                DEBUG_PRINT("Active port number = %d, state = %s, lid = %d\r\n",
>> -                            j,
>> -                            (port_attr.state ==
>> -                             IBV_PORT_ACTIVE) ? "Active" : "Not Active",
>> +            /* port_attr.lid && !use_iboeth -> This is an IB device as it has
>> +             * LID and user has not specified to use RoCE mode.
>> +             * !port_attr.lid && use_iboeth -> This is a RoCE device as it does
>> +             * not have a LID and uer has specified to use RoCE mode.
>> +             */
>> +            if (likely((port_attr.lid && !use_iboeth) ||
>> +                       (!port_attr.lid && use_iboeth))) {
>> +                PRINT_DEBUG(DEBUG_INIT_verbose>0,
>> +                            "Active port number = %d, state = %s, lid = %d\r\n",
>> +                            j, (port_attr.state == IBV_PORT_ACTIVE) ?
>> + "Active" : "Not Active",
>>                                port_attr.lid);
>>                    return j;
>>                } else {
>>
>> Thx,
>> Hari.
>>
>> -----Original Message-----
>> From: mvapich-discuss-bounces at cse.ohio-state.edu On Behalf Of Bob
>> Soliday
>> Sent: Thursday, October 5, 2017 3:10 PM
>> To: mvapich-discuss at cse.ohio-state.edu
>> <mvapich-discuss at mailman.cse.ohio-state.edu>
>> Subject: [mvapich-discuss] mvapich2-2.3b on heterogeneous cluster
>>
>> We recently added 4 nodes to our cluster. The older nodes all have 1
>> IB
>> device:
>>        device                 node GUID
>>        ------              ----------------
>>        mlx4_0              0002c90300043e94
>>
>> The new nodes have 2 IB devices:
>>       device                 node GUID
>>        ------              ----------------
>>        mlx4_0              248a070300fc15d0
>>        mlx5_0              a4bf01030018c34c
>>
>> The mlx4_0 device on the new nodes are listed as Ethernet link layers using ibv_devinfo. The mlx5_0 device with the InifiniBand link layer is what we are using. Setting MV2_NUM_HCAS=2 seemed to solve the problem of finding the active device. This also doesn't seem to cause problems when using the older nodes.
>>
>> If I add enough nodes to the job, eventually it will crash with:
>>
>> [weed5.cluster:mpi_rank_5][async_thread]
>> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1112: Got
>> FATAL event 3 [weed7.cluster:mpi_rank_7][handle_cqe] Send desc error
>> in msg to 5,
>> wc_opcode=0
>> [weed7.cluster:mpi_rank_7][handle_cqe] Msg from 5: wc.status=10,
>> wc.wr_id=0x2ffa040, wc.opcode=0, vbuf->phead->type=24 =
>> MPIDI_CH3_PKT_ADDRESS [weed7.cluster:mpi_rank_7][handle_cqe]
>> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got
>> completion with error 10, vendor code=0x88, dest rank=5
>> : Numerical argument out of domain (33)
>> [weed7.cluster:mpispawn_6][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
>> [weed7.cluster:mpispawn_6][mtpmi_processops] Error while reading PMI socket. MPI process died?
>> [weed11.cluster:mpi_rank_11][handle_cqe] Send desc error in msg to 9,
>> wc_opcode=0
>> [weed11.cluster:mpi_rank_11][handle_cqe] Msg from 9: wc.status=10,
>> wc.wr_id=0x2f80040, wc.opcode=0, vbuf->phead->type=24 =
>> MPIDI_CH3_PKT_ADDRESS [weed9.cluster:mpi_rank_9][async_thread]
>> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:1112: Got
>> FATAL event 3 [weed11.cluster:mpi_rank_11][handle_cqe]
>> src/mpid/ch3/channels/mrail/src/gen2/ibv_channel_manager.c:548: [] Got
>> completion with error 10, vendor code=0x88, dest rank=9
>> : Numerical argument out of domain (33)
>> [weed7.cluster:mpispawn_6][child_handler] MPI process (rank: 7, pid:
>> 4508) exited with status 252
>> [weed9.cluster:mpispawn_8][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
>> [weed9.cluster:mpispawn_8][mtpmi_processops] Error while reading PMI socket. MPI process died?
>> [weed11.cluster:mpispawn_10][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
>> [weed11.cluster:mpispawn_10][mtpmi_processops] Error while reading PMI socket. MPI process died?
>> [weed5.cluster:mpispawn_4][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
>> [weed5.cluster:mpispawn_4][mtpmi_processops] Error while reading PMI socket. MPI process died?
>> [weed9.cluster:mpispawn_8][child_handler] MPI process (rank: 9, pid:
>> 3645) exited with status 255
>> [weed11.cluster:mpispawn_10][child_handler] MPI process (rank: 11, pid:
>> 16975) exited with status 252
>> [weed5.cluster:mpispawn_4][child_handler] MPI process (rank: 5, pid:
>> 4656) exited with status 255
>> [soliday at weed124 Pelegant_ringTracking1]$ [weed124.cluster:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 11. MPI process died?
>> [weed124.cluster:mpispawn_0][read_size] Unexpected End-Of-File on file descriptor 11. MPI process died?
>> [weed124.cluster:mpispawn_0][handle_mt_peer] Error while reading PMI socket. MPI process died?
>> [weed2.cluster:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 7. MPI process died?
>> [weed2.cluster:mpispawn_1][read_size] Unexpected End-Of-File on file descriptor 7. MPI process died?
>> [weed2.cluster:mpispawn_1][handle_mt_peer] Error while reading PMI socket. MPI process died?
>> [weed6.cluster:mpispawn_5][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
>> [weed6.cluster:mpispawn_5][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
>> [weed6.cluster:mpispawn_5][handle_mt_peer] Error while reading PMI socket. MPI process died?
>> [weed3.cluster:mpispawn_2][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
>> [weed3.cluster:mpispawn_2][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
>> [weed3.cluster:mpispawn_2][handle_mt_peer] Error while reading PMI socket. MPI process died?
>> [weed8.cluster:mpispawn_7][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
>> [weed8.cluster:mpispawn_7][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
>> [weed8.cluster:mpispawn_7][handle_mt_peer] Error while reading PMI socket. MPI process died?
>> [weed4.cluster:mpispawn_3][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
>> [weed4.cluster:mpispawn_3][read_size] Unexpected End-Of-File on file descriptor 5. MPI process died?
>> [weed4.cluster:mpispawn_3][handle_mt_peer] Error while reading PMI socket. MPI process died?
>>
>> The launch command:
>> /lustre/3rdPartySoftware/mvapich2-2.3b/bin/mpirun_rsh -rsh \
>>      -hostfile /lustre/soliday/ElegantTests/Pelegant_ringTracking1/machines \
>>      -np 12  MV2_ENABLE_AFFINITY=0 MV2_ON_DEMAND_THRESHOLD=5000 \
>>      MV2_SHOW_HCA_BINDING=2 MV2_NUM_HCAS=2 \
>>      /home/soliday/oag/apps/src/elegant/O.linux-x86_64/Pelegant
>> manyParticles_p.ele
>>
>> machine file (weed124 is the only new node in the list):
>> weed124
>> weed124
>> weed2
>> weed3
>> weed4
>> weed5
>> weed6
>> weed7
>> weed8
>> weed9
>> weed10
>> weed11
>>
>> mpichversion:
>> MVAPICH2 Version:         2.3b
>> MVAPICH2 Release date:    Thu Aug 10 22:00:00 EST 2017
>> MVAPICH2 Device:          ch3:mrail
>> MVAPICH2 configure: --prefix=/lustre/3rdPartySoftware/mvapich2-2.3b
>> --with-device=ch3:mrail --with-rdma=gen2 --disable-shared --enable-romio --with-file-system=lustre+nfs
>> MVAPICH2 CC:      gcc    -DNDEBUG -DNVALGRIND -O2
>> MVAPICH2 CXX:     g++   -DNDEBUG -DNVALGRIND -O2
>> MVAPICH2 F77:     gfortran -L/lib -L/lib   -O2
>> MVAPICH2 FC:      gfortran   -O2
>>
>> Hopefully someone knows what I am doing wrong.
>> --Bob Soliday
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss