[mvapich-discuss] QP failed: Cannot allocate memory

Riley, Douglas (AS) Douglas.Riley at ngc.com
Thu Mar 22 13:07:09 EDT 2012


MVAPICH Team:

I'm currently using:
MVAPICH 1.2-SingleRail
Build-ID: 3635

My cluster has 6 nodes, each node with 48 AMD Opteron cores and each node with 192 GB RAM.  I'm running RHEL 5.5 Linux version 2.6.35

My applications often use MVAPICH to significantly oversubscribe the available cores (288).  Up to about -n 1200, all works fine under mpirun_rsh; however, at about -n 1250, I receive the terminal error:

QP failed: Cannot allocate memory

As described in the User Manual, I've increase the memlock to the maximum memory on each node; however, the problem persists.  If I invoke the environment variable:  VIADEV_USE_XRC=1, the error at startup doesn't appear; however the application code then hangs indefinitely (which occurs for either small or large MPI applications).  XRC apparently may solve the issue; however, either my MVAPICH version was not built to support, or perhaps may hardware doesn't support it.  The following is output from the IB adapter:

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.7.000
        node_guid:                      0002:c903:000b:9b1c
        sys_image_guid:                 0002:c903:000b:9b1f
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       MT_0D30110008
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffe00
        max_qp:                         261056
        max_qp_wr:                      16351
        device_cap_flags:               0x007c9c76
        max_sge:                        32
        max_sge_rd:                     0
        max_cq:                         65408
        max_cqe:                        4194303
        max_mr:                         524272
        max_pd:                         32764
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                4176896
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                1
        max_mcast_grp:                  8192
        max_mcast_qp_attach:            56
        max_total_mcast_qp_attach:      458752
        max_ah:                         0
        max_fmr:                        0
        max_srq:                        65472
        max_srq_wr:                     16383
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             15
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             IB
                        max_msg_sz:             0x40000000
                        port_cap_flags:         0x0251086a
                        max_vl_num:             8 (4)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           128
                        gid_tbl_len:            128
                        subnet_timeout:         18
                        init_type_reply:        0
                        active_width:           4X (2)
                        active_speed:           5.0 Gbps (2)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               fe80:0000:0000:0000:0002:c903:000b:9b1d


Any recommendations to enable larger number of MPI processes on my hardware would be most appreciated.

Many Thanks,
Doug

------------------------
Douglas J Riley, PhD



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20120322/8b3aec76/attachment-0001.html


More information about the mvapich-discuss mailing list