[mvapich-discuss] MVAPICH2 Hugepage Issue with RHEL 6.2

Devendar Bureddy bureddy at cse.ohio-state.edu
Wed Feb 20 14:02:16 EST 2013


Hi Tom

It seems this warning message is added in latest kernels for more
security reasons. You can fix this by adding users group id into
/proc/sys/vm/hugetlb_shm_group.

$echo `id -g <username>` > /proc/sys/vm/hugetlb_shm_group

you can preserve this setting across reboots by adding following line
in "/etc/sysctl.conf"
     vm.hugetlb_shm_group=<group_id>

In MVAPICH2,  hugepages are used only for inter node communication
buffers.  Hence, you are seeing this message only in multi-node case

-Devendar

On Wed, Feb 20, 2013 at 12:04 PM, Tom Crockett <twcroc at wm.edu> wrote:
> Hi,
>
> I'm trying to get MVAPICH2 1.9a2 running under Red Hat Enterprise Linux 6.2,
> and keep getting kernel abort messages which are triggered with the first
> invocation of a multi-node MPI program after a host is booted. The complaint
> is:
>
>    "Using mlock ulimits for SHM_HUGETLB deprecated"
>
> A complete copy of the kernel trace is attached.
>
> When the memlock settings in sysctl.conf are set to "unlimited", this abort
> sometimes crashes the node.  When the memlock limits are set to match the
> size of the hugepage allocation, then these do not appear to be fatal and
> the initial and subsequent MVAPICH2 programs run satisfactorily.  However,
> the abort messages are worrisome and somewhat annoying, so it would be nice
> to understand why this is occurring and what should be done about it.
>
> MPI programs which run entirely within a single node (up to 8 cores in our
> present configuration) do not trigger this problem.
>
> Here are some specifics about our setup:
>
> Software:
>    MVAPICH2 1.9a2
>    Mellanox OFED 1.5.3-3.1.0
>    RHEL 6.2 (kernel 2.6.32-220.el6.x86_64)
>    PGI 11.10
>
> Configuration:
>
>    AnonHugePages:      2048 kB
>    HugePages_Total:     512
>    HugePages_Free:      512
>    HugePages_Rsvd:        0
>    HugePages_Surp:        0
>    Hugepagesize:       2048 kB
>
>    * soft memlock 1048576
>    * hard memlock 1048576
>
> Hardware:
>    Dell C6100 w/ 2 x Xeon X5672, 64 GB mem.
>    Mellanox ConnectX-2 VPI
>
> We have a total of 64 nodes in this cluster.
>
> Thanks for any insights that you can provide on this issue,
>
> Tom Crockett
>
> College of William and Mary               email:  twcroc at wm.edu
> IT/High Performance Computing Group       phone:  (757) 221-2762
> Jones Hall, Rm. 304A                      fax:    (757) 221-1321
> P.O. Box 8795
> Williamsburg, VA  23187-8795
>
>
> ---------- Forwarded message ----------
> From: <user at monsoon.sciclone.wm.edu>
> To: <root at monsoon.sciclone.wm.edu>
> Cc:
> Date: Wed, 20 Feb 2013 10:18:46 -0500
> Subject: [abrt] full crash report
> Duplicate check
> =====
>
>
> Common information
> =====
> architecture
> -----
> x86_64
>
> kernel
> -----
> 2.6.32-220.el6.x86_64
>
> package
> -----
> kernel
>
>
>
> Additional information
> =====
> kernel_tainted_long
> -----
> Taint on warning.
>
> kernel_tainted
> -----
> 512
>
> time
> -----
> 1361373497
>
> backtrace
> -----
> WARNING: at fs/hugetlbfs/inode.c:951 hugetlb_file_setup+0x227/0x250() (Not
> tainted)
> Hardware name: C6100
> Using mlock ulimits for SHM_HUGETLB deprecated
> Modules linked in: autofs4 nfs lockd fscache nfs_acl auth_rpcgss 8021q garp
> stp llc sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf rdma_ucm(U)
> ib_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6
> ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U)
> ib_mthca(U) ib_mad(U) ib_core(U) vhost_net macvtap macvlan tun kvm uinput sg
> mlx4_en(U) mlx4_core(U) igb dcdbas microcode i2c_i801 i2c_core iTCO_wdt
> iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext4 mbcache
> jbd2 sd_mod crc_t10dif megaraid_sas ahci dm_mirror dm_region_hash dm_log
> dm_mod [last unloaded: scsi_wait_scan]
> Pid: 25513, comm: rand8 Not tainted 2.6.32-220.el6.x86_64 #1
> Call Trace:
> [<ffffffff81069b77>] ? warn_slowpath_common+0x87/0xc0
> [<ffffffff81069c66>] ? warn_slowpath_fmt+0x46/0x50
> [<ffffffff8113ddf4>] ? user_shm_lock+0x54/0xc0
> [<ffffffff811f14a7>] ? hugetlb_file_setup+0x227/0x250
> [<ffffffff81275680>] ? sprintf+0x40/0x50
> [<ffffffff811ff942>] ? newseg+0x152/0x290
> [<ffffffff811faba1>] ? ipcget+0x61/0x200
> [<ffffffff811ff7d9>] ? sys_shmget+0x59/0x60
> [<ffffffff811ff7f0>] ? newseg+0x0/0x290
> [<ffffffff811ff7e0>] ? shm_security+0x0/0x10
> [<ffffffff811fef40>] ? shm_more_checks+0x0/0x20
> [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
>
>
> hostname
> -----
> wh03.sciclone.wm.edu
>
> component
> -----
> kernel
>
> cmdline
> -----
> ro root=/dev/mapper/VolGroup00-LogVol00 rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD
> quiet SYSFONT=latarcyrheb-sun16 rhgb rd_LVM_LV=VolGroup00/LogVol01
> rd_LVM_LV=VolGroup00/LogVol00  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
>
> analyzer
> -----
> Kerneloops
>
> kernel_tainted_short
> -----
> ---------W
>
> reason
> -----
> WARNING: at fs/hugetlbfs/inode.c:951 hugetlb_file_setup+0x227/0x250() (Not
> tainted)
>
> os_release
> -----
> Red Hat Enterprise Linux Workstation release 6.2 (Santiago)
>
>
> .
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 
Devendar


More information about the mvapich-discuss mailing list