[mvapich-discuss] out of registration memory when running graph500

Sayan Ghosh sayandeep52 at gmail.com
Tue Aug 4 11:58:05 EDT 2015


Thanks Hari, I shall try them and report back. The median message size
could be approximated, so I will also try to reduce the vbuf size. I was
just using the debug build to see if helpful error messages show up,
otherwise I am using mvapich2/mvapich 2.2.1.

//Sayan

On Tue, Aug 4, 2015 at 7:34 AM, Hari Subramoni <subramoni.1 at osu.edu> wrote:

> Hello Sayan,
>
> Apologies about the delay in getting back to you on this.
>
> Can you please try disabling registration cache mechanism
> (MV2_USE_LAZY_MEM_UNREGISTER=) and retry?
>
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-24000011.82
>
> Do you have an idea about what the median message size for Graph500 is? If
> it is small, can you try reducing the size of VBUF
> (MV2_VBUF_TOTAL_SIZE=<value>) and see if it helps?
>
>
> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-26300011.105
>
> As a side note, I see that you're using a debug version of the MVAPICH2
> build. This is not good for performance. If you are doing this runs to
> measure performance, I would suggest that you use a build where debugging
> is turned off.
>
> Regards,
> Hari.
>
> On Sun, Aug 2, 2015 at 8:12 PM, Sayan Ghosh <sayandeep52 at gmail.com> wrote:
>
>> Hi,
>>
>> I ran into some IB registration issues while trying to run the "toy"
>> graph500 benchmark (one-sided, as well as 2-sided)[
>> http://www.graph500.org/specifications#sec-3_4] on ALCF Cooley (
>> https://www.alcf.anl.gov/user-guides/cooley). I am also setting
>> MV2_IBA_HCA to "mlx5_0" as suggested here:
>> https://www.alcf.anl.gov/user-guides/changes-tukey-cooley.
>>
>> Excerpt of error that I am getting:
>>
>> [9] 9600.0 MB was used for memory usage tracing!
>> [6] 9600.0 MB was used for memory usage tracing!
>> [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 459] Cannot register vbuf
>> region
>> [cc016:mpi_rank_13][MRAILI_Get_Vbuf]
>> src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c:989: vbuf pool allocation
>> failed: Cannot allocate memory (12)
>>
>> The MVAPICH2.2.1 user-guide (section 9.1, page 74) says to increase the
>> OFED kernel module parameter (log_num_mtt) to twice the amount of physical
>> memory, but I see Cooley's /etc/modprobe.d/mlx4_core.conf to be:
>>
>> options mlx4_core log_num_mtt=24 log_mtts_per_seg=4
>>
>> which means max registered memory is 2^24 * 2^4 * 4096 = 1 TB
>>
>> Please advise.
>>
>> MVAPICH version:
>>
>> MVAPICH2 Version:       2.1
>> MVAPICH2 Release date:  Fri Apr 03 20:00:00 EDT 2015
>> MVAPICH2 Device:        ch3:mrail
>> MVAPICH2 configure:     --enable-shared --enable-debuginfo --enable-g=all
>> --prefix=/soft/libraries/mpi/mvapich2-2.1/gccdbg
>> MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -g -O2
>> MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -g -O2
>> MVAPICH2 F77:   gfortran -L/lib -L/lib   -g -O2
>> MVAPICH2 FC:    gfortran   -g -O2
>>
>> Thank you,
>> Sayan
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>>
>


-- 
Regards,
Sayan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20150804/d0620652/attachment-0001.html>


More information about the mvapich-discuss mailing list