[mvapich-discuss] out of registration memory when running graph500

Sayan Ghosh sayandeep52 at gmail.com
Thu Oct 29 14:51:50 EDT 2015


Sorry to bump an old thread, but I am facing different set of issues while
trying to run graph500 (every BFS MPI version) with the toy dataset (
http://www.graph500.org/specifications#sec-1) on a different cluster.
Following your advice, I have updated these two variables:

export MV2_USE_LAZY_MEM_UNREGISTER=0
export MV2_VBUF_TOTAL_SIZE=4096

The error I get for bfs_simple is as follows:

graph_generation:               10.893693 s
construction_time:              3.634747 s
Running BFS 0
In: PMI_Abort(1, Fatal error in PMPI_Test:
Other MPI error, error stack:
PMPI_Test(168)...............: MPI_Test(request=0x9ed208,
flag=0x7fffffffcb84, status=0x1) failed
MPIDI_CH3I_Progress_test(559):
handle_read(1134)............:
handle_read_individual(1419).:
)
In: PMI_Abort(1, Fatal error in PMPI_Test:
Other MPI error, error stack:
PMPI_Test(168)...............: MPI_Test(request=0x7fffffffcb8c,
flag=0x7fffffffcb98, status=0x7fffffffcb60) failed
MPIDI_CH3I_Progress_test(559):
handle_read(1134)............:
handle_read_individual(1419).:
)

For bfs_one_sided, I still get the following error even when I decrease
MV2_VBUF_TOTAL_SIZE.

[src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 459] Cannot register vbuf
region
[node141.local:mpi_rank_77][MRAILI_Get_Vbuf]
src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c:989: vbuf pool allocation
failed: Resource temporarily unavailable (11)

If you want, I could configure mvapich2 with --enable-g --enable-error
-messages=all and report back.

MVAPICH2 Version:     2.1
MVAPICH2 Release date: Fri Apr 03 20:00:00 EDT 2015
MVAPICH2 Device:       ch3:mrail
MVAPICH2 configure:   CC=icc FC=ifort CXX=icpc
--prefix=/share/apps/mvapich2/2.1/intel/15.0.1 --with-pmi --with-slurm
--enable-shared --enable-static --enable-f77 --enable-fc --enable-cxx
--enable-romio
MVAPICH2 CC:   icc    -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: icpc   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: ifort -L/lib -L/lib   -O2
MVAPICH2 FC:   ifort   -O2


Thank you,
Sayan

On Tue, Aug 4, 2015 at 8:58 AM, Sayan Ghosh <sayandeep52 at gmail.com> wrote:

> Thanks Hari, I shall try them and report back. The median message size
> could be approximated, so I will also try to reduce the vbuf size. I was
> just using the debug build to see if helpful error messages show up,
> otherwise I am using mvapich2/mvapich 2.2.1.
>
> //Sayan
>
> On Tue, Aug 4, 2015 at 7:34 AM, Hari Subramoni <subramoni.1 at osu.edu>
> wrote:
>
>> Hello Sayan,
>>
>> Apologies about the delay in getting back to you on this.
>>
>> Can you please try disabling registration cache mechanism
>> (MV2_USE_LAZY_MEM_UNREGISTER=) and retry?
>>
>>
>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-24000011.82
>>
>> Do you have an idea about what the median message size for Graph500 is?
>> If it is small, can you try reducing the size of VBUF
>> (MV2_VBUF_TOTAL_SIZE=<value>) and see if it helps?
>>
>>
>> http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1-userguide.html#x1-26300011.105
>>
>> As a side note, I see that you're using a debug version of the MVAPICH2
>> build. This is not good for performance. If you are doing this runs to
>> measure performance, I would suggest that you use a build where debugging
>> is turned off.
>>
>> Regards,
>> Hari.
>>
>> On Sun, Aug 2, 2015 at 8:12 PM, Sayan Ghosh <sayandeep52 at gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I ran into some IB registration issues while trying to run the "toy"
>>> graph500 benchmark (one-sided, as well as 2-sided)[
>>> http://www.graph500.org/specifications#sec-3_4] on ALCF Cooley (
>>> https://www.alcf.anl.gov/user-guides/cooley). I am also setting
>>> MV2_IBA_HCA to "mlx5_0" as suggested here:
>>> https://www.alcf.anl.gov/user-guides/changes-tukey-cooley.
>>>
>>> Excerpt of error that I am getting:
>>>
>>> [9] 9600.0 MB was used for memory usage tracing!
>>> [6] 9600.0 MB was used for memory usage tracing!
>>> [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 459] Cannot register vbuf
>>> region
>>> [cc016:mpi_rank_13][MRAILI_Get_Vbuf]
>>> src/mpid/ch3/channels/mrail/src/gen2/ibv_send.c:989: vbuf pool allocation
>>> failed: Cannot allocate memory (12)
>>>
>>> The MVAPICH2.2.1 user-guide (section 9.1, page 74) says to increase the
>>> OFED kernel module parameter (log_num_mtt) to twice the amount of physical
>>> memory, but I see Cooley's /etc/modprobe.d/mlx4_core.conf to be:
>>>
>>> options mlx4_core log_num_mtt=24 log_mtts_per_seg=4
>>>
>>> which means max registered memory is 2^24 * 2^4 * 4096 = 1 TB
>>>
>>> Please advise.
>>>
>>> MVAPICH version:
>>>
>>> MVAPICH2 Version:       2.1
>>> MVAPICH2 Release date:  Fri Apr 03 20:00:00 EDT 2015
>>> MVAPICH2 Device:        ch3:mrail
>>> MVAPICH2 configure:     --enable-shared --enable-debuginfo
>>> --enable-g=all --prefix=/soft/libraries/mpi/mvapich2-2.1/gccdbg
>>> MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -g -O2
>>> MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -g -O2
>>> MVAPICH2 F77:   gfortran -L/lib -L/lib   -g -O2
>>> MVAPICH2 FC:    gfortran   -g -O2
>>>
>>> Thank you,
>>> Sayan
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>>
>>
>
>
> --
> Regards,
> Sayan
>



-- 
Regards,
Sayan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151029/5e5fc8e7/attachment-0001.html>


More information about the mvapich-discuss mailing list