[mvapich-discuss] [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf region

Jeff Hammond jeff.science at gmail.com
Sun Dec 15 21:52:20 EST 2013


So there's nothing I can do in userspace?  I've requested the
sysadmins change the IB settings, but since the machine I'm using
shares its IB network with the GFPS servers for Mira
[https://www.alcf.anl.gov/mira], they might balk at it.

Jeff

On Sun, Dec 15, 2013 at 3:42 PM, Deva <devendar.bureddy at gmail.com> wrote:
> Jeff,
>
> This could be related to OFED memory registration limits( log_num_mtt,
> log_mtts_per_seg).  Similar issue was discussed here :
> http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-February/004261.html.
> Can you verify this solution?
>
> Few details from user guide on these OFED parameters:
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-1130009.1.1
>
>
> -Devendar
>
>
>
>
>
>
>
> On Sun, Dec 15, 2013 at 9:39 AM, Jeff Hammond <jeff.science at gmail.com>
> wrote:
>>
>> I am running NWChem using ARMCI over MPI-3 RMA
>> [http://git.mpich.org/armci-mpi.git/shortlog/refs/heads/mpi3rma].  Two
>> attempts to run a relatively large job failed as follows:
>>
>> [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf
>> region
>> [vs9:mpi_rank_350][get_vbuf]
>> ../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:798: VBUF reagion
>> allocation failed. Pool size 640
>> : Cannot allocate memory (12)
>>
>> [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf
>> region
>> [vs28:mpi_rank_8][get_vbuf]
>> ../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:798: VBUF reagion
>> allocation failed. Pool size 4736
>> : Cannot allocate memory (12)
>>
>> NWChem is attempting to allocate a relatively large amount of memory
>> using MPI_Win_allocate, so it doesn't surprise me that this happens.
>> However, it is not entirely clear if the problem is that generic
>> memory allocation has failed, i.e. malloc (or equivalent) returned
>> NULL, or if something related to IB has been exhausted, e.g. ib_reg_mr
>> has failed.
>>
>> If this is not just a simple out-of-memory error, can you suggest
>> environment variables or source changes (in ARMCI-MPI, not MVAPICH2)
>> that might alleviate these problems?  I don't know that the installed
>> Linux has large page support and I can't readily request a new OS
>> image, but I can switch machines if this is likely to have a positive
>> impact.
>>
>> These are the MVAPICH installation details:
>>
>> $ /home/jhammond/TUKEY/MPI/mv2-trunk-gcc/bin/mpichversion
>> MVAPICH2 Version:       2.0b
>> MVAPICH2 Release date:  unreleased development copy
>> MVAPICH2 Device:        ch3:mrail
>> MVAPICH2 configure:     CC=gcc CXX=g++ --enable-fc FC=gfortran
>> --enable-f77 F77=gfortran --with-pm=hydra --enable-mcast
>> --enable-static --prefix=/home/jhammond/TUKEY/MPI/mv2-trunk-gcc
>> MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2
>> MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -O2
>> MVAPICH2 F77:   gfortran   -O2
>> MVAPICH2 FC:    gfortran   -O2
>>
>> I looked at the code and it seems that there might be a way to fix
>> this, but obviously I'll have to wait for you all to do it.
>>
>>     /*
>>      * It will often be possible for higher layers to recover
>>      * when no vbuf is available, but waiting for more descriptors
>>      * to complete. For now, just abort.
>>      */
>>     if (NULL == free_vbuf_head)
>>     {
>>         if(allocate_vbuf_region(rdma_vbuf_secondary_pool_size) != 0) {
>>             ibv_va_error_abort(GEN_EXIT_ERR,
>>                 "VBUF reagion allocation failed. Pool size %d\n",
>> vbuf_n_allocated);
>>         }
>>     }
>>
>> Thanks!
>>
>> Jeff
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
>
> --
>
>
> -Devendar



-- 
Jeff Hammond
jeff.science at gmail.com



More information about the mvapich-discuss mailing list