[mvapich-discuss] [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf region

Deva devendar.bureddy at gmail.com
Sun Dec 15 16:42:15 EST 2013


Jeff,

This could be related to OFED memory registration limits( log_num_mtt,
log_mtts_per_seg).  Similar issue was discussed here :
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-February/004261.html.
 Can you verify this solution?

Few details from user guide on these OFED parameters:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-1130009.1.1


-Devendar







On Sun, Dec 15, 2013 at 9:39 AM, Jeff Hammond <jeff.science at gmail.com>wrote:

> I am running NWChem using ARMCI over MPI-3 RMA
> [http://git.mpich.org/armci-mpi.git/shortlog/refs/heads/mpi3rma].  Two
> attempts to run a relatively large job failed as follows:
>
> [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf
> region
> [vs9:mpi_rank_350][get_vbuf]
> ../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:798: VBUF reagion
> allocation failed. Pool size 640
> : Cannot allocate memory (12)
>
> [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf
> region
> [vs28:mpi_rank_8][get_vbuf]
> ../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:798: VBUF reagion
> allocation failed. Pool size 4736
> : Cannot allocate memory (12)
>
> NWChem is attempting to allocate a relatively large amount of memory
> using MPI_Win_allocate, so it doesn't surprise me that this happens.
> However, it is not entirely clear if the problem is that generic
> memory allocation has failed, i.e. malloc (or equivalent) returned
> NULL, or if something related to IB has been exhausted, e.g. ib_reg_mr
> has failed.
>
> If this is not just a simple out-of-memory error, can you suggest
> environment variables or source changes (in ARMCI-MPI, not MVAPICH2)
> that might alleviate these problems?  I don't know that the installed
> Linux has large page support and I can't readily request a new OS
> image, but I can switch machines if this is likely to have a positive
> impact.
>
> These are the MVAPICH installation details:
>
> $ /home/jhammond/TUKEY/MPI/mv2-trunk-gcc/bin/mpichversion
> MVAPICH2 Version:       2.0b
> MVAPICH2 Release date:  unreleased development copy
> MVAPICH2 Device:        ch3:mrail
> MVAPICH2 configure:     CC=gcc CXX=g++ --enable-fc FC=gfortran
> --enable-f77 F77=gfortran --with-pm=hydra --enable-mcast
> --enable-static --prefix=/home/jhammond/TUKEY/MPI/mv2-trunk-gcc
> MVAPICH2 CC:    gcc    -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 F77:   gfortran   -O2
> MVAPICH2 FC:    gfortran   -O2
>
> I looked at the code and it seems that there might be a way to fix
> this, but obviously I'll have to wait for you all to do it.
>
>     /*
>      * It will often be possible for higher layers to recover
>      * when no vbuf is available, but waiting for more descriptors
>      * to complete. For now, just abort.
>      */
>     if (NULL == free_vbuf_head)
>     {
>         if(allocate_vbuf_region(rdma_vbuf_secondary_pool_size) != 0) {
>             ibv_va_error_abort(GEN_EXIT_ERR,
>                 "VBUF reagion allocation failed. Pool size %d\n",
> vbuf_n_allocated);
>         }
>     }
>
> Thanks!
>
> Jeff
>
> --
> Jeff Hammond
> jeff.science at gmail.com
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



-- 


-Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131215/01c6bcb6/attachment.html>


More information about the mvapich-discuss mailing list