[mvapich-discuss] [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf region
Deva
devendar.bureddy at gmail.com
Sun Dec 15 16:42:15 EST 2013
Jeff,
This could be related to OFED memory registration limits( log_num_mtt,
log_mtts_per_seg). Similar issue was discussed here :
http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2013-February/004261.html.
Can you verify this solution?
Few details from user guide on these OFED parameters:
http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-2.0b.html#x1-1130009.1.1
-Devendar
On Sun, Dec 15, 2013 at 9:39 AM, Jeff Hammond <jeff.science at gmail.com>wrote:
> I am running NWChem using ARMCI over MPI-3 RMA
> [http://git.mpich.org/armci-mpi.git/shortlog/refs/heads/mpi3rma]. Two
> attempts to run a relatively large job failed as follows:
>
> [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf
> region
> [vs9:mpi_rank_350][get_vbuf]
> ../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:798: VBUF reagion
> allocation failed. Pool size 640
> : Cannot allocate memory (12)
>
> [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf
> region
> [vs28:mpi_rank_8][get_vbuf]
> ../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:798: VBUF reagion
> allocation failed. Pool size 4736
> : Cannot allocate memory (12)
>
> NWChem is attempting to allocate a relatively large amount of memory
> using MPI_Win_allocate, so it doesn't surprise me that this happens.
> However, it is not entirely clear if the problem is that generic
> memory allocation has failed, i.e. malloc (or equivalent) returned
> NULL, or if something related to IB has been exhausted, e.g. ib_reg_mr
> has failed.
>
> If this is not just a simple out-of-memory error, can you suggest
> environment variables or source changes (in ARMCI-MPI, not MVAPICH2)
> that might alleviate these problems? I don't know that the installed
> Linux has large page support and I can't readily request a new OS
> image, but I can switch machines if this is likely to have a positive
> impact.
>
> These are the MVAPICH installation details:
>
> $ /home/jhammond/TUKEY/MPI/mv2-trunk-gcc/bin/mpichversion
> MVAPICH2 Version: 2.0b
> MVAPICH2 Release date: unreleased development copy
> MVAPICH2 Device: ch3:mrail
> MVAPICH2 configure: CC=gcc CXX=g++ --enable-fc FC=gfortran
> --enable-f77 F77=gfortran --with-pm=hydra --enable-mcast
> --enable-static --prefix=/home/jhammond/TUKEY/MPI/mv2-trunk-gcc
> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2
> MVAPICH2 F77: gfortran -O2
> MVAPICH2 FC: gfortran -O2
>
> I looked at the code and it seems that there might be a way to fix
> this, but obviously I'll have to wait for you all to do it.
>
> /*
> * It will often be possible for higher layers to recover
> * when no vbuf is available, but waiting for more descriptors
> * to complete. For now, just abort.
> */
> if (NULL == free_vbuf_head)
> {
> if(allocate_vbuf_region(rdma_vbuf_secondary_pool_size) != 0) {
> ibv_va_error_abort(GEN_EXIT_ERR,
> "VBUF reagion allocation failed. Pool size %d\n",
> vbuf_n_allocated);
> }
> }
>
> Thanks!
>
> Jeff
>
> --
> Jeff Hammond
> jeff.science at gmail.com
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
--
-Devendar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20131215/01c6bcb6/attachment.html>
More information about the mvapich-discuss
mailing list