[mvapich-discuss] [../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf region

Jeff Hammond jeff.science at gmail.com
Sun Dec 15 12:39:10 EST 2013


I am running NWChem using ARMCI over MPI-3 RMA
[http://git.mpich.org/armci-mpi.git/shortlog/refs/heads/mpi3rma].  Two
attempts to run a relatively large job failed as follows:

[../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf region
[vs9:mpi_rank_350][get_vbuf]
../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:798: VBUF reagion
allocation failed. Pool size 640
: Cannot allocate memory (12)

[../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 397] Cannot register vbuf region
[vs28:mpi_rank_8][get_vbuf]
../src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:798: VBUF reagion
allocation failed. Pool size 4736
: Cannot allocate memory (12)

NWChem is attempting to allocate a relatively large amount of memory
using MPI_Win_allocate, so it doesn't surprise me that this happens.
However, it is not entirely clear if the problem is that generic
memory allocation has failed, i.e. malloc (or equivalent) returned
NULL, or if something related to IB has been exhausted, e.g. ib_reg_mr
has failed.

If this is not just a simple out-of-memory error, can you suggest
environment variables or source changes (in ARMCI-MPI, not MVAPICH2)
that might alleviate these problems?  I don't know that the installed
Linux has large page support and I can't readily request a new OS
image, but I can switch machines if this is likely to have a positive
impact.

These are the MVAPICH installation details:

$ /home/jhammond/TUKEY/MPI/mv2-trunk-gcc/bin/mpichversion
MVAPICH2 Version:     	2.0b
MVAPICH2 Release date:	unreleased development copy
MVAPICH2 Device:      	ch3:mrail
MVAPICH2 configure:   	CC=gcc CXX=g++ --enable-fc FC=gfortran
--enable-f77 F77=gfortran --with-pm=hydra --enable-mcast
--enable-static --prefix=/home/jhammond/TUKEY/MPI/mv2-trunk-gcc
MVAPICH2 CC:  	gcc    -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: 	g++   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: 	gfortran   -O2
MVAPICH2 FC:  	gfortran   -O2

I looked at the code and it seems that there might be a way to fix
this, but obviously I'll have to wait for you all to do it.

    /*
     * It will often be possible for higher layers to recover
     * when no vbuf is available, but waiting for more descriptors
     * to complete. For now, just abort.
     */
    if (NULL == free_vbuf_head)
    {
        if(allocate_vbuf_region(rdma_vbuf_secondary_pool_size) != 0) {
            ibv_va_error_abort(GEN_EXIT_ERR,
                "VBUF reagion allocation failed. Pool size %d\n",
vbuf_n_allocated);
        }
    }

Thanks!

Jeff

-- 
Jeff Hammond
jeff.science at gmail.com



More information about the mvapich-discuss mailing list