[mvapich-discuss] vbuf pool allocation failure

Devendar Bureddy bureddy at cse.ohio-state.edu
Mon Feb 11 14:46:52 EST 2013


Hi Brody

It seems, it is hitting a limit on amount of memory that can be
registered with HCA.  Can you provide following details?

- Is lockable memory set to unlimited on compute nodes?
$ ulimit -l
unlimited

- How much RAM these nodes have? Can you check OFED parameter
log_mtts_per_seg. With most of the standard ofed installations,
default value of this parameter is '3'.  If your system has more then
16GB, you need to set this parameter to '4'  or more.

$ more /sys/module/mlx4_core/parameters/log_mtts_per_seg
3

- What is the size of cudaHostRegister() buffer which you mentioned?

- What version of MVAPICH2 you are using? and configuration options ?

-Devendar

On Mon, Feb 11, 2013 at 2:01 PM, Brody Huval <brodyh at stanford.edu> wrote:
> Hi,
>
> Our job is running on 64 GPUs (64 MPI nodes) in a small cluster with
> ConnectX3 IB adapters.  We've been running into abort() calls that
> bring down the system after perhaps 10 or 15 minutes of running with
> the following error:
>
> [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 540] Cannot register vbuf region
> [8] Abort: vbuf pool allocation failed at line 607 in file
> src/mpid/ch3/channels/mrail/src/gen2/vbuf.c
>
> Unfortunately, MV2_DEBUG_SHOW_BACKTRACE hasn't show us anything useful
> here, so we're still hunting for the call that's triggering this.  We
> have tried a suggested solution from the archives, setting
> MV2_USE_LAZY_MEM_UNREGISTER to 0, but this leads to an immediate
> crash.  Our code does not make significant use of pinned memory,
> though in the one place that we do use it, it is done with
> cudaHostRegister(), and this buffer is not touched by MPI.
>
> This problem cropped up just recently as we've moved to larger problem
> sizes (and thus larger message sizes).  Previous runs with smaller
> models have worked just fine.
>
> Do you have advice on how to find the problem, or a possible solution? Thank you in advance for any help.
>
>
>
>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss



-- 
Devendar


More information about the mvapich-discuss mailing list