[mvapich-discuss] vbuf pool allocation failure

Mon Feb 11 15:20:35 EST 2013

Hi Devendar,

Thanks for the reply;  we might have it fixed over here, but just to
sanity check (and to populate your discussion list for posterity):

Some outputs from us:

$ mpiname -a
MVAPICH2 1.9a2 Thu Nov  8 11:43:52 EST 2012 ch3:mrail
..<snip>..
Configuration
--with-cuda=/usr/local/cuda-5.0 --enable-cuda

$ ulimit -l
unlimited
[We're not using PBS, etc.;  so that should be good.]

Following other notes on the list + your suggestion, we set
log_num_mtt=24 for the mlx4_core module and reloaded the driver, which
appears to have fixed our issue.

Is there a reason to prefer altering log_mtts_per_seg?  The current value is:
$ more /sys/module/mlx4_core/parameters/log_mtts_per_seg
0

which I assume means it's defaulting to 3.  Our nodes have 128GB of
memory, and 4 GPUs (16GB GPU memory total), so my guess is that the
factor of 16 increase gained by altering log_num_mtt does not
completely fix the issue.

Thanks a lot for your help.

Best,
Adam

On Mon, Feb 11, 2013 at 2:46 PM, Devendar Bureddy
<bureddy at cse.ohio-state.edu> wrote:
> Hi Brody
>
> It seems, it is hitting a limit on amount of memory that can be
> registered with HCA.  Can you provide following details?
>
> - Is lockable memory set to unlimited on compute nodes?
> $ ulimit -l
> unlimited
>
> - How much RAM these nodes have? Can you check OFED parameter
> log_mtts_per_seg. With most of the standard ofed installations,
> default value of this parameter is '3'.  If your system has more then
> 16GB, you need to set this parameter to '4'  or more.
>
> $ more /sys/module/mlx4_core/parameters/log_mtts_per_seg
> 3
>
> - What is the size of cudaHostRegister() buffer which you mentioned?
>
> - What version of MVAPICH2 you are using? and configuration options ?
>
> -Devendar
>
> On Mon, Feb 11, 2013 at 2:01 PM, Brody Huval <brodyh at stanford.edu> wrote:
>> Hi,
>>
>> Our job is running on 64 GPUs (64 MPI nodes) in a small cluster with
>> ConnectX3 IB adapters.  We've been running into abort() calls that
>> bring down the system after perhaps 10 or 15 minutes of running with
>> the following error:
>>
>> [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 540] Cannot register vbuf region
>> [8] Abort: vbuf pool allocation failed at line 607 in file
>> src/mpid/ch3/channels/mrail/src/gen2/vbuf.c
>>
>> Unfortunately, MV2_DEBUG_SHOW_BACKTRACE hasn't show us anything useful
>> here, so we're still hunting for the call that's triggering this.  We
>> have tried a suggested solution from the archives, setting
>> MV2_USE_LAZY_MEM_UNREGISTER to 0, but this leads to an immediate
>> crash.  Our code does not make significant use of pinned memory,
>> though in the one place that we do use it, it is done with
>> cudaHostRegister(), and this buffer is not touched by MPI.
>>
>> This problem cropped up just recently as we've moved to larger problem
>> sizes (and thus larger message sizes).  Previous runs with smaller
>> models have worked just fine.
>>
>> Do you have advice on how to find the problem, or a possible solution? Thank you in advance for any help.
>>
>>
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
> --
> Devendar