[mvapich-discuss] vbuf pool allocation failure
Adam Coates
acoates at cs.stanford.edu
Mon Feb 11 15:20:35 EST 2013
Hi Devendar,
Thanks for the reply; we might have it fixed over here, but just to
sanity check (and to populate your discussion list for posterity):
Some outputs from us:
$ mpiname -a
MVAPICH2 1.9a2 Thu Nov 8 11:43:52 EST 2012 ch3:mrail
..<snip>..
Configuration
--with-cuda=/usr/local/cuda-5.0 --enable-cuda
$ ulimit -l
unlimited
[We're not using PBS, etc.; so that should be good.]
Following other notes on the list + your suggestion, we set
log_num_mtt=24 for the mlx4_core module and reloaded the driver, which
appears to have fixed our issue.
Is there a reason to prefer altering log_mtts_per_seg? The current value is:
$ more /sys/module/mlx4_core/parameters/log_mtts_per_seg
0
which I assume means it's defaulting to 3. Our nodes have 128GB of
memory, and 4 GPUs (16GB GPU memory total), so my guess is that the
factor of 16 increase gained by altering log_num_mtt does not
completely fix the issue.
Thanks a lot for your help.
Best,
Adam
On Mon, Feb 11, 2013 at 2:46 PM, Devendar Bureddy
<bureddy at cse.ohio-state.edu> wrote:
> Hi Brody
>
> It seems, it is hitting a limit on amount of memory that can be
> registered with HCA. Can you provide following details?
>
> - Is lockable memory set to unlimited on compute nodes?
> $ ulimit -l
> unlimited
>
> - How much RAM these nodes have? Can you check OFED parameter
> log_mtts_per_seg. With most of the standard ofed installations,
> default value of this parameter is '3'. If your system has more then
> 16GB, you need to set this parameter to '4' or more.
>
> $ more /sys/module/mlx4_core/parameters/log_mtts_per_seg
> 3
>
> - What is the size of cudaHostRegister() buffer which you mentioned?
>
> - What version of MVAPICH2 you are using? and configuration options ?
>
> -Devendar
>
> On Mon, Feb 11, 2013 at 2:01 PM, Brody Huval <brodyh at stanford.edu> wrote:
>> Hi,
>>
>> Our job is running on 64 GPUs (64 MPI nodes) in a small cluster with
>> ConnectX3 IB adapters. We've been running into abort() calls that
>> bring down the system after perhaps 10 or 15 minutes of running with
>> the following error:
>>
>> [src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 540] Cannot register vbuf region
>> [8] Abort: vbuf pool allocation failed at line 607 in file
>> src/mpid/ch3/channels/mrail/src/gen2/vbuf.c
>>
>> Unfortunately, MV2_DEBUG_SHOW_BACKTRACE hasn't show us anything useful
>> here, so we're still hunting for the call that's triggering this. We
>> have tried a suggested solution from the archives, setting
>> MV2_USE_LAZY_MEM_UNREGISTER to 0, but this leads to an immediate
>> crash. Our code does not make significant use of pinned memory,
>> though in the one place that we do use it, it is done with
>> cudaHostRegister(), and this buffer is not touched by MPI.
>>
>> This problem cropped up just recently as we've moved to larger problem
>> sizes (and thus larger message sizes). Previous runs with smaller
>> models have worked just fine.
>>
>> Do you have advice on how to find the problem, or a possible solution? Thank you in advance for any help.
>>
>>
>>
>>
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>
>
> --
> Devendar
More information about the mvapich-discuss
mailing list