[mvapich-discuss] vbuf pool allocation failure

Mon Feb 11 14:01:19 EST 2013

Hi,

Our job is running on 64 GPUs (64 MPI nodes) in a small cluster with
ConnectX3 IB adapters.  We've been running into abort() calls that
bring down the system after perhaps 10 or 15 minutes of running with
the following error:

[src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 540] Cannot register vbuf region
[8] Abort: vbuf pool allocation failed at line 607 in file
src/mpid/ch3/channels/mrail/src/gen2/vbuf.c

Unfortunately, MV2_DEBUG_SHOW_BACKTRACE hasn't show us anything useful
here, so we're still hunting for the call that's triggering this.  We
have tried a suggested solution from the archives, setting
MV2_USE_LAZY_MEM_UNREGISTER to 0, but this leads to an immediate
crash.  Our code does not make significant use of pinned memory,
though in the one place that we do use it, it is done with
cudaHostRegister(), and this buffer is not touched by MPI.

This problem cropped up just recently as we've moved to larger problem
sizes (and thus larger message sizes).  Previous runs with smaller
models have worked just fine.

Do you have advice on how to find the problem, or a possible solution? Thank you in advance for any help.