[mvapich-discuss] Simple problem with MVAPICH2-X 2.3-1? "VBUF CUDA region allocation failed"

Riebs, Andy andy.riebs at hpe.com
Tue Jun 16 17:50:10 EDT 2020


Summary: Attempts to run MPI jobs with 2 or more nodes return "VBUF CUDA region allocation failed" on a cluster with no GPUs.

Long form:

I tried to install simple MPI support with the commands

$ cd ./mvapich2
$  rpm2cpio ~/tmp/mvapich2-x/mvapich2-x-mofed4.5-gnu4.8.5-2.3-1.el7/mvapich2-x-basic-mofed4.5-gnu4.8.5-slurm-2.3-1.el7.x86_64.rpm | cpio -id
$  mv ./opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/*  ./2.3-1

Compiling a simple MPI "hello world" works fine, and it runs fine on a single node:

$ cat mpi_hello.c
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

int
main(int argc, char *argv[])
{
        int             rank, size, len;
        char            name[MPI_MAX_PROCESSOR_NAME];

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);

        MPI_Get_processor_name(name, &len);
        printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);

        MPI_Finalize();
        exit(0);
}
$ mpicc -o a.out mpi_hello.c
$ srun -N1 ./a.out
Hello world! I'm 0 of 1 on node01
$

But it fails when I try to run on 2 or more nodes:

$ srun -N2 ./a.out
[src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 555] Cannot register vbuf region
[node02:mpi_rank_1][allocate_vbufs] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:788: VBUF CUDA region allocation failed.
: Invalid argument (22)
[src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 555] Cannot register vbuf region
[node01:mpi_rank_0][allocate_vbufs] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:788: VBUF CUDA region allocation failed.
: Invalid argument (22)
srun: error: node02: task 1: Exited with exit code 255
srun: Terminating job step 1799.0
$ which mpicc
/home/riebs/mvapich2/2.3-1/bin/mpicc
$ ls /home/riebs/mvapich2/2.3-1/
bin  etc  include  lib64  share
$ echo $LD_LIBRARY_PATH
/opt/mellanox/sharp/lib:/home/riebs/mvapich2/2.3-1/lib64:/opt/slurm/18.08.5-2/lib64:/opt/slurm/18.08.5-2/lib
$

The environment:
- CentOS 7.4
- MOFED 4.2
- Arch x86_64
- mvapich2:
$ mpichversion
MVAPICH2 Version:       2.3
MVAPICH2 Release date:  Mon June 8 22:00:00 EST 2020
MVAPICH2 Device:        ch3:mrail
MVAPICH2 configure:     --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm --exec-prefix=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm --bindir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/bin --sbindir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/sbin --sysconfdir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/etc --datadir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/share --includedir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/include --libdir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/lib64 --libexecdir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/share/man --infodir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/share/info CC=gcc CXX=g++ F77=gfortran FC=gfortran --disable-gl --enable-fortran=yes --enable-cxx=yes --enable-romio --with-ch3-rank-bits=32 --enable-ucr --disable-rpath --disable-static --enable-shared --disable-rdma-cm --without-hydra-ckpointlib --with-pm=slurm --with-pmi=pmi1 --enable-mpit-tool --enable-hybrid CPPFLAGS= CFLAGS=-pipe CXXFLAGS= FFLAGS= FCFLAGS= LDFLAGS=-Wl,-rpath,XORIGIN/placeholder
MVAPICH2 CC:    gcc -pipe     -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX:   g++   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77:   gfortran   -O2
MVAPICH2 FC:    gfortran   -O2
$

There are no GPUs installed on the cluster, and no sign of CUDA in the environment. I tried specifying "MV2_USE_CUDA=0", but that didn't help.

It seems that I must be missing something pretty obvious here, but I'm not seeing it.

Any suggestions?

Andy

--
Andy Riebs
andy.riebs at hpe.com
Hewlett Packard Enterprise
High Performance Computing Software Engineering

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200616/60dd7e39/attachment-0001.html>


More information about the mvapich-discuss mailing list