[mvapich-discuss] Simple problem with MVAPICH2-X 2.3-1? "VBUF CUDA region allocation failed"
Riebs, Andy
andy.riebs at hpe.com
Tue Jun 16 17:50:10 EDT 2020
Summary: Attempts to run MPI jobs with 2 or more nodes return "VBUF CUDA region allocation failed" on a cluster with no GPUs.
Long form:
I tried to install simple MPI support with the commands
$ cd ./mvapich2
$ rpm2cpio ~/tmp/mvapich2-x/mvapich2-x-mofed4.5-gnu4.8.5-2.3-1.el7/mvapich2-x-basic-mofed4.5-gnu4.8.5-slurm-2.3-1.el7.x86_64.rpm | cpio -id
$ mv ./opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/* ./2.3-1
Compiling a simple MPI "hello world" works fine, and it runs fine on a single node:
$ cat mpi_hello.c
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
int
main(int argc, char *argv[])
{
int rank, size, len;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(name, &len);
printf ("Hello world! I'm %d of %d on %s\n", rank, size, name);
MPI_Finalize();
exit(0);
}
$ mpicc -o a.out mpi_hello.c
$ srun -N1 ./a.out
Hello world! I'm 0 of 1 on node01
$
But it fails when I try to run on 2 or more nodes:
$ srun -N2 ./a.out
[src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 555] Cannot register vbuf region
[node02:mpi_rank_1][allocate_vbufs] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:788: VBUF CUDA region allocation failed.
: Invalid argument (22)
[src/mpid/ch3/channels/mrail/src/gen2/vbuf.c 555] Cannot register vbuf region
[node01:mpi_rank_0][allocate_vbufs] src/mpid/ch3/channels/mrail/src/gen2/vbuf.c:788: VBUF CUDA region allocation failed.
: Invalid argument (22)
srun: error: node02: task 1: Exited with exit code 255
srun: Terminating job step 1799.0
$ which mpicc
/home/riebs/mvapich2/2.3-1/bin/mpicc
$ ls /home/riebs/mvapich2/2.3-1/
bin etc include lib64 share
$ echo $LD_LIBRARY_PATH
/opt/mellanox/sharp/lib:/home/riebs/mvapich2/2.3-1/lib64:/opt/slurm/18.08.5-2/lib64:/opt/slurm/18.08.5-2/lib
$
The environment:
- CentOS 7.4
- MOFED 4.2
- Arch x86_64
- mvapich2:
$ mpichversion
MVAPICH2 Version: 2.3
MVAPICH2 Release date: Mon June 8 22:00:00 EST 2020
MVAPICH2 Device: ch3:mrail
MVAPICH2 configure: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm --exec-prefix=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm --bindir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/bin --sbindir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/sbin --sysconfdir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/etc --datadir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/share --includedir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/include --libdir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/lib64 --libexecdir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/share/man --infodir=/opt/mvapich2-x/gnu4.8.5/mofed4.5/basic/slurm/share/info CC=gcc CXX=g++ F77=gfortran FC=gfortran --disable-gl --enable-fortran=yes --enable-cxx=yes --enable-romio --with-ch3-rank-bits=32 --enable-ucr --disable-rpath --disable-static --enable-shared --disable-rdma-cm --without-hydra-ckpointlib --with-pm=slurm --with-pmi=pmi1 --enable-mpit-tool --enable-hybrid CPPFLAGS= CFLAGS=-pipe CXXFLAGS= FFLAGS= FCFLAGS= LDFLAGS=-Wl,-rpath,XORIGIN/placeholder
MVAPICH2 CC: gcc -pipe -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77: gfortran -O2
MVAPICH2 FC: gfortran -O2
$
There are no GPUs installed on the cluster, and no sign of CUDA in the environment. I tried specifying "MV2_USE_CUDA=0", but that didn't help.
It seems that I must be missing something pretty obvious here, but I'm not seeing it.
Any suggestions?
Andy
--
Andy Riebs
andy.riebs at hpe.com
Hewlett Packard Enterprise
High Performance Computing Software Engineering
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20200616/60dd7e39/attachment-0001.html>
More information about the mvapich-discuss
mailing list