[Mvapich-discuss] MVAPICH2 2.3.7-1 (and 2.3.6) "mv2_mad_dlopen_init" re: "Error opening libibmad.so: libibmad.so", GCC 10.4, CentOS 7.x
Ryan Novosielski
novosirj at rutgers.edu
Mon Jul 11 11:36:05 EDT 2022
!-------------------------------------------------------------------|
This Message Is From an External Sender
This message came from outside your organization.
|-------------------------------------------------------------------!
Hi there,
I'm getting error messages when running an MPI job with SLURM (18.08)
using MVAPICH2, I assume, post 2.3.5, when the following change was made:
NEW Remove dependency on underlying libibverbs, libibmad, libibumad, and
librdmacm libraries using dlopen
Here's what I'm seeing:
[novosirj at amarel-test2 mpihello]$ srun --mpi=pmi2 -n 4
./mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1
srun: job 20824691 queued and waiting for resources
srun: job 20824691 has been allocated resources
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Hello world from processor slepner021.amarel.rutgers.edu, rank 1 out of
4 processors
Hello world from processor slepner021.amarel.rutgers.edu, rank 2 out of
4 processors
Hello world from processor slepner021.amarel.rutgers.edu, rank 3 out of
4 processors
Hello world from processor slepner009.amarel.rutgers.edu, rank 0 out of
4 processors
I don't see this on 2.3. MPI seems to be working, but I assume it's not
using Infiniband?
The libraries do exist:
[novosirj at amarel-test2 mpihello]$ rpm -ql infiniband-diags | grep mad
/usr/lib64/libibmad.so.5
/usr/lib64/libibmad.so.5.5.0
And while I assume it's normal to not see libibmad/libibumad in ldd -v
output anymore post 2.3.5 (and I don't), here's what I see on 2.3, just
to give you an idea of how it used to work:
[novosirj at amarel-test2 mpihello]$ ldd -v
mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1 | head -50
linux-vdso.so.1 => (0x00007fff07b0c000)
libmpi.so.12 =>
/opt/sw/packages/gcc-4_8/mvapich2/2.3/lib/libmpi.so.12 (0x00007f36e87d9000)
libc.so.6 => /lib64/libc.so.6 (0x00007f36e840b000)
libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f36e80e9000)
libm.so.6 => /lib64/libm.so.6 (0x00007f36e7de7000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f36e7bdb000)
libxml2.so.2 => /lib64/libxml2.so.2 (0x00007f36e7871000)
libibmad.so.5 => /lib64/libibmad.so.5 (0x00007f36e7656000)
librdmacm.so.1 => /lib64/librdmacm.so.1 (0x00007f36e743f000)
libibumad.so.3 => /lib64/libibumad.so.3 (0x00007f36e7236000)
libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00007f36e701d000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f36e6e19000)
librt.so.1 => /lib64/librt.so.1 (0x00007f36e6c11000)
libpmi2.so.0 => /lib64/libpmi2.so.0 (0x00007f36e69f9000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f36e67dd000)
libgcc_s.so.1 => /opt/sw/packages/gcc/10.4/lib64/libgcc_s.so.1
(0x00007f36e65c5000)
libquadmath.so.0 =>
/opt/sw/packages/gcc/10.4/lib64/libquadmath.so.0 (0x00007f36e637e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f36e8f40000)
libz.so.1 => /lib64/libz.so.1 (0x00007f36e6168000)
liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f36e5f42000)
libosmcomp.so.4 => /lib64/libosmcomp.so.4 (0x00007f36e5d33000)
libnl-route-3.so.200 => /lib64/libnl-route-3.so.200
(0x00007f36e5ac6000)
libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00007f36e58a5000)
What can/should I do about this?
Sometimes I see (not clear what conditions trigger it, but I have at
least one set of output running one of the OSU benchmarks):
Please retry with MV2_LIBIBMAD_PATH=<path/to/libibmad.so>
It seems like what's suggested in the error message is not a great
idea/this should be dealt with at compile time.
This is my build script; relatively uncomplicated:
[novosirj at amarel-test2 build]$ more
~/src/build-mvapich2-2.3.7-1-gcc-10.4.sh #!/bin/sh
module purge
module load gcc/10.4
module list
export FFLAGS="-fallow-argument-mismatch"
../mvapich2-2.3.7-1/configure --with-pmi=pmi2 --with-pm=slurm
--prefix=/opt/sw/packages/gcc-10/mvapich2/2.3.7-1 && \
make -j32 && make check && make install
And the configure process doesn't seem to point out anything amiss:
checking for the InfiniBand includes path... default
checking for the InfiniBand library path... default
checking for library containing shm_open... -lrt
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
configure: checking checking for InfiniBand umad installation...
checking infiniband/umad.h usability... yes
checking infiniband/umad.h presence... yes
checking for infiniband/umad.h... yes
configure: InfiniBand libumad found
checking whether to enable hybrid communication channel... yes
configure: checking for RDMA CM support...
checking rdma/rdma_cma.h usability... yes
checking rdma/rdma_cma.h presence... yes
checking for rdma/rdma_cma.h... yes
configure: RDMA CM support enabled
configure: checking for hardware multicast support...
checking infiniband/mad.h usability... yes
checking infiniband/mad.h presence... yes
checking for infiniband/mad.h... yes
Thanks!
--
#BlackLivesMatter
____
|| \\UTGERS, |----------------------*O*------------------------
||_// the State | Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark
More information about the Mvapich-discuss
mailing list