[Mvapich-discuss] MVAPICH2 2.3.7-1 (and 2.3.6) "mv2_mad_dlopen_init" re: "Error opening libibmad.so: libibmad.so", GCC 10.4, CentOS 7.x

Ryan Novosielski novosirj at rutgers.edu
Mon Jul 11 11:36:05 EDT 2022


!-------------------------------------------------------------------|
  This Message Is From an External Sender
  This message came from outside your organization.
|-------------------------------------------------------------------!

Hi there,

I'm getting error messages when running an MPI job with SLURM (18.08)
using MVAPICH2, I assume, post 2.3.5, when the following change was made:

NEW Remove dependency on underlying libibverbs, libibmad, libibumad, and
librdmacm libraries using dlopen

Here's what I'm seeing:

[novosirj at amarel-test2 mpihello]$ srun --mpi=pmi2 -n 4
./mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1
srun: job 20824691 queued and waiting for resources
srun: job 20824691 has been allocated resources
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Hello world from processor slepner021.amarel.rutgers.edu, rank 1 out of
4 processors
Hello world from processor slepner021.amarel.rutgers.edu, rank 2 out of
4 processors
Hello world from processor slepner021.amarel.rutgers.edu, rank 3 out of
4 processors
Hello world from processor slepner009.amarel.rutgers.edu, rank 0 out of
4 processors

I don't see this on 2.3. MPI seems to be working, but I assume it's not
using Infiniband?

The libraries do exist:

[novosirj at amarel-test2 mpihello]$ rpm -ql infiniband-diags | grep mad
/usr/lib64/libibmad.so.5
/usr/lib64/libibmad.so.5.5.0

And while I assume it's normal to not see libibmad/libibumad in ldd -v
output anymore post 2.3.5 (and I don't), here's what I see on 2.3, just
to give you an idea of how it used to work:

[novosirj at amarel-test2 mpihello]$ ldd -v
mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1 | head -50

          linux-vdso.so.1 =>  (0x00007fff07b0c000)

          libmpi.so.12 =>
/opt/sw/packages/gcc-4_8/mvapich2/2.3/lib/libmpi.so.12 (0x00007f36e87d9000)
          libc.so.6 => /lib64/libc.so.6 (0x00007f36e840b000)
          libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f36e80e9000)
          libm.so.6 => /lib64/libm.so.6 (0x00007f36e7de7000)
          libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f36e7bdb000)
          libxml2.so.2 => /lib64/libxml2.so.2 (0x00007f36e7871000)
          libibmad.so.5 => /lib64/libibmad.so.5 (0x00007f36e7656000)
          librdmacm.so.1 => /lib64/librdmacm.so.1 (0x00007f36e743f000)
          libibumad.so.3 => /lib64/libibumad.so.3 (0x00007f36e7236000)
          libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00007f36e701d000)
          libdl.so.2 => /lib64/libdl.so.2 (0x00007f36e6e19000)
          librt.so.1 => /lib64/librt.so.1 (0x00007f36e6c11000)
          libpmi2.so.0 => /lib64/libpmi2.so.0 (0x00007f36e69f9000)
          libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f36e67dd000)
          libgcc_s.so.1 => /opt/sw/packages/gcc/10.4/lib64/libgcc_s.so.1
(0x00007f36e65c5000)
          libquadmath.so.0 =>
/opt/sw/packages/gcc/10.4/lib64/libquadmath.so.0 (0x00007f36e637e000)
          /lib64/ld-linux-x86-64.so.2 (0x00007f36e8f40000)
          libz.so.1 => /lib64/libz.so.1 (0x00007f36e6168000)
          liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f36e5f42000)
          libosmcomp.so.4 => /lib64/libosmcomp.so.4 (0x00007f36e5d33000)
          libnl-route-3.so.200 => /lib64/libnl-route-3.so.200
(0x00007f36e5ac6000)
          libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00007f36e58a5000)

What can/should I do about this?
Sometimes I see (not clear what conditions trigger it, but I have at 
least one set of output running one of the OSU benchmarks):

Please retry with MV2_LIBIBMAD_PATH=<path/to/libibmad.so>

It seems like what's suggested in the error message is not a great 
idea/this should be dealt with at compile time.

This is my build script; relatively uncomplicated:

[novosirj at amarel-test2 build]$ more 
~/src/build-mvapich2-2.3.7-1-gcc-10.4.sh #!/bin/sh

module purge
module load gcc/10.4
module list

export FFLAGS="-fallow-argument-mismatch"

../mvapich2-2.3.7-1/configure --with-pmi=pmi2 --with-pm=slurm 
--prefix=/opt/sw/packages/gcc-10/mvapich2/2.3.7-1 && \
         make -j32 && make check && make install

And the configure process doesn't seem to point out anything amiss:

checking for the InfiniBand includes path... default
checking for the InfiniBand library path... default
checking for library containing shm_open... -lrt
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
configure: checking checking for InfiniBand umad installation...
checking infiniband/umad.h usability... yes
checking infiniband/umad.h presence... yes
checking for infiniband/umad.h... yes
configure: InfiniBand libumad found
checking whether to enable hybrid communication channel... yes
configure: checking for RDMA CM support...
checking rdma/rdma_cma.h usability... yes
checking rdma/rdma_cma.h presence... yes
checking for rdma/rdma_cma.h... yes
configure: RDMA CM support enabled
configure: checking for hardware multicast support...
checking infiniband/mad.h usability... yes
checking infiniband/mad.h presence... yes
checking for infiniband/mad.h... yes

Thanks!

-- 
#BlackLivesMatter
____
  || \\UTGERS,     |----------------------*O*------------------------
  ||_// the State  |    Ryan Novosielski - novosirj at rutgers.edu
  || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
  ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark



More information about the Mvapich-discuss mailing list