[Mvapich-discuss] MVAPICH2 2.3.7-1 (and 2.3.6) "mv2_mad_dlopen_init" re: "Error opening libibmad.so: libibmad.so", GCC 10.4, CentOS 7.x

Shineman, Nat shineman.5 at osu.edu
Mon Jul 11 16:32:57 EDT 2022


Hi Ryan,

This is interesting. You are correct that we no longer link directly to the ib libraries, this is to allow MVAPICH2 to run on SMP only machines without needing to install the ib libraries. Instead, we try to dynamically open them as needed from within the library at runtime. The environment variable provided in the error message is there only as a fallback and should only be necessary if the library is not available on the standard LD_LIBRARY_PATH​. However, it looks like yours should be available on there from /usr/lib64​; is there any chance that the library is on a different path on the compute nodes than it is on the head node? I will try to reproduce this and see if I can figure out why it would be failing to open your libibmad.so.

To go back to using the older linking process, please try adding --disable-ibv-dlopen​ to your configure line. Can you try that let us know if it works for you?

Thanks,
Nat


________________________________
From: Mvapich-discuss <mvapich-discuss-bounces+shineman.5=osu.edu at lists.osu.edu> on behalf of Ryan Novosielski via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
Sent: Monday, July 11, 2022 11:36
To: mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
Subject: [Mvapich-discuss] MVAPICH2 2.3.7-1 (and 2.3.6) "mv2_mad_dlopen_init" re: "Error opening libibmad.so: libibmad.so", GCC 10.4, CentOS 7.x

!-------------------------------------------------------------------|
  This Message Is From an External Sender
  This message came from outside your organization.
|-------------------------------------------------------------------!

Hi there,

I'm getting error messages when running an MPI job with SLURM (18.08)
using MVAPICH2, I assume, post 2.3.5, when the following change was made:

NEW Remove dependency on underlying libibverbs, libibmad, libibumad, and
librdmacm libraries using dlopen

Here's what I'm seeing:

[novosirj at amarel-test2 mpihello]$ srun --mpi=pmi2 -n 4
./mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1
srun: job 20824691 queued and waiting for resources
srun: job 20824691 has been allocated resources
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Error opening libibmad.so: libibmad.so: cannot open shared object file:
No such file or directory.
mv2_mad_dlopen_init returned -1
Hello world from processor slepner021.amarel.rutgers.edu, rank 1 out of
4 processors
Hello world from processor slepner021.amarel.rutgers.edu, rank 2 out of
4 processors
Hello world from processor slepner021.amarel.rutgers.edu, rank 3 out of
4 processors
Hello world from processor slepner009.amarel.rutgers.edu, rank 0 out of
4 processors

I don't see this on 2.3. MPI seems to be working, but I assume it's not
using Infiniband?

The libraries do exist:

[novosirj at amarel-test2 mpihello]$ rpm -ql infiniband-diags | grep mad
/usr/lib64/libibmad.so.5
/usr/lib64/libibmad.so.5.5.0

And while I assume it's normal to not see libibmad/libibumad in ldd -v
output anymore post 2.3.5 (and I don't), here's what I see on 2.3, just
to give you an idea of how it used to work:

[novosirj at amarel-test2 mpihello]$ ldd -v
mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1 | head -50

          linux-vdso.so.1 =>  (0x00007fff07b0c000)

          libmpi.so.12 =>
/opt/sw/packages/gcc-4_8/mvapich2/2.3/lib/libmpi.so.12 (0x00007f36e87d9000)
          libc.so.6 => /lib64/libc.so.6 (0x00007f36e840b000)
          libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f36e80e9000)
          libm.so.6 => /lib64/libm.so.6 (0x00007f36e7de7000)
          libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f36e7bdb000)
          libxml2.so.2 => /lib64/libxml2.so.2 (0x00007f36e7871000)
          libibmad.so.5 => /lib64/libibmad.so.5 (0x00007f36e7656000)
          librdmacm.so.1 => /lib64/librdmacm.so.1 (0x00007f36e743f000)
          libibumad.so.3 => /lib64/libibumad.so.3 (0x00007f36e7236000)
          libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00007f36e701d000)
          libdl.so.2 => /lib64/libdl.so.2 (0x00007f36e6e19000)
          librt.so.1 => /lib64/librt.so.1 (0x00007f36e6c11000)
          libpmi2.so.0 => /lib64/libpmi2.so.0 (0x00007f36e69f9000)
          libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f36e67dd000)
          libgcc_s.so.1 => /opt/sw/packages/gcc/10.4/lib64/libgcc_s.so.1
(0x00007f36e65c5000)
          libquadmath.so.0 =>
/opt/sw/packages/gcc/10.4/lib64/libquadmath.so.0 (0x00007f36e637e000)
          /lib64/ld-linux-x86-64.so.2 (0x00007f36e8f40000)
          libz.so.1 => /lib64/libz.so.1 (0x00007f36e6168000)
          liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f36e5f42000)
          libosmcomp.so.4 => /lib64/libosmcomp.so.4 (0x00007f36e5d33000)
          libnl-route-3.so.200 => /lib64/libnl-route-3.so.200
(0x00007f36e5ac6000)
          libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00007f36e58a5000)

What can/should I do about this?
Sometimes I see (not clear what conditions trigger it, but I have at
least one set of output running one of the OSU benchmarks):

Please retry with MV2_LIBIBMAD_PATH=<path/to/libibmad.so>

It seems like what's suggested in the error message is not a great
idea/this should be dealt with at compile time.

This is my build script; relatively uncomplicated:

[novosirj at amarel-test2 build]$ more
~/src/build-mvapich2-2.3.7-1-gcc-10.4.sh #!/bin/sh

module purge
module load gcc/10.4
module list

export FFLAGS="-fallow-argument-mismatch"

../mvapich2-2.3.7-1/configure --with-pmi=pmi2 --with-pm=slurm
--prefix=/opt/sw/packages/gcc-10/mvapich2/2.3.7-1 && \
         make -j32 && make check && make install

And the configure process doesn't seem to point out anything amiss:

checking for the InfiniBand includes path... default
checking for the InfiniBand library path... default
checking for library containing shm_open... -lrt
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
configure: checking checking for InfiniBand umad installation...
checking infiniband/umad.h usability... yes
checking infiniband/umad.h presence... yes
checking for infiniband/umad.h... yes
configure: InfiniBand libumad found
checking whether to enable hybrid communication channel... yes
configure: checking for RDMA CM support...
checking rdma/rdma_cma.h usability... yes
checking rdma/rdma_cma.h presence... yes
checking for rdma/rdma_cma.h... yes
configure: RDMA CM support enabled
configure: checking for hardware multicast support...
checking infiniband/mad.h usability... yes
checking infiniband/mad.h presence... yes
checking for infiniband/mad.h... yes

Thanks!

--
#BlackLivesMatter
____
  || \\UTGERS,     |----------------------*O*------------------------
  ||_// the State  |    Ryan Novosielski - novosirj at rutgers.edu
  || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
  ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
_______________________________________________
Mvapich-discuss mailing list
Mvapich-discuss at lists.osu.edu
https://lists.osu.edu/mailman/listinfo/mvapich-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220711/9d4fed25/attachment-0016.html>


More information about the Mvapich-discuss mailing list