[Mvapich-discuss] MVAPICH2 2.3.7-1 (and 2.3.6) "mv2_mad_dlopen_init" re: "Error opening libibmad.so: libibmad.so", GCC 10.4, CentOS 7.x
Ryan Novosielski
novosirj at rutgers.edu
Mon Jul 11 17:55:56 EDT 2022
!-------------------------------------------------------------------|
This Message Is From an External Sender
This message came from outside your organization.
|-------------------------------------------------------------------!
Aha!
[root at amarel1 ~]# find /usr/lib64 -xdev -name "libibmad*"
/usr/lib64/libibmad.so.5.5.0
/usr/lib64/libibmad.so.5
/usr/lib64/libibmad.so
[root at amarel1 ~]# ssh slepner021 find /usr/lib64 -xdev -name "libibmad*"
/usr/lib64/libibmad.so.5.5.0
/usr/lib64/libibmad.so.5
[root at amarel1 ~]# rpm -qa | grep infiniband-diags
infiniband-diags-devel-2.1.0-1.el7.x86_64
infiniband-diags-2.1.0-1.el7.x86_64
[root at amarel1 ~]# ssh slepner021 rpm -qa | grep infiniband-diags
infiniband-diags-2.1.0-1.el7.x86_64
This is a pretty common thing we do -- have a more complete set of
development libraries, etc., on our login nodes. I'm kind of surprised
to see that this sort of packaging is normal -- that the -devel package
contains a link from *.so to the library the non-devel package, but
without the version number. I never noticed. I've never hit this problem
before though with other software.
On 7/11/22 16:32, Shineman, Nat wrote:
> Hi Ryan,
>
> This is interesting. You are correct that we no longer link directly to
> the ib libraries, this is to allow MVAPICH2 to run on SMP only machines
> without needing to install the ib libraries. Instead, we try to
> dynamically open them as needed from within the library at runtime. The
> environment variable provided in the error message is there only as a
> fallback and should only be necessary if the library is not available on
> the standard |LD_LIBRARY_PATH|. However, it looks like yours should be
> available on there from |/usr/lib64|; is there any chance that the
> library is on a different path on the compute nodes than it is on the
> head node? I will try to reproduce this and see if I can figure out why
> it would be failing to open your libibmad.so.
>
> To go back to using the older linking process, please try adding
> |--disable-ibv-dlopen| to your configure line. Can you try that let us
> know if it works for you?
>
> Thanks,
> Nat
>
>
> ------------------------------------------------------------------------
> *From:* Mvapich-discuss
> <mvapich-discuss-bounces+shineman.5=osu.edu at lists.osu.edu> on behalf of
> Ryan Novosielski via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
> *Sent:* Monday, July 11, 2022 11:36
> *To:* mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
> *Subject:* [Mvapich-discuss] MVAPICH2 2.3.7-1 (and 2.3.6)
> "mv2_mad_dlopen_init" re: "Error opening libibmad.so: libibmad.so", GCC
> 10.4, CentOS 7.x
>
>
> Hi there,
>
> I'm getting error messages when running an MPI job with SLURM (18.08)
> using MVAPICH2, I assume, post 2.3.5, when the following change was made:
>
> NEW Remove dependency on underlying libibverbs, libibmad, libibumad, and
> librdmacm libraries using dlopen
>
> Here's what I'm seeing:
>
> [novosirj at amarel-test2 mpihello]$ srun --mpi=pmi2 -n 4
> ./mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1
> srun: job 20824691 queued and waiting for resources
> srun: job 20824691 has been allocated resources
> Error opening libibmad.so: libibmad.so: cannot open shared object file:
> No such file or directory.
> mv2_mad_dlopen_init returned -1
> Error opening libibmad.so: libibmad.so: cannot open shared object file:
> No such file or directory.
> mv2_mad_dlopen_init returned -1
> Error opening libibmad.so: libibmad.so: cannot open shared object file:
> No such file or directory.
> mv2_mad_dlopen_init returned -1
> Error opening libibmad.so: libibmad.so: cannot open shared object file:
> No such file or directory.
> mv2_mad_dlopen_init returned -1
> Hello world from processor slepner021.amarel.rutgers.edu, rank 1 out of
> 4 processors
> Hello world from processor slepner021.amarel.rutgers.edu, rank 2 out of
> 4 processors
> Hello world from processor slepner021.amarel.rutgers.edu, rank 3 out of
> 4 processors
> Hello world from processor slepner009.amarel.rutgers.edu, rank 0 out of
> 4 processors
>
> I don't see this on 2.3. MPI seems to be working, but I assume it's not
> using Infiniband?
>
> The libraries do exist:
>
> [novosirj at amarel-test2 mpihello]$ rpm -ql infiniband-diags | grep mad
> /usr/lib64/libibmad.so.5
> /usr/lib64/libibmad.so.5.5.0
>
> And while I assume it's normal to not see libibmad/libibumad in ldd -v
> output anymore post 2.3.5 (and I don't), here's what I see on 2.3, just
> to give you an idea of how it used to work:
>
> [novosirj at amarel-test2 mpihello]$ ldd -v
> mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1 | head -50
>
> linux-vdso.so.1 => (0x00007fff07b0c000)
>
> libmpi.so.12 =>
> /opt/sw/packages/gcc-4_8/mvapich2/2.3/lib/libmpi.so.12 (0x00007f36e87d9000)
> libc.so.6 => /lib64/libc.so.6 (0x00007f36e840b000)
> libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f36e80e9000)
> libm.so.6 => /lib64/libm.so.6 (0x00007f36e7de7000)
> libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f36e7bdb000)
> libxml2.so.2 => /lib64/libxml2.so.2 (0x00007f36e7871000)
> libibmad.so.5 => /lib64/libibmad.so.5 (0x00007f36e7656000)
> librdmacm.so.1 => /lib64/librdmacm.so.1 (0x00007f36e743f000)
> libibumad.so.3 => /lib64/libibumad.so.3 (0x00007f36e7236000)
> libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00007f36e701d000)
> libdl.so.2 => /lib64/libdl.so.2 (0x00007f36e6e19000)
> librt.so.1 => /lib64/librt.so.1 (0x00007f36e6c11000)
> libpmi2.so.0 => /lib64/libpmi2.so.0 (0x00007f36e69f9000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f36e67dd000)
> libgcc_s.so.1 => /opt/sw/packages/gcc/10.4/lib64/libgcc_s.so.1
> (0x00007f36e65c5000)
> libquadmath.so.0 =>
> /opt/sw/packages/gcc/10.4/lib64/libquadmath.so.0 (0x00007f36e637e000)
> /lib64/ld-linux-x86-64.so.2 (0x00007f36e8f40000)
> libz.so.1 => /lib64/libz.so.1 (0x00007f36e6168000)
> liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f36e5f42000)
> libosmcomp.so.4 => /lib64/libosmcomp.so.4 (0x00007f36e5d33000)
> libnl-route-3.so.200 => /lib64/libnl-route-3.so.200
> (0x00007f36e5ac6000)
> libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00007f36e58a5000)
>
> What can/should I do about this?
> Sometimes I see (not clear what conditions trigger it, but I have at
> least one set of output running one of the OSU benchmarks):
>
> Please retry with MV2_LIBIBMAD_PATH=<path/to/libibmad.so>
>
> It seems like what's suggested in the error message is not a great
> idea/this should be dealt with at compile time.
>
> This is my build script; relatively uncomplicated:
>
> [novosirj at amarel-test2 build]$ more
> ~/src/build-mvapich2-2.3.7-1-gcc-10.4.sh #!/bin/sh
>
> module purge
> module load gcc/10.4
> module list
>
> export FFLAGS="-fallow-argument-mismatch"
>
> ../mvapich2-2.3.7-1/configure --with-pmi=pmi2 --with-pm=slurm
> --prefix=/opt/sw/packages/gcc-10/mvapich2/2.3.7-1 && \
> make -j32 && make check && make install
>
> And the configure process doesn't seem to point out anything amiss:
>
> checking for the InfiniBand includes path... default
> checking for the InfiniBand library path... default
> checking for library containing shm_open... -lrt
> checking infiniband/verbs.h usability... yes
> checking infiniband/verbs.h presence... yes
> checking for infiniband/verbs.h... yes
> configure: checking checking for InfiniBand umad installation...
> checking infiniband/umad.h usability... yes
> checking infiniband/umad.h presence... yes
> checking for infiniband/umad.h... yes
> configure: InfiniBand libumad found
> checking whether to enable hybrid communication channel... yes
> configure: checking for RDMA CM support...
> checking rdma/rdma_cma.h usability... yes
> checking rdma/rdma_cma.h presence... yes
> checking for rdma/rdma_cma.h... yes
> configure: RDMA CM support enabled
> configure: checking for hardware multicast support...
> checking infiniband/mad.h usability... yes
> checking infiniband/mad.h presence... yes
> checking for infiniband/mad.h... yes
>
> Thanks!
>
> --
> #BlackLivesMatter
> ____
> || \\UTGERS, |----------------------*O*------------------------
> ||_// the State | Ryan Novosielski - novosirj at rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
> || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark
> _______________________________________________
> Mvapich-discuss mailing list
> Mvapich-discuss at lists.osu.edu
> https://lists.osu.edu/mailman/listinfo/mvapich-discuss
> <https://lists.osu.edu/mailman/listinfo/mvapich-discuss>
--
#BlackLivesMatter
____
|| \\UTGERS, |----------------------*O*------------------------
||_// the State | Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
|| \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark
`'
More information about the Mvapich-discuss
mailing list