[Mvapich-discuss] MVAPICH2 2.3.7-1 (and 2.3.6) "mv2_mad_dlopen_init" re: "Error opening libibmad.so: libibmad.so", GCC 10.4, CentOS 7.x

Ryan Novosielski novosirj at rutgers.edu
Mon Jul 11 17:55:56 EDT 2022


!-------------------------------------------------------------------|
  This Message Is From an External Sender
  This message came from outside your organization.
|-------------------------------------------------------------------!

Aha!

[root at amarel1 ~]# find /usr/lib64 -xdev -name "libibmad*"
/usr/lib64/libibmad.so.5.5.0
/usr/lib64/libibmad.so.5
/usr/lib64/libibmad.so

[root at amarel1 ~]# ssh slepner021 find /usr/lib64 -xdev -name "libibmad*"
/usr/lib64/libibmad.so.5.5.0
/usr/lib64/libibmad.so.5


[root at amarel1 ~]# rpm -qa | grep infiniband-diags

infiniband-diags-devel-2.1.0-1.el7.x86_64

infiniband-diags-2.1.0-1.el7.x86_64


[root at amarel1 ~]# ssh slepner021 rpm -qa | grep infiniband-diags

infiniband-diags-2.1.0-1.el7.x86_64


This is a pretty common thing we do -- have a more complete set of 
development libraries, etc., on our login nodes. I'm kind of surprised 
to see that this sort of packaging is normal -- that the -devel package 
contains a link from *.so to the library the non-devel package, but 
without the version number. I never noticed. I've never hit this problem 
before though with other software.

On 7/11/22 16:32, Shineman, Nat wrote:
> Hi Ryan,
> 
> This is interesting. You are correct that we no longer link directly to 
> the ib libraries, this is to allow MVAPICH2 to run on SMP only machines 
> without needing to install the ib libraries. Instead, we try to 
> dynamically open them as needed from within the library at runtime. The 
> environment variable provided in the error message is there only as a 
> fallback and should only be necessary if the library is not available on 
> the standard |LD_LIBRARY_PATH|​. However, it looks like yours should be 
> available on there from |/usr/lib64|​; is there any chance that the 
> library is on a different path on the compute nodes than it is on the 
> head node? I will try to reproduce this and see if I can figure out why 
> it would be failing to open your libibmad.so.
> 
> To go back to using the older linking process, please try adding 
> |--disable-ibv-dlopen|​ to your configure line. Can you try that let us 
> know if it works for you?
> 
> Thanks,
> Nat
> 
> 
> ------------------------------------------------------------------------
> *From:* Mvapich-discuss 
> <mvapich-discuss-bounces+shineman.5=osu.edu at lists.osu.edu> on behalf of 
> Ryan Novosielski via Mvapich-discuss <mvapich-discuss at lists.osu.edu>
> *Sent:* Monday, July 11, 2022 11:36
> *To:* mvapich-discuss at lists.osu.edu <mvapich-discuss at lists.osu.edu>
> *Subject:* [Mvapich-discuss] MVAPICH2 2.3.7-1 (and 2.3.6) 
> "mv2_mad_dlopen_init" re: "Error opening libibmad.so: libibmad.so", GCC 
> 10.4, CentOS 7.x
> 
> 
> Hi there,
> 
> I'm getting error messages when running an MPI job with SLURM (18.08)
> using MVAPICH2, I assume, post 2.3.5, when the following change was made:
> 
> NEW Remove dependency on underlying libibverbs, libibmad, libibumad, and
> librdmacm libraries using dlopen
> 
> Here's what I'm seeing:
> 
> [novosirj at amarel-test2 mpihello]$ srun --mpi=pmi2 -n 4
> ./mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1
> srun: job 20824691 queued and waiting for resources
> srun: job 20824691 has been allocated resources
> Error opening libibmad.so: libibmad.so: cannot open shared object file:
> No such file or directory.
> mv2_mad_dlopen_init returned -1
> Error opening libibmad.so: libibmad.so: cannot open shared object file:
> No such file or directory.
> mv2_mad_dlopen_init returned -1
> Error opening libibmad.so: libibmad.so: cannot open shared object file:
> No such file or directory.
> mv2_mad_dlopen_init returned -1
> Error opening libibmad.so: libibmad.so: cannot open shared object file:
> No such file or directory.
> mv2_mad_dlopen_init returned -1
> Hello world from processor slepner021.amarel.rutgers.edu, rank 1 out of
> 4 processors
> Hello world from processor slepner021.amarel.rutgers.edu, rank 2 out of
> 4 processors
> Hello world from processor slepner021.amarel.rutgers.edu, rank 3 out of
> 4 processors
> Hello world from processor slepner009.amarel.rutgers.edu, rank 0 out of
> 4 processors
> 
> I don't see this on 2.3. MPI seems to be working, but I assume it's not
> using Infiniband?
> 
> The libraries do exist:
> 
> [novosirj at amarel-test2 mpihello]$ rpm -ql infiniband-diags | grep mad
> /usr/lib64/libibmad.so.5
> /usr/lib64/libibmad.so.5.5.0
> 
> And while I assume it's normal to not see libibmad/libibumad in ldd -v
> output anymore post 2.3.5 (and I don't), here's what I see on 2.3, just
> to give you an idea of how it used to work:
> 
> [novosirj at amarel-test2 mpihello]$ ldd -v
> mpi_hello_world.gcc-10.4.mvapich2-2.3.7-1 | head -50
> 
>            linux-vdso.so.1 =>  (0x00007fff07b0c000)
> 
>            libmpi.so.12 =>
> /opt/sw/packages/gcc-4_8/mvapich2/2.3/lib/libmpi.so.12 (0x00007f36e87d9000)
>            libc.so.6 => /lib64/libc.so.6 (0x00007f36e840b000)
>            libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f36e80e9000)
>            libm.so.6 => /lib64/libm.so.6 (0x00007f36e7de7000)
>            libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f36e7bdb000)
>            libxml2.so.2 => /lib64/libxml2.so.2 (0x00007f36e7871000)
>            libibmad.so.5 => /lib64/libibmad.so.5 (0x00007f36e7656000)
>            librdmacm.so.1 => /lib64/librdmacm.so.1 (0x00007f36e743f000)
>            libibumad.so.3 => /lib64/libibumad.so.3 (0x00007f36e7236000)
>            libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00007f36e701d000)
>            libdl.so.2 => /lib64/libdl.so.2 (0x00007f36e6e19000)
>            librt.so.1 => /lib64/librt.so.1 (0x00007f36e6c11000)
>            libpmi2.so.0 => /lib64/libpmi2.so.0 (0x00007f36e69f9000)
>            libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f36e67dd000)
>            libgcc_s.so.1 => /opt/sw/packages/gcc/10.4/lib64/libgcc_s.so.1
> (0x00007f36e65c5000)
>            libquadmath.so.0 =>
> /opt/sw/packages/gcc/10.4/lib64/libquadmath.so.0 (0x00007f36e637e000)
>            /lib64/ld-linux-x86-64.so.2 (0x00007f36e8f40000)
>            libz.so.1 => /lib64/libz.so.1 (0x00007f36e6168000)
>            liblzma.so.5 => /lib64/liblzma.so.5 (0x00007f36e5f42000)
>            libosmcomp.so.4 => /lib64/libosmcomp.so.4 (0x00007f36e5d33000)
>            libnl-route-3.so.200 => /lib64/libnl-route-3.so.200
> (0x00007f36e5ac6000)
>            libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00007f36e58a5000)
> 
> What can/should I do about this?
> Sometimes I see (not clear what conditions trigger it, but I have at
> least one set of output running one of the OSU benchmarks):
> 
> Please retry with MV2_LIBIBMAD_PATH=<path/to/libibmad.so>
> 
> It seems like what's suggested in the error message is not a great
> idea/this should be dealt with at compile time.
> 
> This is my build script; relatively uncomplicated:
> 
> [novosirj at amarel-test2 build]$ more
> ~/src/build-mvapich2-2.3.7-1-gcc-10.4.sh #!/bin/sh
> 
> module purge
> module load gcc/10.4
> module list
> 
> export FFLAGS="-fallow-argument-mismatch"
> 
> ../mvapich2-2.3.7-1/configure --with-pmi=pmi2 --with-pm=slurm
> --prefix=/opt/sw/packages/gcc-10/mvapich2/2.3.7-1 && \
>           make -j32 && make check && make install
> 
> And the configure process doesn't seem to point out anything amiss:
> 
> checking for the InfiniBand includes path... default
> checking for the InfiniBand library path... default
> checking for library containing shm_open... -lrt
> checking infiniband/verbs.h usability... yes
> checking infiniband/verbs.h presence... yes
> checking for infiniband/verbs.h... yes
> configure: checking checking for InfiniBand umad installation...
> checking infiniband/umad.h usability... yes
> checking infiniband/umad.h presence... yes
> checking for infiniband/umad.h... yes
> configure: InfiniBand libumad found
> checking whether to enable hybrid communication channel... yes
> configure: checking for RDMA CM support...
> checking rdma/rdma_cma.h usability... yes
> checking rdma/rdma_cma.h presence... yes
> checking for rdma/rdma_cma.h... yes
> configure: RDMA CM support enabled
> configure: checking for hardware multicast support...
> checking infiniband/mad.h usability... yes
> checking infiniband/mad.h presence... yes
> checking for infiniband/mad.h... yes
> 
> Thanks!
> 
> -- 
> #BlackLivesMatter
> ____
>    || \\UTGERS,     |----------------------*O*------------------------
>    ||_// the State  |    Ryan Novosielski - novosirj at rutgers.edu
>    || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
>    ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
> _______________________________________________
> Mvapich-discuss mailing list
> Mvapich-discuss at lists.osu.edu
> https://lists.osu.edu/mailman/listinfo/mvapich-discuss 
> <https://lists.osu.edu/mailman/listinfo/mvapich-discuss>

-- 
#BlackLivesMatter
____
  || \\UTGERS,     |----------------------*O*------------------------
  ||_// the State  |    Ryan Novosielski - novosirj at rutgers.edu
  || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
  ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
       `'



More information about the Mvapich-discuss mailing list