[mvapich-discuss] Re: [openfabrics-ewg] Announcing the release of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP, RDMA CM-based connection manageme

Shaun Rowland rowland at cse.ohio-state.edu
Wed Nov 15 17:25:23 EST 2006


Sundeep Narravula wrote:
> Hi David,
>   Can you please perform mpdallexit and mpdcleanup before retrying?
> Thanks,
>   --Sundeep.
> 
> On Tue, 14 Nov 2006, david elsen wrote:
> 
>> Hi Sundeep,
>>
>> I see the following error messages:
>>
>> [root at ammasso1 ~]# mpdboot
>> /usr/local/mvapich2/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directory
>> mpdboot_ammasso1.qlogic.org (handle_mpd_output 359): failed to ping mpd on ammasso1.qlogic.org; recvd output={}
>>
>> [root at ammasso1 ~]# /usr/local/mvapich2/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directory

Hi David. I am jumping in the middle of this conversation oddly perhaps
:-) This appears to be a shared library installation problem. When
shared libraries cannot be found, the problem is usually one of the
following issues (and I can't even think of any others, so I would
consider this pretty much every possibility right now):

1) The libraries are not installed.
2) The search path is not specified (LD_LIBRARY_PATH or ld.so.conf).
3) The shared library installation is not correct. This is the hardest
to figure out and the rarest problem.

Since shared libraries are handled by the system's runtime loader and
configuration, it is something out of the control of the MVAPICH2
package itself mostly. There are some things you can do when building by
trying to hard code paths to libraries into the resulting binaries, etc.
However, this seems to be a case of #3 in my opinion.

>> Please see the following for the environment variables settings. I highlighted the
>> LD_LIBRARY_PATH there.

<snip>

>> LD_LIBRARY_PATH=/usr/local/lib

Here it is clear the search path is being set.  Even when using mpiexec, 
this environment will be carried over to the executing processes so they 
also will see it - though you are not getting that far obviously.

>> Please see the following for the files in /usr/local/lib directory:
>>
>> [root at ammasso1 lib]# pwd
>> /usr/local/lib
>> [root at ammasso1 lib]#
>> [root at ammasso1 lib]# ls -la
>> total 1800
>> drwxr-xr-x  3 root root   4096 Nov  9 12:38 .
>> drwxr-xr-x 13 root root   4096 Nov 13 19:51 ..
>> drwxr-xr-x  2 root root   4096 Nov  8 18:38 infiniband
>> -rwxr-xr-x  1 root root    773 Nov  8 18:38 libibat.la
>> -rwxr-xr-x  1 root root  28419 Nov  8 18:38 libibat.so
>> -rw-r--r--  1 root root  38662 Nov  8 18:35 libibcommon.a
>> -rwxr-xr-x  1 root root    820 Nov  8 18:35 libibcommon.la
>> lrwxrwxrwx  1 root root     20 Nov  8 18:35 libibcommon.so -> libibcommon.so.1.0.0
>> lrwxrwxrwx  1 root root     20 Nov  8 18:35 libibcommon.so.1 -> libibcommon.so.1.0.0
>> -rwxr-xr-x  1 root root  29138 Nov  8 18:35 libibcommon.so.1.0.0
>> -rw-r--r--  1 root root 172056 Nov  8 18:36 libibmad.a
>> -rwxr-xr-x  1 root root    857 Nov  8 18:36 libibmad.la
>> lrwxrwxrwx  1 root root     17 Nov  8 18:36 libibmad.so -> libibmad.so.1.0.0
>> lrwxrwxrwx  1 root root     17 Nov  8 18:36 libibmad.so.1 -> libibmad.so.1.0.0
>> -rwxr-xr-x  1 root root 125987 Nov  8 18:36 libibmad.so.1.0.0
>> -rw-r--r--  1 root root  45358 Nov  8 18:35 libibumad.a
>> -rwxr-xr-x  1 root root    836 Nov  8 18:35 libibumad.la
>> lrwxrwxrwx  1 root root     18 Nov  8 18:35 libibumad.so -> libibumad.so.1.0.0
>> lrwxrwxrwx  1 root root     18 Nov  8 18:35 libibumad.so.1 -> libibumad.so.1.0.0
>> -rwxr-xr-x  1 root root  44419 Nov  8 18:35 libibumad.so.1.0.0
>> -rw-r--r--  1 root root 180672 Nov  9 12:38 libibverbs.a
>> -rwxr-xr-x  1 root root    828 Nov  9 12:38 libibverbs.la
>> lrwxrwxrwx  1 root root     19 Nov  9 12:38 libibverbs.so -> libibverbs.so.2.0.0
>> lrwxrwxrwx  1 root root     19 Nov  9 12:38 libibverbs.so.2 -> libibverbs.so.2.0.0
>> -rwxr-xr-x  1 root root 123655 Nov  9 12:38 libibverbs.so.2.0.0
>> lrwxrwxrwx  1 root root     18 Nov  8 18:37 libopensm-1.2.0-rc6.so -> libopensm.so.1.0.0
>> -rw-r--r--  1 root root 130594 Nov  8 18:37 libopensm.a
>> -rwxr-xr-x  1 root root    806 Nov  8 18:37 libopensm.la
>> lrwxrwxrwx  1 root root     18 Nov  8 18:37 libopensm.so -> libopensm.so.1.0.0
>> lrwxrwxrwx  1 root root     18 Nov  8 18:37 libopensm.so.1 -> libopensm.so.1.0.0
>> -rwxr-xr-x  1 root root 121937 Nov  8 18:37 libopensm.so.1.0.0
>> lrwxrwxrwx  1 root root     19 Nov  8 18:37 libosmcomp-1.2.0-rc6.so -> libosmcomp.so.1.0.1
>> -rw-r--r--  1 root root 242594 Nov  8 18:37 libosmcomp.a
>> -rwxr-xr-x  1 root root    823 Nov  8 18:37 libosmcomp.la
>> lrwxrwxrwx  1 root root     19 Nov  8 18:37 libosmcomp.so -> libosmcomp.so.1.0.1
>> lrwxrwxrwx  1 root root     19 Nov  8 18:37 libosmcomp.so.1 -> libosmcomp.so.1.0.1
>> -rwxr-xr-x  1 root root 194469 Nov  8 18:37 libosmcomp.so.1.0.1
>> lrwxrwxrwx  1 root root     21 Nov  8 18:37 libosmvendor-1.2.0-rc6.so -> libosmvendor.so.1.0.0
>> -rw-r--r--  1 root root  86786 Nov  8 18:37 libosmvendor.a
>> -rwxr-xr-x  1 root root    885 Nov  8 18:37 libosmvendor.la
>> lrwxrwxrwx  1 root root     21 Nov  8 18:04 libosmvendor_openib.so -> libosmvendor.so.1.0.0
>> lrwxrwxrwx  1 root root     21 Nov  8 18:37 libosmvendor.so -> libosmvendor.so.1.0.0
>> lrwxrwxrwx  1 root root     21 Nov  8 18:37 libosmvendor.so.1 -> libosmvendor.so.1.0.0
>> -rwxr-xr-x  1 root root  83296 Nov  8 18:37 libosmvendor.so.1.0.0
>> -rwxr-xr-x  1 root root    837 Nov  8 19:04 librdmacm.la
>> -rwxr-xr-x  1 root root  54472 Nov  8 19:04 librdmacm.so

This is why I believe it is a shared library installation problem.
Notice how the other shared libraries are symlinked. Let's take libibverbs:

libibverbs.so -> libibverbs.so.2.0.0
libibverbs.so.2 -> libibverbs.so.2.0.0
libibverbs.so.2.0.0

The actual shared library file is libibverbs.so.2.0.0. There are a
series of symlinks of a more general "specification" as far as the
version number, but they are all linked to libibverbs.so.2.0.0. This is
is what is normally there. You will see the same thing in /usr/lib if
you look at the shared libraries.

On our systems with OFED 1.1, the following is shown for librdmacm:

librdmacm.so -> librdmacm.so.0.9.0
librdmacm.so.0.9.0

which is different than on your system. I have seem problems when the
symlinks are strange on systems (this is very rare from my experience).

What does "ldd /usr/local/mvapich2/bin/mpdroot" show? In addition, what
does "objdump -x /usr/local/mvapich2/bin/mpdroot" show? For example, for
/bin/ls on one of our systems, the following details are shown:

[rowland at e4-oib ~]$ ldd /bin/ls
         librt.so.1 => /lib64/tls/librt.so.1 (0x0000003bf6200000)
         libacl.so.1 => /lib64/libacl.so.1 (0x0000003bf2000000)
         libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003bf2c00000)
         libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
         libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003bf2400000)
         /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)
         libattr.so.1 => /lib64/libattr.so.1 (0x0000003bf3c00000)

In this case, I know that /bin/ls specifically will cause the runtime
linker to look fo libacl.so.1 in the search path exactly. The objdump
output shows the following in the "Dynamic" section:

Dynamic Section:
   NEEDED      librt.so.1
   NEEDED      libacl.so.1
   NEEDED      libselinux.so.1
   NEEDED      libc.so.6

Again, this shows it needs libacl.so.1, so the runtime linker should be
looking for that library in the search path. The library is in /lib64:

[rowland at e4-oib ~]$ ls -l /lib64/libacl*
lrwxrwxrwx  1 root root    11 Mar 28  2006 /lib64/libacl.so -> libacl.so.1
lrwxrwxrwx  1 root root    15 Mar 28  2006 /lib64/libacl.so.1 -> 
libacl.so.1.1.0
-rwxr-xr-x  1 root root 28688 Sep 16  2004 /lib64/libacl.so.1.1.0

Anything that libacl.so.1 is symlinked to should satisfy the runtime
linker. This allows one to upgrade the libacl.so.1.1.0 file to something
else, keeping the symlink, and it still work (if the upgrade is
compatible). This is a shared library feature.

These specifications in the binary are the result of the build, which is
system dependent and also depends on LD_LIBRARY_PATH and the -L
specifications when building (and you can hard code the library
path with -rpath and on some systems -R). Doing "ldd" and "objdump -x"
on a resulting binary from a build should show exactly what will be
searched for. In your case, it should show something that is not in
/usr/local/lib here...

It seems odd that your libradmacm file is not symlinked as ours is. Can
you try the command I suggested above and see what they say? That'd be
helpful. In the end, it looks like a possible shared library
installation issue since the other two factors I mentioned seem to be in
the clear so far. Without these details it is hard to say, and if it is
looking for something that is not symlinked properly in /usr/local/lib
for you... then it seems the library would need to be reinstalled fully
- or it might be possible to try and symlink it correctly (though it
should have been done during the library install, so it would seem to be
an initial installation issue).
-- 
Shaun Rowland	rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


More information about the mvapich-discuss mailing list