[mvapich-discuss] Re: [openfabrics-ewg] Announcing the release
of MVAPICH2 0.9.8 with Checkpoint/Restart, iWARP,
RDMA CM-based connection manageme
Shaun Rowland
rowland at cse.ohio-state.edu
Wed Nov 15 17:25:23 EST 2006
Sundeep Narravula wrote:
> Hi David,
> Can you please perform mpdallexit and mpdcleanup before retrying?
> Thanks,
> --Sundeep.
>
> On Tue, 14 Nov 2006, david elsen wrote:
>
>> Hi Sundeep,
>>
>> I see the following error messages:
>>
>> [root at ammasso1 ~]# mpdboot
>> /usr/local/mvapich2/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directory
>> mpdboot_ammasso1.qlogic.org (handle_mpd_output 359): failed to ping mpd on ammasso1.qlogic.org; recvd output={}
>>
>> [root at ammasso1 ~]# /usr/local/mvapich2/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directory
Hi David. I am jumping in the middle of this conversation oddly perhaps
:-) This appears to be a shared library installation problem. When
shared libraries cannot be found, the problem is usually one of the
following issues (and I can't even think of any others, so I would
consider this pretty much every possibility right now):
1) The libraries are not installed.
2) The search path is not specified (LD_LIBRARY_PATH or ld.so.conf).
3) The shared library installation is not correct. This is the hardest
to figure out and the rarest problem.
Since shared libraries are handled by the system's runtime loader and
configuration, it is something out of the control of the MVAPICH2
package itself mostly. There are some things you can do when building by
trying to hard code paths to libraries into the resulting binaries, etc.
However, this seems to be a case of #3 in my opinion.
>> Please see the following for the environment variables settings. I highlighted the
>> LD_LIBRARY_PATH there.
<snip>
>> LD_LIBRARY_PATH=/usr/local/lib
Here it is clear the search path is being set. Even when using mpiexec,
this environment will be carried over to the executing processes so they
also will see it - though you are not getting that far obviously.
>> Please see the following for the files in /usr/local/lib directory:
>>
>> [root at ammasso1 lib]# pwd
>> /usr/local/lib
>> [root at ammasso1 lib]#
>> [root at ammasso1 lib]# ls -la
>> total 1800
>> drwxr-xr-x 3 root root 4096 Nov 9 12:38 .
>> drwxr-xr-x 13 root root 4096 Nov 13 19:51 ..
>> drwxr-xr-x 2 root root 4096 Nov 8 18:38 infiniband
>> -rwxr-xr-x 1 root root 773 Nov 8 18:38 libibat.la
>> -rwxr-xr-x 1 root root 28419 Nov 8 18:38 libibat.so
>> -rw-r--r-- 1 root root 38662 Nov 8 18:35 libibcommon.a
>> -rwxr-xr-x 1 root root 820 Nov 8 18:35 libibcommon.la
>> lrwxrwxrwx 1 root root 20 Nov 8 18:35 libibcommon.so -> libibcommon.so.1.0.0
>> lrwxrwxrwx 1 root root 20 Nov 8 18:35 libibcommon.so.1 -> libibcommon.so.1.0.0
>> -rwxr-xr-x 1 root root 29138 Nov 8 18:35 libibcommon.so.1.0.0
>> -rw-r--r-- 1 root root 172056 Nov 8 18:36 libibmad.a
>> -rwxr-xr-x 1 root root 857 Nov 8 18:36 libibmad.la
>> lrwxrwxrwx 1 root root 17 Nov 8 18:36 libibmad.so -> libibmad.so.1.0.0
>> lrwxrwxrwx 1 root root 17 Nov 8 18:36 libibmad.so.1 -> libibmad.so.1.0.0
>> -rwxr-xr-x 1 root root 125987 Nov 8 18:36 libibmad.so.1.0.0
>> -rw-r--r-- 1 root root 45358 Nov 8 18:35 libibumad.a
>> -rwxr-xr-x 1 root root 836 Nov 8 18:35 libibumad.la
>> lrwxrwxrwx 1 root root 18 Nov 8 18:35 libibumad.so -> libibumad.so.1.0.0
>> lrwxrwxrwx 1 root root 18 Nov 8 18:35 libibumad.so.1 -> libibumad.so.1.0.0
>> -rwxr-xr-x 1 root root 44419 Nov 8 18:35 libibumad.so.1.0.0
>> -rw-r--r-- 1 root root 180672 Nov 9 12:38 libibverbs.a
>> -rwxr-xr-x 1 root root 828 Nov 9 12:38 libibverbs.la
>> lrwxrwxrwx 1 root root 19 Nov 9 12:38 libibverbs.so -> libibverbs.so.2.0.0
>> lrwxrwxrwx 1 root root 19 Nov 9 12:38 libibverbs.so.2 -> libibverbs.so.2.0.0
>> -rwxr-xr-x 1 root root 123655 Nov 9 12:38 libibverbs.so.2.0.0
>> lrwxrwxrwx 1 root root 18 Nov 8 18:37 libopensm-1.2.0-rc6.so -> libopensm.so.1.0.0
>> -rw-r--r-- 1 root root 130594 Nov 8 18:37 libopensm.a
>> -rwxr-xr-x 1 root root 806 Nov 8 18:37 libopensm.la
>> lrwxrwxrwx 1 root root 18 Nov 8 18:37 libopensm.so -> libopensm.so.1.0.0
>> lrwxrwxrwx 1 root root 18 Nov 8 18:37 libopensm.so.1 -> libopensm.so.1.0.0
>> -rwxr-xr-x 1 root root 121937 Nov 8 18:37 libopensm.so.1.0.0
>> lrwxrwxrwx 1 root root 19 Nov 8 18:37 libosmcomp-1.2.0-rc6.so -> libosmcomp.so.1.0.1
>> -rw-r--r-- 1 root root 242594 Nov 8 18:37 libosmcomp.a
>> -rwxr-xr-x 1 root root 823 Nov 8 18:37 libosmcomp.la
>> lrwxrwxrwx 1 root root 19 Nov 8 18:37 libosmcomp.so -> libosmcomp.so.1.0.1
>> lrwxrwxrwx 1 root root 19 Nov 8 18:37 libosmcomp.so.1 -> libosmcomp.so.1.0.1
>> -rwxr-xr-x 1 root root 194469 Nov 8 18:37 libosmcomp.so.1.0.1
>> lrwxrwxrwx 1 root root 21 Nov 8 18:37 libosmvendor-1.2.0-rc6.so -> libosmvendor.so.1.0.0
>> -rw-r--r-- 1 root root 86786 Nov 8 18:37 libosmvendor.a
>> -rwxr-xr-x 1 root root 885 Nov 8 18:37 libosmvendor.la
>> lrwxrwxrwx 1 root root 21 Nov 8 18:04 libosmvendor_openib.so -> libosmvendor.so.1.0.0
>> lrwxrwxrwx 1 root root 21 Nov 8 18:37 libosmvendor.so -> libosmvendor.so.1.0.0
>> lrwxrwxrwx 1 root root 21 Nov 8 18:37 libosmvendor.so.1 -> libosmvendor.so.1.0.0
>> -rwxr-xr-x 1 root root 83296 Nov 8 18:37 libosmvendor.so.1.0.0
>> -rwxr-xr-x 1 root root 837 Nov 8 19:04 librdmacm.la
>> -rwxr-xr-x 1 root root 54472 Nov 8 19:04 librdmacm.so
This is why I believe it is a shared library installation problem.
Notice how the other shared libraries are symlinked. Let's take libibverbs:
libibverbs.so -> libibverbs.so.2.0.0
libibverbs.so.2 -> libibverbs.so.2.0.0
libibverbs.so.2.0.0
The actual shared library file is libibverbs.so.2.0.0. There are a
series of symlinks of a more general "specification" as far as the
version number, but they are all linked to libibverbs.so.2.0.0. This is
is what is normally there. You will see the same thing in /usr/lib if
you look at the shared libraries.
On our systems with OFED 1.1, the following is shown for librdmacm:
librdmacm.so -> librdmacm.so.0.9.0
librdmacm.so.0.9.0
which is different than on your system. I have seem problems when the
symlinks are strange on systems (this is very rare from my experience).
What does "ldd /usr/local/mvapich2/bin/mpdroot" show? In addition, what
does "objdump -x /usr/local/mvapich2/bin/mpdroot" show? For example, for
/bin/ls on one of our systems, the following details are shown:
[rowland at e4-oib ~]$ ldd /bin/ls
librt.so.1 => /lib64/tls/librt.so.1 (0x0000003bf6200000)
libacl.so.1 => /lib64/libacl.so.1 (0x0000003bf2000000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x0000003bf2c00000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003bf2400000)
/lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)
libattr.so.1 => /lib64/libattr.so.1 (0x0000003bf3c00000)
In this case, I know that /bin/ls specifically will cause the runtime
linker to look fo libacl.so.1 in the search path exactly. The objdump
output shows the following in the "Dynamic" section:
Dynamic Section:
NEEDED librt.so.1
NEEDED libacl.so.1
NEEDED libselinux.so.1
NEEDED libc.so.6
Again, this shows it needs libacl.so.1, so the runtime linker should be
looking for that library in the search path. The library is in /lib64:
[rowland at e4-oib ~]$ ls -l /lib64/libacl*
lrwxrwxrwx 1 root root 11 Mar 28 2006 /lib64/libacl.so -> libacl.so.1
lrwxrwxrwx 1 root root 15 Mar 28 2006 /lib64/libacl.so.1 ->
libacl.so.1.1.0
-rwxr-xr-x 1 root root 28688 Sep 16 2004 /lib64/libacl.so.1.1.0
Anything that libacl.so.1 is symlinked to should satisfy the runtime
linker. This allows one to upgrade the libacl.so.1.1.0 file to something
else, keeping the symlink, and it still work (if the upgrade is
compatible). This is a shared library feature.
These specifications in the binary are the result of the build, which is
system dependent and also depends on LD_LIBRARY_PATH and the -L
specifications when building (and you can hard code the library
path with -rpath and on some systems -R). Doing "ldd" and "objdump -x"
on a resulting binary from a build should show exactly what will be
searched for. In your case, it should show something that is not in
/usr/local/lib here...
It seems odd that your libradmacm file is not symlinked as ours is. Can
you try the command I suggested above and see what they say? That'd be
helpful. In the end, it looks like a possible shared library
installation issue since the other two factors I mentioned seem to be in
the clear so far. Without these details it is hard to say, and if it is
looking for something that is not symlinked properly in /usr/local/lib
for you... then it seems the library would need to be reinstalled fully
- or it might be possible to try and symlink it correctly (though it
should have been done during the library install, so it would seem to be
an initial installation issue).
--
Shaun Rowland rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/
More information about the mvapich-discuss
mailing list