[mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

Raghu Reddy raghu.reddy at noaa.gov
Wed Mar 13 10:40:30 EDT 2019


Hi Peter,

Thank you very much for this information!  We are trying to follow up on
this, this information is very helpful!

This is what we currently have:

tfe03.% rpm -qa | grep psm
infinipath-psm-3.3-26_g604758e_open.2.el7.x86_64
psmisc-22.20-15.el7.x86_64
libpsm2-devel-10.3.58-1.el7.x86_64
libpsm2-10.3.58-1.el7.x86_64
tfe03.%

We will try to install the "psm" libraries that you have mentioned and try
again.

What is strange is we have an existing installation of an older version
mvapich2 and that works fine.  And I have the ldd output from that run
below:

Wed Mar 13 14:18:56 UTC 2019
ldd
/scratch4/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/packed-job-scalability-b
ug-6285/hello_mpi_c-intel-mvp2
        linux-vdso.so.1 =>  (0x00007fff74eb4000)
        libmpi.so.12 => /apps/mvapich2/2.1rc1-intel/lib/libmpi.so.12
(0x00002ad592d9e000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ad59337a000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ad59367c000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ad593892000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ad593c5f000)
        libnuma.so.1 => /lib64/libnuma.so.1 (0x00002ad593e63000)
        libxml2.so.2 => /lib64/libxml2.so.2 (0x00002ad59406f000)
        libibverbs.so.1 => /lib64/libibverbs.so.1 (0x00002ad5943d9000)
        libpsm_infinipath.so.1 => /lib64/libpsm_infinipath.so.1
(0x00002ad5945f0000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ad594846000)
        libifport.so.5 =>
/apps/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64/li
bifport.so.5 (0x00002ad594a62000)
        libifcore.so.5 =>
/apps/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64/li
bifcore.so.5 (0x00002ad594c91000)
        libimf.so =>
/apps/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64/li
bimf.so (0x00002ad594fee000)
        libsvml.so =>
/apps/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64/li
bsvml.so (0x00002ad59557c000)
        libintlc.so.5 =>
/apps/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64/li
bintlc.so.5 (0x00002ad596c2f000)
        /lib64/ld-linux-x86-64.so.2 (0x00002ad592b7a000)
        libz.so.1 => /lib64/libz.so.1 (0x00002ad596e9c000)
        liblzma.so.5 => /lib64/liblzma.so.5 (0x00002ad5970b2000)
        libnl-route-3.so.200 => /lib64/libnl-route-3.so.200
(0x00002ad5972d8000)
        libnl-3.so.200 => /lib64/libnl-3.so.200 (0x00002ad597545000)
        libinfinipath.so.4 => /lib64/libinfinipath.so.4 (0x00002ad597766000)
        librt.so.1 => /lib64/librt.so.1 (0x00002ad597975000)
        libuuid.so.1 => /lib64/libuuid.so.1 (0x00002ad597b7d000)
set TT0=`perl -e 'print time()'`
perl -e print time()
mpirun -np 4
/scratch4/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/packed-job-scalability-b
ug-6285/hello_mpi_c-intel-mvp2
Hello from rank 02 out of 4; procname = t0088, cpuid = 0
Hello from rank 03 out of 4; procname = t0088, cpuid = 1
Hello from rank 00 out of 4; procname = t0087, cpuid = 1
Hello from rank 01 out of 4; procname = t0087, cpuid = 0
set TT1=`perl -e 'print time()'`

But we need to install more recent versions and we are having problems with
that.

Thank you again for taking the time to look into this!

Raghu





-----Original Message-----
From: Peter Kjellström [mailto:cap at nsc.liu.se] 
Sent: Wednesday, March 13, 2019 9:22 AM
To: Raghu Reddy <raghu.reddy at noaa.gov>
Cc: 'Subramoni, Hari' <subramoni.1 at osu.edu>; 'Carlson, Timothy S'
<Timothy.Carlson at pnnl.gov>; 'mvapich-discuss at cse.ohio-state.edu'
<mvapich-discuss at mailman.cse.ohio-state.edu>; 'Brian Osmond'
<brian.osmond at noaa.gov>; 'Kyle Stern' <kstern at redlineperf.com>
Subject: Re: [mvapich-discuss] Problems installing mvapich2/2.3 with Slurm

On Tue, 12 Mar 2019 16:52:14 -0400
Raghu Reddy <raghu.reddy at noaa.gov> wrote:
...
> mpicc -o hello_c
> /tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Testsuite3/hello/hello_mp
> i_c.c
> 
> mpiexec -np 24 ./hello_c
> 
> s0014.110678hfi_wait_for_device: The /dev/hfi1_0 device failed to 
> appear after 15.0 seconds: Connection timed out

The above message looks for OPA..

On a system with truescale (PSM) and the following relevant psm packages
installed:

 $ rpm -qa | grep psm
 infinipath-psm-devel-3.0.1-115.1015_open.2_nsc1.el6.x86_64
 psmisc-22.6-24.el6.x86_64
 infinipath-psm-3.0.1-115.1015_open.2_nsc1.el6.x86_64

 NOTE: not psm2


I did:

 module load buildenv-intel/2018u1
 wget http://.../mvapich2-2.3.1.tar.gz
 tar xf mvapich2-2.3.1.tar.gz 1003  cd mvapich2-2.3.1/  ./configure
--prefix=/home/cap/mpiinst/mvapich2-2.3.1_psm
 --with-device=ch3:psm CC=icc CXX=icpc FC=ifort  make -j 8  make install

I works fine both with mpiexec and mpirun in a slurm job using my choice of
hello world:

 $ export PATH=/home/cap/mpiinst/mvapich2-2.3.1_psm/bin:$PATH
 $ mpicc -o mdrbench_mvp.x mdrbench.c

 # in slurm -N2 -n32 job shell
 $ unset PSM_RANKS_PER_CONTEXT
 $ mpirun ./mdrbench_mvp.x
 CPU timing results: iter/us (rank0/mean): 161/161  Setting load to: 0%  1D
dim geometry is: 32  ...

/Peter K




More information about the mvapich-discuss mailing list