[mvapich-discuss] mvapich 1.8 on SUSE SLES 11 SP2 problem
Carmelo Ponti (CSCS)
cponti at cscs.ch
Tue May 29 15:11:28 EDT 2012
Dear Jonathan
Thank you for your prompt answer.
LiMIC2 is loaded correctly, following some details:
# /etc/init.d/limic status
LiMIC2 is loaded
# lsmod | grep limic
limic 13077 0
# ls -l /dev/limic
crw-r--r-- 1 root root 248, 0 May 29 17:49 /dev/limic
# modinfo /lib/modules/3.0.13-0.27-default/extra/limic.ko
filename: /lib/modules/3.0.13-0.27-default/extra/limic.ko
license: Dual BSD/GPL
version: 0.5.5
description: LiMIC2: Linux Kernel Module for High-Performance MPI
Intra-Node Communication
author: Hyun-Wook Jin <jinh at konkuk.ac.kr>
srcversion: 929C7A1B1D503D007A13B45
depends:
vermagic: 3.0.13-0.27-default SMP mod_unload modversions
and if I check the application:
# ldd hello_world_mpi_mvapich
linux-vdso.so.1 => (0x00007fff4a9ff000)
libmpich.so.3 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libmpich.so.3
(0x00007f0d80f8d000)
libopa.so.1 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libopa.so.1
(0x00007f0d80d8a000)
libmpl.so.1 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libmpl.so.1
(0x00007f0d80b85000)
libpmi.so.0 => /apps/julier/slurm/default/lib/libpmi.so.0
(0x00007f0d80949000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0d8072b000)
liblimic2.so.0 => /usr/lib64/liblimic2.so.0 (0x00007f0d80529000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00007f0d80320000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f0d80110000)
libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00007f0d7ff09000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f0/d7fd05000)
librt.so.1 => /lib64/librt.so.1 (0x00007f0d7fafb000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f0d7f8f2000)
libc.so.6 => /lib64/libc.so.6 (0x00007f0d7f57e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0d813ab000)
libslurm.so.23 => /apps/julier/slurm/2.3.4-pam/lib/libslurm.so.23
(0x00007f0d7b8d7000)
As you suggested I recompiled mvapich2 without limic but the behaviour
is exactly the same:
- 2 nodes, 1 task => OK
- 1 node, 2 tasks => OK
- 2 nodes, 2 tasks => NOK
The result of "sbatch -n 2 --wrap='ulimit -l'" is unlimited.
Unfortunately with Sandy-Bridge I'm force to use SLES11.2 which uses the
kernel 3.0. Could be this the cause of the problem?
Regards
Carmelo
On Tue, 2012-05-29 at 12:07 -0400, Jonathan Perkins wrote:
> Thanks for sending the note. I have a couple initial things we can look
> at.
>
> Have you tried a build without LiMIC2 to verify that it is working in
> this mode? If it is working without LiMIC2 but not with please make
> sure that LiMIC2 is installed and the module is loaded before running a
> job.
>
> chkconfig limic on
> service limic start
>
> If it still does not work without LiMIC2 perhaps it is a locked memory
> issue. If you run
>
> sbatch -n 2 --wrap='ulimit -l'
>
> does it return unlimited or some other high number? If not then please
> take a look at the following link.
>
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-1250009.4.3
>
> Let us know if any of this resolves your problem. Otherwise we can
> debug further.
>
> On Tue, May 29, 2012 at 03:24:35PM +0200, Carmelo Ponti (CSCS) wrote:
> > Dear mvapich-discuss mailing list
> >
> > I'm trying mvapich 1.8 on SUSE SLES 11 SP2 (kernel 3.0.13-0.27) using
> > SLURM 2.3.4 but I got a strange error which I cannot solve.
> >
> > Following the details of my installation/configuration:
> >
> > - CPU: E5-2670 0 @ 2.60GHz dual socket
> > - RAM: 64G
> > - IB: Mellanox Technologies MT27500 Family [ConnectX-3] FDR
> > - OS: SUSE Linux Enterprise Server 11 (x86_64) SP2
> > - KERNEL: 3.0.13-0.27-default #1 SMP
> > - GCC: gcc version 4.3.4
> > - GLIBC: glibc-2.11.3-17.31.1
> >
> > - OFED: OFED-1.5.4.1 installed with the following configuration file:
> >
> > # cat ofed.conf
> > kernel-ib=y
> > core=y
> > mthca=y
> > mlx4=y
> > mlx4_en=y
> > cxgb3=y
> > nes=y
> > ipoib=y
> > kernel-ib-devel=y
> > libibverbs=y
> > libibverbs-devel=y
> > libibverbs-utils=y
> > libmthca=y
> > libmlx4=y
> > libmlx4-devel=y
> > libcxgb3=y
> > libcxgb3-devel=y
> > libnes=y
> > libibumad=y
> > libibumad-devel=y
> > libibmad=y
> > libibmad-devel=y
> > librdmacm=y
> > librdmacm-utils=y
> > librdmacm-devel=y
> > perftest=y
> > mstflint=y
> > ibutils=y
> > infiniband-diags=y
> > ofed-docs=y
> > ofed-scripts=y
> > build32=0
> > prefix=/usr
> >
> > - LIMIC
> >
> > I compiled and installed limic as follow:
> >
> > # cp -rp mvapich2-1.8-r5423/limic2-0.5.5 /usr/src/packages/SOURCES
> > # cd /usr/src/packages/SOURCES
> > # tar cfz limic2-0.5.5.tar.gz limic2-0.5.5
> > # cp mvapich2-1.8-r5423/limic2-0.5.5/limic.spec /usr/src/packages/SPECS
> > # cd /usr/src/packages/SPECS
> > # rpmbuild -ba limic.spec
> >
> > # rpm -ivh /usr/src/packages/RPMS/x86_64/limic2-0.5.5-1.x86_64.rpm
> >
> > I patched limic.spec as follow to make it work on SLES 11 SP2:
> >
> > # diff limic.spec mvapich2-1.8-r5423/limic2-0.5.5/limic.spec
> > 95a96,104
> > > %{_libdir}/*.so.*
> > >
> > > %files module
> > > %defattr(644,root,root,755)
> > > /lib/modules/%(uname -r)/extra/limic.ko
> > >
> > > %files common
> > > %defattr(-,root,root,-)
> > > %doc
> > 97a107,110
> > >
> > > %files devel
> > > %defattr(-,root,root,-)
> > > %doc
> > 100,102d112
> > < %{_libdir}/*.so.*
> > < %{_libdir}/liblimic2.a
> > < %attr(0755, root, root) /lib/modules/%(uname -r)/extra/limic.ko
> >
> > - MVAPICH2
> > # export LD_LIBRARY_PATH=
> > $LD_LIBRARY_PATH:/apps/julier/slurm/default/lib/
> >
> > # cd mvapich2-1.8-r5423
> > # ./configure --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4
> > --enable-threads=default --enable-shared --enable-sharedlibs=gcc
> > --enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-fast
> > --enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail
> > --with-rdma=gen2 --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc
> > CXX=g++ FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> > --with-slurm=/apps/julier/slurm/default/
> > CPPFLAGS=-I/apps/julier/slurm/default/include
> > LDFLAGS=-L/apps/julier/slurm/default/lib
> >
> > # make
> > # make install
> >
> > Notice: the compilation of mvapich2 completed without problems.
> >
> > # mpiname -a
> > MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> >
> > Compilation
> > CC: gcc -g -DNDEBUG -DNVALGRIND -O2
> > CXX: g++ -g -DNDEBUG -DNVALGRIND -O2
> > F77: gfortran -g -O2
> > FC: gfortran -g -O2
> >
> > Configuration
> > --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4 --enable-threads=default
> > --enable-shared --enable-sharedlibs=gcc --enable-fc --with-mpe
> > --enable-rsh --enable-rdma-cm --enable-fast --enable-smpcoll
> > --with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
> > --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc CXX=g++
> > FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> > --with-slurm=/apps/julier/slurm/default/
> > CPPFLAGS=-I/apps/julier/slurm/default/include
> > LDFLAGS=-L/apps/julier/slurm/default/lib
> >
> > - TEST
> >
> > I tested the installation compiling the following small program:
> >
> > #include <stdio.h>
> > #include <mpi.h>
> >
> > int main(int argc, char *argv[]) {
> > int numprocs, rank, namelen;
> > char processor_name[MPI_MAX_PROCESSOR_NAME];
> >
> > MPI_Init(&argc, &argv);
> > MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > MPI_Get_processor_name(processor_name, &namelen);
> >
> > printf("Process %d on %s out of %d\n", rank, processor_name,
> > numprocs);
> >
> > MPI_Finalize();
> > }
> >
> > # mpicc hello_world_mpi_mvapich.c -o hello_world_mpi_mvapich
> >
> > and then I submited it as follow:
> >
> > # srun --job-name=hello_world_mpi_mvapich --time=00:03:00 --nodes=2
> > --ntasks-per-node=2 --mem-per-cpu=1024 ./hello_world_mpi_mvapich
> >
> > RESULTS:
> >
> > - If I submit one job on 2 nodes with --ntasks-per-node=1, it works:
> >
> > Process 1 on julier16 out of 2
> > Process 0 on julier15 out of 2
> >
> > - If I submit one job in one node with --ntasks-per-node=2, it works:
> >
> > Process 1 on julier15 out of 2
> > Process 0 on julier15 out of 2
> >
> > - If I submit one job on 2 nodes with --ntasks-per-node=2, it fails with
> > the following errors:
> >
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> >
> > I tried also to compile it without slurm and use mpiexec.hydra but I got
> > the same errors. I tried mvapich 1.7 as well with the same results.
> >
> > Please notice that mvapich 1.7 on another cluster with OFED 1.5.2 and
> > SUSE SLES 11 SP1 works without problems.
> >
> > Thank you in advance for your help
> > Carmelo Ponti
> >
> > --
> > ----------------------------------------------------------------------
> > Carmelo Ponti System Engineer
> > CSCS Swiss Center for Scientific Computing
> > Via Trevano 131 Email: cponti at cscs.ch
> > CH-6900 Lugano http://www.cscs.ch
> > Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
> > ----------------------------------------------------------------------
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
--
----------------------------------------------------------------------
Carmelo Ponti System Engineer
CSCS Swiss Center for Scientific Computing
Via Trevano 131 Email: cponti at cscs.ch
CH-6900 Lugano http://www.cscs.ch
Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------
More information about the mvapich-discuss
mailing list