[mvapich-discuss] mvapich 1.8 on SUSE SLES 11 SP2 problem

Carmelo Ponti (CSCS) cponti at cscs.ch
Tue May 29 15:11:28 EDT 2012


Dear Jonathan

Thank you for your prompt answer.

LiMIC2 is loaded correctly, following some details:

# /etc/init.d/limic status
LiMIC2 is loaded

# lsmod | grep limic
limic                  13077  0 

# ls -l /dev/limic 
crw-r--r-- 1 root root 248, 0 May 29 17:49 /dev/limic

# modinfo /lib/modules/3.0.13-0.27-default/extra/limic.ko 
filename:       /lib/modules/3.0.13-0.27-default/extra/limic.ko
license:        Dual BSD/GPL
version:        0.5.5
description:    LiMIC2: Linux Kernel Module for High-Performance MPI
Intra-Node Communication
author:         Hyun-Wook Jin <jinh at konkuk.ac.kr>
srcversion:     929C7A1B1D503D007A13B45
depends:        
vermagic:       3.0.13-0.27-default SMP mod_unload modversions

and if I check the application:

# ldd hello_world_mpi_mvapich
	linux-vdso.so.1 =>  (0x00007fff4a9ff000)
	libmpich.so.3 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libmpich.so.3
(0x00007f0d80f8d000)
	libopa.so.1 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libopa.so.1
(0x00007f0d80d8a000)
	libmpl.so.1 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libmpl.so.1
(0x00007f0d80b85000)
	libpmi.so.0 => /apps/julier/slurm/default/lib/libpmi.so.0
(0x00007f0d80949000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0d8072b000)
	liblimic2.so.0 => /usr/lib64/liblimic2.so.0 (0x00007f0d80529000)
	librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00007f0d80320000)
	libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f0d80110000)
	libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00007f0d7ff09000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f0/d7fd05000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f0d7fafb000)
	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f0d7f8f2000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f0d7f57e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f0d813ab000)
	libslurm.so.23 => /apps/julier/slurm/2.3.4-pam/lib/libslurm.so.23
(0x00007f0d7b8d7000)

As you suggested I recompiled mvapich2 without limic but the behaviour
is exactly the same:

- 2 nodes, 1 task => OK
- 1 node, 2 tasks => OK
- 2 nodes, 2 tasks => NOK

The result of "sbatch -n 2 --wrap='ulimit -l'" is unlimited.

Unfortunately with Sandy-Bridge I'm force to use SLES11.2 which uses the
kernel 3.0. Could be this the cause of the problem?

Regards
Carmelo

On Tue, 2012-05-29 at 12:07 -0400, Jonathan Perkins wrote:
> Thanks for sending the note.  I have a couple initial things we can look
> at.
> 
> Have you tried a build without LiMIC2 to verify that it is working in
> this mode?  If it is working without LiMIC2 but not with please make
> sure that LiMIC2 is installed and the module is loaded before running a
> job.
>     
>     chkconfig limic on
>     service limic start
> 
> If it still does not work without LiMIC2 perhaps it is a locked memory
> issue.  If you run
> 
>     sbatch -n 2 --wrap='ulimit -l' 
> 
> does it return unlimited or some other high number?  If not then please
> take a look at the following link.
> 
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-1250009.4.3
> 
> Let us know if any of this resolves your problem.  Otherwise we can
> debug further.
> 
> On Tue, May 29, 2012 at 03:24:35PM +0200, Carmelo Ponti (CSCS) wrote:
> > Dear mvapich-discuss mailing list
> > 
> > I'm trying mvapich 1.8 on SUSE SLES 11 SP2 (kernel 3.0.13-0.27) using
> > SLURM 2.3.4 but I got a strange error which I cannot solve.
> > 
> > Following the details of my installation/configuration:
> > 
> > - CPU: E5-2670 0 @ 2.60GHz dual socket
> > - RAM: 64G
> > - IB: Mellanox Technologies MT27500 Family [ConnectX-3] FDR
> > - OS: SUSE Linux Enterprise Server 11 (x86_64) SP2
> > - KERNEL: 3.0.13-0.27-default #1 SMP
> > - GCC: gcc version 4.3.4
> > - GLIBC: glibc-2.11.3-17.31.1
> > 
> > - OFED: OFED-1.5.4.1 installed with the following configuration file:
> > 
> > # cat ofed.conf
> > kernel-ib=y
> > core=y
> > mthca=y
> > mlx4=y
> > mlx4_en=y
> > cxgb3=y
> > nes=y
> > ipoib=y
> > kernel-ib-devel=y
> > libibverbs=y
> > libibverbs-devel=y
> > libibverbs-utils=y
> > libmthca=y
> > libmlx4=y
> > libmlx4-devel=y
> > libcxgb3=y
> > libcxgb3-devel=y
> > libnes=y
> > libibumad=y
> > libibumad-devel=y
> > libibmad=y
> > libibmad-devel=y
> > librdmacm=y
> > librdmacm-utils=y
> > librdmacm-devel=y
> > perftest=y
> > mstflint=y
> > ibutils=y
> > infiniband-diags=y
> > ofed-docs=y
> > ofed-scripts=y
> > build32=0
> > prefix=/usr
> > 
> > - LIMIC
> > 
> > I compiled and installed limic as follow:
> > 
> > # cp -rp mvapich2-1.8-r5423/limic2-0.5.5 /usr/src/packages/SOURCES
> > # cd /usr/src/packages/SOURCES
> > # tar cfz limic2-0.5.5.tar.gz limic2-0.5.5
> > # cp mvapich2-1.8-r5423/limic2-0.5.5/limic.spec /usr/src/packages/SPECS
> > # cd /usr/src/packages/SPECS
> > # rpmbuild -ba limic.spec
> > 
> > # rpm -ivh /usr/src/packages/RPMS/x86_64/limic2-0.5.5-1.x86_64.rpm
> > 
> > I patched limic.spec as follow to make it work on SLES 11 SP2:
> > 
> > # diff limic.spec mvapich2-1.8-r5423/limic2-0.5.5/limic.spec 
> > 95a96,104
> > > %{_libdir}/*.so.*
> > > 
> > > %files module
> > > %defattr(644,root,root,755)
> > > /lib/modules/%(uname -r)/extra/limic.ko
> > > 
> > > %files common
> > > %defattr(-,root,root,-)
> > > %doc
> > 97a107,110
> > > 
> > > %files devel
> > > %defattr(-,root,root,-)
> > > %doc
> > 100,102d112
> > < %{_libdir}/*.so.*
> > < %{_libdir}/liblimic2.a
> > < %attr(0755, root, root) /lib/modules/%(uname -r)/extra/limic.ko
> > 
> > - MVAPICH2
> > # export LD_LIBRARY_PATH=
> > $LD_LIBRARY_PATH:/apps/julier/slurm/default/lib/
> > 
> > # cd mvapich2-1.8-r5423
> > # ./configure --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4
> > --enable-threads=default --enable-shared --enable-sharedlibs=gcc
> > --enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-fast
> > --enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail
> > --with-rdma=gen2 --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc
> > CXX=g++ FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> > --with-slurm=/apps/julier/slurm/default/
> > CPPFLAGS=-I/apps/julier/slurm/default/include
> > LDFLAGS=-L/apps/julier/slurm/default/lib
> > 
> > # make
> > # make install
> > 
> > Notice: the compilation of mvapich2 completed without problems.
> > 
> > # mpiname -a
> > MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> > 
> > Compilation
> > CC: gcc    -g -DNDEBUG -DNVALGRIND -O2
> > CXX: g++   -g -DNDEBUG -DNVALGRIND -O2
> > F77: gfortran   -g -O2 
> > FC: gfortran   -g -O2
> > 
> > Configuration
> > --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4 --enable-threads=default
> > --enable-shared --enable-sharedlibs=gcc --enable-fc --with-mpe
> > --enable-rsh --enable-rdma-cm --enable-fast --enable-smpcoll
> > --with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
> > --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc CXX=g++
> > FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> > --with-slurm=/apps/julier/slurm/default/
> > CPPFLAGS=-I/apps/julier/slurm/default/include
> > LDFLAGS=-L/apps/julier/slurm/default/lib
> > 
> > - TEST
> > 
> > I tested the installation compiling the following small program:
> > 
> > #include <stdio.h>
> > #include <mpi.h>
> > 
> > int main(int argc, char *argv[]) {
> >   int numprocs, rank, namelen;
> >   char processor_name[MPI_MAX_PROCESSOR_NAME];
> > 
> >   MPI_Init(&argc, &argv);
> >   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
> >   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >   MPI_Get_processor_name(processor_name, &namelen);
> > 
> >   printf("Process %d on %s out of %d\n", rank, processor_name,
> > numprocs);
> > 
> >   MPI_Finalize();
> > }
> > 
> > # mpicc hello_world_mpi_mvapich.c -o hello_world_mpi_mvapich
> > 
> > and then I submited it as follow:
> > 
> > # srun --job-name=hello_world_mpi_mvapich --time=00:03:00 --nodes=2
> > --ntasks-per-node=2 --mem-per-cpu=1024 ./hello_world_mpi_mvapich
> > 
> > RESULTS:
> > 
> > - If I submit one job on 2 nodes with --ntasks-per-node=1, it works:
> > 
> > Process 1 on julier16 out of 2
> > Process 0 on julier15 out of 2
> > 
> > - If I submit one job in one node with --ntasks-per-node=2, it works:
> > 
> > Process 1 on julier15 out of 2
> > Process 0 on julier15 out of 2
> > 
> > - If I submit one job on 2 nodes with --ntasks-per-node=2, it fails with
> > the following errors:
> > 
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > SIGNAL 9 ***
> > 
> > I tried also to compile it without slurm and use mpiexec.hydra but I got
> > the same errors. I tried mvapich 1.7 as well with the same results. 
> > 
> > Please notice that mvapich 1.7 on another cluster with OFED 1.5.2 and
> > SUSE SLES 11 SP1 works without problems.
> > 
> > Thank you in advance for your help
> > Carmelo Ponti
> > 
> > -- 
> > ----------------------------------------------------------------------
> > Carmelo Ponti           System Engineer                             
> > CSCS                    Swiss Center for Scientific Computing 
> > Via Trevano 131         Email: cponti at cscs.ch                  
> > CH-6900 Lugano          http://www.cscs.ch              
> >                         Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
> > ----------------------------------------------------------------------
> > 
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > 
> 

-- 
----------------------------------------------------------------------
Carmelo Ponti           System Engineer                             
CSCS                    Swiss Center for Scientific Computing 
Via Trevano 131         Email: cponti at cscs.ch                  
CH-6900 Lugano          http://www.cscs.ch              
                        Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------



More information about the mvapich-discuss mailing list