[mvapich-discuss] mvapich 1.8 on SUSE SLES 11 SP2 problem

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue May 29 12:07:26 EDT 2012


Thanks for sending the note.  I have a couple initial things we can look
at.

Have you tried a build without LiMIC2 to verify that it is working in
this mode?  If it is working without LiMIC2 but not with please make
sure that LiMIC2 is installed and the module is loaded before running a
job.
    
    chkconfig limic on
    service limic start

If it still does not work without LiMIC2 perhaps it is a locked memory
issue.  If you run

    sbatch -n 2 --wrap='ulimit -l' 

does it return unlimited or some other high number?  If not then please
take a look at the following link.

http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-1250009.4.3

Let us know if any of this resolves your problem.  Otherwise we can
debug further.

On Tue, May 29, 2012 at 03:24:35PM +0200, Carmelo Ponti (CSCS) wrote:
> Dear mvapich-discuss mailing list
> 
> I'm trying mvapich 1.8 on SUSE SLES 11 SP2 (kernel 3.0.13-0.27) using
> SLURM 2.3.4 but I got a strange error which I cannot solve.
> 
> Following the details of my installation/configuration:
> 
> - CPU: E5-2670 0 @ 2.60GHz dual socket
> - RAM: 64G
> - IB: Mellanox Technologies MT27500 Family [ConnectX-3] FDR
> - OS: SUSE Linux Enterprise Server 11 (x86_64) SP2
> - KERNEL: 3.0.13-0.27-default #1 SMP
> - GCC: gcc version 4.3.4
> - GLIBC: glibc-2.11.3-17.31.1
> 
> - OFED: OFED-1.5.4.1 installed with the following configuration file:
> 
> # cat ofed.conf
> kernel-ib=y
> core=y
> mthca=y
> mlx4=y
> mlx4_en=y
> cxgb3=y
> nes=y
> ipoib=y
> kernel-ib-devel=y
> libibverbs=y
> libibverbs-devel=y
> libibverbs-utils=y
> libmthca=y
> libmlx4=y
> libmlx4-devel=y
> libcxgb3=y
> libcxgb3-devel=y
> libnes=y
> libibumad=y
> libibumad-devel=y
> libibmad=y
> libibmad-devel=y
> librdmacm=y
> librdmacm-utils=y
> librdmacm-devel=y
> perftest=y
> mstflint=y
> ibutils=y
> infiniband-diags=y
> ofed-docs=y
> ofed-scripts=y
> build32=0
> prefix=/usr
> 
> - LIMIC
> 
> I compiled and installed limic as follow:
> 
> # cp -rp mvapich2-1.8-r5423/limic2-0.5.5 /usr/src/packages/SOURCES
> # cd /usr/src/packages/SOURCES
> # tar cfz limic2-0.5.5.tar.gz limic2-0.5.5
> # cp mvapich2-1.8-r5423/limic2-0.5.5/limic.spec /usr/src/packages/SPECS
> # cd /usr/src/packages/SPECS
> # rpmbuild -ba limic.spec
> 
> # rpm -ivh /usr/src/packages/RPMS/x86_64/limic2-0.5.5-1.x86_64.rpm
> 
> I patched limic.spec as follow to make it work on SLES 11 SP2:
> 
> # diff limic.spec mvapich2-1.8-r5423/limic2-0.5.5/limic.spec 
> 95a96,104
> > %{_libdir}/*.so.*
> > 
> > %files module
> > %defattr(644,root,root,755)
> > /lib/modules/%(uname -r)/extra/limic.ko
> > 
> > %files common
> > %defattr(-,root,root,-)
> > %doc
> 97a107,110
> > 
> > %files devel
> > %defattr(-,root,root,-)
> > %doc
> 100,102d112
> < %{_libdir}/*.so.*
> < %{_libdir}/liblimic2.a
> < %attr(0755, root, root) /lib/modules/%(uname -r)/extra/limic.ko
> 
> - MVAPICH2
> # export LD_LIBRARY_PATH=
> $LD_LIBRARY_PATH:/apps/julier/slurm/default/lib/
> 
> # cd mvapich2-1.8-r5423
> # ./configure --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4
> --enable-threads=default --enable-shared --enable-sharedlibs=gcc
> --enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-fast
> --enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail
> --with-rdma=gen2 --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc
> CXX=g++ FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> --with-slurm=/apps/julier/slurm/default/
> CPPFLAGS=-I/apps/julier/slurm/default/include
> LDFLAGS=-L/apps/julier/slurm/default/lib
> 
> # make
> # make install
> 
> Notice: the compilation of mvapich2 completed without problems.
> 
> # mpiname -a
> MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> 
> Compilation
> CC: gcc    -g -DNDEBUG -DNVALGRIND -O2
> CXX: g++   -g -DNDEBUG -DNVALGRIND -O2
> F77: gfortran   -g -O2 
> FC: gfortran   -g -O2
> 
> Configuration
> --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4 --enable-threads=default
> --enable-shared --enable-sharedlibs=gcc --enable-fc --with-mpe
> --enable-rsh --enable-rdma-cm --enable-fast --enable-smpcoll
> --with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
> --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc CXX=g++
> FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> --with-slurm=/apps/julier/slurm/default/
> CPPFLAGS=-I/apps/julier/slurm/default/include
> LDFLAGS=-L/apps/julier/slurm/default/lib
> 
> - TEST
> 
> I tested the installation compiling the following small program:
> 
> #include <stdio.h>
> #include <mpi.h>
> 
> int main(int argc, char *argv[]) {
>   int numprocs, rank, namelen;
>   char processor_name[MPI_MAX_PROCESSOR_NAME];
> 
>   MPI_Init(&argc, &argv);
>   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>   MPI_Get_processor_name(processor_name, &namelen);
> 
>   printf("Process %d on %s out of %d\n", rank, processor_name,
> numprocs);
> 
>   MPI_Finalize();
> }
> 
> # mpicc hello_world_mpi_mvapich.c -o hello_world_mpi_mvapich
> 
> and then I submited it as follow:
> 
> # srun --job-name=hello_world_mpi_mvapich --time=00:03:00 --nodes=2
> --ntasks-per-node=2 --mem-per-cpu=1024 ./hello_world_mpi_mvapich
> 
> RESULTS:
> 
> - If I submit one job on 2 nodes with --ntasks-per-node=1, it works:
> 
> Process 1 on julier16 out of 2
> Process 0 on julier15 out of 2
> 
> - If I submit one job in one node with --ntasks-per-node=2, it works:
> 
> Process 1 on julier15 out of 2
> Process 0 on julier15 out of 2
> 
> - If I submit one job on 2 nodes with --ntasks-per-node=2, it fails with
> the following errors:
> 
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> SIGNAL 9 ***
> slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> SIGNAL 9 ***
> slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> SIGNAL 9 ***
> slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> SIGNAL 9 ***
> slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> SIGNAL 9 ***
> slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> SIGNAL 9 ***
> 
> I tried also to compile it without slurm and use mpiexec.hydra but I got
> the same errors. I tried mvapich 1.7 as well with the same results. 
> 
> Please notice that mvapich 1.7 on another cluster with OFED 1.5.2 and
> SUSE SLES 11 SP1 works without problems.
> 
> Thank you in advance for your help
> Carmelo Ponti
> 
> -- 
> ----------------------------------------------------------------------
> Carmelo Ponti           System Engineer                             
> CSCS                    Swiss Center for Scientific Computing 
> Via Trevano 131         Email: cponti at cscs.ch                  
> CH-6900 Lugano          http://www.cscs.ch              
>                         Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
> ----------------------------------------------------------------------
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list