[mvapich-discuss] mvapich 1.8 on SUSE SLES 11 SP2 problem

Carmelo Ponti (CSCS) cponti at cscs.ch
Tue May 29 09:24:35 EDT 2012


Dear mvapich-discuss mailing list

I'm trying mvapich 1.8 on SUSE SLES 11 SP2 (kernel 3.0.13-0.27) using
SLURM 2.3.4 but I got a strange error which I cannot solve.

Following the details of my installation/configuration:

- CPU: E5-2670 0 @ 2.60GHz dual socket
- RAM: 64G
- IB: Mellanox Technologies MT27500 Family [ConnectX-3] FDR
- OS: SUSE Linux Enterprise Server 11 (x86_64) SP2
- KERNEL: 3.0.13-0.27-default #1 SMP
- GCC: gcc version 4.3.4
- GLIBC: glibc-2.11.3-17.31.1

- OFED: OFED-1.5.4.1 installed with the following configuration file:

# cat ofed.conf
kernel-ib=y
core=y
mthca=y
mlx4=y
mlx4_en=y
cxgb3=y
nes=y
ipoib=y
kernel-ib-devel=y
libibverbs=y
libibverbs-devel=y
libibverbs-utils=y
libmthca=y
libmlx4=y
libmlx4-devel=y
libcxgb3=y
libcxgb3-devel=y
libnes=y
libibumad=y
libibumad-devel=y
libibmad=y
libibmad-devel=y
librdmacm=y
librdmacm-utils=y
librdmacm-devel=y
perftest=y
mstflint=y
ibutils=y
infiniband-diags=y
ofed-docs=y
ofed-scripts=y
build32=0
prefix=/usr

- LIMIC

I compiled and installed limic as follow:

# cp -rp mvapich2-1.8-r5423/limic2-0.5.5 /usr/src/packages/SOURCES
# cd /usr/src/packages/SOURCES
# tar cfz limic2-0.5.5.tar.gz limic2-0.5.5
# cp mvapich2-1.8-r5423/limic2-0.5.5/limic.spec /usr/src/packages/SPECS
# cd /usr/src/packages/SPECS
# rpmbuild -ba limic.spec

# rpm -ivh /usr/src/packages/RPMS/x86_64/limic2-0.5.5-1.x86_64.rpm

I patched limic.spec as follow to make it work on SLES 11 SP2:

# diff limic.spec mvapich2-1.8-r5423/limic2-0.5.5/limic.spec 
95a96,104
> %{_libdir}/*.so.*
> 
> %files module
> %defattr(644,root,root,755)
> /lib/modules/%(uname -r)/extra/limic.ko
> 
> %files common
> %defattr(-,root,root,-)
> %doc
97a107,110
> 
> %files devel
> %defattr(-,root,root,-)
> %doc
100,102d112
< %{_libdir}/*.so.*
< %{_libdir}/liblimic2.a
< %attr(0755, root, root) /lib/modules/%(uname -r)/extra/limic.ko

- MVAPICH2
# export LD_LIBRARY_PATH=
$LD_LIBRARY_PATH:/apps/julier/slurm/default/lib/

# cd mvapich2-1.8-r5423
# ./configure --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4
--enable-threads=default --enable-shared --enable-sharedlibs=gcc
--enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-fast
--enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail
--with-rdma=gen2 --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc
CXX=g++ FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
--with-slurm=/apps/julier/slurm/default/
CPPFLAGS=-I/apps/julier/slurm/default/include
LDFLAGS=-L/apps/julier/slurm/default/lib

# make
# make install

Notice: the compilation of mvapich2 completed without problems.

# mpiname -a
MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail

Compilation
CC: gcc    -g -DNDEBUG -DNVALGRIND -O2
CXX: g++   -g -DNDEBUG -DNVALGRIND -O2
F77: gfortran   -g -O2 
FC: gfortran   -g -O2

Configuration
--prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4 --enable-threads=default
--enable-shared --enable-sharedlibs=gcc --enable-fc --with-mpe
--enable-rsh --enable-rdma-cm --enable-fast --enable-smpcoll
--with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
--enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc CXX=g++
FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
--with-slurm=/apps/julier/slurm/default/
CPPFLAGS=-I/apps/julier/slurm/default/include
LDFLAGS=-L/apps/julier/slurm/default/lib

- TEST

I tested the installation compiling the following small program:

#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d on %s out of %d\n", rank, processor_name,
numprocs);

  MPI_Finalize();
}

# mpicc hello_world_mpi_mvapich.c -o hello_world_mpi_mvapich

and then I submited it as follow:

# srun --job-name=hello_world_mpi_mvapich --time=00:03:00 --nodes=2
--ntasks-per-node=2 --mem-per-cpu=1024 ./hello_world_mpi_mvapich

RESULTS:

- If I submit one job on 2 nodes with --ntasks-per-node=1, it works:

Process 1 on julier16 out of 2
Process 0 on julier15 out of 2

- If I submit one job in one node with --ntasks-per-node=2, it works:

Process 1 on julier15 out of 2
Process 0 on julier15 out of 2

- If I submit one job on 2 nodes with --ntasks-per-node=2, it fails with
the following errors:

In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***

I tried also to compile it without slurm and use mpiexec.hydra but I got
the same errors. I tried mvapich 1.7 as well with the same results. 

Please notice that mvapich 1.7 on another cluster with OFED 1.5.2 and
SUSE SLES 11 SP1 works without problems.

Thank you in advance for your help
Carmelo Ponti

-- 
----------------------------------------------------------------------
Carmelo Ponti           System Engineer                             
CSCS                    Swiss Center for Scientific Computing 
Via Trevano 131         Email: cponti at cscs.ch                  
CH-6900 Lugano          http://www.cscs.ch              
                        Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------



More information about the mvapich-discuss mailing list