[mvapich-discuss] mvapich 1.8 on SUSE SLES 11 SP2 problem
Carmelo Ponti (CSCS)
cponti at cscs.ch
Tue May 29 09:24:35 EDT 2012
Dear mvapich-discuss mailing list
I'm trying mvapich 1.8 on SUSE SLES 11 SP2 (kernel 3.0.13-0.27) using
SLURM 2.3.4 but I got a strange error which I cannot solve.
Following the details of my installation/configuration:
- CPU: E5-2670 0 @ 2.60GHz dual socket
- RAM: 64G
- IB: Mellanox Technologies MT27500 Family [ConnectX-3] FDR
- OS: SUSE Linux Enterprise Server 11 (x86_64) SP2
- KERNEL: 3.0.13-0.27-default #1 SMP
- GCC: gcc version 4.3.4
- GLIBC: glibc-2.11.3-17.31.1
- OFED: OFED-1.5.4.1 installed with the following configuration file:
# cat ofed.conf
kernel-ib=y
core=y
mthca=y
mlx4=y
mlx4_en=y
cxgb3=y
nes=y
ipoib=y
kernel-ib-devel=y
libibverbs=y
libibverbs-devel=y
libibverbs-utils=y
libmthca=y
libmlx4=y
libmlx4-devel=y
libcxgb3=y
libcxgb3-devel=y
libnes=y
libibumad=y
libibumad-devel=y
libibmad=y
libibmad-devel=y
librdmacm=y
librdmacm-utils=y
librdmacm-devel=y
perftest=y
mstflint=y
ibutils=y
infiniband-diags=y
ofed-docs=y
ofed-scripts=y
build32=0
prefix=/usr
- LIMIC
I compiled and installed limic as follow:
# cp -rp mvapich2-1.8-r5423/limic2-0.5.5 /usr/src/packages/SOURCES
# cd /usr/src/packages/SOURCES
# tar cfz limic2-0.5.5.tar.gz limic2-0.5.5
# cp mvapich2-1.8-r5423/limic2-0.5.5/limic.spec /usr/src/packages/SPECS
# cd /usr/src/packages/SPECS
# rpmbuild -ba limic.spec
# rpm -ivh /usr/src/packages/RPMS/x86_64/limic2-0.5.5-1.x86_64.rpm
I patched limic.spec as follow to make it work on SLES 11 SP2:
# diff limic.spec mvapich2-1.8-r5423/limic2-0.5.5/limic.spec
95a96,104
> %{_libdir}/*.so.*
>
> %files module
> %defattr(644,root,root,755)
> /lib/modules/%(uname -r)/extra/limic.ko
>
> %files common
> %defattr(-,root,root,-)
> %doc
97a107,110
>
> %files devel
> %defattr(-,root,root,-)
> %doc
100,102d112
< %{_libdir}/*.so.*
< %{_libdir}/liblimic2.a
< %attr(0755, root, root) /lib/modules/%(uname -r)/extra/limic.ko
- MVAPICH2
# export LD_LIBRARY_PATH=
$LD_LIBRARY_PATH:/apps/julier/slurm/default/lib/
# cd mvapich2-1.8-r5423
# ./configure --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4
--enable-threads=default --enable-shared --enable-sharedlibs=gcc
--enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-fast
--enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail
--with-rdma=gen2 --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc
CXX=g++ FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
--with-slurm=/apps/julier/slurm/default/
CPPFLAGS=-I/apps/julier/slurm/default/include
LDFLAGS=-L/apps/julier/slurm/default/lib
# make
# make install
Notice: the compilation of mvapich2 completed without problems.
# mpiname -a
MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
Compilation
CC: gcc -g -DNDEBUG -DNVALGRIND -O2
CXX: g++ -g -DNDEBUG -DNVALGRIND -O2
F77: gfortran -g -O2
FC: gfortran -g -O2
Configuration
--prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4 --enable-threads=default
--enable-shared --enable-sharedlibs=gcc --enable-fc --with-mpe
--enable-rsh --enable-rdma-cm --enable-fast --enable-smpcoll
--with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
--enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc CXX=g++
FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
--with-slurm=/apps/julier/slurm/default/
CPPFLAGS=-I/apps/julier/slurm/default/include
LDFLAGS=-L/apps/julier/slurm/default/lib
- TEST
I tested the installation compiling the following small program:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
printf("Process %d on %s out of %d\n", rank, processor_name,
numprocs);
MPI_Finalize();
}
# mpicc hello_world_mpi_mvapich.c -o hello_world_mpi_mvapich
and then I submited it as follow:
# srun --job-name=hello_world_mpi_mvapich --time=00:03:00 --nodes=2
--ntasks-per-node=2 --mem-per-cpu=1024 ./hello_world_mpi_mvapich
RESULTS:
- If I submit one job on 2 nodes with --ntasks-per-node=1, it works:
Process 1 on julier16 out of 2
Process 0 on julier15 out of 2
- If I submit one job in one node with --ntasks-per-node=2, it works:
Process 1 on julier15 out of 2
Process 0 on julier15 out of 2
- If I submit one job on 2 nodes with --ntasks-per-node=2, it fails with
the following errors:
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
SIGNAL 9 ***
I tried also to compile it without slurm and use mpiexec.hydra but I got
the same errors. I tried mvapich 1.7 as well with the same results.
Please notice that mvapich 1.7 on another cluster with OFED 1.5.2 and
SUSE SLES 11 SP1 works without problems.
Thank you in advance for your help
Carmelo Ponti
--
----------------------------------------------------------------------
Carmelo Ponti System Engineer
CSCS Swiss Center for Scientific Computing
Via Trevano 131 Email: cponti at cscs.ch
CH-6900 Lugano http://www.cscs.ch
Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------
More information about the mvapich-discuss
mailing list