[mvapich-discuss] mvapich 1.8 on SUSE SLES 11 SP2 problem

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Jun 6 10:15:25 EDT 2012


List:
After rebuilding MVAPICH2 with --enable-g=dbg and --disable-fast we were
able to see that the problem was caused by not finding an open port on
the HCA.  Using MV2_IBA_HCA=mlx4_1 (this hca had an active port) solved
the problem for Carmelo.

Carmelo:
Thanks for the report and please let us know if you encounter any more
trouble using MVAPICH2.

On Tue, May 29, 2012 at 04:58:51PM -0400, Jonathan Perkins wrote:
> Sorry to hear that neither of my suggestions where able to help.  I'll
> talk this over with some of the other developers and see what other
> things we can try.
> 
> On Tue, May 29, 2012 at 09:11:28PM +0200, Carmelo Ponti (CSCS) wrote:
> > Dear Jonathan
> > 
> > Thank you for your prompt answer.
> > 
> > LiMIC2 is loaded correctly, following some details:
> > 
> > # /etc/init.d/limic status
> > LiMIC2 is loaded
> > 
> > # lsmod | grep limic
> > limic                  13077  0 
> > 
> > # ls -l /dev/limic 
> > crw-r--r-- 1 root root 248, 0 May 29 17:49 /dev/limic
> > 
> > # modinfo /lib/modules/3.0.13-0.27-default/extra/limic.ko 
> > filename:       /lib/modules/3.0.13-0.27-default/extra/limic.ko
> > license:        Dual BSD/GPL
> > version:        0.5.5
> > description:    LiMIC2: Linux Kernel Module for High-Performance MPI
> > Intra-Node Communication
> > author:         Hyun-Wook Jin <jinh at konkuk.ac.kr>
> > srcversion:     929C7A1B1D503D007A13B45
> > depends:        
> > vermagic:       3.0.13-0.27-default SMP mod_unload modversions
> > 
> > and if I check the application:
> > 
> > # ldd hello_world_mpi_mvapich
> > 	linux-vdso.so.1 =>  (0x00007fff4a9ff000)
> > 	libmpich.so.3 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libmpich.so.3
> > (0x00007f0d80f8d000)
> > 	libopa.so.1 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libopa.so.1
> > (0x00007f0d80d8a000)
> > 	libmpl.so.1 => /apps/julier/mvapich2/1.8/gcc-4.3.4/lib/libmpl.so.1
> > (0x00007f0d80b85000)
> > 	libpmi.so.0 => /apps/julier/slurm/default/lib/libpmi.so.0
> > (0x00007f0d80949000)
> > 	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0d8072b000)
> > 	liblimic2.so.0 => /usr/lib64/liblimic2.so.0 (0x00007f0d80529000)
> > 	librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00007f0d80320000)
> > 	libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f0d80110000)
> > 	libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00007f0d7ff09000)
> > 	libdl.so.2 => /lib64/libdl.so.2 (0x00007f0/d7fd05000)
> > 	librt.so.1 => /lib64/librt.so.1 (0x00007f0d7fafb000)
> > 	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f0d7f8f2000)
> > 	libc.so.6 => /lib64/libc.so.6 (0x00007f0d7f57e000)
> > 	/lib64/ld-linux-x86-64.so.2 (0x00007f0d813ab000)
> > 	libslurm.so.23 => /apps/julier/slurm/2.3.4-pam/lib/libslurm.so.23
> > (0x00007f0d7b8d7000)
> > 
> > As you suggested I recompiled mvapich2 without limic but the behaviour
> > is exactly the same:
> > 
> > - 2 nodes, 1 task => OK
> > - 1 node, 2 tasks => OK
> > - 2 nodes, 2 tasks => NOK
> > 
> > The result of "sbatch -n 2 --wrap='ulimit -l'" is unlimited.
> > 
> > Unfortunately with Sandy-Bridge I'm force to use SLES11.2 which uses the
> > kernel 3.0. Could be this the cause of the problem?
> > 
> > Regards
> > Carmelo
> > 
> > On Tue, 2012-05-29 at 12:07 -0400, Jonathan Perkins wrote:
> > > Thanks for sending the note.  I have a couple initial things we can look
> > > at.
> > > 
> > > Have you tried a build without LiMIC2 to verify that it is working in
> > > this mode?  If it is working without LiMIC2 but not with please make
> > > sure that LiMIC2 is installed and the module is loaded before running a
> > > job.
> > >     
> > >     chkconfig limic on
> > >     service limic start
> > > 
> > > If it still does not work without LiMIC2 perhaps it is a locked memory
> > > issue.  If you run
> > > 
> > >     sbatch -n 2 --wrap='ulimit -l' 
> > > 
> > > does it return unlimited or some other high number?  If not then please
> > > take a look at the following link.
> > > 
> > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.8.html#x1-1250009.4.3
> > > 
> > > Let us know if any of this resolves your problem.  Otherwise we can
> > > debug further.
> > > 
> > > On Tue, May 29, 2012 at 03:24:35PM +0200, Carmelo Ponti (CSCS) wrote:
> > > > Dear mvapich-discuss mailing list
> > > > 
> > > > I'm trying mvapich 1.8 on SUSE SLES 11 SP2 (kernel 3.0.13-0.27) using
> > > > SLURM 2.3.4 but I got a strange error which I cannot solve.
> > > > 
> > > > Following the details of my installation/configuration:
> > > > 
> > > > - CPU: E5-2670 0 @ 2.60GHz dual socket
> > > > - RAM: 64G
> > > > - IB: Mellanox Technologies MT27500 Family [ConnectX-3] FDR
> > > > - OS: SUSE Linux Enterprise Server 11 (x86_64) SP2
> > > > - KERNEL: 3.0.13-0.27-default #1 SMP
> > > > - GCC: gcc version 4.3.4
> > > > - GLIBC: glibc-2.11.3-17.31.1
> > > > 
> > > > - OFED: OFED-1.5.4.1 installed with the following configuration file:
> > > > 
> > > > # cat ofed.conf
> > > > kernel-ib=y
> > > > core=y
> > > > mthca=y
> > > > mlx4=y
> > > > mlx4_en=y
> > > > cxgb3=y
> > > > nes=y
> > > > ipoib=y
> > > > kernel-ib-devel=y
> > > > libibverbs=y
> > > > libibverbs-devel=y
> > > > libibverbs-utils=y
> > > > libmthca=y
> > > > libmlx4=y
> > > > libmlx4-devel=y
> > > > libcxgb3=y
> > > > libcxgb3-devel=y
> > > > libnes=y
> > > > libibumad=y
> > > > libibumad-devel=y
> > > > libibmad=y
> > > > libibmad-devel=y
> > > > librdmacm=y
> > > > librdmacm-utils=y
> > > > librdmacm-devel=y
> > > > perftest=y
> > > > mstflint=y
> > > > ibutils=y
> > > > infiniband-diags=y
> > > > ofed-docs=y
> > > > ofed-scripts=y
> > > > build32=0
> > > > prefix=/usr
> > > > 
> > > > - LIMIC
> > > > 
> > > > I compiled and installed limic as follow:
> > > > 
> > > > # cp -rp mvapich2-1.8-r5423/limic2-0.5.5 /usr/src/packages/SOURCES
> > > > # cd /usr/src/packages/SOURCES
> > > > # tar cfz limic2-0.5.5.tar.gz limic2-0.5.5
> > > > # cp mvapich2-1.8-r5423/limic2-0.5.5/limic.spec /usr/src/packages/SPECS
> > > > # cd /usr/src/packages/SPECS
> > > > # rpmbuild -ba limic.spec
> > > > 
> > > > # rpm -ivh /usr/src/packages/RPMS/x86_64/limic2-0.5.5-1.x86_64.rpm
> > > > 
> > > > I patched limic.spec as follow to make it work on SLES 11 SP2:
> > > > 
> > > > # diff limic.spec mvapich2-1.8-r5423/limic2-0.5.5/limic.spec 
> > > > 95a96,104
> > > > > %{_libdir}/*.so.*
> > > > > 
> > > > > %files module
> > > > > %defattr(644,root,root,755)
> > > > > /lib/modules/%(uname -r)/extra/limic.ko
> > > > > 
> > > > > %files common
> > > > > %defattr(-,root,root,-)
> > > > > %doc
> > > > 97a107,110
> > > > > 
> > > > > %files devel
> > > > > %defattr(-,root,root,-)
> > > > > %doc
> > > > 100,102d112
> > > > < %{_libdir}/*.so.*
> > > > < %{_libdir}/liblimic2.a
> > > > < %attr(0755, root, root) /lib/modules/%(uname -r)/extra/limic.ko
> > > > 
> > > > - MVAPICH2
> > > > # export LD_LIBRARY_PATH=
> > > > $LD_LIBRARY_PATH:/apps/julier/slurm/default/lib/
> > > > 
> > > > # cd mvapich2-1.8-r5423
> > > > # ./configure --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4
> > > > --enable-threads=default --enable-shared --enable-sharedlibs=gcc
> > > > --enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-fast
> > > > --enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail
> > > > --with-rdma=gen2 --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc
> > > > CXX=g++ FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> > > > --with-slurm=/apps/julier/slurm/default/
> > > > CPPFLAGS=-I/apps/julier/slurm/default/include
> > > > LDFLAGS=-L/apps/julier/slurm/default/lib
> > > > 
> > > > # make
> > > > # make install
> > > > 
> > > > Notice: the compilation of mvapich2 completed without problems.
> > > > 
> > > > # mpiname -a
> > > > MVAPICH2 1.8 Mon Apr 30 14:56:40 EDT 2012 ch3:mrail
> > > > 
> > > > Compilation
> > > > CC: gcc    -g -DNDEBUG -DNVALGRIND -O2
> > > > CXX: g++   -g -DNDEBUG -DNVALGRIND -O2
> > > > F77: gfortran   -g -O2 
> > > > FC: gfortran   -g -O2
> > > > 
> > > > Configuration
> > > > --prefix=/apps/julier/mvapich2/1.8/gcc-4.3.4 --enable-threads=default
> > > > --enable-shared --enable-sharedlibs=gcc --enable-fc --with-mpe
> > > > --enable-rsh --enable-rdma-cm --enable-fast --enable-smpcoll
> > > > --with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
> > > > --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc CXX=g++
> > > > FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> > > > --with-slurm=/apps/julier/slurm/default/
> > > > CPPFLAGS=-I/apps/julier/slurm/default/include
> > > > LDFLAGS=-L/apps/julier/slurm/default/lib
> > > > 
> > > > - TEST
> > > > 
> > > > I tested the installation compiling the following small program:
> > > > 
> > > > #include <stdio.h>
> > > > #include <mpi.h>
> > > > 
> > > > int main(int argc, char *argv[]) {
> > > >   int numprocs, rank, namelen;
> > > >   char processor_name[MPI_MAX_PROCESSOR_NAME];
> > > > 
> > > >   MPI_Init(&argc, &argv);
> > > >   MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
> > > >   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > > >   MPI_Get_processor_name(processor_name, &namelen);
> > > > 
> > > >   printf("Process %d on %s out of %d\n", rank, processor_name,
> > > > numprocs);
> > > > 
> > > >   MPI_Finalize();
> > > > }
> > > > 
> > > > # mpicc hello_world_mpi_mvapich.c -o hello_world_mpi_mvapich
> > > > 
> > > > and then I submited it as follow:
> > > > 
> > > > # srun --job-name=hello_world_mpi_mvapich --time=00:03:00 --nodes=2
> > > > --ntasks-per-node=2 --mem-per-cpu=1024 ./hello_world_mpi_mvapich
> > > > 
> > > > RESULTS:
> > > > 
> > > > - If I submit one job on 2 nodes with --ntasks-per-node=1, it works:
> > > > 
> > > > Process 1 on julier16 out of 2
> > > > Process 0 on julier15 out of 2
> > > > 
> > > > - If I submit one job in one node with --ntasks-per-node=2, it works:
> > > > 
> > > > Process 1 on julier15 out of 2
> > > > Process 0 on julier15 out of 2
> > > > 
> > > > - If I submit one job on 2 nodes with --ntasks-per-node=2, it fails with
> > > > the following errors:
> > > > 
> > > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > > Other MPI error
> > > > )
> > > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > > Other MPI error
> > > > )
> > > > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > > > SIGNAL 9 ***
> > > > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > > > SIGNAL 9 ***
> > > > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > > > SIGNAL 9 ***
> > > > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > > > SIGNAL 9 ***
> > > > slurmd[julier15]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > > > SIGNAL 9 ***
> > > > slurmd[julier16]: *** STEP 23721.0 KILLED AT 2012-05-29T15:03:16 WITH
> > > > SIGNAL 9 ***
> > > > 
> > > > I tried also to compile it without slurm and use mpiexec.hydra but I got
> > > > the same errors. I tried mvapich 1.7 as well with the same results. 
> > > > 
> > > > Please notice that mvapich 1.7 on another cluster with OFED 1.5.2 and
> > > > SUSE SLES 11 SP1 works without problems.
> > > > 
> > > > Thank you in advance for your help
> > > > Carmelo Ponti
> > > > 
> > > > -- 
> > > > ----------------------------------------------------------------------
> > > > Carmelo Ponti           System Engineer                             
> > > > CSCS                    Swiss Center for Scientific Computing 
> > > > Via Trevano 131         Email: cponti at cscs.ch                  
> > > > CH-6900 Lugano          http://www.cscs.ch              
> > > >                         Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
> > > > ----------------------------------------------------------------------
> > > > 
> > > > _______________________________________________
> > > > mvapich-discuss mailing list
> > > > mvapich-discuss at cse.ohio-state.edu
> > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > > > 
> > > 
> > 
> > -- 
> > ----------------------------------------------------------------------
> > Carmelo Ponti           System Engineer                             
> > CSCS                    Swiss Center for Scientific Computing 
> > Via Trevano 131         Email: cponti at cscs.ch                  
> > CH-6900 Lugano          http://www.cscs.ch              
> >                         Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
> > ----------------------------------------------------------------------
> > 
> 
> -- 
> Jonathan Perkins
> http://www.cse.ohio-state.edu/~perkinjo
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


More information about the mvapich-discuss mailing list