[mvapich-discuss] mvapich2 1.8.1 and gcc 4.7.2 problem

Jonathan Perkins perkinjo at cse.ohio-state.edu
Fri Feb 15 10:05:17 EST 2013


Hi list,
The problem described in this thread was due to a bug exposed when using
aggressive optimizations with the latest GCC compilers.  Our latest
nightly 1.8 tarball contains the fix as well as the 1.9a2 release.

The nightly tarball can be retrieved from
https://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.8/
and the 1.9a2 release from
http://mvapich.cse.ohio-state.edu/download/mvapich2/mvapich2-1.9a2.tgz

Also, Carmelo found that setting MV2_DEBUG_SHOW_BACKTRACE to 1 only
worked when setting it in the slurm /etc/TaskProlog.sh script as opposed
to the srun command line or in the sbatch submission script.  This might
be particular to his setup but I wanted to mention it as this may affect
others.

On Wed, Feb 13, 2013 at 04:41:30PM +0100, Carmelo Ponti (CSCS) wrote:
> Dear Devendar,
> 
> thank you for your prompt answer.
> 
> I tried your suggestion but I didn't see any additional information:
> 
> # MV2_DEBUG_SHOW_BACKTRACE=1
> # srun --job-name="hello_world_mpi_mvapich" --time=00:30:00 --nodes=4
> --ntasks-per-node=24 --mem-per-cpu=1024
> --partition=parallel ./hello_world_mpi_mvapich
> 
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> In: PMI_Abort(1, Fatal error in MPI_Init:
> Other MPI error
> )
> slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
> SIGNAL 9 ***
> slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
> SIGNAL 9 ***
> slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
> SIGNAL 9 ***
> slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
> SIGNAL 9 ***
> slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
> SIGNAL 9 ***
> slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
> SIGNAL 9 ***
> slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
> srun: error: julier06: tasks 0,2-23: Killed
> srun: error: julier06: task 1: Exited with exit code 1
> srun: error: julier07: tasks 24-25,27-47: Killed
> srun: error: julier07: task 26: Exited with exit code 1
> srun: error: julier08: tasks 48,51-71: Killed
> srun: error: julier08: tasks 49-50: Exited with exit code 1
> srun: error: julier09: tasks 72-95: Killed
> 
> Anyway I found other information which could help to understand the
> problem. 
> 
> I compiled mvapich2 1.8.1 with gcc 4.7.2 in 3 different clusters. I used
> always the following configuration:
> 
> ./configure --prefix=/apps/eiger/mvapich2/1.8.1/gcc-4.7.2
> --enable-threads=default --enable-shared --enable-sharedlibs=gcc
> --enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-smpcoll
> --with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
> --enable-g=dbg --enable-debuginfo CC=gcc CXX=g++ FC=gfortran
> F77=gfortran --with-pmi=slurm --with-pm=no
> --with-slurm=/apps/eiger/slurm/default/
> CPPFLAGS=-I/apps/eiger/slurm/default/include
> LDFLAGS=-L/apps/eiger/slurm/default/lib --enable-g=dbg
> 
> Note that is enough that I add --disable-fast and the problem disappear
> on all 3 clusters.
> 
> Following the 3 clusters configuration:
> 
> Cluster 1
> ---------
> 
> 42 compute nodes with 2 sockets, 8 E5-2670 cores in hyper-trading (32
> cores per nodes) interconnected with IB FDR
> 
> Cluster 2
> ---------
> 
> 12 compute nodes with 2 sockets, 6 E5649 cores in hyper-trading (24
> cores per nodes) interconnected with IB QDR
> 
> Cluster 3
> ---------
> 
> 21 compute nodes with 2 sockets, 6 AMD 2427 cores (12 cores per nodes)
> interconnected with IB QDR
> 
> I tried different test on the 3 clusters and I noticed that the problem
> appears only if I'm using more than 64 processes. The following
> combinations of number of nodes and ntasks-per-node are working fine:
> 
> srun --nodes=2 --ntasks-per-node=32 ./hello_world_mpi_mvapich 
> srun --nodes=4 --ntasks-per-node=16 ./hello_world_mpi_mvapich
> srun --nodes=8 --ntasks-per-node=8 ./hello_world_mpi_mvapich
> 
> I'm excluding a limitation of slurm because the problem don't appear
> with the other compilers (gcc 4.3.4, intel 13.0.1 and pgi 13.1) but
> maybe there is an error in the compilation of gcc 4.7.2 I did. In this
> case I don't understand why the problem disappears if I compile mvapich2
> with --disable-fast.
> 
> Please let me know if you need more information.
> 
> regards
> Carmelo
> 
> On Tue, 2013-02-12 at 13:11 -0500, Devendar Bureddy wrote: 
> > Hi Carmelo Ponti
> > 
> > I tried mvapich2-1.8.1 (same configure flags) with gcc-4.7.2 and slurm
> > version 2.5.3. It seems it is working fine us.
> > 
> > mvapich2-1.8.1]$ srun -N 8 examples/cpi
> > Process 6 of 8 is on node147.cluster
> > Process 7 of 8 is on node148.cluster
> > Process 0 of 8 is on node141.cluster
> > Process 5 of 8 is on node146.cluster
> > Process 2 of 8 is on node143.cluster
> > Process 3 of 8 is on node144.cluster
> > Process 1 of 8 is on node142.cluster
> > Process 4 of 8 is on node145.cluster
> > 
> > Can you try with MV2_DEBUG_SHOW_BACKTRACE=1  to see if that display
> > any useful info in your environment.
> > 
> > -Devendar
> > On Tue, Feb 12, 2013 at 8:55 AM, Carmelo Ponti (CSCS) <cponti at cscs.ch> wrote:
> > > Hello
> > >
> > > I compiled mvapich2 1.8.1 with gcc 4.7.2 and slurm 2.3.4 as follow:
> > >
> > > ./configure --prefix=/apps/pilatus/mvapich2/1.8.1/gcc-4.7.2
> > > --enable-threads=default --enable-shared --enable-sharedlibs=gcc
> > > --enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-fast
> > > --enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail
> > > --with-rdma=gen2 --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc
> > > CXX=g++ FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> > > --with-slurm=/apps/pilatus/slurm/default/
> > > CPPFLAGS=-I/apps/pilatus/slurm/default/include
> > > LDFLAGS=-L/apps/pilatus/slurm/default/lib
> > >
> > > but if I try a simple hello world mpi program I got:
> > >
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error
> > > )
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error
> > > )
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error
> > > )
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error
> > > )
> > > slurmd[pilatus19]: *** STEP 40910.0 KILLED AT 12:01:02 WITH SIGNAL 9 ***
> > > slurmd[pilatus21]: *** STEP 40910.0 KILLED AT 12:01:02 WITH SIGNAL 9 ***
> > > slurmd[pilatus20]: *** STEP 40910.0 KILLED AT 12:01:02 WITH SIGNAL 9 ***
> > > ...
> > >
> > > The problem appears only if I use more than 2 nodes.
> > >
> > > I compiled the same version of mvapich2 with intel 13.0.1 and pgi 13.1
> > > and everything is working fine.
> > >
> > > I recompiled mvapich2 1.8.1/gcc 4.7.2 with --disable-fast and
> > > --enable-g=dbg and then the problem disappear.
> > >
> > > I recompiled it with --enable-g=dbg but I didn't get more information
> > > than this:
> > >
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error
> > > )
> > > In: PMI_Abort(1, Fatal error in MPI_Init:
> > > Other MPI error
> > > )
> > > slurmd[pilatus21]: *** STEP 40936.0 KILLED AT 14:49:01 WITH SIGNAL 9 ***
> > >
> > > Please let me know if you need more information.
> > >
> > > Thank you in advance for your help
> > > Carmelo Ponti
> > >
> > > --
> > > ----------------------------------------------------------------------
> > > Carmelo Ponti           System Engineer
> > > CSCS                    Swiss Center for Scientific Computing
> > > Via Trevano 131         Email: cponti at cscs.ch
> > > CH-6900 Lugano          http://www.cscs.ch
> > >                         Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
> > > ----------------------------------------------------------------------
> > >
> > > _______________________________________________
> > > mvapich-discuss mailing list
> > > mvapich-discuss at cse.ohio-state.edu
> > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> > 
> > 
> > 
> 
> -- 
> ----------------------------------------------------------------------
> Carmelo Ponti           System Engineer                             
> CSCS                    Swiss Center for Scientific Computing 
> Via Trevano 131         Email: cponti at cscs.ch                  
> CH-6900 Lugano          http://www.cscs.ch              
>                         Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
> ----------------------------------------------------------------------
> 
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 

-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo



More information about the mvapich-discuss mailing list