[mvapich-discuss] mvapich2 1.8.1 and gcc 4.7.2 problem

Wed Feb 13 10:41:30 EST 2013

Dear Devendar,

thank you for your prompt answer.

I tried your suggestion but I didn't see any additional information:

# MV2_DEBUG_SHOW_BACKTRACE=1
# srun --job-name="hello_world_mpi_mvapich" --time=00:30:00 --nodes=4
--ntasks-per-node=24 --mem-per-cpu=1024
--partition=parallel ./hello_world_mpi_mvapich

In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
In: PMI_Abort(1, Fatal error in MPI_Init:
Other MPI error
)
slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
SIGNAL 9 ***
slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
SIGNAL 9 ***
slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
SIGNAL 9 ***
slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
SIGNAL 9 ***
slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
SIGNAL 9 ***
slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier07]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier06]: *** STEP 302564.0 KILLED AT 2013-02-13T11:43:32 WITH
SIGNAL 9 ***
slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier09]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
slurmd[julier08]: *** STEP 302564.0 KILLED AT 11:43:32 WITH SIGNAL 9 ***
srun: error: julier06: tasks 0,2-23: Killed
srun: error: julier06: task 1: Exited with exit code 1
srun: error: julier07: tasks 24-25,27-47: Killed
srun: error: julier07: task 26: Exited with exit code 1
srun: error: julier08: tasks 48,51-71: Killed
srun: error: julier08: tasks 49-50: Exited with exit code 1
srun: error: julier09: tasks 72-95: Killed

Anyway I found other information which could help to understand the
problem. 

I compiled mvapich2 1.8.1 with gcc 4.7.2 in 3 different clusters. I used
always the following configuration:

./configure --prefix=/apps/eiger/mvapich2/1.8.1/gcc-4.7.2
--enable-threads=default --enable-shared --enable-sharedlibs=gcc
--enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-smpcoll
--with-hwloc --enable-xrc --with-device=ch3:mrail --with-rdma=gen2
--enable-g=dbg --enable-debuginfo CC=gcc CXX=g++ FC=gfortran
F77=gfortran --with-pmi=slurm --with-pm=no
--with-slurm=/apps/eiger/slurm/default/
CPPFLAGS=-I/apps/eiger/slurm/default/include
LDFLAGS=-L/apps/eiger/slurm/default/lib --enable-g=dbg

Note that is enough that I add --disable-fast and the problem disappear
on all 3 clusters.

Following the 3 clusters configuration:

Cluster 1
---------

42 compute nodes with 2 sockets, 8 E5-2670 cores in hyper-trading (32
cores per nodes) interconnected with IB FDR

Cluster 2
---------

12 compute nodes with 2 sockets, 6 E5649 cores in hyper-trading (24
cores per nodes) interconnected with IB QDR

Cluster 3
---------

21 compute nodes with 2 sockets, 6 AMD 2427 cores (12 cores per nodes)
interconnected with IB QDR

I tried different test on the 3 clusters and I noticed that the problem
appears only if I'm using more than 64 processes. The following
combinations of number of nodes and ntasks-per-node are working fine:

srun --nodes=2 --ntasks-per-node=32 ./hello_world_mpi_mvapich 
srun --nodes=4 --ntasks-per-node=16 ./hello_world_mpi_mvapich
srun --nodes=8 --ntasks-per-node=8 ./hello_world_mpi_mvapich

I'm excluding a limitation of slurm because the problem don't appear
with the other compilers (gcc 4.3.4, intel 13.0.1 and pgi 13.1) but
maybe there is an error in the compilation of gcc 4.7.2 I did. In this
case I don't understand why the problem disappears if I compile mvapich2
with --disable-fast.

Please let me know if you need more information.

regards
Carmelo

On Tue, 2013-02-12 at 13:11 -0500, Devendar Bureddy wrote: 
> Hi Carmelo Ponti
> 
> I tried mvapich2-1.8.1 (same configure flags) with gcc-4.7.2 and slurm
> version 2.5.3. It seems it is working fine us.
> 
> mvapich2-1.8.1]$ srun -N 8 examples/cpi
> Process 6 of 8 is on node147.cluster
> Process 7 of 8 is on node148.cluster
> Process 0 of 8 is on node141.cluster
> Process 5 of 8 is on node146.cluster
> Process 2 of 8 is on node143.cluster
> Process 3 of 8 is on node144.cluster
> Process 1 of 8 is on node142.cluster
> Process 4 of 8 is on node145.cluster
> 
> Can you try with MV2_DEBUG_SHOW_BACKTRACE=1  to see if that display
> any useful info in your environment.
> 
> -Devendar
> On Tue, Feb 12, 2013 at 8:55 AM, Carmelo Ponti (CSCS) <cponti at cscs.ch> wrote:
> > Hello
> >
> > I compiled mvapich2 1.8.1 with gcc 4.7.2 and slurm 2.3.4 as follow:
> >
> > ./configure --prefix=/apps/pilatus/mvapich2/1.8.1/gcc-4.7.2
> > --enable-threads=default --enable-shared --enable-sharedlibs=gcc
> > --enable-fc --with-mpe --enable-rsh --enable-rdma-cm --enable-fast
> > --enable-smpcoll --with-hwloc --enable-xrc --with-device=ch3:mrail
> > --with-rdma=gen2 --enable-g=dbg --enable-debuginfo --with-limic2 CC=gcc
> > CXX=g++ FC=gfortran F77=gfortran --with-pmi=slurm --with-pm=no
> > --with-slurm=/apps/pilatus/slurm/default/
> > CPPFLAGS=-I/apps/pilatus/slurm/default/include
> > LDFLAGS=-L/apps/pilatus/slurm/default/lib
> >
> > but if I try a simple hello world mpi program I got:
> >
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > slurmd[pilatus19]: *** STEP 40910.0 KILLED AT 12:01:02 WITH SIGNAL 9 ***
> > slurmd[pilatus21]: *** STEP 40910.0 KILLED AT 12:01:02 WITH SIGNAL 9 ***
> > slurmd[pilatus20]: *** STEP 40910.0 KILLED AT 12:01:02 WITH SIGNAL 9 ***
> > ...
> >
> > The problem appears only if I use more than 2 nodes.
> >
> > I compiled the same version of mvapich2 with intel 13.0.1 and pgi 13.1
> > and everything is working fine.
> >
> > I recompiled mvapich2 1.8.1/gcc 4.7.2 with --disable-fast and
> > --enable-g=dbg and then the problem disappear.
> >
> > I recompiled it with --enable-g=dbg but I didn't get more information
> > than this:
> >
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > In: PMI_Abort(1, Fatal error in MPI_Init:
> > Other MPI error
> > )
> > slurmd[pilatus21]: *** STEP 40936.0 KILLED AT 14:49:01 WITH SIGNAL 9 ***
> >
> > Please let me know if you need more information.
> >
> > Thank you in advance for your help
> > Carmelo Ponti
> >
> > --
> > ----------------------------------------------------------------------
> > Carmelo Ponti           System Engineer
> > CSCS                    Swiss Center for Scientific Computing
> > Via Trevano 131         Email: cponti at cscs.ch
> > CH-6900 Lugano          http://www.cscs.ch
> >                         Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
> > ----------------------------------------------------------------------
> >
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> 
> 
> 

-- 
----------------------------------------------------------------------
Carmelo Ponti           System Engineer                             
CSCS                    Swiss Center for Scientific Computing 
Via Trevano 131         Email: cponti at cscs.ch                  
CH-6900 Lugano          http://www.cscs.ch              
                        Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------