[mvapich-discuss] Mpiexec fails to terminate when program ends

Jonathan Perkins perkinjo at cse.ohio-state.edu
Wed Nov 5 17:39:56 EST 2014


Thanks for the additional output.  It's not really clear what is
happening yet but maybe you can try running your job with mpirun_rsh to
see if the application also hangs there on exit.

Here is a link to using mpirun_rsh if it helps.

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.1a-userguide.html#x1-240005.2.1

On Tue, Nov 04, 2014 at 04:19:49PM -0700, Alex M Warren wrote:
> I recompiled with mvapich2 2.0b (what was available on the cluster)
> > mpiname -a
> MVAPICH2 2.0b Fri Nov  8 11:17:40 EST 2013 ch3:mrail
> 
> Compilation
> CC: gcc -fpic   -DNDEBUG -DNVALGRIND -O2
> CXX: g++ -fpic  -DNDEBUG -DNVALGRIND -O2
> F77: gfortran -L/lib -L/lib -fpic  -O2
> FC: gfortran -fpic  -O2
> 
> Configuration
> CC=gcc CFLAGS=-fpic CXX=g++ CXXFLAGS=-fpic F77=gfortran FFLAGS=-fpic
> FC=gfortran FCFLAGS=-fpic
> --prefix=/usr/local/mpi/mvapich2-2.0b-gcc-4.7.1
> 
> This is the end of my program:
> 
>  if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
>   std::string message = std::string("rank_") +
> std::to_string(mpi::Rank(mpi::COMM_WORLD)) + "_a";
>   std::cout << message;
>   Finalize();
>   message = message + "b";
>   std::cout << message;
>   mpi::Finalize();
>   message = message + "c";
>   std::cout << message;
>   return 0;
> }
> 
> The results are (running one process):
> 
> ----------------------------------------
> Begin Torque Prologue (Tue Nov  4 16:01:58 2014)
> Job ID:           1680954.cc-mgmt1.campuscluster.illinois.edu
> Username:         amwarren
> Group:            ***
> Job Name:         mpi_test1
> Limits:
> ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
> Job Queue:        secondary
> Account:          ***
> Nodes:            taub205
> End Torque Prologue
> ----------------------------------------
> Currently Loaded Modulefiles:
>   1) torque/4.2.9              5) gcc/4.7.1
>   2) moab/7.2.9                6) mvapich2/2.0b-gcc-4.7.1
>   3) env/taub                  7) mvapich2/mpiexec
>   4) blas                      8) lapack
> mpiexec: resolve_exe: using absolute path
> "/home/amwarren/aps/distributed_memory/aps".
> node  0: name taub205, cpu avail 6
> mpiexec: process_start_event: evt 2 task 0 on taub205.
> mpiexec: All 1 task (spawn 0) started.
> mpiexec: wait_tasks: waiting for taub205.
> mpiexec: accept_pmi_conn: cmd=initack pmiid=0.
> mpiexec: accept_pmi_conn: rank 0 (spawn 0) checks in.
> mpiexec: accept_pmi_conn: cmd=init pmi_version=1 pmi_subversion=1.
> --------------------------------------------------------------------------------
> 
> [...]
> 
> TIME: Total[ 166.405
> Finalize
> rank_0_arank_0_abrank_0_abcmpiexec: killall: caught signal 15 (Terminated).
> mpiexec: kill_tasks: killing all tasks.
> mpiexec: wait_tasks: waiting for taub205.
> mpiexec: killall: caught signal 15 (Terminated).
> =>> PBS: job killed: walltime 801 exceeded limit 780
> ----------------------------------------
> Begin Torque Epilogue (Tue Nov  4 16:15:19 2014)
> Job ID:           1680954.cc-mgmt1.campuscluster.illinois.edu
> Username:         amwarren
> Group:            ***
> Job Name:         mpi_test1
> Session:          11270
> Limits:
> ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
> Resources:        cput=00:02:12,mem=429524kb,vmem=773600kb,walltime=00:13:21
> Job Queue:        secondary
> Account:          ***
> Nodes:            taub205
> End Torque Epilogue
> ----------------------------------------
> 
> On Tue, Nov 4, 2014 at 8:56 AM, Jonathan Perkins
> <perkinjo at cse.ohio-state.edu> wrote:
> > On Mon, Nov 03, 2014 at 02:10:22PM -0700, Alex M Warren wrote:
> >> I am running an mpi program on a cluster. When the program ends the
> >> job does not. And so I have to wait for it to time out.
> >>
> >> I am not sure how to debug this. I checked that the program got to the
> >> finalize statement in MPI, and it does. I am using lib Elemental.
> >>
> >
> > Can you print out specific messages after each Finalize call? Maybe one
> > of these calls is hanging or exiting abnormally.  If your job isn't too
> > large you may want to try printing from each rank to see which rank is
> > hanging (if one is).
> >
> >> Final lines of the program
> >>
> >>
> >> if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
> >>
> >> Finalize();
> >
> > try adding a print here
> >
> >> mpi::Finalize();
> >
> > and another here
> >
> >> return 0;
> >>
> >> (I tried letting elemental do finalize and that didn't work either)
> >> The output will be
> >>
> >> Finalize
> >> mpiexec: killall: caught signal 15 (Terminated).
> >> mpiexec: kill_tasks: killing all tasks.
> >> mpiexec: wait_tasks: waiting for taub263.
> >> mpiexec: killall: caught signal 15 (Terminated).
> >> ----------------------------------------
> >> Begin Torque Epilogue (Sun Aug 17 01:53:55 2014)
> >> Job ID:           ***
> >> Username:         ***
> >> Group:            ***
> >> Job Name:         num_core_compare_nside-32_mpi_nodes-1_cores-2_1e0e4c0516
> >> Session:          16786
> >> Limits:
> >> ncpus=1,neednodes=2:ppn=6:m24G:taub,nodes=2:ppn=6:m24G:taub,walltime=00:13:00
> >> Resources:        cput=00:08:17,mem=297884kb,vmem=672648kb,walltime=00:13:13
> >> Job Queue:        secondary
> >> Account:          ***
> >> Nodes:            taub263 taub290
> >> End Torque Epilogue
> >> ----------------------------------------
> >>
> >> Running these modules on https://campuscluster.illinois.edu/hardware/#taub
> >>
> >> > module list
> >> Currently Loaded Modulefiles:
> >>   1) torque/4.2.9       4) blas               7) lapack            10) gcc/4.7.1
> >>   2) moab/7.2.9         5) mvapich2/1.6-gcc   8) git/1.7           11) cmake/2.8
> >>  3) env/taub           6) mvapich2/mpiexec   9) vim/7.3           12)
> >> valgrind/3.9.0
> >
> > I see you have a module for mvapich2/1.6-gcc.  Can you let us know which
> > version of MVAPICH2 you are using.  If 1.6, we strongly encourage you to
> > upgrade in case your issue has been resolved in a newer version of
> > mvapich2.
> >
> > To find out which version you're using you should be able to run
> > `mpiname -a'.  If using an older version please try MVAPICH2 v2.0.1
> >
> > --
> > Jonathan Perkins

-- 
Jonathan Perkins


More information about the mvapich-discuss mailing list