[mvapich-discuss] Mpiexec fails to terminate when program ends

Jonathan Perkins perkinjo at cse.ohio-state.edu
Tue Nov 4 10:56:10 EST 2014


On Mon, Nov 03, 2014 at 02:10:22PM -0700, Alex M Warren wrote:
> I am running an mpi program on a cluster. When the program ends the
> job does not. And so I have to wait for it to time out.
> 
> I am not sure how to debug this. I checked that the program got to the
> finalize statement in MPI, and it does. I am using lib Elemental.
> 

Can you print out specific messages after each Finalize call? Maybe one
of these calls is hanging or exiting abnormally.  If your job isn't too
large you may want to try printing from each rank to see which rank is
hanging (if one is).

> Final lines of the program
> 
> 
> if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
> 
> Finalize();

try adding a print here

> mpi::Finalize();

and another here

> return 0;
> 
> (I tried letting elemental do finalize and that didn't work either)
> The output will be
> 
> Finalize
> mpiexec: killall: caught signal 15 (Terminated).
> mpiexec: kill_tasks: killing all tasks.
> mpiexec: wait_tasks: waiting for taub263.
> mpiexec: killall: caught signal 15 (Terminated).
> ----------------------------------------
> Begin Torque Epilogue (Sun Aug 17 01:53:55 2014)
> Job ID:           ***
> Username:         ***
> Group:            ***
> Job Name:         num_core_compare_nside-32_mpi_nodes-1_cores-2_1e0e4c0516
> Session:          16786
> Limits:
> ncpus=1,neednodes=2:ppn=6:m24G:taub,nodes=2:ppn=6:m24G:taub,walltime=00:13:00
> Resources:        cput=00:08:17,mem=297884kb,vmem=672648kb,walltime=00:13:13
> Job Queue:        secondary
> Account:          ***
> Nodes:            taub263 taub290
> End Torque Epilogue
> ----------------------------------------
> 
> Running these modules on https://campuscluster.illinois.edu/hardware/#taub
> 
> > module list
> Currently Loaded Modulefiles:
>   1) torque/4.2.9       4) blas               7) lapack            10) gcc/4.7.1
>   2) moab/7.2.9         5) mvapich2/1.6-gcc   8) git/1.7           11) cmake/2.8
>  3) env/taub           6) mvapich2/mpiexec   9) vim/7.3           12)
> valgrind/3.9.0

I see you have a module for mvapich2/1.6-gcc.  Can you let us know which
version of MVAPICH2 you are using.  If 1.6, we strongly encourage you to
upgrade in case your issue has been resolved in a newer version of
mvapich2.

To find out which version you're using you should be able to run
`mpiname -a'.  If using an older version please try MVAPICH2 v2.0.1

-- 
Jonathan Perkins


More information about the mvapich-discuss mailing list