[mvapich-discuss] Mpiexec fails to terminate when program ends

Alex M Warren amwarren at email.arizona.edu
Tue Nov 4 18:19:49 EST 2014


I recompiled with mvapich2 2.0b (what was available on the cluster)
> mpiname -a
MVAPICH2 2.0b Fri Nov  8 11:17:40 EST 2013 ch3:mrail

Compilation
CC: gcc -fpic   -DNDEBUG -DNVALGRIND -O2
CXX: g++ -fpic  -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib -fpic  -O2
FC: gfortran -fpic  -O2

Configuration
CC=gcc CFLAGS=-fpic CXX=g++ CXXFLAGS=-fpic F77=gfortran FFLAGS=-fpic
FC=gfortran FCFLAGS=-fpic
--prefix=/usr/local/mpi/mvapich2-2.0b-gcc-4.7.1

This is the end of my program:

 if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
  std::string message = std::string("rank_") +
std::to_string(mpi::Rank(mpi::COMM_WORLD)) + "_a";
  std::cout << message;
  Finalize();
  message = message + "b";
  std::cout << message;
  mpi::Finalize();
  message = message + "c";
  std::cout << message;
  return 0;
}

The results are (running one process):

----------------------------------------
Begin Torque Prologue (Tue Nov  4 16:01:58 2014)
Job ID:           1680954.cc-mgmt1.campuscluster.illinois.edu
Username:         amwarren
Group:            ***
Job Name:         mpi_test1
Limits:
ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
Job Queue:        secondary
Account:          ***
Nodes:            taub205
End Torque Prologue
----------------------------------------
Currently Loaded Modulefiles:
  1) torque/4.2.9              5) gcc/4.7.1
  2) moab/7.2.9                6) mvapich2/2.0b-gcc-4.7.1
  3) env/taub                  7) mvapich2/mpiexec
  4) blas                      8) lapack
mpiexec: resolve_exe: using absolute path
"/home/amwarren/aps/distributed_memory/aps".
node  0: name taub205, cpu avail 6
mpiexec: process_start_event: evt 2 task 0 on taub205.
mpiexec: All 1 task (spawn 0) started.
mpiexec: wait_tasks: waiting for taub205.
mpiexec: accept_pmi_conn: cmd=initack pmiid=0.
mpiexec: accept_pmi_conn: rank 0 (spawn 0) checks in.
mpiexec: accept_pmi_conn: cmd=init pmi_version=1 pmi_subversion=1.
--------------------------------------------------------------------------------

[...]

TIME: Total[ 166.405
Finalize
rank_0_arank_0_abrank_0_abcmpiexec: killall: caught signal 15 (Terminated).
mpiexec: kill_tasks: killing all tasks.
mpiexec: wait_tasks: waiting for taub205.
mpiexec: killall: caught signal 15 (Terminated).
=>> PBS: job killed: walltime 801 exceeded limit 780
----------------------------------------
Begin Torque Epilogue (Tue Nov  4 16:15:19 2014)
Job ID:           1680954.cc-mgmt1.campuscluster.illinois.edu
Username:         amwarren
Group:            ***
Job Name:         mpi_test1
Session:          11270
Limits:
ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
Resources:        cput=00:02:12,mem=429524kb,vmem=773600kb,walltime=00:13:21
Job Queue:        secondary
Account:          ***
Nodes:            taub205
End Torque Epilogue
----------------------------------------

On Tue, Nov 4, 2014 at 8:56 AM, Jonathan Perkins
<perkinjo at cse.ohio-state.edu> wrote:
> On Mon, Nov 03, 2014 at 02:10:22PM -0700, Alex M Warren wrote:
>> I am running an mpi program on a cluster. When the program ends the
>> job does not. And so I have to wait for it to time out.
>>
>> I am not sure how to debug this. I checked that the program got to the
>> finalize statement in MPI, and it does. I am using lib Elemental.
>>
>
> Can you print out specific messages after each Finalize call? Maybe one
> of these calls is hanging or exiting abnormally.  If your job isn't too
> large you may want to try printing from each rank to see which rank is
> hanging (if one is).
>
>> Final lines of the program
>>
>>
>> if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
>>
>> Finalize();
>
> try adding a print here
>
>> mpi::Finalize();
>
> and another here
>
>> return 0;
>>
>> (I tried letting elemental do finalize and that didn't work either)
>> The output will be
>>
>> Finalize
>> mpiexec: killall: caught signal 15 (Terminated).
>> mpiexec: kill_tasks: killing all tasks.
>> mpiexec: wait_tasks: waiting for taub263.
>> mpiexec: killall: caught signal 15 (Terminated).
>> ----------------------------------------
>> Begin Torque Epilogue (Sun Aug 17 01:53:55 2014)
>> Job ID:           ***
>> Username:         ***
>> Group:            ***
>> Job Name:         num_core_compare_nside-32_mpi_nodes-1_cores-2_1e0e4c0516
>> Session:          16786
>> Limits:
>> ncpus=1,neednodes=2:ppn=6:m24G:taub,nodes=2:ppn=6:m24G:taub,walltime=00:13:00
>> Resources:        cput=00:08:17,mem=297884kb,vmem=672648kb,walltime=00:13:13
>> Job Queue:        secondary
>> Account:          ***
>> Nodes:            taub263 taub290
>> End Torque Epilogue
>> ----------------------------------------
>>
>> Running these modules on https://campuscluster.illinois.edu/hardware/#taub
>>
>> > module list
>> Currently Loaded Modulefiles:
>>   1) torque/4.2.9       4) blas               7) lapack            10) gcc/4.7.1
>>   2) moab/7.2.9         5) mvapich2/1.6-gcc   8) git/1.7           11) cmake/2.8
>>  3) env/taub           6) mvapich2/mpiexec   9) vim/7.3           12)
>> valgrind/3.9.0
>
> I see you have a module for mvapich2/1.6-gcc.  Can you let us know which
> version of MVAPICH2 you are using.  If 1.6, we strongly encourage you to
> upgrade in case your issue has been resolved in a newer version of
> mvapich2.
>
> To find out which version you're using you should be able to run
> `mpiname -a'.  If using an older version please try MVAPICH2 v2.0.1
>
> --
> Jonathan Perkins


More information about the mvapich-discuss mailing list