[mvapich-discuss] Mpiexec fails to terminate when program ends
Alex M Warren
amwarren at email.arizona.edu
Tue Nov 4 18:19:49 EST 2014
I recompiled with mvapich2 2.0b (what was available on the cluster)
> mpiname -a
MVAPICH2 2.0b Fri Nov 8 11:17:40 EST 2013 ch3:mrail
Compilation
CC: gcc -fpic -DNDEBUG -DNVALGRIND -O2
CXX: g++ -fpic -DNDEBUG -DNVALGRIND -O2
F77: gfortran -L/lib -L/lib -fpic -O2
FC: gfortran -fpic -O2
Configuration
CC=gcc CFLAGS=-fpic CXX=g++ CXXFLAGS=-fpic F77=gfortran FFLAGS=-fpic
FC=gfortran FCFLAGS=-fpic
--prefix=/usr/local/mpi/mvapich2-2.0b-gcc-4.7.1
This is the end of my program:
if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
std::string message = std::string("rank_") +
std::to_string(mpi::Rank(mpi::COMM_WORLD)) + "_a";
std::cout << message;
Finalize();
message = message + "b";
std::cout << message;
mpi::Finalize();
message = message + "c";
std::cout << message;
return 0;
}
The results are (running one process):
----------------------------------------
Begin Torque Prologue (Tue Nov 4 16:01:58 2014)
Job ID: 1680954.cc-mgmt1.campuscluster.illinois.edu
Username: amwarren
Group: ***
Job Name: mpi_test1
Limits:
ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
Job Queue: secondary
Account: ***
Nodes: taub205
End Torque Prologue
----------------------------------------
Currently Loaded Modulefiles:
1) torque/4.2.9 5) gcc/4.7.1
2) moab/7.2.9 6) mvapich2/2.0b-gcc-4.7.1
3) env/taub 7) mvapich2/mpiexec
4) blas 8) lapack
mpiexec: resolve_exe: using absolute path
"/home/amwarren/aps/distributed_memory/aps".
node 0: name taub205, cpu avail 6
mpiexec: process_start_event: evt 2 task 0 on taub205.
mpiexec: All 1 task (spawn 0) started.
mpiexec: wait_tasks: waiting for taub205.
mpiexec: accept_pmi_conn: cmd=initack pmiid=0.
mpiexec: accept_pmi_conn: rank 0 (spawn 0) checks in.
mpiexec: accept_pmi_conn: cmd=init pmi_version=1 pmi_subversion=1.
--------------------------------------------------------------------------------
[...]
TIME: Total[ 166.405
Finalize
rank_0_arank_0_abrank_0_abcmpiexec: killall: caught signal 15 (Terminated).
mpiexec: kill_tasks: killing all tasks.
mpiexec: wait_tasks: waiting for taub205.
mpiexec: killall: caught signal 15 (Terminated).
=>> PBS: job killed: walltime 801 exceeded limit 780
----------------------------------------
Begin Torque Epilogue (Tue Nov 4 16:15:19 2014)
Job ID: 1680954.cc-mgmt1.campuscluster.illinois.edu
Username: amwarren
Group: ***
Job Name: mpi_test1
Session: 11270
Limits:
ncpus=1,neednodes=1:ppn=6:m24G:taub,nodes=1:ppn=6:m24G:taub,walltime=00:13:00
Resources: cput=00:02:12,mem=429524kb,vmem=773600kb,walltime=00:13:21
Job Queue: secondary
Account: ***
Nodes: taub205
End Torque Epilogue
----------------------------------------
On Tue, Nov 4, 2014 at 8:56 AM, Jonathan Perkins
<perkinjo at cse.ohio-state.edu> wrote:
> On Mon, Nov 03, 2014 at 02:10:22PM -0700, Alex M Warren wrote:
>> I am running an mpi program on a cluster. When the program ends the
>> job does not. And so I have to wait for it to time out.
>>
>> I am not sure how to debug this. I checked that the program got to the
>> finalize statement in MPI, and it does. I am using lib Elemental.
>>
>
> Can you print out specific messages after each Finalize call? Maybe one
> of these calls is hanging or exiting abnormally. If your job isn't too
> large you may want to try printing from each rank to see which rank is
> hanging (if one is).
>
>> Final lines of the program
>>
>>
>> if (grid.Rank() == 0) std::cout << "Finalize" << std::endl;
>>
>> Finalize();
>
> try adding a print here
>
>> mpi::Finalize();
>
> and another here
>
>> return 0;
>>
>> (I tried letting elemental do finalize and that didn't work either)
>> The output will be
>>
>> Finalize
>> mpiexec: killall: caught signal 15 (Terminated).
>> mpiexec: kill_tasks: killing all tasks.
>> mpiexec: wait_tasks: waiting for taub263.
>> mpiexec: killall: caught signal 15 (Terminated).
>> ----------------------------------------
>> Begin Torque Epilogue (Sun Aug 17 01:53:55 2014)
>> Job ID: ***
>> Username: ***
>> Group: ***
>> Job Name: num_core_compare_nside-32_mpi_nodes-1_cores-2_1e0e4c0516
>> Session: 16786
>> Limits:
>> ncpus=1,neednodes=2:ppn=6:m24G:taub,nodes=2:ppn=6:m24G:taub,walltime=00:13:00
>> Resources: cput=00:08:17,mem=297884kb,vmem=672648kb,walltime=00:13:13
>> Job Queue: secondary
>> Account: ***
>> Nodes: taub263 taub290
>> End Torque Epilogue
>> ----------------------------------------
>>
>> Running these modules on https://campuscluster.illinois.edu/hardware/#taub
>>
>> > module list
>> Currently Loaded Modulefiles:
>> 1) torque/4.2.9 4) blas 7) lapack 10) gcc/4.7.1
>> 2) moab/7.2.9 5) mvapich2/1.6-gcc 8) git/1.7 11) cmake/2.8
>> 3) env/taub 6) mvapich2/mpiexec 9) vim/7.3 12)
>> valgrind/3.9.0
>
> I see you have a module for mvapich2/1.6-gcc. Can you let us know which
> version of MVAPICH2 you are using. If 1.6, we strongly encourage you to
> upgrade in case your issue has been resolved in a newer version of
> mvapich2.
>
> To find out which version you're using you should be able to run
> `mpiname -a'. If using an older version please try MVAPICH2 v2.0.1
>
> --
> Jonathan Perkins
More information about the mvapich-discuss
mailing list