[mvapich-discuss] MPI_FINALIZE() and forced ending of the job.
Troy Telford
ttelford at lnxi.com
Thu Sep 7 12:53:13 EDT 2006
I'm getting a report that MVAPICH 0.9.5-mlx1.0.3 (although I've verified it
still exists in MVAPICH 0.9.8):
(The following is mostly quoted from the person that reported it to me)
MPI 1.1 Spec re MPI_FINALIZE:
****
"MPI_FINALIZE()
int MPI_Finalize(void)
MPI_FINALIZE(IERROR)
INTEGER IERROR
This routines cleans up all MPI state. Once this routine is called, no
MPI routine (even MPI_INIT) may be called. The user must ensure that all
pending communications involving a process completes before the process
calls MPI_FINALIZE."
****
There is no mention of forcefully ending the mpi job is stated here. In
addition,
http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/node32.htm
(Clarification of MPI_FINALIZE) has :
****
"Although it is not required that all processes return from
MPI_FINALIZE, it is required that at least process 0 in MPI_COMM_WORLD
return, so that users can know that the MPI portion of the computation
is over. In addition, in a POSIX environment, they may desire to supply
an exit code for each process that returns from MPI_FINALIZE.
Example: The following illustrates the use of requiring that at least one
process return and that it be known that process 0 is one of the
processes that return. One wants code like the following to work no
matter how many processes return.
..
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
..
MPI_Finalize();
if (myrank == 0) {
resultfile = fopen("outfile","w");
dump_results(resultfile);
fclose(resultfile);
}
exit(0);
****
again, there is no mention of forceful termination the job. mpirun_rsh,
however, when receiving a SIGCHLD, will set an alarm to 10 secondes and kill
all remaining processes afterwards. The user feels this isn't standards
conforming, and is making some of his debugging/tracing efforts impossible.
Here's a sample code that illustrates the problem:
#include <unistd.h>
#include <stdio.h>
#include <mpi.h>
int main(int argc,char *argv[])
{
int rc,myID ;
rc = MPI_Init(&argc,&argv) ;
printf("rc = %d\n",rc) ;
MPI_Comm_rank(MPI_COMM_WORLD,&myID) ;
printf("myID = %d,hello\n",myID) ;
rc = MPI_Finalize() ;
printf("finalize done, myid = %d rc2 = %d\n",myID,rc) ;
if ( myID == 0 )
{
printf("myID %d start sleeping\n",myID) ;
sleep(50) ;
printf("end sleeping myID %d\n",myID) ;
}
return 0 ;
}
"end sleeping" will not be printed as this process is killed after 10 seconds.
`time mpirun_rsh ......`
will also show this.
This issue seems specific to MVAPICH; the problem doesn't happen with MPICH,
MVAPICH2, or Open MPI.
Any ideas on how to satisfy the user?
--
Troy Telford
More information about the mvapich-discuss
mailing list