[mvapich-discuss] Random large-scale MPI jobs hanging
You, Zhi-Qiang
zyou at osc.edu
Wed Oct 23 12:21:23 EDT 2019
No, I ran osc_allreduce with 2.3.1 and 2.3.2 using mpirun_rsh on 15 nodes
mpirun_rsh -np $PBS_NP -hostfile $PBS_NODEFILE $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -m 4194304
For each mpi version, 3 out of 20 jobs hang with this error message:
[o0630.ten.osc.edu:mpirun_rsh][signal_processor] Caught signal 15, killing job
=>> PBS: job killed: walltime 334 exceeded limit 300
Also, I can reproduce the issue with a simple hello-world code:
== hello.c ==
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Print off a hello world message
printf("Hello world from rank %d out of %d processors\n",
world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
--
Zhi-Qiang You
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-8492<tel:+16142928492> • Fax: (614) 292-7168<tel:+16142927168>
zyou at osc.edu<mailto:zyou at osc.edu>
From: "Ruhela, Amit" <ruhela.2 at osu.edu>
Date: Wednesday, October 23, 2019 at 10:25 AM
To: "You, Zhi-Qiang" <zyou at osc.edu>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: Random large-scale MPI jobs hanging
Thanks, Zhi-Qiang for reporting the issue.
Can you try mpirun_rsh instead of mpiexec and see if the issues go away?
Regards,
Amit Ruhela
________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> on behalf of You, Zhi-Qiang <zyou at osc.edu>
Sent: Tuesday, October 22, 2019 4:54 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Random large-scale MPI jobs hanging
Hi,
I am running large-scale MPI testing on OSC Owens system. I found jobs randomly hang or have the error message:
[mpiexec at o0317.ten.osc.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec at o0317.ten.osc.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at o0317.ten.osc.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at o0317.ten.osc.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
== Version of MVAPICH2 ==
2.3.1 and 2.3.2
$ module reset
$ module load mvapich2/2.3.1. or module load mvapich2/2.3.2
== Testing Jobs ==
I use ‘osu_bcast’ and ‘osu_allreduce’. For bcast, run the command twice with walltime=00:02:00
mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -m 4194304
For allreduce, run the command three times with walltime=00:05:00
mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -m 4194304
I submit these jobs with nodes ranging from 5 to 145 (5 increments).
== Result ==
With a small number of nodes, 3 out of 10 jobs have the error or hang. When increasing nodes to 50 or 70, more jobs fail.
-ZQ
--
Zhi-Qiang You
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-8492<tel:+16142928492> • Fax: (614) 292-7168<tel:+16142927168>
zyou at osc.edu<mailto:zyou at osc.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191023/028e0248/attachment-0001.html>
More information about the mvapich-discuss
mailing list