[mvapich-discuss] Random large-scale MPI jobs hanging
You, Zhi-Qiang
zyou at osc.edu
Tue Oct 22 16:54:38 EDT 2019
Hi,
I am running large-scale MPI testing on OSC Owens system. I found jobs randomly hang or have the error message:
[mpiexec at o0317.ten.osc.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec at o0317.ten.osc.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at o0317.ten.osc.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at o0317.ten.osc.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
== Version of MVAPICH2 ==
2.3.1 and 2.3.2
$ module reset
$ module load mvapich2/2.3.1. or module load mvapich2/2.3.2
== Testing Jobs ==
I use ‘osu_bcast’ and ‘osu_allreduce’. For bcast, run the command twice with walltime=00:02:00
mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -m 4194304
For allreduce, run the command three times with walltime=00:05:00
mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -m 4194304
I submit these jobs with nodes ranging from 5 to 145 (5 increments).
== Result ==
With a small number of nodes, 3 out of 10 jobs have the error or hang. When increasing nodes to 50 or 70, more jobs fail.
-ZQ
--
Zhi-Qiang You
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-8492<tel:+16142928492> • Fax: (614) 292-7168<tel:+16142927168>
zyou at osc.edu<mailto:zyou at osc.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191022/f984817b/attachment.html>
More information about the mvapich-discuss
mailing list