[mvapich-discuss] Random large-scale MPI jobs hanging
Ruhela, Amit
ruhela.2 at osu.edu
Wed Oct 23 10:25:44 EDT 2019
Thanks, Zhi-Qiang for reporting the issue.
Can you try mpirun_rsh instead of mpiexec and see if the issues go away?
Regards,
Amit Ruhela
________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> on behalf of You, Zhi-Qiang <zyou at osc.edu>
Sent: Tuesday, October 22, 2019 4:54 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Random large-scale MPI jobs hanging
Hi,
I am running large-scale MPI testing on OSC Owens system. I found jobs randomly hang or have the error message:
[mpiexec at o0317.ten.osc.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec at o0317.ten.osc.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at o0317.ten.osc.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at o0317.ten.osc.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
== Version of MVAPICH2 ==
2.3.1 and 2.3.2
$ module reset
$ module load mvapich2/2.3.1. or module load mvapich2/2.3.2
== Testing Jobs ==
I use ‘osu_bcast’ and ‘osu_allreduce’. For bcast, run the command twice with walltime=00:02:00
mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -m 4194304
For allreduce, run the command three times with walltime=00:05:00
mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -m 4194304
I submit these jobs with nodes ranging from 5 to 145 (5 increments).
== Result ==
With a small number of nodes, 3 out of 10 jobs have the error or hang. When increasing nodes to 50 or 70, more jobs fail.
-ZQ
--
Zhi-Qiang You
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-8492<tel:+16142928492> • Fax: (614) 292-7168<tel:+16142927168>
zyou at osc.edu<mailto:zyou at osc.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191023/bcd52863/attachment-0001.html>
More information about the mvapich-discuss
mailing list