[mvapich-discuss] Random large-scale MPI jobs hanging

Ruhela, Amit ruhela.2 at osu.edu
Wed Oct 23 10:25:44 EDT 2019


Thanks, Zhi-Qiang for reporting the issue.

Can you try mpirun_rsh instead of mpiexec and see if the issues go away?

Regards,
Amit Ruhela


________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> on behalf of You, Zhi-Qiang <zyou at osc.edu>
Sent: Tuesday, October 22, 2019 4:54 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Random large-scale MPI jobs hanging


Hi,



I am running large-scale MPI testing on OSC Owens system. I found jobs randomly hang or have the error message:



[mpiexec at o0317.ten.osc.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002

[mpiexec at o0317.ten.osc.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion

[mpiexec at o0317.ten.osc.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion

[mpiexec at o0317.ten.osc.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion



== Version of MVAPICH2 ==

2.3.1 and 2.3.2

$ module reset

$ module load mvapich2/2.3.1. or module load mvapich2/2.3.2



== Testing Jobs ==

I use  ‘osu_bcast’ and ‘osu_allreduce’. For bcast,  run the command twice with walltime=00:02:00



mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -m 4194304



For allreduce, run the command three times with walltime=00:05:00



mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -m 4194304



I submit these jobs with nodes ranging from 5 to 145 (5 increments).



== Result ==

With a small number of nodes,  3 out of 10 jobs have the error or hang. When increasing nodes to 50 or 70,  more jobs fail.



-ZQ



--

Zhi-Qiang You
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-8492<tel:+16142928492> • Fax: (614) 292-7168<tel:+16142927168>
zyou at osc.edu<mailto:zyou at osc.edu>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191023/bcd52863/attachment-0001.html>


More information about the mvapich-discuss mailing list