[Mvapich-discuss] HYDT_bscd_pbs_wait_for_completion error
Phanish, Deepa
dnagendra3 at gatech.edu
Wed Sep 28 09:28:17 EDT 2022
Hi All,
A user on our HPC cluster is reporting this issue - Has anyone come across this, or has any ideas to solve this?
Note: The workflow is executing without issues using OpenMP, although slower.
Workflow: User is launching mpirun jobs in series within a loop (1 PBS job per time step). Each job reads from input file, computes and writes to output, which is read by the next job.
Issue: At sporadic times, he is getting the error message "HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002" where the program is failing to write the output file on a random time step.
Debugging efforts so far: 1) We have tried adding sleep intervals between job launch, that didn't help. 2) We have added barriers before MPI_Finalize, that didn't help. 3) We have checked the stack size, that looks good. 4) We have run simulations without file write, and just reads. That still has issues. I have attached the debug messages below showing where it fails.
Any help is much appreciated!
For reference, the output was obtained with the following run options in the pbs script:
mpirun -np 4 -genv I_MPI_HYDRA_DEBUG=1 -verbose $WKDIR/circ3d.x
Here's the tail end of the output from timestep 50, which is the last one that completed successfully before the thing failed at the end of the next time step:
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 4): barrier_in
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 5): barrier_in
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 6): barrier_in
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 9): barrier_in
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] forwarding command (cmd=barrier_in) upstream
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] PMI response to fd 7 pid 9: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 4): finalize
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 5): finalize
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 6): finalize
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 9): finalize
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu]
PBS_DEBUG: Done with polling obit events
Job ended at Tue Aug 2 11:02:51 EDT 2022
And here is the same block of output from the next time step, where it failed:
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 4): barrier_in
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 5): barrier_in
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 6): barrier_in
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 9): barrier_in
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] forwarding command (cmd=barrier_in) upstream
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] PMI response to fd 7 pid 9: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 9): finalize
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 5): finalize
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 6): finalize
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 4): finalize
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion
Regards,
Deepa Phanish, PhD
Research Scientist II, Research Computing Facilitator
Partnership for an Advanced Computing Environment (PACE)
Georgia Institute of Technology
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220928/0b83c47a/attachment-0013.html>
More information about the Mvapich-discuss
mailing list