[Mvapich-discuss] HYDT_bscd_pbs_wait_for_completion error

Phanish, Deepa dnagendra3 at gatech.edu
Wed Sep 28 09:28:17 EDT 2022


Hi All,

A user on our HPC cluster is reporting this issue - Has anyone come across this, or has any ideas to solve this?

Note: The workflow is executing without issues using OpenMP, although slower.

Workflow: User is launching mpirun jobs in series within a loop (1 PBS job per time step). Each job reads from input file, computes and writes to output, which is read by the next job.

Issue: At sporadic times, he is getting the error message "HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002" where the program is failing to write the output file on a random time step.

Debugging efforts so far: 1) We have tried adding sleep intervals between job launch, that didn't help. 2) We have added barriers before MPI_Finalize, that didn't help. 3) We have checked the stack size, that looks good. 4) We have run simulations without file write, and just reads. That still has issues. I have attached the debug messages below showing where it fails.

Any help is much appreciated!


For reference, the output was obtained with the following run options in the pbs script:

mpirun -np 4 -genv I_MPI_HYDRA_DEBUG=1 -verbose $WKDIR/circ3d.x

Here's the tail end of the output from timestep 50, which is the last one that completed successfully before the thing failed at the end of the next time step:

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 4): barrier_in

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 5): barrier_in

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 6): barrier_in

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 9): barrier_in

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] forwarding command (cmd=barrier_in) upstream
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] PMI response to fd 7 pid 9: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 4): finalize

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 5): finalize

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 6): finalize

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 9): finalize

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu]
PBS_DEBUG: Done with polling obit events
 Job ended at Tue Aug  2 11:02:51 EDT 2022

And here is the same block of output from the next time step, where it failed:

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 4): barrier_in

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 5): barrier_in

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 6): barrier_in

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 9): barrier_in

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] forwarding command (cmd=barrier_in) upstream
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] [pgid: 0] got PMI command: cmd=barrier_in
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] PMI response to fd 7 pid 9: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=barrier_out
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 9): finalize

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 5): finalize

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 6): finalize

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] got pmi command (from 4): finalize

[proxy:0:0 at atl1-1-02-010-15-l.pace.gatech.edu] PMI response: cmd=finalize_ack
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec at atl1-1-02-010-15-l.pace.gatech.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion



Regards,


Deepa Phanish, PhD

Research Scientist II, Research Computing Facilitator

Partnership for an Advanced Computing Environment (PACE)

Georgia Institute of Technology

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osu.edu/pipermail/mvapich-discuss/attachments/20220928/0b83c47a/attachment-0013.html>


More information about the Mvapich-discuss mailing list