[mvapich-discuss] Random large-scale MPI jobs hanging

You, Zhi-Qiang zyou at osc.edu
Wed Oct 23 12:21:23 EDT 2019


No, I ran osc_allreduce with 2.3.1 and 2.3.2 using mpirun_rsh  on 15 nodes

mpirun_rsh -np $PBS_NP -hostfile $PBS_NODEFILE $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -m 4194304

For each mpi version, 3 out of 20 jobs  hang with this error message:

[o0630.ten.osc.edu:mpirun_rsh][signal_processor] Caught signal 15, killing job
=>> PBS: job killed: walltime 334 exceeded limit 300


Also, I can reproduce the issue with a simple hello-world code:

== hello.c ==
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Print off a hello world message
    printf("Hello world from rank %d out of %d processors\n",
                    world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
}

--
Zhi-Qiang You
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-8492<tel:+16142928492> • Fax: (614) 292-7168<tel:+16142927168>
zyou at osc.edu<mailto:zyou at osc.edu>


From: "Ruhela, Amit" <ruhela.2 at osu.edu>
Date: Wednesday, October 23, 2019 at 10:25 AM
To: "You, Zhi-Qiang" <zyou at osc.edu>, "mvapich-discuss at cse.ohio-state.edu" <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: Re: Random large-scale MPI jobs hanging

Thanks, Zhi-Qiang for reporting the issue.

Can you try mpirun_rsh instead of mpiexec and see if the issues go away?

Regards,
Amit Ruhela


________________________________
From: mvapich-discuss-bounces at cse.ohio-state.edu <mvapich-discuss-bounces at mailman.cse.ohio-state.edu> on behalf of You, Zhi-Qiang <zyou at osc.edu>
Sent: Tuesday, October 22, 2019 4:54 PM
To: mvapich-discuss at cse.ohio-state.edu <mvapich-discuss at mailman.cse.ohio-state.edu>
Subject: [mvapich-discuss] Random large-scale MPI jobs hanging


Hi,



I am running large-scale MPI testing on OSC Owens system. I found jobs randomly hang or have the error message:



[mpiexec at o0317.ten.osc.edu] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002

[mpiexec at o0317.ten.osc.edu] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion

[mpiexec at o0317.ten.osc.edu] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion

[mpiexec at o0317.ten.osc.edu] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion



== Version of MVAPICH2 ==

2.3.1 and 2.3.2

$ module reset

$ module load mvapich2/2.3.1. or module load mvapich2/2.3.2



== Testing Jobs ==

I use  ‘osu_bcast’ and ‘osu_allreduce’. For bcast,  run the command twice with walltime=00:02:00



mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_bcast -m 4194304



For allreduce, run the command three times with walltime=00:05:00



mpiexec -np $PBS_NP $MPICH_HOME/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -m 4194304



I submit these jobs with nodes ranging from 5 to 145 (5 increments).



== Result ==

With a small number of nodes,  3 out of 10 jobs have the error or hang. When increasing nodes to 50 or 70,  more jobs fail.



-ZQ



--

Zhi-Qiang You
Scientific Applications Engineer
Ohio Supercomputer Center (OSC)<https://osc.edu/>
A member of the Ohio Technology Consortium<https://oh-tech.org/>
1224 Kinnear Road, Columbus, Ohio 43212
Office: (614) 292-8492<tel:+16142928492> • Fax: (614) 292-7168<tel:+16142927168>
zyou at osc.edu<mailto:zyou at osc.edu>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20191023/028e0248/attachment-0001.html>


More information about the mvapich-discuss mailing list