[mvapich-discuss] Question on how to debug job start failures

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed Jul 8 22:22:01 EDT 2009


Are you able to run simple MPI programs (say MPI Hello World) or some IMB
tests using ~512 cores or larger. This will help you to find out whether
there are any issues when launching jobs and isolate any nodes which might
be having problems.

Thanks,

DK

On Wed, 8 Jul 2009, Craig Tierney wrote:

> I am running mvapich2 1.2, built with Ofed support (v1.3.1).
> For large jobs, I am having problems where they do not start.
> I am using the mpirun_rsh launcher.  When I try to start jobs
> with ~512 cores or larger, I can see the problem.  The problem
> doesn't happen all the time.
>
> I can't rule our quirky hardware.  The IB tree seems to be
> clean (as reported by ibdiagnet).  My last hang, I looked to
> see if xhpl had started on all the nodes (8 cases for each
> node for dual-socket quad-core systems).  I found that 7 of
> the 245 nodes (1960 core job) had no xhpl processes on them.
> So either the launching mechanism hung, or something was up with one of
> those nodes.
>
> My question is, how should I start debugging this to understand
> what process is hanging?
>
> Thanks,
> Craig
>
>
> --
> Craig Tierney (craig.tierney at noaa.gov)
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list