[mvapich-discuss] Intermittent hanging after application exits at scale with MVAPICH2

Sundeep Narravula narravul at cse.ohio-state.edu
Tue Jul 24 11:05:03 EDT 2007


> Would the patch that "you" provided for the start-up scaling issue we
> reported back in May (15th or so) be in 0.9.8p3 or do I need to add the
> patch myself ?

Greg,

You will need to reapply the patch to the 0.9.8p3 release.

  --Sundeep.


>
> -Greg
>
> Sundeep Narravula wrote:
>
> >Hi Greg,
> >  We have just released a new bug-fix version of mvapich2 - 0.9.8p3.
> >This might help your case. You can obtain this from the download section
> >our web-page (http://mvapich.cse.ohio-state.edu)
> >
> >Also, could you try a later version of python (maybe v2.5)?
> >
> >Regards,
> >  --Sundeep.
> >
> >On Thu, 19 Jul 2007, Gregory Bauer wrote:
> >
> >
> >
> >>We are using mvapich2-0.9.8p2 (with the patch applied that addresses a
> >>start-up scalability issue) built via the make.mvapich2.ofa
> >>(--with-device=osu_ch3:mrail --with-rdma=gen2) script and with ofed-1.2
> >>and python-2.3.4.
> >>
> >>I recently ran a series of 1024 tasks (128 nodes, 8 cores per node) jobs
> >>(via PBS). Out of 8 jobs, two jobs were left in a state where the
> >>application had exited but the mpd's for each task still remained (the
> >>launch process was still in mpiexec).
> >>
> >>I have attached output from ps and from gdb for the backtrace.
> >>
> >>The application output is such that it thinks it exited correctly. It is
> >>just that mpiexec doesn't return and PBS eventually kills the job after
> >>it exceeds the job wallclock time.
> >>
> >>Any ideas?
> >>
> >>-Greg
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
>
> --
> -Greg Bauer
>
> ph: (217) 333-2754
> email: gbauer at ncsa.uiuc.edu
>
> Performance Engineering and Computational Methods Group
> National Center for Supercomputing Applications
> University of Illinois at Urbana-Champaign
>
>



More information about the mvapich-discuss mailing list