[mvapich-discuss] Intermittent hanging after application exits at scale with MVAPICH2

Sundeep Narravula narravul at cse.ohio-state.edu
Mon Jul 23 11:11:47 EDT 2007


Hi Greg,
  We have just released a new bug-fix version of mvapich2 - 0.9.8p3.
This might help your case. You can obtain this from the download section
our web-page (http://mvapich.cse.ohio-state.edu)

Also, could you try a later version of python (maybe v2.5)?

Regards,
  --Sundeep.

On Thu, 19 Jul 2007, Gregory Bauer wrote:

> We are using mvapich2-0.9.8p2 (with the patch applied that addresses a
> start-up scalability issue) built via the make.mvapich2.ofa
> (--with-device=osu_ch3:mrail --with-rdma=gen2) script and with ofed-1.2
> and python-2.3.4.
>
> I recently ran a series of 1024 tasks (128 nodes, 8 cores per node) jobs
> (via PBS). Out of 8 jobs, two jobs were left in a state where the
> application had exited but the mpd's for each task still remained (the
> launch process was still in mpiexec).
>
> I have attached output from ps and from gdb for the backtrace.
>
> The application output is such that it thinks it exited correctly. It is
> just that mpiexec doesn't return and PBS eventually kills the job after
> it exceeds the job wallclock time.
>
> Any ideas?
>
> -Greg
>
>
>



More information about the mvapich-discuss mailing list