[mvapich-discuss] Intermittent hanging after application exits
at scale with MVAPICH2
Gregory Bauer
gbauer at ncsa.uiuc.edu
Tue Jul 24 10:29:11 EDT 2007
Sundeep-
Thanks for the suggestion.
I'll see what I can do about the version of Python. (I know we have 2.3.6).
Would the patch that "you" provided for the start-up scaling issue we
reported back in May (15th or so) be in 0.9.8p3 or do I need to add the
patch myself ?
-Greg
Sundeep Narravula wrote:
>Hi Greg,
> We have just released a new bug-fix version of mvapich2 - 0.9.8p3.
>This might help your case. You can obtain this from the download section
>our web-page (http://mvapich.cse.ohio-state.edu)
>
>Also, could you try a later version of python (maybe v2.5)?
>
>Regards,
> --Sundeep.
>
>On Thu, 19 Jul 2007, Gregory Bauer wrote:
>
>
>
>>We are using mvapich2-0.9.8p2 (with the patch applied that addresses a
>>start-up scalability issue) built via the make.mvapich2.ofa
>>(--with-device=osu_ch3:mrail --with-rdma=gen2) script and with ofed-1.2
>>and python-2.3.4.
>>
>>I recently ran a series of 1024 tasks (128 nodes, 8 cores per node) jobs
>>(via PBS). Out of 8 jobs, two jobs were left in a state where the
>>application had exited but the mpd's for each task still remained (the
>>launch process was still in mpiexec).
>>
>>I have attached output from ps and from gdb for the backtrace.
>>
>>The application output is such that it thinks it exited correctly. It is
>>just that mpiexec doesn't return and PBS eventually kills the job after
>>it exceeds the job wallclock time.
>>
>>Any ideas?
>>
>>-Greg
>>
>>
>>
>>
>>
>
>
>
--
-Greg Bauer
ph: (217) 333-2754
email: gbauer at ncsa.uiuc.edu
Performance Engineering and Computational Methods Group
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070724/6f5f5442/attachment.html
More information about the mvapich-discuss
mailing list