[mvapich-discuss] Intermittent hanging after application exits at scale with MVAPICH2

Gregory Bauer gbauer at ncsa.uiuc.edu
Tue Jul 24 10:29:11 EDT 2007


Sundeep-
Thanks for the suggestion.

I'll see what I can do about the version of Python. (I know we have 2.3.6).

Would the patch that "you" provided for the start-up scaling issue we 
reported back in May (15th or so) be in 0.9.8p3 or do I need to add the 
patch myself ?

-Greg

Sundeep Narravula wrote:

>Hi Greg,
>  We have just released a new bug-fix version of mvapich2 - 0.9.8p3.
>This might help your case. You can obtain this from the download section
>our web-page (http://mvapich.cse.ohio-state.edu)
>
>Also, could you try a later version of python (maybe v2.5)?
>
>Regards,
>  --Sundeep.
>
>On Thu, 19 Jul 2007, Gregory Bauer wrote:
>
>  
>
>>We are using mvapich2-0.9.8p2 (with the patch applied that addresses a
>>start-up scalability issue) built via the make.mvapich2.ofa
>>(--with-device=osu_ch3:mrail --with-rdma=gen2) script and with ofed-1.2
>>and python-2.3.4.
>>
>>I recently ran a series of 1024 tasks (128 nodes, 8 cores per node) jobs
>>(via PBS). Out of 8 jobs, two jobs were left in a state where the
>>application had exited but the mpd's for each task still remained (the
>>launch process was still in mpiexec).
>>
>>I have attached output from ps and from gdb for the backtrace.
>>
>>The application output is such that it thinks it exited correctly. It is
>>just that mpiexec doesn't return and PBS eventually kills the job after
>>it exceeds the job wallclock time.
>>
>>Any ideas?
>>
>>-Greg
>>
>>
>>
>>    
>>
>
>  
>

-- 
-Greg Bauer

ph: (217) 333-2754                                                              
email: gbauer at ncsa.uiuc.edu                                                     

Performance Engineering and Computational Methods Group
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070724/6f5f5442/attachment.html


More information about the mvapich-discuss mailing list