[mvapich-discuss] Possible mvapich bug (possibly not as well).

Laurence Marks L-marks at northwestern.edu
Fri Sep 19 18:42:41 EDT 2008


The computing platform is Intel duo quad-cores with 8G per node.
Intel(R) Xeon(R) CPU E5410  @ 2.33GHz. Linux version 2.6.18-8.1.15.el5
(mockbuild at builder6.centos.org) (gcc version 4.1.1 20070105 (Red Hat
4.1.1-52)) #1 SMP Mon Oct 22 08:32:28 EDT 2007

I have been running this on between 80 and 96 cores (10-12 nodes),
both with a single core only used on the master (first in the list) or
with 8, the same result. It is not memory limited, only using ~6G/node
and not getting into swap from what I can see -- I have run into this
one and that's not it. (The total memory needed is around 50G.)

I have not pinned down the exact size where this occurs, I just know
that it's between 36927x36927 and 38381x38381 ; I did run one larger
and it failed the same way.

On Fri, Sep 19, 2008 at 4:54 PM, Dhabaleswar Panda
<panda at cse.ohio-state.edu> wrote:
> Thanks for your report. What is your computing platform and how much
> memory your computing nodes have (per processor/core)? Does your
> application (using the 38381x38381 PDSYGST configuration) require more
> memory than being available on these platforms. As you might be knowing,
> if you run an application which requires higher amount of memory than
> being available, a lot of swapping will occur and your computation will
> not be able to make progress. Not sure whether this is the situation you
> are encountering. Does this problem happen for this exact matrix size? Are
> you able to run your application with any higher sized matrix?
>
> If it is a multi-core-based cluster, can you run your application using
> more nodes and less cores/node while keeping the total number of cores for
> the application constant (such as using 8 nodes with 4 cores/node vs.  4
> nodes with 8 cores/node). If the application runs with the first
> configuration but not with the second configuration, it will show that you
> are getting constrained by the amount of memory being available per
> core/node.
>
>
> Thanks,
>
> DK
>
> On Fri, 19 Sep 2008, Laurence Marks wrote:
>
>> I have a highly reproducible, but so far untraceable problem. It could
>> be due to mvapich, but also not.
>>
>> In a code which calls the scalapack subroutine PDSYGST (which uses two
>> distributed matrices), if the matrices are 36927x36927 it works fine;
>> if they are 38381x38381 it runs forever, i.e.until I kill it.
>>
>> This behavior occurs for the Intel mkl versions 10.0.3.020,
>> 10.0.4.023, 10.1.0.009 and ifort/icc versions 10.1.015 and 10.1.018.
>> It occurs for both an April 2008 svn of mvapich, and an svn of a few
>> days ago. It also occurs with OFED-1.2.5.5 and OFED-1.3.
>>
>> I would welcome any suggestions.
>>
>> --
>> Laurence Marks
>> Department of Materials Science and Engineering
>> MSE Rm 2036 Cook Hall
>> 2220 N Campus Drive
>> Northwestern University
>> Evanston, IL 60208, USA
>> Tel: (847) 491-3996 Fax: (847) 491-7820
>> email: L-marks at northwestern dot edu
>> Web: www.numis.northwestern.edu
>> Chair, Commission on Electron Crystallography of IUCR
>> www.numis.northwestern.edu/IUCR_CED
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>
>



-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/IUCR_CED


More information about the mvapich-discuss mailing list