[mvapich-discuss] MVAPICH2 with HPL

Sayantan Sur surs at cse.ohio-state.edu
Thu May 20 09:10:41 EDT 2010


Hi Pradeep,

I tried out your HPL.dat input on our cluster with MVAPICH2-1.4.1 and
Atlas blas library. There was no problem and the run completed
successfully. Our nodes are Intel Xeon 5345 and 6 GB/node.

Let us know if you continue to face issues with the latest version.

Thanks.


On Wed, May 19, 2010 at 11:59 PM, Dhabaleswar Panda
<panda at cse.ohio-state.edu> wrote:
> You are running a two-year old version of MVAPICH2 here. Can you try the
> latest stable version 1.4.1 (check out the 1.4 branch version of the
> codebase to get the recent bug-fixes after the 1.4.1 release was made) and
> let us know whether you still see similar issues.
>
> Thanks,
>
> DK
>
> On Wed, 19 May 2010, pradeep sivakumar wrote:
>
>> Hello,
>>
>> I have been running HPL compiled with MVAPICH2-1.2p1and Intel MKL libraries and testing it on Intel Nehalem, 8 cores/node and 48GB RAM/node. The MVAPICH2 was configured as follows:
>>
>> $ ./configure --prefix=/software/usr/mpi/intel/mvapich2-1.2p1 --with-rdma=gen2 --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 --enable-sharedlibs=gcc CC=icc -i-dynamic CXX=icpc -i-dynamic F77=ifort -i-dynamic F90=ifort -i-dynamic
>>
>>  and the MPI part of the HPL makefile was modified to include,
>>
>> MPdir        = /software/usr/mpi/intel/mvapich2-1.2p1
>> MPinc        = -I$(MPdir)/include
>> MPlib        = $(MPdir)/lib/libmpich.a
>>
>>
>> The runs have ranged for a problem size of 1% of memory available/node to 85% of memory available/node. All the test cases are having problems by running out of memory too soon. for example, a test case with N=10000 and 3 nodes (24 cores) which is only about 1% memory available and the problem seems to run out of memory within minutes. When I log in to the compute nodes and look at CPU usage through 'top', the memory usage climbs gradually until it exceeds the limit and crashes the node. The cluster does not have any swap space so after the node crashes, an examination of the .o file shows the message,
>>
>> rank 22 in job 1  qnode0371_42752   caused collective abort of all ranks
>>   exit status of rank 22: killed by signal 9
>>
>> I compared all of the failed runs with MVAPICH2 to HPL compiled with OpenMPI and all of those runs were successful with no abnormal memory usage. Here is the HPL input file I have been using,
>>
>> > HPLinpack benchmark input file
>> > Innovative Computing Laboratory, University of Tennessee
>> > HPL.out      output file name (if any)
>> > 7            device out (6=stdout,7=stderr,file)
>> > 1      # of problems sizes (N)
>> > 10000  Ns
>> > 1            # of NBs
>> > 80     NBs
>> > 0            PMAP process mapping (0=Row-,1=Column-major)
>> > 1            # of process grids (P x Q)
>> > 4       Ps
>> > 6       Qs
>> > 8.0         threshold
>> > 1            # of panel fact
>> > 0 2 1        PFACTs (0=left, 1=Crout, 2=Right)
>> > 1            # of recursive stopping criterium
>> > 4 2          NBMINs (>= 1)
>> > 1            # of panels in recursion
>> > 2            NDIVs
>> > 1            # of recursive panel fact.
>> > 1 2 0        RFACTs (0=left, 1=Crout, 2=Right)
>> > 1            # of broadcast
>> > 0 3 1 2 4    BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
>> > 1            # of lookahead depth
>> > 0            DEPTHs (>=0)
>> > 2            SWAP (0=bin-exch,1=long,2=mix)
>> > 256          swapping threshold
>> > 0            L1 in (0=transposed,1=no-transposed) form
>> > 0            U  in (0=transposed,1=no-transposed) form
>> > 0            Equilibration (0=no,1=yes)
>> > 8            memory alignment in double (> 0)
>>
>> I don't know what might be going wrong, but if anyone has any advice or suggestions then please let me know. I appreciate any help. Thanks.
>>
>> Pradeep
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
>



-- 
Sayantan Sur

Research Scientist
Department of Computer Science
The Ohio State University.



More information about the mvapich-discuss mailing list