[mvapich-discuss] MVAPICH2 with HPL

Dhabaleswar Panda panda at cse.ohio-state.edu
Wed May 19 23:59:49 EDT 2010


You are running a two-year old version of MVAPICH2 here. Can you try the
latest stable version 1.4.1 (check out the 1.4 branch version of the
codebase to get the recent bug-fixes after the 1.4.1 release was made) and
let us know whether you still see similar issues.

Thanks,

DK

On Wed, 19 May 2010, pradeep sivakumar wrote:

> Hello,
>
> I have been running HPL compiled with MVAPICH2-1.2p1and Intel MKL libraries and testing it on Intel Nehalem, 8 cores/node and 48GB RAM/node. The MVAPICH2 was configured as follows:
>
> $ ./configure --prefix=/software/usr/mpi/intel/mvapich2-1.2p1 --with-rdma=gen2 --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 --enable-sharedlibs=gcc CC=icc -i-dynamic CXX=icpc -i-dynamic F77=ifort -i-dynamic F90=ifort -i-dynamic
>
>  and the MPI part of the HPL makefile was modified to include,
>
> MPdir        = /software/usr/mpi/intel/mvapich2-1.2p1
> MPinc        = -I$(MPdir)/include
> MPlib        = $(MPdir)/lib/libmpich.a
>
>
> The runs have ranged for a problem size of 1% of memory available/node to 85% of memory available/node. All the test cases are having problems by running out of memory too soon. for example, a test case with N=10000 and 3 nodes (24 cores) which is only about 1% memory available and the problem seems to run out of memory within minutes. When I log in to the compute nodes and look at CPU usage through 'top', the memory usage climbs gradually until it exceeds the limit and crashes the node. The cluster does not have any swap space so after the node crashes, an examination of the .o file shows the message,
>
> rank 22 in job 1  qnode0371_42752   caused collective abort of all ranks
>   exit status of rank 22: killed by signal 9
>
> I compared all of the failed runs with MVAPICH2 to HPL compiled with OpenMPI and all of those runs were successful with no abnormal memory usage. Here is the HPL input file I have been using,
>
> > HPLinpack benchmark input file
> > Innovative Computing Laboratory, University of Tennessee
> > HPL.out      output file name (if any)
> > 7            device out (6=stdout,7=stderr,file)
> > 1      # of problems sizes (N)
> > 10000  Ns
> > 1            # of NBs
> > 80     NBs
> > 0            PMAP process mapping (0=Row-,1=Column-major)
> > 1            # of process grids (P x Q)
> > 4       Ps
> > 6       Qs
> > 8.0         threshold
> > 1            # of panel fact
> > 0 2 1        PFACTs (0=left, 1=Crout, 2=Right)
> > 1            # of recursive stopping criterium
> > 4 2          NBMINs (>= 1)
> > 1            # of panels in recursion
> > 2            NDIVs
> > 1            # of recursive panel fact.
> > 1 2 0        RFACTs (0=left, 1=Crout, 2=Right)
> > 1            # of broadcast
> > 0 3 1 2 4    BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
> > 1            # of lookahead depth
> > 0            DEPTHs (>=0)
> > 2            SWAP (0=bin-exch,1=long,2=mix)
> > 256          swapping threshold
> > 0            L1 in (0=transposed,1=no-transposed) form
> > 0            U  in (0=transposed,1=no-transposed) form
> > 0            Equilibration (0=no,1=yes)
> > 8            memory alignment in double (> 0)
>
> I don't know what might be going wrong, but if anyone has any advice or suggestions then please let me know. I appreciate any help. Thanks.
>
> Pradeep
>
>
>
>
>
>



More information about the mvapich-discuss mailing list