[mvapich-discuss] MVAPICH2 with HPL

pradeep sivakumar pradeep-sivakumar at northwestern.edu
Wed May 19 17:53:55 EDT 2010


Hello,

I have been running HPL compiled with MVAPICH2-1.2p1and Intel MKL libraries and testing it on Intel Nehalem, 8 cores/node and 48GB RAM/node. The MVAPICH2 was configured as follows:

$ ./configure --prefix=/software/usr/mpi/intel/mvapich2-1.2p1 --with-rdma=gen2 --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 --enable-sharedlibs=gcc CC=icc -i-dynamic CXX=icpc -i-dynamic F77=ifort -i-dynamic F90=ifort -i-dynamic

 and the MPI part of the HPL makefile was modified to include,

MPdir        = /software/usr/mpi/intel/mvapich2-1.2p1
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/libmpich.a


The runs have ranged for a problem size of 1% of memory available/node to 85% of memory available/node. All the test cases are having problems by running out of memory too soon. for example, a test case with N=10000 and 3 nodes (24 cores) which is only about 1% memory available and the problem seems to run out of memory within minutes. When I log in to the compute nodes and look at CPU usage through 'top', the memory usage climbs gradually until it exceeds the limit and crashes the node. The cluster does not have any swap space so after the node crashes, an examination of the .o file shows the message,

rank 22 in job 1  qnode0371_42752   caused collective abort of all ranks
  exit status of rank 22: killed by signal 9

I compared all of the failed runs with MVAPICH2 to HPL compiled with OpenMPI and all of those runs were successful with no abnormal memory usage. Here is the HPL input file I have been using,

> HPLinpack benchmark input file
> Innovative Computing Laboratory, University of Tennessee
> HPL.out      output file name (if any)
> 7            device out (6=stdout,7=stderr,file)
> 1      # of problems sizes (N)
> 10000  Ns
> 1            # of NBs
> 80     NBs
> 0            PMAP process mapping (0=Row-,1=Column-major)
> 1            # of process grids (P x Q)
> 4       Ps
> 6       Qs
> 8.0         threshold
> 1            # of panel fact
> 0 2 1        PFACTs (0=left, 1=Crout, 2=Right)
> 1            # of recursive stopping criterium
> 4 2          NBMINs (>= 1)
> 1            # of panels in recursion
> 2            NDIVs
> 1            # of recursive panel fact.
> 1 2 0        RFACTs (0=left, 1=Crout, 2=Right)
> 1            # of broadcast
> 0 3 1 2 4    BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
> 1            # of lookahead depth
> 0            DEPTHs (>=0)
> 2            SWAP (0=bin-exch,1=long,2=mix)
> 256          swapping threshold
> 0            L1 in (0=transposed,1=no-transposed) form
> 0            U  in (0=transposed,1=no-transposed) form
> 0            Equilibration (0=no,1=yes)
> 8            memory alignment in double (> 0)

I don't know what might be going wrong, but if anyone has any advice or suggestions then please let me know. I appreciate any help. Thanks.

Pradeep





-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20100519/55d0e132/attachment.html


More information about the mvapich-discuss mailing list