[mvapich-discuss] HPL with mvapich2-1.0.1 issue.

yogeshwar sonawane yogyas at gmail.com
Sun Jul 13 07:22:37 EDT 2008


Hi all,

I am using mvapich2-1.0.1 with uDAPL as the device configured with
default settings like shared memory support. I am running HPL compiled
with this mpi binaries. HPL-version 1.0a downloaded from
www.netlib.org is used. ATLAS-3.8.1 is used which is required for HPL.
I am using udapl from OFED-1.2 on IB card.  Now, when i run HPL with
16 processes, on a single node (quad-core, quad-socket having 64 GB of
RAM), machine gets stuck/hang after first HPL reading. This is not
kernel-panic condition.

I did some observations. The problem size of HPL is for 65 %, 70 % &
75 % of total memory with multiple NB values. When HPL is fired,
everything is smooth, around 14 GB  out of 64 GB is free. After first
reading with some combination is displayed, memory   usage increases
to full. Then swap space is also comsumed near to full. This happens
very quickly. Then machine becomes unresponsive to commands. But
machine is able to ping from other nodes. Now after around 2 hrs, HPL
exited with "caused collective abort of all ranks  exit status of rank
14: killed by signal 9" error. There were kernel messages "out of
memory, killing xhpl..."

Multiple runs, having different N, have shown similar behaviour. One
point to note is, after first reading only problem will start. I tried
to provide HPL.dat which will produce only single reading. That run
was successful. Such multiple runs of HPL, each producing only single
reading/combination are done. All are successful. Problem seems to be
there when multiple combination/reading HPL.dat is used.

I did the same HPL run with MVAPICH2-1.0.1 compiled for TCP/IP again
on single node.
But, this run was successful, with all readings displayed, no swap
usage & normal closure of HPL.

Can anybody help me to solve the issue ?
Any links or references are welcomed.

I am not sure whether this list is the correct for HPL related query.
So, kindly guide me on this also.

Thanks,
Yogeshwar


More information about the mvapich-discuss mailing list