[mvapich-discuss] performance problems with gath/scat
Dan Kokron
daniel.kokron at nasa.gov
Thu Jul 29 14:29:35 EDT 2010
Max Suarez asked me to respond to your questions and provide any support
necessary to enable us to effectively use MVAPICH2 with our
applications.
We first noticed issues with performance when scaling the GEOS5 GCM to
720 processes. We had been using Intel MPI (3.2.x) before switching to
MVAPICH2 (1.4.1). Walltimes (hh:mm:ss) for a test case are as follows
for 256p, 512p and 720p using indicated MPI. All codes were compiled
with the Intel-11.0.083 suite of compilers. I have attached a text file
with hardware and software stack information for the platform used in
these tests (discover.HW_SWstack).
GCM application run wall time
mv2-1.4.1 iMPI-3.2.2.006 mv2-1.5-2010-07-22
256 00:23:45 00:15:53 00:22:57
512 00:26:45 00:11:06 00:13:58
720 00:43:12 00:11:28 00:16:15
The test with the mv2-1.5 nightly snapshot was run at your suggestion.
Next I instrumented the application with TAU
(http://www.cs.uoregon.edu/research/tau/home.php) to get subroutine
level timings.
Results from 256p, 512p and 720p runs show that the performance
difference between Intel MPI and MVAPICH2-1.5 can be accounted for in
collective operations. Specifically, Scatterv, Gatherv and
MPI_Allgatherv.
Any suggestions for further tuning of mv2-1.5 for our particular needs
would be appreciated.
Dan
On Fri, 2010-07-23 at 15:57 -0500, Dhabaleswar Panda wrote:
> Hi Max,
>
> Thanks for your note.
>
> > We are having serious performance problems
> > with collectives when using several hundred cores
> > on the Discover system at NASA Goddard.
>
> Could you please let us know some more details on the performance problems
> you are observing - which collectives, what data sizes, what system sizes,
> etc.?
>
> > I noticed some fixes were made to collectives in 1.5.
> > Would these help with scat/gath?
>
> In 1.5, in addition to some fixes in collectives, several thresholds were
> changed for point-to-point operations (based on platform and adapter
> characteristics) to obtain better performance. These changes will also
> have positive impact on the performance of collectives.
>
> Thus, I will suggest you to upgrade to 1.5 first. If the performance
> issues for collectives still remain, we will be happy to debug this issue
> further.
>
> > I noticed a couple of months ago someone reporting
> > very poor performance in global sums:
> >
> > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2010-June/002876.html
> >
> > But the thread ends unresolved.
>
> Since the 1.5 release procedure was getting overlapped with the
> examination of this issue, we got context-switched. We will take a closer
> look at this issue with 1.5 version.
>
> > Has anyone else had these problems?
>
> Thanks,
>
> DK
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
--
Dan Kokron
Global Modeling and Assimilation Office
NASA Goddard Space Flight Center
Greenbelt, MD 20771
Daniel.S.Kokron at nasa.gov
Phone: (301) 614-5192
Fax: (301) 614-5304
-------------- next part --------------
ifort -V
Intel(R) Fortran Intel(R) 64 Compiler Professional for applications running on Intel(R) 64, Version 11.0 Build 20090318 Package ID: l_cprof_p_11.0.083
uname -a
Linux borgk126 2.6.16.60-0.42.5-smp #1 SMP Mon Aug 24 09:41:41 UTC 2009 x86_64 x86_64 x86_64 GNU/Linux
mvapich2-1.4.1 configured by the system group. They claim to have used 'defaults' for everything.
mvapich2-1.4-2010-05-25
./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS="-fpic -DRDMA_CM" CXXFLAGS="-fpic -DRDMA_CM" FFLAGS=-fpic F90FLAGS=-fpic --prefix=/discover/nobackup/dkokron/mv2-1.4.1_11.0.083 --enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio --enable-threads=multiple --with-rdma=gen2
mvapich2-1.5-2010-07-22
./configure CC=icc CXX=icpc F77=ifort F90=ifort CFLAGS=-fpic -DRDMA_CM CXXFLAGS=-fpic -DRDMA_CM FFLAGS=-fpic F90FLAGS=-fpic --prefix=/discover/nobackup/dkokron/mv2-1.5_11.0.083 --enable-f77 --enable-f90 --enable-cxx --enable-mpe --enable-romio --enable-threads=default --with-rdma=gen2 --with-hwloc
------------------------------------------------------------------------------------------------------------------------
Each node has two nehalem sockets with four cores each. They are connected to each other via DDR Infiniband.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 5
cpu MHz : 2800.184
cache size : 8192 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr dca popcnt lahf_lm
bogomips : 5605.36
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
ibv_devinfo -v
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.6.648
node_guid: 0002:c903:0004:0f54
sys_image_guid: 0002:c903:0004:0f57
vendor_id: 0x02c9
vendor_part_id: 26418
hw_ver: 0xA0
board_id: IBM0010110008
phys_port_cnt: 2
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffe00
max_qp: 260032
max_qp_wr: 16351
device_cap_flags: 0x00fc9c76
max_sge: 32
max_sge_rd: 0
max_cq: 65408
max_cqe: 4194303
max_mr: 524272
max_pd: 32764
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4160512
max_qp_init_rd_atom: 128
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 0
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 2
max_mcast_grp: 8192
max_mcast_qp_attach: 56
max_total_mcast_qp_attach: 458752
max_ah: 0
max_fmr: 0
max_srq: 65472
max_srq_wr: 16383
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 15
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1211
port_lid: 209
port_lmc: 0x00
link_layer: IB
max_msg_sz: 0x40000000
port_cap_flags: 0x02510868
max_vl_num: 8 (4)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 128
subnet_timeout: 18
init_type_reply: 0
active_width: 4X (2)
active_speed: 5.0 Gbps (2)
phys_state: LINK_UP (5)
GID[ 0]: fe80:0000:0000:0000:0002:c903:0004:0f55
port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: IB
max_msg_sz: 0x40000000
port_cap_flags: 0x02510868
max_vl_num: 8 (4)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 128
gid_tbl_len: 128
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: invalid speed (3)
phys_state: POLLING (2)
GID[ 0]: fe80:0000:0000:0000:0002:c903:0004:0f56
ofed_info
OFED-1.5.1:
compat-dapl:
http://www.openfabrics.org/downloads/dapl/compat-dapl-1.2.16.tar.gz
dapl:
http://www.openfabrics.org/downloads/dapl/dapl-2.0.27.tar.gz
ib-bonding:
http://www.openfabrics.org/~monis/ofed_1_5/ib-bonding-0.9.0-42.src.rpm
ibsim:
http://www.openfabrics.org/downloads/ibsim/ibsim-0.5-0.1.g327c3d8.tar.gz
ibutils:
http://www.openfabrics.org/downloads/ibutils/ibutils-1.5.4-0.1.g0464fe6.tar.gz
infiniband-diags:
http://www.openfabrics.org/downloads/management/infiniband-diags-1.5.5.tar.gz
libcxgb3:
http://www.openfabrics.org/downloads/cxgb3/libcxgb3-1.2.5.tar.gz
libehca:
http://www.openfabrics.org/downloads/libehca/libehca-1.2.1-0.1.g0a82a52.tar.gz
libibcm:
http://www.openfabrics.org/downloads/rdmacm/libibcm-1.0.5.tar.gz
libibmad:
http://www.openfabrics.org/downloads/management/libibmad-1.3.4.tar.gz
libibumad:
http://www.openfabrics.org/downloads/management/libibumad-1.3.4.tar.gz
libibverbs:
http://www.openfabrics.org/downloads/libibverbs/libibverbs-1.1.3-0.6.g932f1a2.tar.gz
libipathverbs:
http://www.openfabrics.org/downloads/libipathverbs/libipathverbs-1.2.tar.gz
libmlx4:
http://www.openfabrics.org/downloads/libmlx4/libmlx4-1.0-0.7.g2432360.tar.gz
libmthca:
http://www.openfabrics.org/downloads/libmthca/libmthca-1.0.5-0.1.gbe5eef3.tar.gz
libnes:
http://www.openfabrics.org/downloads/nes/libnes-1.0.1.tar.gz
librdmacm:
http://www.openfabrics.org/downloads/rdmacm/librdmacm-1.0.11.tar.gz
libsdp:
http://www.openfabrics.org/downloads/libsdp/libsdp-1.1.100-0.1.g920ea31.tar.gz
mpi-selector:
http://www.openfabrics.org/downloads/mpi-selector/mpi-selector-1.0.3-1.src.rpm
mpitests:
http://www.openfabrics.org/~pasha/ofed_1_5/mpitests/mpitests-3.2-916.src.rpm
mstflint:
http://www.openfabrics.org/downloads/mstflint/mstflint-1.4-0.3.gf304647.tar.gz
mvapich:
http://www.openfabrics.org/~pasha/ofed_1_5_1/mvapich/mvapich-1.2.0-3635.src.rpm
mvapich2:
http://www.openfabrics.org/~perkinjo/ofed_1_5/mvapich2-1.4.1-1.src.rpm
ofa_kernel:
git://git.openfabrics.org/ofed_1_5/linux-2.6.git ofed_kernel_1_5
commit 17badd753b40fb1046dc2e5474739357a921fb86
ofed-docs:
git://git.openfabrics.org/~tziporet/docs.git ofed_1_5
commit 4b7a81073731c630427e97a3013efd6cafa537ac
open-iscsi-generic1:
http://www.openfabrics.org/downloads/iscsi/open-iscsi-generic-2.0-754.1.src.rpm
open-iscsi-generic2:
http://www.openfabrics.org/downloads/iscsi/open-iscsi-generic-2.0-869.2.src.rpm
openmpi:
http://www.openfabrics.org/~jsquyres/ofed_1_5/openmpi-1.4.1-2ofed.src.rpm
opensm:
http://www.openfabrics.org/downloads/management/opensm-3.3.5.tar.gz
perftest:
http://www.openfabrics.org/downloads/perftest/perftest-1.2.3-0.10.g90b10d8.tar.gz
qlvnictools:
http://www.openfabrics.org/downloads/qlvnictools/qlvnictools-0.0.1-0.1.ge27eef7.tar.gz
qperf:
http://www.openfabrics.org/downloads/qperf/qperf-0.4.6-0.1.gb81434e.tar.gz
rds-tools:
http://www.openfabrics.org/~vlad/ofed_1_5/rds-tools/rds-tools-1.5-1.src.rpm
rnfs-utils:
http://www.openfabrics.org/~swise/ofed_1_5/rnfs-utils/rnfs-utils-1.1.5-10.OFED.src.rpm
sdpnetstat:
http://www.openfabrics.org/downloads/sdpnetstat/sdpnetstat-1.60-0.2.g8844f04.tar.gz
srptools:
http://www.openfabrics.org/downloads/srptools/srptools-0.0.4-0.1.gce1f64c.tar.gz
tgt-generic:
http://www.openfabrics.org/downloads/iscsi/tgt-generic-0.1-20080828.src.rpm
ofed-scripts:
git://git.openfabrics.org/~vlad/ofed_scripts.git ofed_1_5
More information about the mvapich-discuss
mailing list