[mvapich-discuss] Slow MV2_USE_SHMEM_COLL

Sat Oct 10 04:19:15 EDT 2015

Thank you for a quick reply, Jonathan.

The tests I ran before were with MVAPICH2 v2.0.1 on a cluster based on
24-core Xeons. Today I tried the same on MVAPICH2 v2.1 on a 20-core
Xeon-based machine and I was able to reproduce the same behavior. Below are
the details of the machines and MPI details:

$ mpiname -a
MVAPICH2 2.0.1 Thu Oct 30 20:00:00 EDT 2014 ch3:psm
Compilation
CC: icc    -g -O3
CXX: icpc   -g
F77: ifort   -g -O3
FC: ifort   -g

Configuration
--prefix=/apps/intel15/mvapich2/2.0.1 --with-device=ch3:psm --enable-romio
--enable-fast=O3 --enable-g=dbg --enable-sharedlibs=gcc --enable-debuginfo
--enable-shared --with-file-system=ufs+nfs CC=icc CXX=icpc FC=ifort
F77=ifort

$ mpiname -a
MVAPICH2 2.1 Fri Apr 03 20:00:00 EDT 2015 ch3:psm
Compilation
CC: icc    -g -O3
CXX: icpc   -g -O3
F77: ifort   -g -O3
FC: ifort   -g -O3

Configuration
--prefix=/apps/intel15/mvapich2/2.1 --with-device=ch3:psm --enable-romio
--enable-fast=O3 --enable-g=dbg --enable-sharedlibs=gcc --enable-debuginfo
--enable-shared --with-file-system=ufs+nfs CC=icc CXX=icpc FC=ifort
F77=ifort

20-core cluster

processor       : 19
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
stepping        : 4
cpu MHz         : 2793.070
cache size      : 25600 KB
physical id     : 1
siblings        : 10
core id         : 12
cpu cores       : 10
apicid          : 56
initial apicid  : 56
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt
pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips        : 5585.73
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

24-core cluster

processor       : 23
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
stepping        : 2
cpu MHz         : 2501.000
cache size      : 30720 KB
physical id     : 1
siblings        : 12
core id         : 13
cpu cores       : 12
apicid          : 58
initial apicid  : 58
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb
xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1
avx2 smep bmi2 erms invpcid
bogomips        : 4999.27
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Thank you.

Regards,
Marcin Rogowski

On Fri, Oct 9, 2015 at 8:02 PM Jonathan Perkins <perkinjo at cse.ohio-state.edu>
wrote:

> In addition to the info requested previously, can you also try using
> MVAPICH2 v2.1 to see if the results are still degraded or not?
>
>
> On Fri, Oct 9, 2015 at 11:49 AM Jonathan Perkins <
> perkinjo at cse.ohio-state.edu> wrote:
>
>> Thanks for your note Marcin.  We'll discuss this issue and get back to
>> you.
>>
>> However, can you send more information such as the cpu model name as
>> reported by `cat /proc/cpuinfo' as well as the options used to build
>> MVAPICH2 as reported by mpiname -a.  Thanks in advance.
>>
>> On Fri, Oct 9, 2015 at 11:39 AM Marcin Rogowski <
>> marcin.rogowski at gmail.com> wrote:
>>
>>> --===============0526641324750062117==
>>> Content-Type: multipart/alternative;
>>> boundary="001a11c25b487209240521adc7b1"
>>>
>>> --001a11c25b487209240521adc7b1
>>> Content-Type: text/plain; charset="UTF-8"
>>>
>>> Hello,
>>>
>>> I have been trying to diagnose what causes a huge slow down of one part
>>> of
>>> our application between MVAPICH2 1.9 and 2.0.1 and eventually came up
>>> with
>>> a test case that simply calls MPI_Gather of 16 MPI_CHARACTERS to process
>>> 0.
>>> Timings over 51 iterations are the following:
>>>
>>> (setenv MV2_USE_SHMEM_COLL 1)
>>> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>>> took 20.7183160781860 seconds 0.406241491729138 per gather
>>>
>>> (setenv MV2_USE_SHMEM_COLL 0)
>>> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>>> took 2.943396568298340E-002 seconds 5.771365820192823E-004 per gather
>>>
>>>
>>> Interestingly, if hostfile file contains unique host names (by default we
>>> use node1 repeated 'cpu cores' times followed by node2 etc.) the
>>> observation does not hold - both Gathers are fast.
>>>
>>> The problem does not seem to appear before the release 2.0.1. Easy
>>> solution
>>> would be to disable collective shared memory optimizations or use unique
>>> host lists however both solutions slow down different parts of the
>>> application, on average exactly offsetting the benefits.
>>>
>>> Please let me know if you would like to know any details of our cluster
>>> environment (24 core Xeons with QLogic's InfiniBand). I would be really
>>> grateful if you can share any ideas and/or solutions to what could be
>>> causing our problems and help us achieve optimal performance.
>>>
>>> Thank you.
>>>
>>>
>>> Regards,
>>> Marcin Rogowski
>>> Saudi Aramco
>>>
>>> --001a11c25b487209240521adc7b1
>>> Content-Type: text/html; charset="UTF-8"
>>> Content-Transfer-Encoding: quoted-printable
>>>
>>> <div dir=3D"ltr">Hello,<br><br>I have been trying to diagnose what
>>> causes a=
>>>  huge slow down of one part of our application between MVAPICH2 1.9 and
>>> 2.0=
>>> .1 and eventually came up with a test case that simply calls MPI_Gather
>>> of =
>>> 16 MPI_CHARACTERS to process 0. Timings over 51 iterations are the
>>> followin=
>>> g:<br><br>(setenv MV2_USE_SHMEM_COLL 1)<br>$ mpirun_rsh -np 3000
>>> -hostfile =
>>> nodes ./a.out<br>took   20.7183160781860      seconds
>>> 0.406241491729138   =
>>>    per gather<br><br>(setenv MV2_USE_SHMEM_COLL 0)<br>$ mpirun_rsh -np
>>> 3000=
>>>  -hostfile nodes ./a.out<br>took  2.943396568298340E-002 seconds
>>> 5.7713658=
>>> 20192823E-004 per gather<br><br><br>Interestingly, if hostfile file
>>> contain=
>>> s unique host names (by default we use node1 repeated 'cpu
>>> cores' t=
>>> imes followed by node2 etc.) the observation does not hold - both
>>> Gathers a=
>>> re fast.<br><br>The problem does not seem to appear before the release
>>> 2.0.=
>>> 1. Easy solution would be to disable collective shared memory
>>> optimizations=
>>>  or use unique host lists however both solutions slow down different
>>> parts =
>>> of the application, on average exactly offsetting the
>>> benefits.<br><br>Plea=
>>> se let me know if you would like to know any details of our cluster
>>> environ=
>>> ment (24 core Xeons with QLogic's InfiniBand). I would be really
>>> gratef=
>>> ul if you can share any ideas and/or solutions to what could be causing
>>> our=
>>>  problems and help us achieve optimal performance.<br><br>Thank
>>> you.<br><br=
>>> ><br>Regards,<br>Marcin Rogowski<br>Saudi Aramco<br></div>
>>>
>>> --001a11c25b487209240521adc7b1--
>>>
>>> --===============0526641324750062117==
>>> Content-Type: text/plain; charset="us-ascii"
>>> MIME-Version: 1.0
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>> --===============0526641324750062117==--
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151010/0915972e/attachment-0001.html>