[mvapich-discuss] Slow MV2_USE_SHMEM_COLL

Sun Oct 11 01:41:08 EDT 2015

Dr. Panda,

The version of the library Marcin is using has been tuned for a system that
has 8 cores per node and hence the tuning setup may not be ideal for your
use case (hence the degradation). We may likely provide a patch with better
tuning if we could get access on his system for a large enough scale.
Should we ask him for access?

On Sun, Oct 11, 2015 at 12:50 AM, Marcin Rogowski <marcin.rogowski at gmail.com
> wrote:

> Hello Akshay,
>
> I checked and it seems like only one MV2_INTER_GATHER_TUNING option (2) is
> the slow one:
>
> INTER_GATHER_TUNING 1 USE_SHMEM_COLL 1
> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>  took  0.270218133926392      seconds  5.298394782870424E-003 per gather
>
> INTER_GATHER_TUNING 2 USE_SHMEM_COLL 1
> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>  took   23.2984428405762      seconds  0.456832212560317      per gather
>
> INTER_GATHER_TUNING 3 USE_SHMEM_COLL 1
> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>  took  0.377309799194336      seconds  7.398231356751685E-003 per gather
>
>
> As expected, no effect when shared memory optimizations are off:
>
> INTER_GATHER_TUNING 1 USE_SHMEM_COLL 0
> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
> took  6.511116027832031E-002 seconds  1.276689417221967E-003 per gather
>
> INTER_GATHER_TUNING 2 USE_SHMEM_COLL 0
> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
> took  4.047799110412598E-002 seconds  7.936861000809015E-004 per gather
>
> INTER_GATHER_TUNING 3 USE_SHMEM_COLL 0
> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
> took  5.628991127014160E-002 seconds  1.103723750394933E-003 per gather
>
>
> Regards,
> Marcin Rogowski
>
> On Sat, Oct 10, 2015 at 6:50 PM, Akshay Venkatesh <
> akshay at cse.ohio-state.edu> wrote:
>
>> Marcin,
>>
>> Can you try one of the following parameters and see if one of them helps?
>>
>> MV2_INTER_GATHER_TUNING=1
>>
>> or
>>
>> MV2_INTER_GATHER_TUNING=2
>>
>> or
>>
>> MV2_INTER_GATHER_TUNING=3
>>
>> Thanks
>>
>>
>> On Fri, Oct 9, 2015 at 11:38 AM, Marcin Rogowski <
>> marcin.rogowski at gmail.com> wrote:
>>
>>> --===============0526641324750062117==
>>> Content-Type: multipart/alternative;
>>> boundary="001a11c25b487209240521adc7b1"
>>>
>>> --001a11c25b487209240521adc7b1
>>> Content-Type: text/plain; charset="UTF-8"
>>>
>>> Hello,
>>>
>>> I have been trying to diagnose what causes a huge slow down of one part
>>> of
>>> our application between MVAPICH2 1.9 and 2.0.1 and eventually came up
>>> with
>>> a test case that simply calls MPI_Gather of 16 MPI_CHARACTERS to process
>>> 0.
>>> Timings over 51 iterations are the following:
>>>
>>> (setenv MV2_USE_SHMEM_COLL 1)
>>> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>>> took 20.7183160781860 seconds 0.406241491729138 per gather
>>>
>>> (setenv MV2_USE_SHMEM_COLL 0)
>>> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>>> took 2.943396568298340E-002 seconds 5.771365820192823E-004 per gather
>>>
>>>
>>> Interestingly, if hostfile file contains unique host names (by default we
>>> use node1 repeated 'cpu cores' times followed by node2 etc.) the
>>> observation does not hold - both Gathers are fast.
>>>
>>> The problem does not seem to appear before the release 2.0.1. Easy
>>> solution
>>> would be to disable collective shared memory optimizations or use unique
>>> host lists however both solutions slow down different parts of the
>>> application, on average exactly offsetting the benefits.
>>>
>>> Please let me know if you would like to know any details of our cluster
>>> environment (24 core Xeons with QLogic's InfiniBand). I would be really
>>> grateful if you can share any ideas and/or solutions to what could be
>>> causing our problems and help us achieve optimal performance.
>>>
>>> Thank you.
>>>
>>>
>>> Regards,
>>> Marcin Rogowski
>>> Saudi Aramco
>>>
>>> --001a11c25b487209240521adc7b1
>>> Content-Type: text/html; charset="UTF-8"
>>> Content-Transfer-Encoding: quoted-printable
>>>
>>> <div dir=3D"ltr">Hello,<br><br>I have been trying to diagnose what
>>> causes a=
>>>  huge slow down of one part of our application between MVAPICH2 1.9 and
>>> 2.0=
>>> .1 and eventually came up with a test case that simply calls MPI_Gather
>>> of =
>>> 16 MPI_CHARACTERS to process 0. Timings over 51 iterations are the
>>> followin=
>>> g:<br><br>(setenv MV2_USE_SHMEM_COLL 1)<br>$ mpirun_rsh -np 3000
>>> -hostfile =
>>> nodes ./a.out<br>took   20.7183160781860      seconds
>>> 0.406241491729138   =
>>>    per gather<br><br>(setenv MV2_USE_SHMEM_COLL 0)<br>$ mpirun_rsh -np
>>> 3000=
>>>  -hostfile nodes ./a.out<br>took  2.943396568298340E-002 seconds
>>> 5.7713658=
>>> 20192823E-004 per gather<br><br><br>Interestingly, if hostfile file
>>> contain=
>>> s unique host names (by default we use node1 repeated 'cpu
>>> cores' t=
>>> imes followed by node2 etc.) the observation does not hold - both
>>> Gathers a=
>>> re fast.<br><br>The problem does not seem to appear before the release
>>> 2.0.=
>>> 1. Easy solution would be to disable collective shared memory
>>> optimizations=
>>>  or use unique host lists however both solutions slow down different
>>> parts =
>>> of the application, on average exactly offsetting the
>>> benefits.<br><br>Plea=
>>> se let me know if you would like to know any details of our cluster
>>> environ=
>>> ment (24 core Xeons with QLogic's InfiniBand). I would be really
>>> gratef=
>>> ul if you can share any ideas and/or solutions to what could be causing
>>> our=
>>>  problems and help us achieve optimal performance.<br><br>Thank
>>> you.<br><br=
>>> ><br>Regards,<br>Marcin Rogowski<br>Saudi Aramco<br></div>
>>>
>>> --001a11c25b487209240521adc7b1--
>>>
>>> --===============0526641324750062117==
>>> Content-Type: text/plain; charset="us-ascii"
>>> MIME-Version: 1.0
>>> Content-Transfer-Encoding: 7bit
>>> Content-Disposition: inline
>>>
>>> _______________________________________________
>>> mvapich-discuss mailing list
>>> mvapich-discuss at cse.ohio-state.edu
>>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>>
>>> --===============0526641324750062117==--
>>>
>>
>>
>>
>> --
>> - Akshay
>>
>
>

-- 
- Akshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151011/021beaa6/attachment-0001.html>