[mvapich-discuss] Slow MV2_USE_SHMEM_COLL

Sun Oct 11 00:50:56 EDT 2015

Hello Akshay,

I checked and it seems like only one MV2_INTER_GATHER_TUNING option (2) is
the slow one:

INTER_GATHER_TUNING 1 USE_SHMEM_COLL 1
$ mpirun_rsh -np 3000 -hostfile nodes ./a.out
 took  0.270218133926392      seconds  5.298394782870424E-003 per gather

INTER_GATHER_TUNING 2 USE_SHMEM_COLL 1
$ mpirun_rsh -np 3000 -hostfile nodes ./a.out
 took   23.2984428405762      seconds  0.456832212560317      per gather

INTER_GATHER_TUNING 3 USE_SHMEM_COLL 1
$ mpirun_rsh -np 3000 -hostfile nodes ./a.out
 took  0.377309799194336      seconds  7.398231356751685E-003 per gather

As expected, no effect when shared memory optimizations are off:

INTER_GATHER_TUNING 1 USE_SHMEM_COLL 0
$ mpirun_rsh -np 3000 -hostfile nodes ./a.out
took  6.511116027832031E-002 seconds  1.276689417221967E-003 per gather

INTER_GATHER_TUNING 2 USE_SHMEM_COLL 0
$ mpirun_rsh -np 3000 -hostfile nodes ./a.out
took  4.047799110412598E-002 seconds  7.936861000809015E-004 per gather

INTER_GATHER_TUNING 3 USE_SHMEM_COLL 0
$ mpirun_rsh -np 3000 -hostfile nodes ./a.out
took  5.628991127014160E-002 seconds  1.103723750394933E-003 per gather

Regards,
Marcin Rogowski

On Sat, Oct 10, 2015 at 6:50 PM, Akshay Venkatesh <akshay at cse.ohio-state.edu
> wrote:

> Marcin,
>
> Can you try one of the following parameters and see if one of them helps?
>
> MV2_INTER_GATHER_TUNING=1
>
> or
>
> MV2_INTER_GATHER_TUNING=2
>
> or
>
> MV2_INTER_GATHER_TUNING=3
>
> Thanks
>
>
> On Fri, Oct 9, 2015 at 11:38 AM, Marcin Rogowski <
> marcin.rogowski at gmail.com> wrote:
>
>> --===============0526641324750062117==
>> Content-Type: multipart/alternative;
>> boundary="001a11c25b487209240521adc7b1"
>>
>> --001a11c25b487209240521adc7b1
>> Content-Type: text/plain; charset="UTF-8"
>>
>> Hello,
>>
>> I have been trying to diagnose what causes a huge slow down of one part of
>> our application between MVAPICH2 1.9 and 2.0.1 and eventually came up with
>> a test case that simply calls MPI_Gather of 16 MPI_CHARACTERS to process
>> 0.
>> Timings over 51 iterations are the following:
>>
>> (setenv MV2_USE_SHMEM_COLL 1)
>> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>> took 20.7183160781860 seconds 0.406241491729138 per gather
>>
>> (setenv MV2_USE_SHMEM_COLL 0)
>> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
>> took 2.943396568298340E-002 seconds 5.771365820192823E-004 per gather
>>
>>
>> Interestingly, if hostfile file contains unique host names (by default we
>> use node1 repeated 'cpu cores' times followed by node2 etc.) the
>> observation does not hold - both Gathers are fast.
>>
>> The problem does not seem to appear before the release 2.0.1. Easy
>> solution
>> would be to disable collective shared memory optimizations or use unique
>> host lists however both solutions slow down different parts of the
>> application, on average exactly offsetting the benefits.
>>
>> Please let me know if you would like to know any details of our cluster
>> environment (24 core Xeons with QLogic's InfiniBand). I would be really
>> grateful if you can share any ideas and/or solutions to what could be
>> causing our problems and help us achieve optimal performance.
>>
>> Thank you.
>>
>>
>> Regards,
>> Marcin Rogowski
>> Saudi Aramco
>>
>> --001a11c25b487209240521adc7b1
>> Content-Type: text/html; charset="UTF-8"
>> Content-Transfer-Encoding: quoted-printable
>>
>> <div dir=3D"ltr">Hello,<br><br>I have been trying to diagnose what causes
>> a=
>>  huge slow down of one part of our application between MVAPICH2 1.9 and
>> 2.0=
>> .1 and eventually came up with a test case that simply calls MPI_Gather
>> of =
>> 16 MPI_CHARACTERS to process 0. Timings over 51 iterations are the
>> followin=
>> g:<br><br>(setenv MV2_USE_SHMEM_COLL 1)<br>$ mpirun_rsh -np 3000
>> -hostfile =
>> nodes ./a.out<br>took   20.7183160781860      seconds  0.406241491729138
>>  =
>>    per gather<br><br>(setenv MV2_USE_SHMEM_COLL 0)<br>$ mpirun_rsh -np
>> 3000=
>>  -hostfile nodes ./a.out<br>took  2.943396568298340E-002 seconds
>> 5.7713658=
>> 20192823E-004 per gather<br><br><br>Interestingly, if hostfile file
>> contain=
>> s unique host names (by default we use node1 repeated 'cpu cores'
>> t=
>> imes followed by node2 etc.) the observation does not hold - both Gathers
>> a=
>> re fast.<br><br>The problem does not seem to appear before the release
>> 2.0.=
>> 1. Easy solution would be to disable collective shared memory
>> optimizations=
>>  or use unique host lists however both solutions slow down different
>> parts =
>> of the application, on average exactly offsetting the
>> benefits.<br><br>Plea=
>> se let me know if you would like to know any details of our cluster
>> environ=
>> ment (24 core Xeons with QLogic's InfiniBand). I would be really
>> gratef=
>> ul if you can share any ideas and/or solutions to what could be causing
>> our=
>>  problems and help us achieve optimal performance.<br><br>Thank
>> you.<br><br=
>> ><br>Regards,<br>Marcin Rogowski<br>Saudi Aramco<br></div>
>>
>> --001a11c25b487209240521adc7b1--
>>
>> --===============0526641324750062117==
>> Content-Type: text/plain; charset="us-ascii"
>> MIME-Version: 1.0
>> Content-Transfer-Encoding: 7bit
>> Content-Disposition: inline
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>>
>> --===============0526641324750062117==--
>>
>
>
>
> --
> - Akshay
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151011/b4cc6472/attachment-0001.html>