[mvapich-discuss] Slow MV2_USE_SHMEM_COLL

Sat Oct 10 11:50:23 EDT 2015

Marcin,

Can you try one of the following parameters and see if one of them helps?

MV2_INTER_GATHER_TUNING=1

or

MV2_INTER_GATHER_TUNING=2

or

MV2_INTER_GATHER_TUNING=3

Thanks

On Fri, Oct 9, 2015 at 11:38 AM, Marcin Rogowski <marcin.rogowski at gmail.com>
wrote:

> --===============0526641324750062117==
> Content-Type: multipart/alternative;
> boundary="001a11c25b487209240521adc7b1"
>
> --001a11c25b487209240521adc7b1
> Content-Type: text/plain; charset="UTF-8"
>
> Hello,
>
> I have been trying to diagnose what causes a huge slow down of one part of
> our application between MVAPICH2 1.9 and 2.0.1 and eventually came up with
> a test case that simply calls MPI_Gather of 16 MPI_CHARACTERS to process 0.
> Timings over 51 iterations are the following:
>
> (setenv MV2_USE_SHMEM_COLL 1)
> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
> took 20.7183160781860 seconds 0.406241491729138 per gather
>
> (setenv MV2_USE_SHMEM_COLL 0)
> $ mpirun_rsh -np 3000 -hostfile nodes ./a.out
> took 2.943396568298340E-002 seconds 5.771365820192823E-004 per gather
>
>
> Interestingly, if hostfile file contains unique host names (by default we
> use node1 repeated 'cpu cores' times followed by node2 etc.) the
> observation does not hold - both Gathers are fast.
>
> The problem does not seem to appear before the release 2.0.1. Easy solution
> would be to disable collective shared memory optimizations or use unique
> host lists however both solutions slow down different parts of the
> application, on average exactly offsetting the benefits.
>
> Please let me know if you would like to know any details of our cluster
> environment (24 core Xeons with QLogic's InfiniBand). I would be really
> grateful if you can share any ideas and/or solutions to what could be
> causing our problems and help us achieve optimal performance.
>
> Thank you.
>
>
> Regards,
> Marcin Rogowski
> Saudi Aramco
>
> --001a11c25b487209240521adc7b1
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"ltr">Hello,<br><br>I have been trying to diagnose what causes
> a=
>  huge slow down of one part of our application between MVAPICH2 1.9 and
> 2.0=
> .1 and eventually came up with a test case that simply calls MPI_Gather of
> =
> 16 MPI_CHARACTERS to process 0. Timings over 51 iterations are the
> followin=
> g:<br><br>(setenv MV2_USE_SHMEM_COLL 1)<br>$ mpirun_rsh -np 3000 -hostfile
> =
> nodes ./a.out<br>took   20.7183160781860      seconds  0.406241491729138
>  =
>    per gather<br><br>(setenv MV2_USE_SHMEM_COLL 0)<br>$ mpirun_rsh -np
> 3000=
>  -hostfile nodes ./a.out<br>took  2.943396568298340E-002 seconds
> 5.7713658=
> 20192823E-004 per gather<br><br><br>Interestingly, if hostfile file
> contain=
> s unique host names (by default we use node1 repeated 'cpu cores'
> t=
> imes followed by node2 etc.) the observation does not hold - both Gathers
> a=
> re fast.<br><br>The problem does not seem to appear before the release
> 2.0.=
> 1. Easy solution would be to disable collective shared memory
> optimizations=
>  or use unique host lists however both solutions slow down different parts
> =
> of the application, on average exactly offsetting the
> benefits.<br><br>Plea=
> se let me know if you would like to know any details of our cluster
> environ=
> ment (24 core Xeons with QLogic's InfiniBand). I would be really
> gratef=
> ul if you can share any ideas and/or solutions to what could be causing
> our=
>  problems and help us achieve optimal performance.<br><br>Thank
> you.<br><br=
> ><br>Regards,<br>Marcin Rogowski<br>Saudi Aramco<br></div>
>
> --001a11c25b487209240521adc7b1--
>
> --===============0526641324750062117==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
> --===============0526641324750062117==--
>

-- 
- Akshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20151010/7fd7a985/attachment.html>