[mvapich-discuss] Performance issue

Masahiro Nakao mnakao at ccs.tsukuba.ac.jp
Tue Jun 7 00:48:57 EDT 2011


Dear Devendar,

In before trial, I set MV2_ENABLE_AFFINITY=0.
I tried with MV2_ENABLE_AFFINITY=1.

The results are as below.
---
single-rail
  2048 Byte:102.3 MByte/s
  4096 Byte:165.2 MByte/s
  8192 Byte:194.1 MByte/s
16384 Byte: 74.1 MByte/s
32768 Byte:134.3 MByte/s
---
4-rails
  2048 Byte::92.4 MByte/s
  4096 Byte:163.6 MByte/s
  8192 Byte:190.9  MByte/s
16384 Byte: 38.1 MByte/s
32768 Byte: 72.7  MByte/s
---

The same tendency ...

Regards,


(11/06/07 2:14), Devendar Bureddy wrote:
> Hi Mashhiro,
>
> Can you please try your experiment with MV2_ENABLE_AFFINITY=1 ( default
> setting) ?  Please let us know the result.
>
> Thanks
> Devendar
>
> On Sun, Jun 5, 2011 at 10:48 AM, Dhabaleswar Panda
> <panda at cse.ohio-state.edu <mailto:panda at cse.ohio-state.edu>> wrote:
>
>     Thanks for your reply. We will take a look at it and get back to you.
>
>     Thanks,
>
>     DK
>
>     On Sun, 5 Jun 2011, Masahiro Nakao wrote:
>
>      > Dear Professor Dhabaleswar Panda,
>      >
>      > Thank you for your answer.
>      >
>      > 2011/6/5 Dhabaleswar Panda <panda at cse.ohio-state.edu
>     <mailto:panda at cse.ohio-state.edu>>:
>      > > Thanks for your note. Could you tell us whether you are using
>     single-rail
>      > > or multi-rail environment.
>      >
>      > I used single-rail environment.
>      >
>      > The environment variables are as bellow.
>      > --
>      > - export MV2_NUM_HCAS=1
>      > - export MV2_USE_SHARED_MEM=1
>      > - export MV2_ENABLE_AFFINITY=0
>      > - export MV2_NUM_PORTS=1
>      > --
>      > The compile option is only "-O3".
>      > ---
>      > mpicc -O3 hoge.c -o hoge
>      > mpirun_rsh -np 2 -hostfile hosts hoge
>      > ---
>      >
>      > Actually, I had tried to use 4-rails environment before.
>      > Then I had changed below environment.
>      > MV2_NUM_HCAS=1 -> MV2_NUM_HCAS=4
>      > But the results are almost the same.
>      > ---
>      > 64 Byte:4.0 MByte/s
>      > 128 Byte:7.2 MByte/s
>      > 256 Byte:12.8 MByte/s
>      > 512 Byte:28.3 MByte/s
>      > 1024 Byte:57.3 MByte/s
>      > 2048 Byte:93.4 MByte/s
>      > 4096 Byte:136.3 MByte/s
>      > 8192 Byte:216.1 MByte/s
>      > 16384 Byte:40.0 MByte/s
>      > 32768 Byte:72.0 MByte/s
>      > ---
>      > This is an another question.
>      > Why is the performance of 4-rails worse than that of 1-rail ?
>      >
>      > > Is this being run on the Tsukuba cluster?
>      >
>      > Yes, it is :).
>      >
>      > Regard,
>      >
>      > > DK
>      > >
>      > > On Sun, 5 Jun 2011, Masahiro Nakao wrote:
>      > >
>      > >> Dear all,
>      > >>
>      > >> I use mvapich2-1.7a.
>      > >> I tried to measure a troughtput performance,
>      > >> but the value of performance I don't understand.
>      > >>
>      > >> The source code is as below.
>      > >> ---
>      > >> double tmp[SIZE];
>      > >>   :
>      > >> MPI_Barrier(MPI_COMM_WORLD);
>      > >> t1 = gettimeofday_sec();
>      > >> if(rank==0)   MPI_Send(tmp, SIZE, MPI_DOUBLE, 1, 999,
>     MPI_COMM_WORLD );
>      > >> else  MPI_Recv(tmp, SIZE, MPI_DOUBLE, 0, 999, MPI_COMM_WORLD,
>     &status );
>      > >> MPI_Barrier(MPI_COMM_WORLD);
>      > >> t2 = gettimeofday_sec();
>      > >>
>      > >> printf("%d Byte:%.1f MByte/s\n", sizeof(tmp),
>     sizeof(tmp)/(t2-t1)/1000000);
>      > >> ---
>      > >> This program was run on 2 nodes.
>      > >>
>      > >> Results are as below. SIZE = 1, 2, 4, ... , 4096
>      > >> ---
>      > >>     8 Byte:  0.6 MByte/s
>      > >>    16 Byte:  1.1 MByte/s
>      > >>    32 Byte:  2.1 MByte/s
>      > >>    64 Byte:  4.5 MByte/s
>      > >>   128 Byte:  8.0 MByte/s
>      > >>   256 Byte: 15.1 MByte/s
>      > >>   512 Byte: 30.2 MByte/s
>      > >>  1024 Byte: 68.2 MByte/s
>      > >>  2048 Byte:102.3 MByte/s
>      > >>  4096 Byte:157.6 MByte/s
>      > >>  8192 Byte:195.2 MByte/s
>      > >> 16384 Byte: 80.8 MByte/s
>      > >> 32768 Byte:142.4 MByte/s
>      > >> ---
>      > >>
>      > >> Why is the troughtput down, when the value of transfar size is
>     16384 ?
>      > >>
>      > >> My environment is as below.
>      > >> ---
>      > >> - Quad-Core AMD Opteron(tm) Processor 8356
>      > >> - DDR Infiniband (Mellanox ConnectX)
>      > >> ---
>      > >>
>      > >> Regard,
>      > >> --
-- 
Masahiro NAKAO
Email : mnakao at ccs.tsukuba.ac.jp
Researcher
Center for Computational Sciences
UNIVERSITY OF TSUKUBA


More information about the mvapich-discuss mailing list