[mvapich-discuss] Performance issue

Dhabaleswar Panda panda at cse.ohio-state.edu
Sun Jun 5 10:48:50 EDT 2011


Thanks for your reply. We will take a look at it and get back to you.

Thanks,

DK

On Sun, 5 Jun 2011, Masahiro Nakao wrote:

> Dear Professor Dhabaleswar Panda,
>
> Thank you for your answer.
>
> 2011/6/5 Dhabaleswar Panda <panda at cse.ohio-state.edu>:
> > Thanks for your note. Could you tell us whether you are using single-rail
> > or multi-rail environment.
>
> I used single-rail environment.
>
> The environment variables are as bellow.
> --
> - export MV2_NUM_HCAS=1
> - export MV2_USE_SHARED_MEM=1
> - export MV2_ENABLE_AFFINITY=0
> - export MV2_NUM_PORTS=1
> --
> The compile option is only "-O3".
> ---
> mpicc -O3 hoge.c -o hoge
> mpirun_rsh -np 2 -hostfile hosts hoge
> ---
>
> Actually, I had tried to use 4-rails environment before.
> Then I had changed below environment.
> MV2_NUM_HCAS=1 -> MV2_NUM_HCAS=4
> But the results are almost the same.
> ---
> 64 Byte:4.0 MByte/s
> 128 Byte:7.2 MByte/s
> 256 Byte:12.8 MByte/s
> 512 Byte:28.3 MByte/s
> 1024 Byte:57.3 MByte/s
> 2048 Byte:93.4 MByte/s
> 4096 Byte:136.3 MByte/s
> 8192 Byte:216.1 MByte/s
> 16384 Byte:40.0 MByte/s
> 32768 Byte:72.0 MByte/s
> ---
> This is an another question.
> Why is the performance of 4-rails worse than that of 1-rail ?
>
> > Is this being run on the Tsukuba cluster?
>
> Yes, it is :).
>
> Regard,
>
> > DK
> >
> > On Sun, 5 Jun 2011, Masahiro Nakao wrote:
> >
> >> Dear all,
> >>
> >> I use mvapich2-1.7a.
> >> I tried to measure a troughtput performance,
> >> but the value of performance I don't understand.
> >>
> >> The source code is as below.
> >> ---
> >> double tmp[SIZE];
> >>   :
> >> MPI_Barrier(MPI_COMM_WORLD);
> >> t1 = gettimeofday_sec();
> >> if(rank==0)   MPI_Send(tmp, SIZE, MPI_DOUBLE, 1, 999, MPI_COMM_WORLD );
> >> else  MPI_Recv(tmp, SIZE, MPI_DOUBLE, 0, 999, MPI_COMM_WORLD, &status );
> >> MPI_Barrier(MPI_COMM_WORLD);
> >> t2 = gettimeofday_sec();
> >>
> >> printf("%d Byte:%.1f MByte/s\n", sizeof(tmp), sizeof(tmp)/(t2-t1)/1000000);
> >> ---
> >> This program was run on 2 nodes.
> >>
> >> Results are as below. SIZE = 1, 2, 4, ... , 4096
> >> ---
> >>     8 Byte:  0.6 MByte/s
> >>    16 Byte:  1.1 MByte/s
> >>    32 Byte:  2.1 MByte/s
> >>    64 Byte:  4.5 MByte/s
> >>   128 Byte:  8.0 MByte/s
> >>   256 Byte: 15.1 MByte/s
> >>   512 Byte: 30.2 MByte/s
> >>  1024 Byte: 68.2 MByte/s
> >>  2048 Byte:102.3 MByte/s
> >>  4096 Byte:157.6 MByte/s
> >>  8192 Byte:195.2 MByte/s
> >> 16384 Byte: 80.8 MByte/s
> >> 32768 Byte:142.4 MByte/s
> >> ---
> >>
> >> Why is the troughtput down, when the value of transfar size is 16384 ?
> >>
> >> My environment is as below.
> >> ---
> >> - Quad-Core AMD Opteron(tm) Processor 8356
> >> - DDR Infiniband (Mellanox ConnectX)
> >> ---
> >>
> >> Regard,
> >> --
> >> Masahiro NAKAO
> >> Email : mnakao at ccs.tsukuba.ac.jp
> >> Researcher
> >> Center for Computational Sciences
> >> UNIVERSITY OF TSUKUBA
> >> _______________________________________________
> >> mvapich-discuss mailing list
> >> mvapich-discuss at cse.ohio-state.edu
> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>
> >
> >
>
>
>
> --
> Masahiro NAKAO
> Email : mnakao at ccs.tsukuba.ac.jp
> Researcher
> Center for Computational Sciences
> UNIVERSITY OF TSUKUBA
>




More information about the mvapich-discuss mailing list