[mvapich-discuss] global summation very slow
Krishna Chaitanya Kandalla
kandalla at cse.ohio-state.edu
Mon Jun 7 13:37:23 EDT 2010
Hi Joachim,
Thanks for your report. We have run CPMD with wat-32
datasets on our local systems. I have included below some of the
performance data that we have observed on 32 processes on our cluster -
Dual Socket Intel Nehalem (2.4 GHz) and IB QDR. Could you please
elaborate a bit on the hardware features of your cluster? Also, you
indicated that the performance is slightly better if you use as much
processes on one node as possible. How many processes are you running
the job with and could you also let us know about how you are mapping
different processes of the job to the compute nodes?
Here are some of the results that we have observed with
the wat-32-inp-1 dataset across 32 processes
================================================================
= COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
= SEND/RECEIVE 17409. BYTES 7967. =
= BROADCAST 12045. BYTES 282. =
= GLOBAL SUMMATION 13065. BYTES 564. =
= GLOBAL MULTIPLICATION 0. BYTES 1. =
= ALL TO ALL COMM 214086. BYTES 4650. =
= PERFORMANCE TOTAL TIME =
= SEND/RECEIVE 1272.878 MB/S 0.109 SEC =
= BROADCAST 284.826 MB/S 0.012 SEC =
= GLOBAL SUMMATION 355.396 MB/S 0.104 SEC =
= GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
= ALL TO ALL COMM 148.371 MB/S 6.710 SEC =
= SYNCHRONISATION 0.010 SEC =
================================================================
Thanks,
Krishna
Greipel.Joachim at mh-hannover.de wrote:
> Dear all,
>
> I compiled CPMD for use with mvapich2 over Infiniband, the mvapich2 is
> version 1.4.1. The program does not scale at all, because the global
> summation, and, in part, all to all communication, is exceedingly slow
> when the processes run on different nodes. See below the wat32 test of
> CPMD as an example.
>
> ================================================================
> = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
> = SEND/RECEIVE 34813. BYTES 3855. =
> = BROADCAST 12720. BYTES 267. =
> = GLOBAL SUMMATION 14293. BYTES 644. =
> = GLOBAL MULTIPLICATION 0. BYTES 1. =
> = ALL TO ALL COMM 419075. BYTES 6258. =
> = PERFORMANCE TOTAL TIME =
> = SEND/RECEIVE 265.685 MB/S 0.505 SEC =
> = BROADCAST 32.876 MB/S 0.103 SEC =
> = GLOBAL SUMMATION 0.609 MB/S 60.432 SEC =
> = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
> = ALL TO ALL COMM 44.715 MB/S 58.651 SEC =
> = SYNCHRONISATION 0.007 SEC =
> ================================================================
> When I use as much processes on one node as possible the results are
> slightly better:
>
> ================================================================
> = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS =
> = SEND/RECEIVE 69639. BYTES 1799. =
> = BROADCAST 13112. BYTES 259. =
> = GLOBAL SUMMATION 14293. BYTES 644. =
> = GLOBAL MULTIPLICATION 0. BYTES 1. =
> = ALL TO ALL COMM 834296. BYTES 6258. =
> = PERFORMANCE TOTAL TIME =
> = SEND/RECEIVE 309.035 MB/S 0.405 SEC =
> = BROADCAST 89.488 MB/S 0.038 SEC =
> = GLOBAL SUMMATION 17.658 MB/S 1.564 SEC =
> = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC =
> = ALL TO ALL COMM 173.341 MB/S 30.120 SEC =
> = SYNCHRONISATION 0.010 SEC =
> ================================================================
> But in no case the performance with global summation is nearly
> satisfying. The global summation performance should be at least
> 50-100fold higher than it is.
>
> Does anyone have a clue what is wrong?
>
> Regards,
> Joachim
>
>
> --
> Dr. rer. nat. Joachim Greipel
> Med. Hochschule Hannover
> Biophys. Chem. OE 4350
> Carl-Neuberg-Str. 1
> 30625 Hannover
> Germany
>
> Fon: +49-511-532-3718
> Fax: +49-511-532-8924
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>
More information about the mvapich-discuss
mailing list