[mvapich-discuss] global summation very slow

Mon Jun 7 13:37:23 EDT 2010

Hi Joachim,
                Thanks for your report. We have run CPMD with wat-32 
datasets on our local systems. I have included below some of the 
performance data that we have observed on 32 processes on our cluster - 
Dual Socket Intel Nehalem (2.4 GHz) and IB QDR. Could you please 
elaborate a bit on the hardware features of your cluster? Also, you 
indicated that the performance is slightly better if you use as much 
processes on one node as possible. How many processes are you running 
the job with and could you also let us know about how you are mapping 
different processes of the job to the compute nodes?

                Here are some of the results that we have observed with 
the wat-32-inp-1 dataset across 32 processes

 ================================================================
 = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
 = SEND/RECEIVE               17409. BYTES               7967.  =
 = BROADCAST                  12045. BYTES                282.  =
 = GLOBAL SUMMATION           13065. BYTES                564.  =
 = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
 = ALL TO ALL COMM           214086. BYTES               4650.  =
 =                             PERFORMANCE          TOTAL TIME  =
 = SEND/RECEIVE             1272.878  MB/S           0.109 SEC  =
 = BROADCAST                 284.826  MB/S           0.012 SEC  =
 = GLOBAL SUMMATION          355.396  MB/S           0.104 SEC  =
 = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
 = ALL TO ALL COMM           148.371  MB/S           6.710 SEC  =
 = SYNCHRONISATION                                   0.010 SEC  =
 ================================================================


Thanks,
Krishna
              

Greipel.Joachim at mh-hannover.de wrote:
> Dear all,
>  
> I compiled CPMD for use with mvapich2 over Infiniband, the mvapich2 is 
> version 1.4.1. The program does not scale at all, because the global 
> summation, and, in part, all to all communication, is exceedingly slow 
> when the processes run on different nodes. See below the wat32 test of 
> CPMD as an example.
>  
>  ================================================================
>  = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
>  = SEND/RECEIVE               34813. BYTES               3855.  =
>  = BROADCAST                  12720. BYTES                267.  =
>  = GLOBAL SUMMATION           14293. BYTES                644.  =
>  = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
>  = ALL TO ALL COMM           419075. BYTES               6258.  =
>  =                             PERFORMANCE          TOTAL TIME  =
>  = SEND/RECEIVE              265.685  MB/S           0.505 SEC  =
>  = BROADCAST                  32.876  MB/S           0.103 SEC  =
>  = GLOBAL SUMMATION            0.609  MB/S          60.432 SEC  =
>  = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
>  = ALL TO ALL COMM            44.715  MB/S          58.651 SEC  =
>  = SYNCHRONISATION                                   0.007 SEC  =
>  ================================================================
> When I use as much processes on one node as possible the results are 
> slightly better:
>  
>  ================================================================
>  = COMMUNICATION TASK  AVERAGE MESSAGE LENGTH  NUMBER OF CALLS  =
>  = SEND/RECEIVE               69639. BYTES               1799.  =
>  = BROADCAST                  13112. BYTES                259.  =
>  = GLOBAL SUMMATION           14293. BYTES                644.  =
>  = GLOBAL MULTIPLICATION          0. BYTES                  1.  =
>  = ALL TO ALL COMM           834296. BYTES               6258.  =
>  =                             PERFORMANCE          TOTAL TIME  =
>  = SEND/RECEIVE              309.035  MB/S           0.405 SEC  =
>  = BROADCAST                  89.488  MB/S           0.038 SEC  =
>  = GLOBAL SUMMATION           17.658  MB/S           1.564 SEC  =
>  = GLOBAL MULTIPLICATION       0.000  MB/S           0.001 SEC  =
>  = ALL TO ALL COMM           173.341  MB/S          30.120 SEC  =
>  = SYNCHRONISATION                                   0.010 SEC  =
>  ================================================================
> But in no case the performance with global summation is nearly 
> satisfying. The global summation performance should be at least 
> 50-100fold higher than it is.
>  
> Does anyone have a clue what is wrong?
>  
> Regards,
> Joachim
>  
>  
> --
> Dr. rer. nat. Joachim Greipel
> Med. Hochschule Hannover
> Biophys. Chem. OE 4350
> Carl-Neuberg-Str. 1
> 30625 Hannover
> Germany
>  
> Fon: +49-511-532-3718
> Fax: +49-511-532-8924
>  
>  
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>