[mvapich-discuss] mvapich2 slow with default mapping

Fri Apr 29 07:41:28 EDT 2016

Dear Hari,

this is the output with default CPU binding and the benchmark result
of the PingPong:
-------------CPU AFFINITY-------------
RANK:0  CPU_SET:   4
RANK:2  CPU_SET:   6
RANK:4  CPU_SET:   4
RANK:6  CPU_SET:   6
RANK:8  CPU_SET:   1
RANK:10  CPU_SET:   7
RANK:12  CPU_SET:   1
RANK:14  CPU_SET:   7
-------------------------------------
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 4.1, MPI-1 part
#------------------------------------------------------------
# Date                  : Fri Apr 29 13:36:33 2016
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-327.10.1.el7.x86_64
# Version               : #1 SMP Tue Feb 16 06:09:11 CST 2016
# MPI Version           : 3.0
# MPI Thread Environment:

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# /opt/ohpc/pub/libs/gnu/mvapich2/imb/4.1/bin/IMB-MPI1

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_scatter
# Allgather
# Allgatherv
# Gather
# Gatherv
# Scatter
# Scatterv
# Alltoall
# Alltoallv
# Bcast
# Barrier

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 14 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        39.98         0.00
            1         1000         1.59         0.60
            2         1000         1.55         1.23
            4         1000         1.54         2.48
            8         1000         1.55         4.91
           16         1000         1.63         9.33
           32         1000         1.66        18.43
           64         1000         1.69        36.09
          128         1000         1.92        63.60
          256         1000         3.10        78.71
          512         1000         3.37       144.87
         1024         1000         3.94       248.16
         2048         1000        15.67       124.60
         4096         1000        16.61       235.20
         8192         1000        19.52       400.14
        16384         1000        54.38       287.33
        32768         1000        44.83       697.01
        65536          640        62.69       996.96
       131072          320       119.91      1042.41
       262144          160       225.87      1106.84
       524288           80       789.38       633.41
      1048576           40       875.05      1142.79
      2097152           20      1784.90      1120.51
      4194304           10     18072.80       221.33

This is the Infiniband information of one node, the other looks the same:
hca_id:    mlx4_0
    transport:            InfiniBand (0)
    fw_ver:                2.7.000
    node_guid:            0018:8b90:97fe:ef8d
    sys_image_guid:            0018:8b90:97fe:ef90
    vendor_id:            0x02c9
    vendor_part_id:            26428
    hw_ver:                0xA0
    board_id:            DEL08C0000009
    phys_port_cnt:            2
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            318
            port_lid:        127
            port_lmc:        0x00
            link_layer:        InfiniBand

        port:    2
            state:            PORT_DOWN (1)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            0
            port_lid:        0
            port_lmc:        0x00
            link_layer:        InfiniBand

Regards, Götz Waschk

On Thu, Apr 28, 2016 at 5:57 PM, Hari Subramoni <subramoni.1 at osu.edu> wrote:
> Hello Götz,
>
> It looks like some sort of oversubscription is happening here. Could you
> please send us the following information?
>
> 1. Output of program run after setting MV2_SHOW_CPU_BINDING=1
>
> 2. Output of ibv_devinfo executed on the system where you're seeing the
> degradation.
>
> Thanks,
> Hari.
>
> On Apr 28, 2016 10:11 AM, "Götz Waschk" <goetz.waschk at gmail.com> wrote:
>>
>> Dear Mvapich2 experts,
>>
>> I'm currently evaluating OpenHPC packages, including mvapich2 2.1.
>> I've tested the speed using the Intel MPI benchmarks and I have
>> noticed, that the first benchmark PingPong is behaving differently
>> when run with 16 cores vs. 2 cores, although only two cores are in use
>> and the remaining processes simply wait. The full results are in
>> OpenHPC's issue tracker on github:
>> https://github.com/openhpc/ohpc/issues/207#issuecomment-212319647
>>
>> As you can see there, the configuration change to set these variables
>> helped:
>>
>> export MV2_SHOW_CPU_BINDING=1
>> export MV2_CPU_MAPPING=0:1:2:3:4:5:6:7
>>
>> I still wonder why they have such an influence and the default setting
>> isn't sufficient here.
>>
>> Regards,
>> Götz Waschk
>>
>> _______________________________________________
>> mvapich-discuss mailing list
>> mvapich-discuss at cse.ohio-state.edu
>> http://mailman.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss

-- 
AL I:40: Do what thou wilt shall be the whole of the Law.