[mvapich-discuss] Very bad latency scaling on bisectional bandwidth test with OFED 1.3/MVAPICH 1.0.0

Fri Mar 28 12:06:25 EDT 2008

I just upgraded to OFED 1.3 and MVAPICH 1.0.0 (from OFED 1.2.5.5 and
MVAPICH 0.9.9).  I'm using ConnectX cards.  IB diagnostics show no
fabric issues.

I have a test for bisectional bandwidth and latency; the latency test
is showing very poor worst-case results repeatably as the node count
goes over ~100.  Other MPI implementations (that will remain nameless
as they normally don't perform as well as MVAPICH) don't have this
issue... so I don't think it's strictly an OFED 1.3 issue.

What I'm seeing shows worst-cast latency (in the msecs!), for all
nodes.  Here's a sample of the current results (only testing ~120
nodes), all times in usecs:

C-25-38: worst=1038.127000 (C-27-06,C-27-37), best=2.688000
(C-25-41,C-25-35), avg=31.816815
C-25-29: worst=1037.159000 (C-26-42,C-27-28), best=2.694000
(C-25-32,C-25-26), avg=31.645870
C-25-26: worst=1038.052000 (C-26-39,C-27-25), best=2.695000
(C-25-29,C-25-23), avg=31.757562
C-25-41: worst=1037.349000 (C-27-09,C-27-40), best=2.695000
(C-25-44,C-25-38), avg=31.776089
C-25-17: worst=1036.924000 (C-26-30,C-27-16), best=2.697000
(C-25-20,C-25-14), avg=31.664692
C-26-05: worst=1037.973000 (C-27-18,C-27-54), best=2.697000
(C-26-08,C-26-02), avg=31.809110
C-26-17: worst=1038.095000 (C-27-30,C-25-04), best=2.703000
(C-26-20,C-26-14), avg=31.774685
C-26-20: worst=1038.225000 (C-27-33,C-25-07), best=2.704000
(C-26-23,C-26-17), avg=30.208007
C-25-14: worst=1037.357000 (C-26-27,C-27-13), best=2.705000
(C-25-17,C-25-11), avg=31.576705
C-26-08: worst=1038.058000 (C-27-21,C-27-57), best=2.705000
(C-26-11,C-26-05), avg=31.639445
C-27-20: worst=1037.043000 (C-25-21,C-26-07), best=2.706000
(C-27-23,C-27-17), avg=31.819363
C-27-23: worst=1037.909000 (C-25-24,C-26-10), best=2.706000
(C-27-26,C-27-20), avg=31.714664
C-26-32: worst=1037.963000 (C-27-45,C-25-19), best=2.707000
(C-26-35,C-26-29), avg=31.694966
C-26-41: worst=1037.817000 (C-27-59,C-25-28), best=2.707000
(C-26-44,C-26-38), avg=31.674466
C-27-08: worst=1037.036000 (C-25-09,C-25-40), best=2.708000
(C-27-11,C-27-05), avg=31.781712
C-26-29: worst=1038.093000 (C-27-42,C-25-16), best=2.709000
(C-26-32,C-26-26), avg=31.812582
C-27-11: worst=1038.136000 (C-25-12,C-25-43), best=2.710000
(C-27-14,C-27-08), avg=31.653336
C-26-44: worst=1038.070000 (C-27-62,C-25-31), best=2.711000
(C-27-02,C-26-41), avg=31.666521
C-27-32: worst=1037.563000 (C-25-33,C-26-19), best=2.712000
(C-27-35,C-27-29), avg=31.778705
C-25-44: worst=1036.881000 (C-27-12,C-27-43), best=2.713000
(C-26-02,C-25-41), avg=31.752103

While other MPI implementations running under OFED 1.3 on the same
node set show more stability:

C-25-30: worst=35.822153 (C-27-39,C-27-19), best=3.398895
(C-25-10,C-26-34), avg=11.128738
C-26-11: worst=35.799026 (C-27-22,C-26-39), best=3.398180
(C-25-14,C-27-05), avg=10.864981
C-27-36: worst=35.802126 (C-26-02,C-25-11), best=3.396034
(C-26-34,C-27-38), avg=10.694269
C-25-16: worst=35.804033 (C-27-51,C-27-26), best=3.391981
(C-25-20,C-26-27), avg=11.112461
C-25-10: worst=35.800934 (C-27-37,C-27-53), best=3.388882
(C-26-23,C-25-30), avg=10.956828
C-27-08: worst=35.817862 (C-26-30,C-25-25), best=3.388882
(C-26-06,C-27-54), avg=10.765788
C-27-11: worst=35.810947 (C-25-31,C-26-15), best=3.386974
(C-27-56,C-26-23), avg=11.048172
C-26-10: worst=35.799026 (C-27-20,C-26-40), best=3.386974
(C-25-01,C-27-04), avg=10.720228
C-26-31: worst=37.193060 (C-25-33,C-26-43), best=3.386021
(C-27-27,C-25-20), avg=10.809506
C-26-23: worst=35.791874 (C-26-35,C-27-17), best=3.386021
(C-27-11,C-25-10), avg=10.747612
C-25-20: worst=35.769939 (C-27-28,C-27-61), best=3.385782
(C-26-31,C-25-16), avg=11.007588
C-27-37: worst=35.790205 (C-26-03,C-25-10), best=3.385067
(C-26-35,C-27-39), avg=10.998216
C-27-54: worst=35.786867 (C-25-29,C-25-19), best=3.384829
(C-27-08,C-27-35), avg=11.109948
C-27-43: worst=35.766125 (C-25-14,C-25-44), best=3.382921
(C-27-34,C-26-39), avg=11.048978
C-26-13: worst=35.825968 (C-27-18,C-26-04), best=3.382921
(C-25-28,C-27-10), avg=11.002756
C-25-44: worst=35.761833 (C-27-43,C-26-20), best=3.382921
(C-25-18,C-27-01), avg=10.673819
C-25-19: worst=35.789013 (C-27-54,C-27-06), best=3.382206
(C-25-25,C-26-05), avg=10.877252
C-27-56: worst=37.191868 (C-26-43,C-27-59), best=3.381014
(C-27-24,C-27-11), avg=10.994493
C-26-24: worst=35.806894 (C-26-42,C-27-24), best=3.381014
(C-27-03,C-25-12), avg=10.798348

Previously, MVAPICH 0.9.9 (with OFED 1.2.5.5) showed stability (this
example was on ~280 node test):

C-25-35: worst=34.869000 (C-21-36,C-21-20), best=2.208000
(C-21-28,C-21-28), avg=5.739774
C-21-28: worst=34.914000 (C-25-43,C-25-27), best=2.210000
(C-25-35,C-25-35), avg=5.692484
C-27-63: worst=23.946000 (C-25-13,C-23-44), best=3.123000
(C-21-01,C-27-61), avg=5.673201
C-25-41: worst=34.792000 (C-21-42,C-21-26), best=3.177000
(C-25-43,C-25-39), avg=5.715597
C-25-45: worst=34.944000 (C-22-01,C-21-30), best=3.182000
(C-26-02,C-25-43), avg=5.734901
C-25-37: worst=34.829000 (C-21-38,C-21-22), best=3.183000
(C-25-39,C-25-35), avg=5.715226
C-25-33: worst=34.900000 (C-21-34,C-21-18), best=3.185000
(C-25-35,C-25-31), avg=5.726198
C-25-39: worst=34.939000 (C-21-40,C-21-24), best=3.185000
(C-25-45,C-25-37), avg=5.760403
C-25-43: worst=34.913000 (C-21-44,C-21-28), best=3.185000
(C-25-45,C-25-41), avg=5.774223
C-25-25: worst=34.834000 (C-21-26,C-21-10), best=3.187000
(C-25-27,C-25-23), avg=5.701191
C-25-29: worst=34.935000 (C-21-30,C-21-14), best=3.188000
(C-25-31,C-25-27), avg=5.732307
C-25-17: worst=34.905000 (C-21-18,C-21-02), best=3.193000
(C-25-19,C-25-15), avg=5.712527
C-25-09: worst=34.839000 (C-21-10,C-27-58), best=3.195000
(C-25-11,C-25-07), avg=5.719269
C-25-21: worst=34.826000 (C-21-22,C-21-06), best=3.195000
(C-25-23,C-25-19), avg=5.709025

The Latency portion of the test seems unaffected, I expect ~1.5GB/s
best, ~300MB/s worst, and ~600MB/s average.  Here's a sample from the
MVAPICH 1.0.0 test, values in MB/s, with about 120 nodes in the test:

C-27-28: worst=311.533788 (C-26-43,C-25-01), best=1528.480740
(C-27-27,C-27-29), avg=587.437197
C-27-27: worst=304.659190 (C-26-42,C-27-62), best=1528.480740
(C-27-26,C-27-28), avg=578.232542
C-27-26: worst=305.106860 (C-26-43,C-27-59), best=1528.480740
(C-27-25,C-27-27), avg=586.553406
C-27-25: worst=305.004799 (C-26-42,C-27-58), best=1528.369347
(C-27-24,C-27-26), avg=607.532908
C-27-22: worst=305.069134 (C-26-39,C-27-55), best=1528.369347
(C-27-21,C-27-23), avg=581.500478
C-27-21: worst=387.081961 (C-27-33,C-27-09), best=1528.369347
(C-27-20,C-27-22), avg=586.030643
C-27-24: worst=312.083157 (C-26-39,C-27-59), best=1528.257970
(C-27-23,C-27-25), avg=586.508800
C-27-19: worst=389.750871 (C-26-21,C-25-05), best=1528.202288
(C-27-18,C-27-20), avg=610.889360
C-27-23: worst=305.124616 (C-26-40,C-27-56), best=1528.146610
(C-27-22,C-27-24), avg=592.528629
C-27-20: worst=375.241912 (C-26-23,C-25-05), best=1528.146610
(C-27-19,C-27-21), avg=587.685815
C-27-18: worst=381.980984 (C-27-30,C-27-06), best=1528.035265
(C-27-17,C-27-19), avg=617.072824
C-25-34: worst=367.293139 (C-27-34,C-27-01), best=1527.534416
(C-25-33,C-25-35), avg=587.728400

Previous tests (280 nodes, in this case), look about the same...
here's a sample:

C-27-62: worst=324.551124 (C-25-28,C-23-27), best=1528.759294
(C-25-05,C-25-05), avg=513.596860
C-25-05: worst=286.193170 (C-27-20,C-21-35), best=1528.480740
(C-27-62,C-27-62), avg=531.348528
C-26-06: worst=309.232357 (C-22-17,C-21-26), best=1527.200699
(C-26-02,C-26-10), avg=526.550142
C-26-02: worst=308.702059 (C-22-13,C-21-22), best=1527.089492
(C-25-43,C-26-06), avg=517.635786
C-26-10: worst=303.692998 (C-27-26,C-23-39), best=1527.033895
(C-26-06,C-26-14), avg=526.867299
C-26-14: worst=302.966896 (C-27-29,C-23-44), best=1526.255959
(C-26-10,C-26-18), avg=521.675016
C-22-09: worst=353.618467 (C-25-12,C-27-20), best=1526.144890
(C-26-16,C-26-16), avg=564.268143
C-26-26: worst=311.455134 (C-27-18,C-25-34), best=1526.089361
(C-26-22,C-26-30), avg=547.530324
C-26-18: worst=304.969316 (C-27-36,C-23-45), best=1526.089361
(C-26-14,C-26-22), avg=515.463309
C-26-22: worst=302.152809 (C-25-02,C-27-42), best=1526.033837
(C-26-18,C-26-26), avg=529.672528
C-26-13: worst=299.106027 (C-27-28,C-23-43), best=1525.978316
(C-26-12,C-26-14), avg=549.159212
C-21-08: worst=292.192329 (C-25-11,C-25-19), best=1525.978316
(C-25-15,C-25-15), avg=539.609085
C-25-15: worst=270.248064 (C-21-04,C-21-12), best=1525.922800
(C-21-08,C-21-08), avg=523.654988
C-26-16: worst=293.111198 (C-21-30,C-22-33), best=1525.867288
(C-22-09,C-22-09), avg=488.649320
C-26-11: worst=300.619544 (C-27-29,C-23-38), best=1525.811779
(C-26-10,C-26-12), avg=496.530124
C-25-24: worst=317.670885 (C-22-08,C-27-40), best=1525.756275
(C-21-17,C-21-17), avg=510.542500

The test itself may be to blame.  The test is run w/ one rank per
node.  The idea is to get a bisectional test where each node is
exclusively sending to another node, and exclusively receiving from
another node, with all nodes sending/receiving simultaneously;note
that the sender and receiver will most likely be different in the
sendrecv call, allowing to test bisectional bandwidth for a odd number
of nodes.  The latency test sends/receives zero bytes 1000 times, the
bandwidth test sends 4MB 10 times.  Iteratively, all ranks will
eventually send and receive to/from all other ranks, but all send/recv
combinations will not be completely enumerated (where nodes>2).

While you'd expect a fat-tree switch to get full bisectional
bandwidth, it never does; a problem w/ a static subnet manager
(opensm).  Given that the average is ~1/3 the best bandwidth, I
interpret that to mean that on average a rank is being blocked by two
other ranks.  The worst case shows roughly 5 or 6 ranks blocking each
other.

The routine goes through a "for" loop starting at the current rank:
send-ranks are decreasing and recv-ranks are increasing (both
circularly) for each iteration until you get back to the current rank.
 The core of the routine looks like:

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &wsize);
  MPI_Comm_rank(MPI_COMM_WORLD, &me);
  for(hi = (me == 0) ? wsize - 1 : me - 1,
      lo = (me + 1 == wsize) ? 0 : me + 1;
        hi != me;
          hi = (hi == 0) ? wsize - 1 : hi - 1,
          lo = (lo + 1 == wsize) ? 0 : lo + 1) {

        MPI_Barrier(MPI_COMM_WORLD);

        start = MPI_Wtime();

        for ( i = 0; i < iters; i++ ) {
          MPI_Sendrecv(&comBuf, size, MPI_CHAR, lo, 0, &comBuf, size,
MPI_CHAR, hi, 0, MPI_COMM_WORLD, &stat);
        }
        diff = MPI_Wtime() - start;
        sum += diff;
        n++;
        if (diff < min) {
          minnode1 = lo;
          minnode2 = hi;
          min = diff;
        }
        if (diff > max) {
          maxnode1 = lo;
          maxnode2 = hi;
          max = diff;
        }
  }

At the end of the test, the best, worst, and average cases are
reported for each node/rank, along with the node names associated with
that best/worst event.  So, if there is an issue with a node, you'd
expect that node to show up in multiple reports, as a single reported
event only narrows the culprit down to two.

Any ideas would be appreciated.

Chris