[mvapich-discuss] Very bad latency scaling on bisectional bandwidth test with OFED 1.3/MVAPICH 1.0.0

Chris Worley worleys at gmail.com
Sat Mar 29 18:04:04 EDT 2008


On Fri, Mar 28, 2008 at 10:34 PM, Dhabaleswar Panda
<panda at cse.ohio-state.edu> wrote:
> Chris,
>
>  Thanks for your note. We will take a look at it.

it seems one of your folks has provided a solution.

> Do you also see this
>  trend with mvapich 1.0.0 + OFED 1.2.5.5?

We didn't start using MVAPICH 1.0.0 until we upgraded to OFED 1.3.

>  Will it be possible to get
>  numbers for this combination? This will clearly tell us whether this issue
>  is happening because of mvapich 1.0.0 only or because of some interactions
>  between OFED 1.3 and mvapich 1.0.0.

Resetting my PATH and LD_LIBRARY_PATH to the old MVAPICH 0.9.9 built
with OFED 1.2.5.5, but not recompiling my executable, I see the same
effect, but not as pronounced, for example:

C-23-07: worst=115.522000 (C-25-37,C-27-09), best=2.893000
(C-23-11,C-22-28), avg=8.501770
C-27-15: worst=115.545000 (C-23-13,C-26-21), best=2.772000
(C-27-18,C-27-13), avg=8.599698
C-26-18: worst=115.534000 (C-27-12,C-25-24), best=2.775000
(C-26-20,C-26-16), avg=8.548942
C-22-27: worst=115.556000 (C-25-32,C-27-04), best=2.806000
(C-23-04,C-22-25), avg=8.117748
C-22-23: worst=115.506000 (C-25-28,C-26-45), best=2.727000
(C-22-27,C-21-26), avg=8.668741
C-21-08: worst=115.543000 (C-25-19,C-26-36), best=2.783000
(C-21-25,C-21-04), avg=8.021223
C-27-20: worst=115.572000 (C-25-08,C-26-25), best=2.773000
(C-27-22,C-27-18), avg=7.997187
C-26-15: worst=115.516000 (C-27-09,C-25-21), best=2.796000
(C-26-19,C-26-11), avg=8.661647
C-21-25: worst=115.557000 (C-25-23,C-26-40), best=2.883000
(C-22-23,C-21-07), avg=8.459388

... those "worst cases" should be ~30usecs.

Chris
>
>  Thanks,
>
>  DK
>
>
>
>  On Fri, 28 Mar 2008, Chris Worley wrote:
>
>  > I just upgraded to OFED 1.3 and MVAPICH 1.0.0 (from OFED 1.2.5.5 and
>  > MVAPICH 0.9.9).  I'm using ConnectX cards.  IB diagnostics show no
>  > fabric issues.
>  >
>  > I have a test for bisectional bandwidth and latency; the latency test
>  > is showing very poor worst-case results repeatably as the node count
>  > goes over ~100.  Other MPI implementations (that will remain nameless
>  > as they normally don't perform as well as MVAPICH) don't have this
>  > issue... so I don't think it's strictly an OFED 1.3 issue.
>  >
>  > What I'm seeing shows worst-cast latency (in the msecs!), for all
>  > nodes.  Here's a sample of the current results (only testing ~120
>  > nodes), all times in usecs:
>  >
>  > C-25-38: worst=1038.127000 (C-27-06,C-27-37), best=2.688000
>  > (C-25-41,C-25-35), avg=31.816815
>  > C-25-29: worst=1037.159000 (C-26-42,C-27-28), best=2.694000
>  > (C-25-32,C-25-26), avg=31.645870
>  > C-25-26: worst=1038.052000 (C-26-39,C-27-25), best=2.695000
>  > (C-25-29,C-25-23), avg=31.757562
>  > C-25-41: worst=1037.349000 (C-27-09,C-27-40), best=2.695000
>  > (C-25-44,C-25-38), avg=31.776089
>  > C-25-17: worst=1036.924000 (C-26-30,C-27-16), best=2.697000
>  > (C-25-20,C-25-14), avg=31.664692
>  > C-26-05: worst=1037.973000 (C-27-18,C-27-54), best=2.697000
>  > (C-26-08,C-26-02), avg=31.809110
>  > C-26-17: worst=1038.095000 (C-27-30,C-25-04), best=2.703000
>  > (C-26-20,C-26-14), avg=31.774685
>  > C-26-20: worst=1038.225000 (C-27-33,C-25-07), best=2.704000
>  > (C-26-23,C-26-17), avg=30.208007
>  > C-25-14: worst=1037.357000 (C-26-27,C-27-13), best=2.705000
>  > (C-25-17,C-25-11), avg=31.576705
>  > C-26-08: worst=1038.058000 (C-27-21,C-27-57), best=2.705000
>  > (C-26-11,C-26-05), avg=31.639445
>  > C-27-20: worst=1037.043000 (C-25-21,C-26-07), best=2.706000
>  > (C-27-23,C-27-17), avg=31.819363
>  > C-27-23: worst=1037.909000 (C-25-24,C-26-10), best=2.706000
>  > (C-27-26,C-27-20), avg=31.714664
>  > C-26-32: worst=1037.963000 (C-27-45,C-25-19), best=2.707000
>  > (C-26-35,C-26-29), avg=31.694966
>  > C-26-41: worst=1037.817000 (C-27-59,C-25-28), best=2.707000
>  > (C-26-44,C-26-38), avg=31.674466
>  > C-27-08: worst=1037.036000 (C-25-09,C-25-40), best=2.708000
>  > (C-27-11,C-27-05), avg=31.781712
>  > C-26-29: worst=1038.093000 (C-27-42,C-25-16), best=2.709000
>  > (C-26-32,C-26-26), avg=31.812582
>  > C-27-11: worst=1038.136000 (C-25-12,C-25-43), best=2.710000
>  > (C-27-14,C-27-08), avg=31.653336
>  > C-26-44: worst=1038.070000 (C-27-62,C-25-31), best=2.711000
>  > (C-27-02,C-26-41), avg=31.666521
>  > C-27-32: worst=1037.563000 (C-25-33,C-26-19), best=2.712000
>  > (C-27-35,C-27-29), avg=31.778705
>  > C-25-44: worst=1036.881000 (C-27-12,C-27-43), best=2.713000
>  > (C-26-02,C-25-41), avg=31.752103
>  >
>  > While other MPI implementations running under OFED 1.3 on the same
>  > node set show more stability:
>  >
>  > C-25-30: worst=35.822153 (C-27-39,C-27-19), best=3.398895
>  > (C-25-10,C-26-34), avg=11.128738
>  > C-26-11: worst=35.799026 (C-27-22,C-26-39), best=3.398180
>  > (C-25-14,C-27-05), avg=10.864981
>  > C-27-36: worst=35.802126 (C-26-02,C-25-11), best=3.396034
>  > (C-26-34,C-27-38), avg=10.694269
>  > C-25-16: worst=35.804033 (C-27-51,C-27-26), best=3.391981
>  > (C-25-20,C-26-27), avg=11.112461
>  > C-25-10: worst=35.800934 (C-27-37,C-27-53), best=3.388882
>  > (C-26-23,C-25-30), avg=10.956828
>  > C-27-08: worst=35.817862 (C-26-30,C-25-25), best=3.388882
>  > (C-26-06,C-27-54), avg=10.765788
>  > C-27-11: worst=35.810947 (C-25-31,C-26-15), best=3.386974
>  > (C-27-56,C-26-23), avg=11.048172
>  > C-26-10: worst=35.799026 (C-27-20,C-26-40), best=3.386974
>  > (C-25-01,C-27-04), avg=10.720228
>  > C-26-31: worst=37.193060 (C-25-33,C-26-43), best=3.386021
>  > (C-27-27,C-25-20), avg=10.809506
>  > C-26-23: worst=35.791874 (C-26-35,C-27-17), best=3.386021
>  > (C-27-11,C-25-10), avg=10.747612
>  > C-25-20: worst=35.769939 (C-27-28,C-27-61), best=3.385782
>  > (C-26-31,C-25-16), avg=11.007588
>  > C-27-37: worst=35.790205 (C-26-03,C-25-10), best=3.385067
>  > (C-26-35,C-27-39), avg=10.998216
>  > C-27-54: worst=35.786867 (C-25-29,C-25-19), best=3.384829
>  > (C-27-08,C-27-35), avg=11.109948
>  > C-27-43: worst=35.766125 (C-25-14,C-25-44), best=3.382921
>  > (C-27-34,C-26-39), avg=11.048978
>  > C-26-13: worst=35.825968 (C-27-18,C-26-04), best=3.382921
>  > (C-25-28,C-27-10), avg=11.002756
>  > C-25-44: worst=35.761833 (C-27-43,C-26-20), best=3.382921
>  > (C-25-18,C-27-01), avg=10.673819
>  > C-25-19: worst=35.789013 (C-27-54,C-27-06), best=3.382206
>  > (C-25-25,C-26-05), avg=10.877252
>  > C-27-56: worst=37.191868 (C-26-43,C-27-59), best=3.381014
>  > (C-27-24,C-27-11), avg=10.994493
>  > C-26-24: worst=35.806894 (C-26-42,C-27-24), best=3.381014
>  > (C-27-03,C-25-12), avg=10.798348
>  >
>  > Previously, MVAPICH 0.9.9 (with OFED 1.2.5.5) showed stability (this
>  > example was on ~280 node test):
>  >
>  > C-25-35: worst=34.869000 (C-21-36,C-21-20), best=2.208000
>  > (C-21-28,C-21-28), avg=5.739774
>  > C-21-28: worst=34.914000 (C-25-43,C-25-27), best=2.210000
>  > (C-25-35,C-25-35), avg=5.692484
>  > C-27-63: worst=23.946000 (C-25-13,C-23-44), best=3.123000
>  > (C-21-01,C-27-61), avg=5.673201
>  > C-25-41: worst=34.792000 (C-21-42,C-21-26), best=3.177000
>  > (C-25-43,C-25-39), avg=5.715597
>  > C-25-45: worst=34.944000 (C-22-01,C-21-30), best=3.182000
>  > (C-26-02,C-25-43), avg=5.734901
>  > C-25-37: worst=34.829000 (C-21-38,C-21-22), best=3.183000
>  > (C-25-39,C-25-35), avg=5.715226
>  > C-25-33: worst=34.900000 (C-21-34,C-21-18), best=3.185000
>  > (C-25-35,C-25-31), avg=5.726198
>  > C-25-39: worst=34.939000 (C-21-40,C-21-24), best=3.185000
>  > (C-25-45,C-25-37), avg=5.760403
>  > C-25-43: worst=34.913000 (C-21-44,C-21-28), best=3.185000
>  > (C-25-45,C-25-41), avg=5.774223
>  > C-25-25: worst=34.834000 (C-21-26,C-21-10), best=3.187000
>  > (C-25-27,C-25-23), avg=5.701191
>  > C-25-29: worst=34.935000 (C-21-30,C-21-14), best=3.188000
>  > (C-25-31,C-25-27), avg=5.732307
>  > C-25-17: worst=34.905000 (C-21-18,C-21-02), best=3.193000
>  > (C-25-19,C-25-15), avg=5.712527
>  > C-25-09: worst=34.839000 (C-21-10,C-27-58), best=3.195000
>  > (C-25-11,C-25-07), avg=5.719269
>  > C-25-21: worst=34.826000 (C-21-22,C-21-06), best=3.195000
>  > (C-25-23,C-25-19), avg=5.709025
>  >
>  > The Latency portion of the test seems unaffected, I expect ~1.5GB/s
>  > best, ~300MB/s worst, and ~600MB/s average.  Here's a sample from the
>  > MVAPICH 1.0.0 test, values in MB/s, with about 120 nodes in the test:
>  >
>  > C-27-28: worst=311.533788 (C-26-43,C-25-01), best=1528.480740
>  > (C-27-27,C-27-29), avg=587.437197
>  > C-27-27: worst=304.659190 (C-26-42,C-27-62), best=1528.480740
>  > (C-27-26,C-27-28), avg=578.232542
>  > C-27-26: worst=305.106860 (C-26-43,C-27-59), best=1528.480740
>  > (C-27-25,C-27-27), avg=586.553406
>  > C-27-25: worst=305.004799 (C-26-42,C-27-58), best=1528.369347
>  > (C-27-24,C-27-26), avg=607.532908
>  > C-27-22: worst=305.069134 (C-26-39,C-27-55), best=1528.369347
>  > (C-27-21,C-27-23), avg=581.500478
>  > C-27-21: worst=387.081961 (C-27-33,C-27-09), best=1528.369347
>  > (C-27-20,C-27-22), avg=586.030643
>  > C-27-24: worst=312.083157 (C-26-39,C-27-59), best=1528.257970
>  > (C-27-23,C-27-25), avg=586.508800
>  > C-27-19: worst=389.750871 (C-26-21,C-25-05), best=1528.202288
>  > (C-27-18,C-27-20), avg=610.889360
>  > C-27-23: worst=305.124616 (C-26-40,C-27-56), best=1528.146610
>  > (C-27-22,C-27-24), avg=592.528629
>  > C-27-20: worst=375.241912 (C-26-23,C-25-05), best=1528.146610
>  > (C-27-19,C-27-21), avg=587.685815
>  > C-27-18: worst=381.980984 (C-27-30,C-27-06), best=1528.035265
>  > (C-27-17,C-27-19), avg=617.072824
>  > C-25-34: worst=367.293139 (C-27-34,C-27-01), best=1527.534416
>  > (C-25-33,C-25-35), avg=587.728400
>  >
>  > Previous tests (280 nodes, in this case), look about the same...
>  > here's a sample:
>  >
>  > C-27-62: worst=324.551124 (C-25-28,C-23-27), best=1528.759294
>  > (C-25-05,C-25-05), avg=513.596860
>  > C-25-05: worst=286.193170 (C-27-20,C-21-35), best=1528.480740
>  > (C-27-62,C-27-62), avg=531.348528
>  > C-26-06: worst=309.232357 (C-22-17,C-21-26), best=1527.200699
>  > (C-26-02,C-26-10), avg=526.550142
>  > C-26-02: worst=308.702059 (C-22-13,C-21-22), best=1527.089492
>  > (C-25-43,C-26-06), avg=517.635786
>  > C-26-10: worst=303.692998 (C-27-26,C-23-39), best=1527.033895
>  > (C-26-06,C-26-14), avg=526.867299
>  > C-26-14: worst=302.966896 (C-27-29,C-23-44), best=1526.255959
>  > (C-26-10,C-26-18), avg=521.675016
>  > C-22-09: worst=353.618467 (C-25-12,C-27-20), best=1526.144890
>  > (C-26-16,C-26-16), avg=564.268143
>  > C-26-26: worst=311.455134 (C-27-18,C-25-34), best=1526.089361
>  > (C-26-22,C-26-30), avg=547.530324
>  > C-26-18: worst=304.969316 (C-27-36,C-23-45), best=1526.089361
>  > (C-26-14,C-26-22), avg=515.463309
>  > C-26-22: worst=302.152809 (C-25-02,C-27-42), best=1526.033837
>  > (C-26-18,C-26-26), avg=529.672528
>  > C-26-13: worst=299.106027 (C-27-28,C-23-43), best=1525.978316
>  > (C-26-12,C-26-14), avg=549.159212
>  > C-21-08: worst=292.192329 (C-25-11,C-25-19), best=1525.978316
>  > (C-25-15,C-25-15), avg=539.609085
>  > C-25-15: worst=270.248064 (C-21-04,C-21-12), best=1525.922800
>  > (C-21-08,C-21-08), avg=523.654988
>  > C-26-16: worst=293.111198 (C-21-30,C-22-33), best=1525.867288
>  > (C-22-09,C-22-09), avg=488.649320
>  > C-26-11: worst=300.619544 (C-27-29,C-23-38), best=1525.811779
>  > (C-26-10,C-26-12), avg=496.530124
>  > C-25-24: worst=317.670885 (C-22-08,C-27-40), best=1525.756275
>  > (C-21-17,C-21-17), avg=510.542500
>  >
>  > The test itself may be to blame.  The test is run w/ one rank per
>  > node.  The idea is to get a bisectional test where each node is
>  > exclusively sending to another node, and exclusively receiving from
>  > another node, with all nodes sending/receiving simultaneously;note
>  > that the sender and receiver will most likely be different in the
>  > sendrecv call, allowing to test bisectional bandwidth for a odd number
>  > of nodes.  The latency test sends/receives zero bytes 1000 times, the
>  > bandwidth test sends 4MB 10 times.  Iteratively, all ranks will
>  > eventually send and receive to/from all other ranks, but all send/recv
>  > combinations will not be completely enumerated (where nodes>2).
>  >
>  > While you'd expect a fat-tree switch to get full bisectional
>  > bandwidth, it never does; a problem w/ a static subnet manager
>  > (opensm).  Given that the average is ~1/3 the best bandwidth, I
>  > interpret that to mean that on average a rank is being blocked by two
>  > other ranks.  The worst case shows roughly 5 or 6 ranks blocking each
>  > other.
>  >
>  > The routine goes through a "for" loop starting at the current rank:
>  > send-ranks are decreasing and recv-ranks are increasing (both
>  > circularly) for each iteration until you get back to the current rank.
>  >  The core of the routine looks like:
>  >
>  >   MPI_Init(&argc, &argv);
>  >   MPI_Comm_size(MPI_COMM_WORLD, &wsize);
>  >   MPI_Comm_rank(MPI_COMM_WORLD, &me);
>  >   for(hi = (me == 0) ? wsize - 1 : me - 1,
>  >       lo = (me + 1 == wsize) ? 0 : me + 1;
>  >         hi != me;
>  >           hi = (hi == 0) ? wsize - 1 : hi - 1,
>  >           lo = (lo + 1 == wsize) ? 0 : lo + 1) {
>  >
>  >         MPI_Barrier(MPI_COMM_WORLD);
>  >
>  >         start = MPI_Wtime();
>  >
>  >         for ( i = 0; i < iters; i++ ) {
>  >           MPI_Sendrecv(&comBuf, size, MPI_CHAR, lo, 0, &comBuf, size,
>  > MPI_CHAR, hi, 0, MPI_COMM_WORLD, &stat);
>  >         }
>  >         diff = MPI_Wtime() - start;
>  >         sum += diff;
>  >         n++;
>  >         if (diff < min) {
>  >           minnode1 = lo;
>  >           minnode2 = hi;
>  >           min = diff;
>  >         }
>  >         if (diff > max) {
>  >           maxnode1 = lo;
>  >           maxnode2 = hi;
>  >           max = diff;
>  >         }
>  >   }
>  >
>  > At the end of the test, the best, worst, and average cases are
>  > reported for each node/rank, along with the node names associated with
>  > that best/worst event.  So, if there is an issue with a node, you'd
>  > expect that node to show up in multiple reports, as a single reported
>  > event only narrows the culprit down to two.
>  >
>  > Any ideas would be appreciated.
>  >
>  > Chris
>
>
> > _______________________________________________
>  > mvapich-discuss mailing list
>  > mvapich-discuss at cse.ohio-state.edu
>  > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>  >
>
>


More information about the mvapich-discuss mailing list