[mvapich-discuss] Very bad latency scaling on bisectional
bandwidth test with OFED 1.3/MVAPICH 1.0.0
Chris Worley
worleys at gmail.com
Sat Mar 29 18:04:04 EDT 2008
On Fri, Mar 28, 2008 at 10:34 PM, Dhabaleswar Panda
<panda at cse.ohio-state.edu> wrote:
> Chris,
>
> Thanks for your note. We will take a look at it.
it seems one of your folks has provided a solution.
> Do you also see this
> trend with mvapich 1.0.0 + OFED 1.2.5.5?
We didn't start using MVAPICH 1.0.0 until we upgraded to OFED 1.3.
> Will it be possible to get
> numbers for this combination? This will clearly tell us whether this issue
> is happening because of mvapich 1.0.0 only or because of some interactions
> between OFED 1.3 and mvapich 1.0.0.
Resetting my PATH and LD_LIBRARY_PATH to the old MVAPICH 0.9.9 built
with OFED 1.2.5.5, but not recompiling my executable, I see the same
effect, but not as pronounced, for example:
C-23-07: worst=115.522000 (C-25-37,C-27-09), best=2.893000
(C-23-11,C-22-28), avg=8.501770
C-27-15: worst=115.545000 (C-23-13,C-26-21), best=2.772000
(C-27-18,C-27-13), avg=8.599698
C-26-18: worst=115.534000 (C-27-12,C-25-24), best=2.775000
(C-26-20,C-26-16), avg=8.548942
C-22-27: worst=115.556000 (C-25-32,C-27-04), best=2.806000
(C-23-04,C-22-25), avg=8.117748
C-22-23: worst=115.506000 (C-25-28,C-26-45), best=2.727000
(C-22-27,C-21-26), avg=8.668741
C-21-08: worst=115.543000 (C-25-19,C-26-36), best=2.783000
(C-21-25,C-21-04), avg=8.021223
C-27-20: worst=115.572000 (C-25-08,C-26-25), best=2.773000
(C-27-22,C-27-18), avg=7.997187
C-26-15: worst=115.516000 (C-27-09,C-25-21), best=2.796000
(C-26-19,C-26-11), avg=8.661647
C-21-25: worst=115.557000 (C-25-23,C-26-40), best=2.883000
(C-22-23,C-21-07), avg=8.459388
... those "worst cases" should be ~30usecs.
Chris
>
> Thanks,
>
> DK
>
>
>
> On Fri, 28 Mar 2008, Chris Worley wrote:
>
> > I just upgraded to OFED 1.3 and MVAPICH 1.0.0 (from OFED 1.2.5.5 and
> > MVAPICH 0.9.9). I'm using ConnectX cards. IB diagnostics show no
> > fabric issues.
> >
> > I have a test for bisectional bandwidth and latency; the latency test
> > is showing very poor worst-case results repeatably as the node count
> > goes over ~100. Other MPI implementations (that will remain nameless
> > as they normally don't perform as well as MVAPICH) don't have this
> > issue... so I don't think it's strictly an OFED 1.3 issue.
> >
> > What I'm seeing shows worst-cast latency (in the msecs!), for all
> > nodes. Here's a sample of the current results (only testing ~120
> > nodes), all times in usecs:
> >
> > C-25-38: worst=1038.127000 (C-27-06,C-27-37), best=2.688000
> > (C-25-41,C-25-35), avg=31.816815
> > C-25-29: worst=1037.159000 (C-26-42,C-27-28), best=2.694000
> > (C-25-32,C-25-26), avg=31.645870
> > C-25-26: worst=1038.052000 (C-26-39,C-27-25), best=2.695000
> > (C-25-29,C-25-23), avg=31.757562
> > C-25-41: worst=1037.349000 (C-27-09,C-27-40), best=2.695000
> > (C-25-44,C-25-38), avg=31.776089
> > C-25-17: worst=1036.924000 (C-26-30,C-27-16), best=2.697000
> > (C-25-20,C-25-14), avg=31.664692
> > C-26-05: worst=1037.973000 (C-27-18,C-27-54), best=2.697000
> > (C-26-08,C-26-02), avg=31.809110
> > C-26-17: worst=1038.095000 (C-27-30,C-25-04), best=2.703000
> > (C-26-20,C-26-14), avg=31.774685
> > C-26-20: worst=1038.225000 (C-27-33,C-25-07), best=2.704000
> > (C-26-23,C-26-17), avg=30.208007
> > C-25-14: worst=1037.357000 (C-26-27,C-27-13), best=2.705000
> > (C-25-17,C-25-11), avg=31.576705
> > C-26-08: worst=1038.058000 (C-27-21,C-27-57), best=2.705000
> > (C-26-11,C-26-05), avg=31.639445
> > C-27-20: worst=1037.043000 (C-25-21,C-26-07), best=2.706000
> > (C-27-23,C-27-17), avg=31.819363
> > C-27-23: worst=1037.909000 (C-25-24,C-26-10), best=2.706000
> > (C-27-26,C-27-20), avg=31.714664
> > C-26-32: worst=1037.963000 (C-27-45,C-25-19), best=2.707000
> > (C-26-35,C-26-29), avg=31.694966
> > C-26-41: worst=1037.817000 (C-27-59,C-25-28), best=2.707000
> > (C-26-44,C-26-38), avg=31.674466
> > C-27-08: worst=1037.036000 (C-25-09,C-25-40), best=2.708000
> > (C-27-11,C-27-05), avg=31.781712
> > C-26-29: worst=1038.093000 (C-27-42,C-25-16), best=2.709000
> > (C-26-32,C-26-26), avg=31.812582
> > C-27-11: worst=1038.136000 (C-25-12,C-25-43), best=2.710000
> > (C-27-14,C-27-08), avg=31.653336
> > C-26-44: worst=1038.070000 (C-27-62,C-25-31), best=2.711000
> > (C-27-02,C-26-41), avg=31.666521
> > C-27-32: worst=1037.563000 (C-25-33,C-26-19), best=2.712000
> > (C-27-35,C-27-29), avg=31.778705
> > C-25-44: worst=1036.881000 (C-27-12,C-27-43), best=2.713000
> > (C-26-02,C-25-41), avg=31.752103
> >
> > While other MPI implementations running under OFED 1.3 on the same
> > node set show more stability:
> >
> > C-25-30: worst=35.822153 (C-27-39,C-27-19), best=3.398895
> > (C-25-10,C-26-34), avg=11.128738
> > C-26-11: worst=35.799026 (C-27-22,C-26-39), best=3.398180
> > (C-25-14,C-27-05), avg=10.864981
> > C-27-36: worst=35.802126 (C-26-02,C-25-11), best=3.396034
> > (C-26-34,C-27-38), avg=10.694269
> > C-25-16: worst=35.804033 (C-27-51,C-27-26), best=3.391981
> > (C-25-20,C-26-27), avg=11.112461
> > C-25-10: worst=35.800934 (C-27-37,C-27-53), best=3.388882
> > (C-26-23,C-25-30), avg=10.956828
> > C-27-08: worst=35.817862 (C-26-30,C-25-25), best=3.388882
> > (C-26-06,C-27-54), avg=10.765788
> > C-27-11: worst=35.810947 (C-25-31,C-26-15), best=3.386974
> > (C-27-56,C-26-23), avg=11.048172
> > C-26-10: worst=35.799026 (C-27-20,C-26-40), best=3.386974
> > (C-25-01,C-27-04), avg=10.720228
> > C-26-31: worst=37.193060 (C-25-33,C-26-43), best=3.386021
> > (C-27-27,C-25-20), avg=10.809506
> > C-26-23: worst=35.791874 (C-26-35,C-27-17), best=3.386021
> > (C-27-11,C-25-10), avg=10.747612
> > C-25-20: worst=35.769939 (C-27-28,C-27-61), best=3.385782
> > (C-26-31,C-25-16), avg=11.007588
> > C-27-37: worst=35.790205 (C-26-03,C-25-10), best=3.385067
> > (C-26-35,C-27-39), avg=10.998216
> > C-27-54: worst=35.786867 (C-25-29,C-25-19), best=3.384829
> > (C-27-08,C-27-35), avg=11.109948
> > C-27-43: worst=35.766125 (C-25-14,C-25-44), best=3.382921
> > (C-27-34,C-26-39), avg=11.048978
> > C-26-13: worst=35.825968 (C-27-18,C-26-04), best=3.382921
> > (C-25-28,C-27-10), avg=11.002756
> > C-25-44: worst=35.761833 (C-27-43,C-26-20), best=3.382921
> > (C-25-18,C-27-01), avg=10.673819
> > C-25-19: worst=35.789013 (C-27-54,C-27-06), best=3.382206
> > (C-25-25,C-26-05), avg=10.877252
> > C-27-56: worst=37.191868 (C-26-43,C-27-59), best=3.381014
> > (C-27-24,C-27-11), avg=10.994493
> > C-26-24: worst=35.806894 (C-26-42,C-27-24), best=3.381014
> > (C-27-03,C-25-12), avg=10.798348
> >
> > Previously, MVAPICH 0.9.9 (with OFED 1.2.5.5) showed stability (this
> > example was on ~280 node test):
> >
> > C-25-35: worst=34.869000 (C-21-36,C-21-20), best=2.208000
> > (C-21-28,C-21-28), avg=5.739774
> > C-21-28: worst=34.914000 (C-25-43,C-25-27), best=2.210000
> > (C-25-35,C-25-35), avg=5.692484
> > C-27-63: worst=23.946000 (C-25-13,C-23-44), best=3.123000
> > (C-21-01,C-27-61), avg=5.673201
> > C-25-41: worst=34.792000 (C-21-42,C-21-26), best=3.177000
> > (C-25-43,C-25-39), avg=5.715597
> > C-25-45: worst=34.944000 (C-22-01,C-21-30), best=3.182000
> > (C-26-02,C-25-43), avg=5.734901
> > C-25-37: worst=34.829000 (C-21-38,C-21-22), best=3.183000
> > (C-25-39,C-25-35), avg=5.715226
> > C-25-33: worst=34.900000 (C-21-34,C-21-18), best=3.185000
> > (C-25-35,C-25-31), avg=5.726198
> > C-25-39: worst=34.939000 (C-21-40,C-21-24), best=3.185000
> > (C-25-45,C-25-37), avg=5.760403
> > C-25-43: worst=34.913000 (C-21-44,C-21-28), best=3.185000
> > (C-25-45,C-25-41), avg=5.774223
> > C-25-25: worst=34.834000 (C-21-26,C-21-10), best=3.187000
> > (C-25-27,C-25-23), avg=5.701191
> > C-25-29: worst=34.935000 (C-21-30,C-21-14), best=3.188000
> > (C-25-31,C-25-27), avg=5.732307
> > C-25-17: worst=34.905000 (C-21-18,C-21-02), best=3.193000
> > (C-25-19,C-25-15), avg=5.712527
> > C-25-09: worst=34.839000 (C-21-10,C-27-58), best=3.195000
> > (C-25-11,C-25-07), avg=5.719269
> > C-25-21: worst=34.826000 (C-21-22,C-21-06), best=3.195000
> > (C-25-23,C-25-19), avg=5.709025
> >
> > The Latency portion of the test seems unaffected, I expect ~1.5GB/s
> > best, ~300MB/s worst, and ~600MB/s average. Here's a sample from the
> > MVAPICH 1.0.0 test, values in MB/s, with about 120 nodes in the test:
> >
> > C-27-28: worst=311.533788 (C-26-43,C-25-01), best=1528.480740
> > (C-27-27,C-27-29), avg=587.437197
> > C-27-27: worst=304.659190 (C-26-42,C-27-62), best=1528.480740
> > (C-27-26,C-27-28), avg=578.232542
> > C-27-26: worst=305.106860 (C-26-43,C-27-59), best=1528.480740
> > (C-27-25,C-27-27), avg=586.553406
> > C-27-25: worst=305.004799 (C-26-42,C-27-58), best=1528.369347
> > (C-27-24,C-27-26), avg=607.532908
> > C-27-22: worst=305.069134 (C-26-39,C-27-55), best=1528.369347
> > (C-27-21,C-27-23), avg=581.500478
> > C-27-21: worst=387.081961 (C-27-33,C-27-09), best=1528.369347
> > (C-27-20,C-27-22), avg=586.030643
> > C-27-24: worst=312.083157 (C-26-39,C-27-59), best=1528.257970
> > (C-27-23,C-27-25), avg=586.508800
> > C-27-19: worst=389.750871 (C-26-21,C-25-05), best=1528.202288
> > (C-27-18,C-27-20), avg=610.889360
> > C-27-23: worst=305.124616 (C-26-40,C-27-56), best=1528.146610
> > (C-27-22,C-27-24), avg=592.528629
> > C-27-20: worst=375.241912 (C-26-23,C-25-05), best=1528.146610
> > (C-27-19,C-27-21), avg=587.685815
> > C-27-18: worst=381.980984 (C-27-30,C-27-06), best=1528.035265
> > (C-27-17,C-27-19), avg=617.072824
> > C-25-34: worst=367.293139 (C-27-34,C-27-01), best=1527.534416
> > (C-25-33,C-25-35), avg=587.728400
> >
> > Previous tests (280 nodes, in this case), look about the same...
> > here's a sample:
> >
> > C-27-62: worst=324.551124 (C-25-28,C-23-27), best=1528.759294
> > (C-25-05,C-25-05), avg=513.596860
> > C-25-05: worst=286.193170 (C-27-20,C-21-35), best=1528.480740
> > (C-27-62,C-27-62), avg=531.348528
> > C-26-06: worst=309.232357 (C-22-17,C-21-26), best=1527.200699
> > (C-26-02,C-26-10), avg=526.550142
> > C-26-02: worst=308.702059 (C-22-13,C-21-22), best=1527.089492
> > (C-25-43,C-26-06), avg=517.635786
> > C-26-10: worst=303.692998 (C-27-26,C-23-39), best=1527.033895
> > (C-26-06,C-26-14), avg=526.867299
> > C-26-14: worst=302.966896 (C-27-29,C-23-44), best=1526.255959
> > (C-26-10,C-26-18), avg=521.675016
> > C-22-09: worst=353.618467 (C-25-12,C-27-20), best=1526.144890
> > (C-26-16,C-26-16), avg=564.268143
> > C-26-26: worst=311.455134 (C-27-18,C-25-34), best=1526.089361
> > (C-26-22,C-26-30), avg=547.530324
> > C-26-18: worst=304.969316 (C-27-36,C-23-45), best=1526.089361
> > (C-26-14,C-26-22), avg=515.463309
> > C-26-22: worst=302.152809 (C-25-02,C-27-42), best=1526.033837
> > (C-26-18,C-26-26), avg=529.672528
> > C-26-13: worst=299.106027 (C-27-28,C-23-43), best=1525.978316
> > (C-26-12,C-26-14), avg=549.159212
> > C-21-08: worst=292.192329 (C-25-11,C-25-19), best=1525.978316
> > (C-25-15,C-25-15), avg=539.609085
> > C-25-15: worst=270.248064 (C-21-04,C-21-12), best=1525.922800
> > (C-21-08,C-21-08), avg=523.654988
> > C-26-16: worst=293.111198 (C-21-30,C-22-33), best=1525.867288
> > (C-22-09,C-22-09), avg=488.649320
> > C-26-11: worst=300.619544 (C-27-29,C-23-38), best=1525.811779
> > (C-26-10,C-26-12), avg=496.530124
> > C-25-24: worst=317.670885 (C-22-08,C-27-40), best=1525.756275
> > (C-21-17,C-21-17), avg=510.542500
> >
> > The test itself may be to blame. The test is run w/ one rank per
> > node. The idea is to get a bisectional test where each node is
> > exclusively sending to another node, and exclusively receiving from
> > another node, with all nodes sending/receiving simultaneously;note
> > that the sender and receiver will most likely be different in the
> > sendrecv call, allowing to test bisectional bandwidth for a odd number
> > of nodes. The latency test sends/receives zero bytes 1000 times, the
> > bandwidth test sends 4MB 10 times. Iteratively, all ranks will
> > eventually send and receive to/from all other ranks, but all send/recv
> > combinations will not be completely enumerated (where nodes>2).
> >
> > While you'd expect a fat-tree switch to get full bisectional
> > bandwidth, it never does; a problem w/ a static subnet manager
> > (opensm). Given that the average is ~1/3 the best bandwidth, I
> > interpret that to mean that on average a rank is being blocked by two
> > other ranks. The worst case shows roughly 5 or 6 ranks blocking each
> > other.
> >
> > The routine goes through a "for" loop starting at the current rank:
> > send-ranks are decreasing and recv-ranks are increasing (both
> > circularly) for each iteration until you get back to the current rank.
> > The core of the routine looks like:
> >
> > MPI_Init(&argc, &argv);
> > MPI_Comm_size(MPI_COMM_WORLD, &wsize);
> > MPI_Comm_rank(MPI_COMM_WORLD, &me);
> > for(hi = (me == 0) ? wsize - 1 : me - 1,
> > lo = (me + 1 == wsize) ? 0 : me + 1;
> > hi != me;
> > hi = (hi == 0) ? wsize - 1 : hi - 1,
> > lo = (lo + 1 == wsize) ? 0 : lo + 1) {
> >
> > MPI_Barrier(MPI_COMM_WORLD);
> >
> > start = MPI_Wtime();
> >
> > for ( i = 0; i < iters; i++ ) {
> > MPI_Sendrecv(&comBuf, size, MPI_CHAR, lo, 0, &comBuf, size,
> > MPI_CHAR, hi, 0, MPI_COMM_WORLD, &stat);
> > }
> > diff = MPI_Wtime() - start;
> > sum += diff;
> > n++;
> > if (diff < min) {
> > minnode1 = lo;
> > minnode2 = hi;
> > min = diff;
> > }
> > if (diff > max) {
> > maxnode1 = lo;
> > maxnode2 = hi;
> > max = diff;
> > }
> > }
> >
> > At the end of the test, the best, worst, and average cases are
> > reported for each node/rank, along with the node names associated with
> > that best/worst event. So, if there is an issue with a node, you'd
> > expect that node to show up in multiple reports, as a single reported
> > event only narrows the culprit down to two.
> >
> > Any ideas would be appreciated.
> >
> > Chris
>
>
> > _______________________________________________
> > mvapich-discuss mailing list
> > mvapich-discuss at cse.ohio-state.edu
> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >
>
>
More information about the mvapich-discuss
mailing list