[mvapich-discuss] Very bad latency scaling on bisectional bandwidth test with OFED 1.3/MVAPICH 1.0.0

Dhabaleswar Panda panda at cse.ohio-state.edu
Sat Mar 29 00:34:09 EDT 2008


Chris,

Thanks for your note. We will take a look at it. Do you also see this
trend with mvapich 1.0.0 + OFED 1.2.5.5? Will it be possible to get
numbers for this combination? This will clearly tell us whether this issue
is happening because of mvapich 1.0.0 only or because of some interactions
between OFED 1.3 and mvapich 1.0.0.

Thanks,

DK

On Fri, 28 Mar 2008, Chris Worley wrote:

> I just upgraded to OFED 1.3 and MVAPICH 1.0.0 (from OFED 1.2.5.5 and
> MVAPICH 0.9.9).  I'm using ConnectX cards.  IB diagnostics show no
> fabric issues.
>
> I have a test for bisectional bandwidth and latency; the latency test
> is showing very poor worst-case results repeatably as the node count
> goes over ~100.  Other MPI implementations (that will remain nameless
> as they normally don't perform as well as MVAPICH) don't have this
> issue... so I don't think it's strictly an OFED 1.3 issue.
>
> What I'm seeing shows worst-cast latency (in the msecs!), for all
> nodes.  Here's a sample of the current results (only testing ~120
> nodes), all times in usecs:
>
> C-25-38: worst=1038.127000 (C-27-06,C-27-37), best=2.688000
> (C-25-41,C-25-35), avg=31.816815
> C-25-29: worst=1037.159000 (C-26-42,C-27-28), best=2.694000
> (C-25-32,C-25-26), avg=31.645870
> C-25-26: worst=1038.052000 (C-26-39,C-27-25), best=2.695000
> (C-25-29,C-25-23), avg=31.757562
> C-25-41: worst=1037.349000 (C-27-09,C-27-40), best=2.695000
> (C-25-44,C-25-38), avg=31.776089
> C-25-17: worst=1036.924000 (C-26-30,C-27-16), best=2.697000
> (C-25-20,C-25-14), avg=31.664692
> C-26-05: worst=1037.973000 (C-27-18,C-27-54), best=2.697000
> (C-26-08,C-26-02), avg=31.809110
> C-26-17: worst=1038.095000 (C-27-30,C-25-04), best=2.703000
> (C-26-20,C-26-14), avg=31.774685
> C-26-20: worst=1038.225000 (C-27-33,C-25-07), best=2.704000
> (C-26-23,C-26-17), avg=30.208007
> C-25-14: worst=1037.357000 (C-26-27,C-27-13), best=2.705000
> (C-25-17,C-25-11), avg=31.576705
> C-26-08: worst=1038.058000 (C-27-21,C-27-57), best=2.705000
> (C-26-11,C-26-05), avg=31.639445
> C-27-20: worst=1037.043000 (C-25-21,C-26-07), best=2.706000
> (C-27-23,C-27-17), avg=31.819363
> C-27-23: worst=1037.909000 (C-25-24,C-26-10), best=2.706000
> (C-27-26,C-27-20), avg=31.714664
> C-26-32: worst=1037.963000 (C-27-45,C-25-19), best=2.707000
> (C-26-35,C-26-29), avg=31.694966
> C-26-41: worst=1037.817000 (C-27-59,C-25-28), best=2.707000
> (C-26-44,C-26-38), avg=31.674466
> C-27-08: worst=1037.036000 (C-25-09,C-25-40), best=2.708000
> (C-27-11,C-27-05), avg=31.781712
> C-26-29: worst=1038.093000 (C-27-42,C-25-16), best=2.709000
> (C-26-32,C-26-26), avg=31.812582
> C-27-11: worst=1038.136000 (C-25-12,C-25-43), best=2.710000
> (C-27-14,C-27-08), avg=31.653336
> C-26-44: worst=1038.070000 (C-27-62,C-25-31), best=2.711000
> (C-27-02,C-26-41), avg=31.666521
> C-27-32: worst=1037.563000 (C-25-33,C-26-19), best=2.712000
> (C-27-35,C-27-29), avg=31.778705
> C-25-44: worst=1036.881000 (C-27-12,C-27-43), best=2.713000
> (C-26-02,C-25-41), avg=31.752103
>
> While other MPI implementations running under OFED 1.3 on the same
> node set show more stability:
>
> C-25-30: worst=35.822153 (C-27-39,C-27-19), best=3.398895
> (C-25-10,C-26-34), avg=11.128738
> C-26-11: worst=35.799026 (C-27-22,C-26-39), best=3.398180
> (C-25-14,C-27-05), avg=10.864981
> C-27-36: worst=35.802126 (C-26-02,C-25-11), best=3.396034
> (C-26-34,C-27-38), avg=10.694269
> C-25-16: worst=35.804033 (C-27-51,C-27-26), best=3.391981
> (C-25-20,C-26-27), avg=11.112461
> C-25-10: worst=35.800934 (C-27-37,C-27-53), best=3.388882
> (C-26-23,C-25-30), avg=10.956828
> C-27-08: worst=35.817862 (C-26-30,C-25-25), best=3.388882
> (C-26-06,C-27-54), avg=10.765788
> C-27-11: worst=35.810947 (C-25-31,C-26-15), best=3.386974
> (C-27-56,C-26-23), avg=11.048172
> C-26-10: worst=35.799026 (C-27-20,C-26-40), best=3.386974
> (C-25-01,C-27-04), avg=10.720228
> C-26-31: worst=37.193060 (C-25-33,C-26-43), best=3.386021
> (C-27-27,C-25-20), avg=10.809506
> C-26-23: worst=35.791874 (C-26-35,C-27-17), best=3.386021
> (C-27-11,C-25-10), avg=10.747612
> C-25-20: worst=35.769939 (C-27-28,C-27-61), best=3.385782
> (C-26-31,C-25-16), avg=11.007588
> C-27-37: worst=35.790205 (C-26-03,C-25-10), best=3.385067
> (C-26-35,C-27-39), avg=10.998216
> C-27-54: worst=35.786867 (C-25-29,C-25-19), best=3.384829
> (C-27-08,C-27-35), avg=11.109948
> C-27-43: worst=35.766125 (C-25-14,C-25-44), best=3.382921
> (C-27-34,C-26-39), avg=11.048978
> C-26-13: worst=35.825968 (C-27-18,C-26-04), best=3.382921
> (C-25-28,C-27-10), avg=11.002756
> C-25-44: worst=35.761833 (C-27-43,C-26-20), best=3.382921
> (C-25-18,C-27-01), avg=10.673819
> C-25-19: worst=35.789013 (C-27-54,C-27-06), best=3.382206
> (C-25-25,C-26-05), avg=10.877252
> C-27-56: worst=37.191868 (C-26-43,C-27-59), best=3.381014
> (C-27-24,C-27-11), avg=10.994493
> C-26-24: worst=35.806894 (C-26-42,C-27-24), best=3.381014
> (C-27-03,C-25-12), avg=10.798348
>
> Previously, MVAPICH 0.9.9 (with OFED 1.2.5.5) showed stability (this
> example was on ~280 node test):
>
> C-25-35: worst=34.869000 (C-21-36,C-21-20), best=2.208000
> (C-21-28,C-21-28), avg=5.739774
> C-21-28: worst=34.914000 (C-25-43,C-25-27), best=2.210000
> (C-25-35,C-25-35), avg=5.692484
> C-27-63: worst=23.946000 (C-25-13,C-23-44), best=3.123000
> (C-21-01,C-27-61), avg=5.673201
> C-25-41: worst=34.792000 (C-21-42,C-21-26), best=3.177000
> (C-25-43,C-25-39), avg=5.715597
> C-25-45: worst=34.944000 (C-22-01,C-21-30), best=3.182000
> (C-26-02,C-25-43), avg=5.734901
> C-25-37: worst=34.829000 (C-21-38,C-21-22), best=3.183000
> (C-25-39,C-25-35), avg=5.715226
> C-25-33: worst=34.900000 (C-21-34,C-21-18), best=3.185000
> (C-25-35,C-25-31), avg=5.726198
> C-25-39: worst=34.939000 (C-21-40,C-21-24), best=3.185000
> (C-25-45,C-25-37), avg=5.760403
> C-25-43: worst=34.913000 (C-21-44,C-21-28), best=3.185000
> (C-25-45,C-25-41), avg=5.774223
> C-25-25: worst=34.834000 (C-21-26,C-21-10), best=3.187000
> (C-25-27,C-25-23), avg=5.701191
> C-25-29: worst=34.935000 (C-21-30,C-21-14), best=3.188000
> (C-25-31,C-25-27), avg=5.732307
> C-25-17: worst=34.905000 (C-21-18,C-21-02), best=3.193000
> (C-25-19,C-25-15), avg=5.712527
> C-25-09: worst=34.839000 (C-21-10,C-27-58), best=3.195000
> (C-25-11,C-25-07), avg=5.719269
> C-25-21: worst=34.826000 (C-21-22,C-21-06), best=3.195000
> (C-25-23,C-25-19), avg=5.709025
>
> The Latency portion of the test seems unaffected, I expect ~1.5GB/s
> best, ~300MB/s worst, and ~600MB/s average.  Here's a sample from the
> MVAPICH 1.0.0 test, values in MB/s, with about 120 nodes in the test:
>
> C-27-28: worst=311.533788 (C-26-43,C-25-01), best=1528.480740
> (C-27-27,C-27-29), avg=587.437197
> C-27-27: worst=304.659190 (C-26-42,C-27-62), best=1528.480740
> (C-27-26,C-27-28), avg=578.232542
> C-27-26: worst=305.106860 (C-26-43,C-27-59), best=1528.480740
> (C-27-25,C-27-27), avg=586.553406
> C-27-25: worst=305.004799 (C-26-42,C-27-58), best=1528.369347
> (C-27-24,C-27-26), avg=607.532908
> C-27-22: worst=305.069134 (C-26-39,C-27-55), best=1528.369347
> (C-27-21,C-27-23), avg=581.500478
> C-27-21: worst=387.081961 (C-27-33,C-27-09), best=1528.369347
> (C-27-20,C-27-22), avg=586.030643
> C-27-24: worst=312.083157 (C-26-39,C-27-59), best=1528.257970
> (C-27-23,C-27-25), avg=586.508800
> C-27-19: worst=389.750871 (C-26-21,C-25-05), best=1528.202288
> (C-27-18,C-27-20), avg=610.889360
> C-27-23: worst=305.124616 (C-26-40,C-27-56), best=1528.146610
> (C-27-22,C-27-24), avg=592.528629
> C-27-20: worst=375.241912 (C-26-23,C-25-05), best=1528.146610
> (C-27-19,C-27-21), avg=587.685815
> C-27-18: worst=381.980984 (C-27-30,C-27-06), best=1528.035265
> (C-27-17,C-27-19), avg=617.072824
> C-25-34: worst=367.293139 (C-27-34,C-27-01), best=1527.534416
> (C-25-33,C-25-35), avg=587.728400
>
> Previous tests (280 nodes, in this case), look about the same...
> here's a sample:
>
> C-27-62: worst=324.551124 (C-25-28,C-23-27), best=1528.759294
> (C-25-05,C-25-05), avg=513.596860
> C-25-05: worst=286.193170 (C-27-20,C-21-35), best=1528.480740
> (C-27-62,C-27-62), avg=531.348528
> C-26-06: worst=309.232357 (C-22-17,C-21-26), best=1527.200699
> (C-26-02,C-26-10), avg=526.550142
> C-26-02: worst=308.702059 (C-22-13,C-21-22), best=1527.089492
> (C-25-43,C-26-06), avg=517.635786
> C-26-10: worst=303.692998 (C-27-26,C-23-39), best=1527.033895
> (C-26-06,C-26-14), avg=526.867299
> C-26-14: worst=302.966896 (C-27-29,C-23-44), best=1526.255959
> (C-26-10,C-26-18), avg=521.675016
> C-22-09: worst=353.618467 (C-25-12,C-27-20), best=1526.144890
> (C-26-16,C-26-16), avg=564.268143
> C-26-26: worst=311.455134 (C-27-18,C-25-34), best=1526.089361
> (C-26-22,C-26-30), avg=547.530324
> C-26-18: worst=304.969316 (C-27-36,C-23-45), best=1526.089361
> (C-26-14,C-26-22), avg=515.463309
> C-26-22: worst=302.152809 (C-25-02,C-27-42), best=1526.033837
> (C-26-18,C-26-26), avg=529.672528
> C-26-13: worst=299.106027 (C-27-28,C-23-43), best=1525.978316
> (C-26-12,C-26-14), avg=549.159212
> C-21-08: worst=292.192329 (C-25-11,C-25-19), best=1525.978316
> (C-25-15,C-25-15), avg=539.609085
> C-25-15: worst=270.248064 (C-21-04,C-21-12), best=1525.922800
> (C-21-08,C-21-08), avg=523.654988
> C-26-16: worst=293.111198 (C-21-30,C-22-33), best=1525.867288
> (C-22-09,C-22-09), avg=488.649320
> C-26-11: worst=300.619544 (C-27-29,C-23-38), best=1525.811779
> (C-26-10,C-26-12), avg=496.530124
> C-25-24: worst=317.670885 (C-22-08,C-27-40), best=1525.756275
> (C-21-17,C-21-17), avg=510.542500
>
> The test itself may be to blame.  The test is run w/ one rank per
> node.  The idea is to get a bisectional test where each node is
> exclusively sending to another node, and exclusively receiving from
> another node, with all nodes sending/receiving simultaneously;note
> that the sender and receiver will most likely be different in the
> sendrecv call, allowing to test bisectional bandwidth for a odd number
> of nodes.  The latency test sends/receives zero bytes 1000 times, the
> bandwidth test sends 4MB 10 times.  Iteratively, all ranks will
> eventually send and receive to/from all other ranks, but all send/recv
> combinations will not be completely enumerated (where nodes>2).
>
> While you'd expect a fat-tree switch to get full bisectional
> bandwidth, it never does; a problem w/ a static subnet manager
> (opensm).  Given that the average is ~1/3 the best bandwidth, I
> interpret that to mean that on average a rank is being blocked by two
> other ranks.  The worst case shows roughly 5 or 6 ranks blocking each
> other.
>
> The routine goes through a "for" loop starting at the current rank:
> send-ranks are decreasing and recv-ranks are increasing (both
> circularly) for each iteration until you get back to the current rank.
>  The core of the routine looks like:
>
>   MPI_Init(&argc, &argv);
>   MPI_Comm_size(MPI_COMM_WORLD, &wsize);
>   MPI_Comm_rank(MPI_COMM_WORLD, &me);
>   for(hi = (me == 0) ? wsize - 1 : me - 1,
>       lo = (me + 1 == wsize) ? 0 : me + 1;
>         hi != me;
>           hi = (hi == 0) ? wsize - 1 : hi - 1,
>           lo = (lo + 1 == wsize) ? 0 : lo + 1) {
>
>         MPI_Barrier(MPI_COMM_WORLD);
>
>         start = MPI_Wtime();
>
>         for ( i = 0; i < iters; i++ ) {
>           MPI_Sendrecv(&comBuf, size, MPI_CHAR, lo, 0, &comBuf, size,
> MPI_CHAR, hi, 0, MPI_COMM_WORLD, &stat);
>         }
>         diff = MPI_Wtime() - start;
>         sum += diff;
>         n++;
>         if (diff < min) {
>           minnode1 = lo;
>           minnode2 = hi;
>           min = diff;
>         }
>         if (diff > max) {
>           maxnode1 = lo;
>           maxnode2 = hi;
>           max = diff;
>         }
>   }
>
> At the end of the test, the best, worst, and average cases are
> reported for each node/rank, along with the node names associated with
> that best/worst event.  So, if there is an issue with a node, you'd
> expect that node to show up in multiple reports, as a single reported
> event only narrows the culprit down to two.
>
> Any ideas would be appreciated.
>
> Chris
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list