Fwd: [mvapich-discuss] Very bad latency scaling on bisectional bandwidth test with OFED 1.3/MVAPICH 1.0.0

Sat Mar 29 17:36:03 EDT 2008

On Sat, Mar 29, 2008 at 9:37 AM, Matthew Koop <koop at cse.ohio-state.edu> wrote:
 > Chris,
 >
 >  We did some optimizations to SRQ for the 1.0 version of MVAPICH. In some
 >  short-running benchmarks it may show worse performance. Can you try
 >  running with the following ENV:
 >
 >  VIADEV_SRQ_SIZE=1024

 Excellent.  That fixes it.  The problem seems to kick-in around
 125-132 nodes (running 1 process per node)... it's unstable in that
 range... you may or may not see the issue in multiple runs at node
 counts in that range.  Higher node counts repeatably exhibit the
 problem, and setting the above fixes it.

 Can you explain more of how this only effects "short running" benchmarks?

 Also What is the cutoff for short running (what would I do to the
 benchmark to make this not happen)?

 Extra credit: why doesn't my benchmark show full bisectional bandwidth
 given a fat tree switch (this is not the fault of MVAPICH... but I'd
 like to know what I'm doing wrong)?

 Thanks!

 Chris

>
 >  e.g.
 >  mpirun_rsh -np X ... VIADEV_SRQ_SIZE=1024 ./exec
 >
 >  And see if performance goes back to 0.9.9 levels?
 >
 >  Thanks,
 >  Matt
 >
 >
 >  On Fri, 28 Mar 2008, Chris Worley wrote:
 >
 >
 >
 > > I just upgraded to OFED 1.3 and MVAPICH 1.0.0 (from OFED 1.2.5.5 and
 >  > MVAPICH 0.9.9).  I'm using ConnectX cards.  IB diagnostics show no
 >  > fabric issues.
 >  >
 >  > I have a test for bisectional bandwidth and latency; the latency test
 >  > is showing very poor worst-case results repeatably as the node count
 >  > goes over ~100.  Other MPI implementations (that will remain nameless
 >  > as they normally don't perform as well as MVAPICH) don't have this
 >  > issue... so I don't think it's strictly an OFED 1.3 issue.
 >  >
 >  > What I'm seeing shows worst-cast latency (in the msecs!), for all
 >  > nodes.  Here's a sample of the current results (only testing ~120
 >  > nodes), all times in usecs:
 >  >
 >  > C-25-38: worst=1038.127000 (C-27-06,C-27-37), best=2.688000
 >  > (C-25-41,C-25-35), avg=31.816815
 >  > C-25-29: worst=1037.159000 (C-26-42,C-27-28), best=2.694000
 >  > (C-25-32,C-25-26), avg=31.645870
 >  > C-25-26: worst=1038.052000 (C-26-39,C-27-25), best=2.695000
 >  > (C-25-29,C-25-23), avg=31.757562
 >  > C-25-41: worst=1037.349000 (C-27-09,C-27-40), best=2.695000
 >  > (C-25-44,C-25-38), avg=31.776089
 >  > C-25-17: worst=1036.924000 (C-26-30,C-27-16), best=2.697000
 >  > (C-25-20,C-25-14), avg=31.664692
 >  > C-26-05: worst=1037.973000 (C-27-18,C-27-54), best=2.697000
 >  > (C-26-08,C-26-02), avg=31.809110
 >  > C-26-17: worst=1038.095000 (C-27-30,C-25-04), best=2.703000
 >  > (C-26-20,C-26-14), avg=31.774685
 >  > C-26-20: worst=1038.225000 (C-27-33,C-25-07), best=2.704000
 >  > (C-26-23,C-26-17), avg=30.208007
 >  > C-25-14: worst=1037.357000 (C-26-27,C-27-13), best=2.705000
 >  > (C-25-17,C-25-11), avg=31.576705
 >  > C-26-08: worst=1038.058000 (C-27-21,C-27-57), best=2.705000
 >  > (C-26-11,C-26-05), avg=31.639445
 >  > C-27-20: worst=1037.043000 (C-25-21,C-26-07), best=2.706000
 >  > (C-27-23,C-27-17), avg=31.819363
 >  > C-27-23: worst=1037.909000 (C-25-24,C-26-10), best=2.706000
 >  > (C-27-26,C-27-20), avg=31.714664
 >  > C-26-32: worst=1037.963000 (C-27-45,C-25-19), best=2.707000
 >  > (C-26-35,C-26-29), avg=31.694966
 >  > C-26-41: worst=1037.817000 (C-27-59,C-25-28), best=2.707000
 >  > (C-26-44,C-26-38), avg=31.674466
 >  > C-27-08: worst=1037.036000 (C-25-09,C-25-40), best=2.708000
 >  > (C-27-11,C-27-05), avg=31.781712
 >  > C-26-29: worst=1038.093000 (C-27-42,C-25-16), best=2.709000
 >  > (C-26-32,C-26-26), avg=31.812582
 >  > C-27-11: worst=1038.136000 (C-25-12,C-25-43), best=2.710000
 >  > (C-27-14,C-27-08), avg=31.653336
 >  > C-26-44: worst=1038.070000 (C-27-62,C-25-31), best=2.711000
 >  > (C-27-02,C-26-41), avg=31.666521
 >  > C-27-32: worst=1037.563000 (C-25-33,C-26-19), best=2.712000
 >  > (C-27-35,C-27-29), avg=31.778705
 >  > C-25-44: worst=1036.881000 (C-27-12,C-27-43), best=2.713000
 >  > (C-26-02,C-25-41), avg=31.752103
 >  >
 >  > While other MPI implementations running under OFED 1.3 on the same
 >  > node set show more stability:
 >  >
 >  > C-25-30: worst=35.822153 (C-27-39,C-27-19), best=3.398895
 >  > (C-25-10,C-26-34), avg=11.128738
 >  > C-26-11: worst=35.799026 (C-27-22,C-26-39), best=3.398180
 >  > (C-25-14,C-27-05), avg=10.864981
 >  > C-27-36: worst=35.802126 (C-26-02,C-25-11), best=3.396034
 >  > (C-26-34,C-27-38), avg=10.694269
 >  > C-25-16: worst=35.804033 (C-27-51,C-27-26), best=3.391981
 >  > (C-25-20,C-26-27), avg=11.112461
 >  > C-25-10: worst=35.800934 (C-27-37,C-27-53), best=3.388882
 >  > (C-26-23,C-25-30), avg=10.956828
 >  > C-27-08: worst=35.817862 (C-26-30,C-25-25), best=3.388882
 >  > (C-26-06,C-27-54), avg=10.765788
 >  > C-27-11: worst=35.810947 (C-25-31,C-26-15), best=3.386974
 >  > (C-27-56,C-26-23), avg=11.048172
 >  > C-26-10: worst=35.799026 (C-27-20,C-26-40), best=3.386974
 >  > (C-25-01,C-27-04), avg=10.720228
 >  > C-26-31: worst=37.193060 (C-25-33,C-26-43), best=3.386021
 >  > (C-27-27,C-25-20), avg=10.809506
 >  > C-26-23: worst=35.791874 (C-26-35,C-27-17), best=3.386021
 >  > (C-27-11,C-25-10), avg=10.747612
 >  > C-25-20: worst=35.769939 (C-27-28,C-27-61), best=3.385782
 >  > (C-26-31,C-25-16), avg=11.007588
 >  > C-27-37: worst=35.790205 (C-26-03,C-25-10), best=3.385067
 >  > (C-26-35,C-27-39), avg=10.998216
 >  > C-27-54: worst=35.786867 (C-25-29,C-25-19), best=3.384829
 >  > (C-27-08,C-27-35), avg=11.109948
 >  > C-27-43: worst=35.766125 (C-25-14,C-25-44), best=3.382921
 >  > (C-27-34,C-26-39), avg=11.048978
 >  > C-26-13: worst=35.825968 (C-27-18,C-26-04), best=3.382921
 >  > (C-25-28,C-27-10), avg=11.002756
 >  > C-25-44: worst=35.761833 (C-27-43,C-26-20), best=3.382921
 >  > (C-25-18,C-27-01), avg=10.673819
 >  > C-25-19: worst=35.789013 (C-27-54,C-27-06), best=3.382206
 >  > (C-25-25,C-26-05), avg=10.877252
 >  > C-27-56: worst=37.191868 (C-26-43,C-27-59), best=3.381014
 >  > (C-27-24,C-27-11), avg=10.994493
 >  > C-26-24: worst=35.806894 (C-26-42,C-27-24), best=3.381014
 >  > (C-27-03,C-25-12), avg=10.798348
 >  >
 >  > Previously, MVAPICH 0.9.9 (with OFED 1.2.5.5) showed stability (this
 >  > example was on ~280 node test):
 >  >
 >  > C-25-35: worst=34.869000 (C-21-36,C-21-20), best=2.208000
 >  > (C-21-28,C-21-28), avg=5.739774
 >  > C-21-28: worst=34.914000 (C-25-43,C-25-27), best=2.210000
 >  > (C-25-35,C-25-35), avg=5.692484
 >  > C-27-63: worst=23.946000 (C-25-13,C-23-44), best=3.123000
 >  > (C-21-01,C-27-61), avg=5.673201
 >  > C-25-41: worst=34.792000 (C-21-42,C-21-26), best=3.177000
 >  > (C-25-43,C-25-39), avg=5.715597
 >  > C-25-45: worst=34.944000 (C-22-01,C-21-30), best=3.182000
 >  > (C-26-02,C-25-43), avg=5.734901
 >  > C-25-37: worst=34.829000 (C-21-38,C-21-22), best=3.183000
 >  > (C-25-39,C-25-35), avg=5.715226
 >  > C-25-33: worst=34.900000 (C-21-34,C-21-18), best=3.185000
 >  > (C-25-35,C-25-31), avg=5.726198
 >  > C-25-39: worst=34.939000 (C-21-40,C-21-24), best=3.185000
 >  > (C-25-45,C-25-37), avg=5.760403
 >  > C-25-43: worst=34.913000 (C-21-44,C-21-28), best=3.185000
 >  > (C-25-45,C-25-41), avg=5.774223
 >  > C-25-25: worst=34.834000 (C-21-26,C-21-10), best=3.187000
 >  > (C-25-27,C-25-23), avg=5.701191
 >  > C-25-29: worst=34.935000 (C-21-30,C-21-14), best=3.188000
 >  > (C-25-31,C-25-27), avg=5.732307
 >  > C-25-17: worst=34.905000 (C-21-18,C-21-02), best=3.193000
 >  > (C-25-19,C-25-15), avg=5.712527
 >  > C-25-09: worst=34.839000 (C-21-10,C-27-58), best=3.195000
 >  > (C-25-11,C-25-07), avg=5.719269
 >  > C-25-21: worst=34.826000 (C-21-22,C-21-06), best=3.195000
 >  > (C-25-23,C-25-19), avg=5.709025
 >  >
 >  > The Latency portion of the test seems unaffected, I expect ~1.5GB/s
 >  > best, ~300MB/s worst, and ~600MB/s average.  Here's a sample from the
 >  > MVAPICH 1.0.0 test, values in MB/s, with about 120 nodes in the test:
 >  >
 >  > C-27-28: worst=311.533788 (C-26-43,C-25-01), best=1528.480740
 >  > (C-27-27,C-27-29), avg=587.437197
 >  > C-27-27: worst=304.659190 (C-26-42,C-27-62), best=1528.480740
 >  > (C-27-26,C-27-28), avg=578.232542
 >  > C-27-26: worst=305.106860 (C-26-43,C-27-59), best=1528.480740
 >  > (C-27-25,C-27-27), avg=586.553406
 >  > C-27-25: worst=305.004799 (C-26-42,C-27-58), best=1528.369347
 >  > (C-27-24,C-27-26), avg=607.532908
 >  > C-27-22: worst=305.069134 (C-26-39,C-27-55), best=1528.369347
 >  > (C-27-21,C-27-23), avg=581.500478
 >  > C-27-21: worst=387.081961 (C-27-33,C-27-09), best=1528.369347
 >  > (C-27-20,C-27-22), avg=586.030643
 >  > C-27-24: worst=312.083157 (C-26-39,C-27-59), best=1528.257970
 >  > (C-27-23,C-27-25), avg=586.508800
 >  > C-27-19: worst=389.750871 (C-26-21,C-25-05), best=1528.202288
 >  > (C-27-18,C-27-20), avg=610.889360
 >  > C-27-23: worst=305.124616 (C-26-40,C-27-56), best=1528.146610
 >  > (C-27-22,C-27-24), avg=592.528629
 >  > C-27-20: worst=375.241912 (C-26-23,C-25-05), best=1528.146610
 >  > (C-27-19,C-27-21), avg=587.685815
 >  > C-27-18: worst=381.980984 (C-27-30,C-27-06), best=1528.035265
 >  > (C-27-17,C-27-19), avg=617.072824
 >  > C-25-34: worst=367.293139 (C-27-34,C-27-01), best=1527.534416
 >  > (C-25-33,C-25-35), avg=587.728400
 >  >
 >  > Previous tests (280 nodes, in this case), look about the same...
 >  > here's a sample:
 >  >
 >  > C-27-62: worst=324.551124 (C-25-28,C-23-27), best=1528.759294
 >  > (C-25-05,C-25-05), avg=513.596860
 >  > C-25-05: worst=286.193170 (C-27-20,C-21-35), best=1528.480740
 >  > (C-27-62,C-27-62), avg=531.348528
 >  > C-26-06: worst=309.232357 (C-22-17,C-21-26), best=1527.200699
 >  > (C-26-02,C-26-10), avg=526.550142
 >  > C-26-02: worst=308.702059 (C-22-13,C-21-22), best=1527.089492
 >  > (C-25-43,C-26-06), avg=517.635786
 >  > C-26-10: worst=303.692998 (C-27-26,C-23-39), best=1527.033895
 >  > (C-26-06,C-26-14), avg=526.867299
 >  > C-26-14: worst=302.966896 (C-27-29,C-23-44), best=1526.255959
 >  > (C-26-10,C-26-18), avg=521.675016
 >  > C-22-09: worst=353.618467 (C-25-12,C-27-20), best=1526.144890
 >  > (C-26-16,C-26-16), avg=564.268143
 >  > C-26-26: worst=311.455134 (C-27-18,C-25-34), best=1526.089361
 >  > (C-26-22,C-26-30), avg=547.530324
 >  > C-26-18: worst=304.969316 (C-27-36,C-23-45), best=1526.089361
 >  > (C-26-14,C-26-22), avg=515.463309
 >  > C-26-22: worst=302.152809 (C-25-02,C-27-42), best=1526.033837
 >  > (C-26-18,C-26-26), avg=529.672528
 >  > C-26-13: worst=299.106027 (C-27-28,C-23-43), best=1525.978316
 >  > (C-26-12,C-26-14), avg=549.159212
 >  > C-21-08: worst=292.192329 (C-25-11,C-25-19), best=1525.978316
 >  > (C-25-15,C-25-15), avg=539.609085
 >  > C-25-15: worst=270.248064 (C-21-04,C-21-12), best=1525.922800
 >  > (C-21-08,C-21-08), avg=523.654988
 >  > C-26-16: worst=293.111198 (C-21-30,C-22-33), best=1525.867288
 >  > (C-22-09,C-22-09), avg=488.649320
 >  > C-26-11: worst=300.619544 (C-27-29,C-23-38), best=1525.811779
 >  > (C-26-10,C-26-12), avg=496.530124
 >  > C-25-24: worst=317.670885 (C-22-08,C-27-40), best=1525.756275
 >  > (C-21-17,C-21-17), avg=510.542500
 >  >
 >  > The test itself may be to blame.  The test is run w/ one rank per
 >  > node.  The idea is to get a bisectional test where each node is
 >  > exclusively sending to another node, and exclusively receiving from
 >  > another node, with all nodes sending/receiving simultaneously;note
 >  > that the sender and receiver will most likely be different in the
 >  > sendrecv call, allowing to test bisectional bandwidth for a odd number
 >  > of nodes.  The latency test sends/receives zero bytes 1000 times, the
 >  > bandwidth test sends 4MB 10 times.  Iteratively, all ranks will
 >  > eventually send and receive to/from all other ranks, but all send/recv
 >  > combinations will not be completely enumerated (where nodes>2).
 >  >
 >  > While you'd expect a fat-tree switch to get full bisectional
 >  > bandwidth, it never does; a problem w/ a static subnet manager
 >  > (opensm).  Given that the average is ~1/3 the best bandwidth, I
 >  > interpret that to mean that on average a rank is being blocked by two
 >  > other ranks.  The worst case shows roughly 5 or 6 ranks blocking each
 >  > other.
 >  >
 >  > The routine goes through a "for" loop starting at the current rank:
 >  > send-ranks are decreasing and recv-ranks are increasing (both
 >  > circularly) for each iteration until you get back to the current rank.
 >  >  The core of the routine looks like:
 >  >
 >  >   MPI_Init(&argc, &argv);
 >  >   MPI_Comm_size(MPI_COMM_WORLD, &wsize);
 >  >   MPI_Comm_rank(MPI_COMM_WORLD, &me);
 >  >   for(hi = (me == 0) ? wsize - 1 : me - 1,
 >  >       lo = (me + 1 == wsize) ? 0 : me + 1;
 >  >         hi != me;
 >  >           hi = (hi == 0) ? wsize - 1 : hi - 1,
 >  >           lo = (lo + 1 == wsize) ? 0 : lo + 1) {
 >  >
 >  >         MPI_Barrier(MPI_COMM_WORLD);
 >  >
 >  >         start = MPI_Wtime();
 >  >
 >  >         for ( i = 0; i < iters; i++ ) {
 >  >           MPI_Sendrecv(&comBuf, size, MPI_CHAR, lo, 0, &comBuf, size,
 >  > MPI_CHAR, hi, 0, MPI_COMM_WORLD, &stat);
 >  >         }
 >  >         diff = MPI_Wtime() - start;
 >  >         sum += diff;
 >  >         n++;
 >  >         if (diff < min) {
 >  >           minnode1 = lo;
 >  >           minnode2 = hi;
 >  >           min = diff;
 >  >         }
 >  >         if (diff > max) {
 >  >           maxnode1 = lo;
 >  >           maxnode2 = hi;
 >  >           max = diff;
 >  >         }
 >  >   }
 >  >
 >  > At the end of the test, the best, worst, and average cases are
 >  > reported for each node/rank, along with the node names associated with
 >  > that best/worst event.  So, if there is an issue with a node, you'd
 >  > expect that node to show up in multiple reports, as a single reported
 >  > event only narrows the culprit down to two.
 >  >
 >  > Any ideas would be appreciated.
 >  >
 >  > Chris
 >
 >
 > > _______________________________________________
 >  > mvapich-discuss mailing list
 >  > mvapich-discuss at cse.ohio-state.edu
 >  > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
 >  >
 >
 >