[mvapich-discuss] Very bad latency scaling on bisectional bandwidth test with OFED 1.3/MVAPICH 1.0.0

Matthew Koop koop at cse.ohio-state.edu
Sat Mar 29 11:37:09 EDT 2008


Chris,

We did some optimizations to SRQ for the 1.0 version of MVAPICH. In some
short-running benchmarks it may show worse performance. Can you try
running with the following ENV:

VIADEV_SRQ_SIZE=1024

e.g.
mpirun_rsh -np X ... VIADEV_SRQ_SIZE=1024 ./exec

And see if performance goes back to 0.9.9 levels?

Thanks,
Matt

On Fri, 28 Mar 2008, Chris Worley wrote:

> I just upgraded to OFED 1.3 and MVAPICH 1.0.0 (from OFED 1.2.5.5 and
> MVAPICH 0.9.9).  I'm using ConnectX cards.  IB diagnostics show no
> fabric issues.
>
> I have a test for bisectional bandwidth and latency; the latency test
> is showing very poor worst-case results repeatably as the node count
> goes over ~100.  Other MPI implementations (that will remain nameless
> as they normally don't perform as well as MVAPICH) don't have this
> issue... so I don't think it's strictly an OFED 1.3 issue.
>
> What I'm seeing shows worst-cast latency (in the msecs!), for all
> nodes.  Here's a sample of the current results (only testing ~120
> nodes), all times in usecs:
>
> C-25-38: worst=1038.127000 (C-27-06,C-27-37), best=2.688000
> (C-25-41,C-25-35), avg=31.816815
> C-25-29: worst=1037.159000 (C-26-42,C-27-28), best=2.694000
> (C-25-32,C-25-26), avg=31.645870
> C-25-26: worst=1038.052000 (C-26-39,C-27-25), best=2.695000
> (C-25-29,C-25-23), avg=31.757562
> C-25-41: worst=1037.349000 (C-27-09,C-27-40), best=2.695000
> (C-25-44,C-25-38), avg=31.776089
> C-25-17: worst=1036.924000 (C-26-30,C-27-16), best=2.697000
> (C-25-20,C-25-14), avg=31.664692
> C-26-05: worst=1037.973000 (C-27-18,C-27-54), best=2.697000
> (C-26-08,C-26-02), avg=31.809110
> C-26-17: worst=1038.095000 (C-27-30,C-25-04), best=2.703000
> (C-26-20,C-26-14), avg=31.774685
> C-26-20: worst=1038.225000 (C-27-33,C-25-07), best=2.704000
> (C-26-23,C-26-17), avg=30.208007
> C-25-14: worst=1037.357000 (C-26-27,C-27-13), best=2.705000
> (C-25-17,C-25-11), avg=31.576705
> C-26-08: worst=1038.058000 (C-27-21,C-27-57), best=2.705000
> (C-26-11,C-26-05), avg=31.639445
> C-27-20: worst=1037.043000 (C-25-21,C-26-07), best=2.706000
> (C-27-23,C-27-17), avg=31.819363
> C-27-23: worst=1037.909000 (C-25-24,C-26-10), best=2.706000
> (C-27-26,C-27-20), avg=31.714664
> C-26-32: worst=1037.963000 (C-27-45,C-25-19), best=2.707000
> (C-26-35,C-26-29), avg=31.694966
> C-26-41: worst=1037.817000 (C-27-59,C-25-28), best=2.707000
> (C-26-44,C-26-38), avg=31.674466
> C-27-08: worst=1037.036000 (C-25-09,C-25-40), best=2.708000
> (C-27-11,C-27-05), avg=31.781712
> C-26-29: worst=1038.093000 (C-27-42,C-25-16), best=2.709000
> (C-26-32,C-26-26), avg=31.812582
> C-27-11: worst=1038.136000 (C-25-12,C-25-43), best=2.710000
> (C-27-14,C-27-08), avg=31.653336
> C-26-44: worst=1038.070000 (C-27-62,C-25-31), best=2.711000
> (C-27-02,C-26-41), avg=31.666521
> C-27-32: worst=1037.563000 (C-25-33,C-26-19), best=2.712000
> (C-27-35,C-27-29), avg=31.778705
> C-25-44: worst=1036.881000 (C-27-12,C-27-43), best=2.713000
> (C-26-02,C-25-41), avg=31.752103
>
> While other MPI implementations running under OFED 1.3 on the same
> node set show more stability:
>
> C-25-30: worst=35.822153 (C-27-39,C-27-19), best=3.398895
> (C-25-10,C-26-34), avg=11.128738
> C-26-11: worst=35.799026 (C-27-22,C-26-39), best=3.398180
> (C-25-14,C-27-05), avg=10.864981
> C-27-36: worst=35.802126 (C-26-02,C-25-11), best=3.396034
> (C-26-34,C-27-38), avg=10.694269
> C-25-16: worst=35.804033 (C-27-51,C-27-26), best=3.391981
> (C-25-20,C-26-27), avg=11.112461
> C-25-10: worst=35.800934 (C-27-37,C-27-53), best=3.388882
> (C-26-23,C-25-30), avg=10.956828
> C-27-08: worst=35.817862 (C-26-30,C-25-25), best=3.388882
> (C-26-06,C-27-54), avg=10.765788
> C-27-11: worst=35.810947 (C-25-31,C-26-15), best=3.386974
> (C-27-56,C-26-23), avg=11.048172
> C-26-10: worst=35.799026 (C-27-20,C-26-40), best=3.386974
> (C-25-01,C-27-04), avg=10.720228
> C-26-31: worst=37.193060 (C-25-33,C-26-43), best=3.386021
> (C-27-27,C-25-20), avg=10.809506
> C-26-23: worst=35.791874 (C-26-35,C-27-17), best=3.386021
> (C-27-11,C-25-10), avg=10.747612
> C-25-20: worst=35.769939 (C-27-28,C-27-61), best=3.385782
> (C-26-31,C-25-16), avg=11.007588
> C-27-37: worst=35.790205 (C-26-03,C-25-10), best=3.385067
> (C-26-35,C-27-39), avg=10.998216
> C-27-54: worst=35.786867 (C-25-29,C-25-19), best=3.384829
> (C-27-08,C-27-35), avg=11.109948
> C-27-43: worst=35.766125 (C-25-14,C-25-44), best=3.382921
> (C-27-34,C-26-39), avg=11.048978
> C-26-13: worst=35.825968 (C-27-18,C-26-04), best=3.382921
> (C-25-28,C-27-10), avg=11.002756
> C-25-44: worst=35.761833 (C-27-43,C-26-20), best=3.382921
> (C-25-18,C-27-01), avg=10.673819
> C-25-19: worst=35.789013 (C-27-54,C-27-06), best=3.382206
> (C-25-25,C-26-05), avg=10.877252
> C-27-56: worst=37.191868 (C-26-43,C-27-59), best=3.381014
> (C-27-24,C-27-11), avg=10.994493
> C-26-24: worst=35.806894 (C-26-42,C-27-24), best=3.381014
> (C-27-03,C-25-12), avg=10.798348
>
> Previously, MVAPICH 0.9.9 (with OFED 1.2.5.5) showed stability (this
> example was on ~280 node test):
>
> C-25-35: worst=34.869000 (C-21-36,C-21-20), best=2.208000
> (C-21-28,C-21-28), avg=5.739774
> C-21-28: worst=34.914000 (C-25-43,C-25-27), best=2.210000
> (C-25-35,C-25-35), avg=5.692484
> C-27-63: worst=23.946000 (C-25-13,C-23-44), best=3.123000
> (C-21-01,C-27-61), avg=5.673201
> C-25-41: worst=34.792000 (C-21-42,C-21-26), best=3.177000
> (C-25-43,C-25-39), avg=5.715597
> C-25-45: worst=34.944000 (C-22-01,C-21-30), best=3.182000
> (C-26-02,C-25-43), avg=5.734901
> C-25-37: worst=34.829000 (C-21-38,C-21-22), best=3.183000
> (C-25-39,C-25-35), avg=5.715226
> C-25-33: worst=34.900000 (C-21-34,C-21-18), best=3.185000
> (C-25-35,C-25-31), avg=5.726198
> C-25-39: worst=34.939000 (C-21-40,C-21-24), best=3.185000
> (C-25-45,C-25-37), avg=5.760403
> C-25-43: worst=34.913000 (C-21-44,C-21-28), best=3.185000
> (C-25-45,C-25-41), avg=5.774223
> C-25-25: worst=34.834000 (C-21-26,C-21-10), best=3.187000
> (C-25-27,C-25-23), avg=5.701191
> C-25-29: worst=34.935000 (C-21-30,C-21-14), best=3.188000
> (C-25-31,C-25-27), avg=5.732307
> C-25-17: worst=34.905000 (C-21-18,C-21-02), best=3.193000
> (C-25-19,C-25-15), avg=5.712527
> C-25-09: worst=34.839000 (C-21-10,C-27-58), best=3.195000
> (C-25-11,C-25-07), avg=5.719269
> C-25-21: worst=34.826000 (C-21-22,C-21-06), best=3.195000
> (C-25-23,C-25-19), avg=5.709025
>
> The Latency portion of the test seems unaffected, I expect ~1.5GB/s
> best, ~300MB/s worst, and ~600MB/s average.  Here's a sample from the
> MVAPICH 1.0.0 test, values in MB/s, with about 120 nodes in the test:
>
> C-27-28: worst=311.533788 (C-26-43,C-25-01), best=1528.480740
> (C-27-27,C-27-29), avg=587.437197
> C-27-27: worst=304.659190 (C-26-42,C-27-62), best=1528.480740
> (C-27-26,C-27-28), avg=578.232542
> C-27-26: worst=305.106860 (C-26-43,C-27-59), best=1528.480740
> (C-27-25,C-27-27), avg=586.553406
> C-27-25: worst=305.004799 (C-26-42,C-27-58), best=1528.369347
> (C-27-24,C-27-26), avg=607.532908
> C-27-22: worst=305.069134 (C-26-39,C-27-55), best=1528.369347
> (C-27-21,C-27-23), avg=581.500478
> C-27-21: worst=387.081961 (C-27-33,C-27-09), best=1528.369347
> (C-27-20,C-27-22), avg=586.030643
> C-27-24: worst=312.083157 (C-26-39,C-27-59), best=1528.257970
> (C-27-23,C-27-25), avg=586.508800
> C-27-19: worst=389.750871 (C-26-21,C-25-05), best=1528.202288
> (C-27-18,C-27-20), avg=610.889360
> C-27-23: worst=305.124616 (C-26-40,C-27-56), best=1528.146610
> (C-27-22,C-27-24), avg=592.528629
> C-27-20: worst=375.241912 (C-26-23,C-25-05), best=1528.146610
> (C-27-19,C-27-21), avg=587.685815
> C-27-18: worst=381.980984 (C-27-30,C-27-06), best=1528.035265
> (C-27-17,C-27-19), avg=617.072824
> C-25-34: worst=367.293139 (C-27-34,C-27-01), best=1527.534416
> (C-25-33,C-25-35), avg=587.728400
>
> Previous tests (280 nodes, in this case), look about the same...
> here's a sample:
>
> C-27-62: worst=324.551124 (C-25-28,C-23-27), best=1528.759294
> (C-25-05,C-25-05), avg=513.596860
> C-25-05: worst=286.193170 (C-27-20,C-21-35), best=1528.480740
> (C-27-62,C-27-62), avg=531.348528
> C-26-06: worst=309.232357 (C-22-17,C-21-26), best=1527.200699
> (C-26-02,C-26-10), avg=526.550142
> C-26-02: worst=308.702059 (C-22-13,C-21-22), best=1527.089492
> (C-25-43,C-26-06), avg=517.635786
> C-26-10: worst=303.692998 (C-27-26,C-23-39), best=1527.033895
> (C-26-06,C-26-14), avg=526.867299
> C-26-14: worst=302.966896 (C-27-29,C-23-44), best=1526.255959
> (C-26-10,C-26-18), avg=521.675016
> C-22-09: worst=353.618467 (C-25-12,C-27-20), best=1526.144890
> (C-26-16,C-26-16), avg=564.268143
> C-26-26: worst=311.455134 (C-27-18,C-25-34), best=1526.089361
> (C-26-22,C-26-30), avg=547.530324
> C-26-18: worst=304.969316 (C-27-36,C-23-45), best=1526.089361
> (C-26-14,C-26-22), avg=515.463309
> C-26-22: worst=302.152809 (C-25-02,C-27-42), best=1526.033837
> (C-26-18,C-26-26), avg=529.672528
> C-26-13: worst=299.106027 (C-27-28,C-23-43), best=1525.978316
> (C-26-12,C-26-14), avg=549.159212
> C-21-08: worst=292.192329 (C-25-11,C-25-19), best=1525.978316
> (C-25-15,C-25-15), avg=539.609085
> C-25-15: worst=270.248064 (C-21-04,C-21-12), best=1525.922800
> (C-21-08,C-21-08), avg=523.654988
> C-26-16: worst=293.111198 (C-21-30,C-22-33), best=1525.867288
> (C-22-09,C-22-09), avg=488.649320
> C-26-11: worst=300.619544 (C-27-29,C-23-38), best=1525.811779
> (C-26-10,C-26-12), avg=496.530124
> C-25-24: worst=317.670885 (C-22-08,C-27-40), best=1525.756275
> (C-21-17,C-21-17), avg=510.542500
>
> The test itself may be to blame.  The test is run w/ one rank per
> node.  The idea is to get a bisectional test where each node is
> exclusively sending to another node, and exclusively receiving from
> another node, with all nodes sending/receiving simultaneously;note
> that the sender and receiver will most likely be different in the
> sendrecv call, allowing to test bisectional bandwidth for a odd number
> of nodes.  The latency test sends/receives zero bytes 1000 times, the
> bandwidth test sends 4MB 10 times.  Iteratively, all ranks will
> eventually send and receive to/from all other ranks, but all send/recv
> combinations will not be completely enumerated (where nodes>2).
>
> While you'd expect a fat-tree switch to get full bisectional
> bandwidth, it never does; a problem w/ a static subnet manager
> (opensm).  Given that the average is ~1/3 the best bandwidth, I
> interpret that to mean that on average a rank is being blocked by two
> other ranks.  The worst case shows roughly 5 or 6 ranks blocking each
> other.
>
> The routine goes through a "for" loop starting at the current rank:
> send-ranks are decreasing and recv-ranks are increasing (both
> circularly) for each iteration until you get back to the current rank.
>  The core of the routine looks like:
>
>   MPI_Init(&argc, &argv);
>   MPI_Comm_size(MPI_COMM_WORLD, &wsize);
>   MPI_Comm_rank(MPI_COMM_WORLD, &me);
>   for(hi = (me == 0) ? wsize - 1 : me - 1,
>       lo = (me + 1 == wsize) ? 0 : me + 1;
>         hi != me;
>           hi = (hi == 0) ? wsize - 1 : hi - 1,
>           lo = (lo + 1 == wsize) ? 0 : lo + 1) {
>
>         MPI_Barrier(MPI_COMM_WORLD);
>
>         start = MPI_Wtime();
>
>         for ( i = 0; i < iters; i++ ) {
>           MPI_Sendrecv(&comBuf, size, MPI_CHAR, lo, 0, &comBuf, size,
> MPI_CHAR, hi, 0, MPI_COMM_WORLD, &stat);
>         }
>         diff = MPI_Wtime() - start;
>         sum += diff;
>         n++;
>         if (diff < min) {
>           minnode1 = lo;
>           minnode2 = hi;
>           min = diff;
>         }
>         if (diff > max) {
>           maxnode1 = lo;
>           maxnode2 = hi;
>           max = diff;
>         }
>   }
>
> At the end of the test, the best, worst, and average cases are
> reported for each node/rank, along with the node names associated with
> that best/worst event.  So, if there is an issue with a node, you'd
> expect that node to show up in multiple reports, as a single reported
> event only narrows the culprit down to two.
>
> Any ideas would be appreciated.
>
> Chris
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss at cse.ohio-state.edu
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>



More information about the mvapich-discuss mailing list