Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench: tpccbench regression between Feb 12th and 13th #62078

Closed
nvanbenschoten opened this issue Mar 16, 2021 · 36 comments
Closed

bench: tpccbench regression between Feb 12th and 13th #62078

nvanbenschoten opened this issue Mar 16, 2021 · 36 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 16, 2021

Ref: https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16&tab=aws

Between Feb 12th and Feb 13th, we saw a drop in max throughput of 16% on tpccbench running on AWS. We should determine the cause of this and resolve it.

roachtest run --cloud=aws tpccbench/nodes=3/cpu=16

The SHA before the drop was e9e3721 and after the drop was ba1a144.

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker labels Mar 16, 2021
@nvanbenschoten
Copy link
Member Author

@sumeerbhola looking at the code changes between these two days, the big ones that stand out to me are the lockTable changes (cee3bcf and 59bda0d). In the first commit, we began reading and writing to the separated lock table. In the second commit, we turned off writing to the separated lock table, but continued reading from it.

I know you've looked at various performance profiles and landed a few optimizations over the past few weeks, but I'm not sure whether you've made a ruling on whether these changes were or were not responsible for this top-level regression. Would you be able to help make a call on that? Can we temporarily disable reading from the separated lock table and compare that to what we're seeing each night?

@sumeerbhola
Copy link
Collaborator

(for other cockroach labs folks) the previous discussion and profile is in https://cockroachlabs.slack.com/archives/C4X2J0RH6/p1615478361020000?thread_ts=1615397803.018300&cid=C4X2J0RH6

Can we temporarily disable reading from the separated lock table and compare that to what we're seeing each night?

Yes, I can try that.

@tbg
Copy link
Member

tbg commented Mar 17, 2021

image

Just for my sanity, what's the orange line? The efficiency? Why isn't there a label?

@sumeerbhola
Copy link
Collaborator

  • The kv workloads, like kv95, don't show a significant effect on these dates. I am assuming this is because those are using a fixed number of workers for the workload so not necessarily driving the machines to saturation (based on how ConnFlags.Concurrency is initialized and then used to decide how much to append to QueryLoad.WorkerFns). Is my understanding correct?
  • I ran tpccbench on master and with a modified intentInterleavingIter that constructs a proper EngineIterator but doesn't use it for anything useful and instead uses a fake EngineIterator that returns nothing. sumeerbhola@68468c1
    The highest master passed on was 2530 warehouses and fake passed on 2875. The master run with 2545 had very high latency across all operations
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          20538           34.2  10163.4   1610.6  38654.7  60129.5 103079.2  delivery
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         200037          333.4  10392.1   1073.7  45097.2  94489.3 103079.2  newOrder
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          20655           34.4   6203.2    100.7  28991.0  49392.1  85899.3  orderStatus
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0         204152          340.3   9123.7    570.4  40802.2  77309.4 103079.2  payment
_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  600.0s        0          20538           34.2   7035.6    209.7  32212.3  60129.5 103079.2  stockLevel

The high latency develops about 3min into the run after the 5min ramp-up is done
Before the badness

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  173.0s        0           74.0           51.8    260.0    906.0   1040.2   1040.2 delivery
  173.0s        0          608.0          479.2     71.3    469.8    637.5    872.4 newOrder
  173.0s        0           54.0           51.5      7.3     31.5     44.0     46.1 orderStatus
  173.0s        0          639.0          513.1     46.1    201.3    402.7    738.2 payment
  173.0s        0           45.0           51.1     22.0     71.3    117.4    117.4 stockLevel
  174.0s        0           42.0           51.7    604.0   1006.6   1140.9   1140.9 delivery
  174.0s        0          565.5          479.7    302.0    637.5    704.6    805.3 newOrder
  174.0s        0           56.9           51.6     25.2    104.9    142.6    151.0 orderStatus
  174.0s        0          521.5          513.1    192.9    453.0    570.4    671.1 payment
  174.0s        0           42.0           51.1     71.3    503.3    536.9    536.9 stockLevel
  175.0s        0           55.1           51.7    201.3    872.4   1073.7   1140.9 delivery
  175.0s        0          435.4          479.5     37.7    209.7    268.4    335.5 newOrder
  175.0s        0           53.1           51.6      7.6     46.1     50.3     75.5 orderStatus
  175.0s        0          468.5          512.9     23.1    130.0    192.9    318.8 payment
  175.0s        0           55.1           51.1     18.9     67.1     96.5    201.3 stockLevel
  176.0s        0           48.0           51.7    570.4    973.1   1208.0   1208.0 delivery
  176.0s        0          622.6          480.3    243.3    604.0    738.2   1006.6 newOrder
  176.0s        0           50.0           51.6     19.9     96.5    159.4    159.4 orderStatus
  176.0s        0          616.6          513.5    176.2    402.7    503.3    805.3 payment
  176.0s        0           62.0           51.2     60.8    260.0    268.4    285.2 stockLevel

After

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
  237.0s        0           26.0           48.0  12348.0  23622.3  30064.8  30064.8 delivery
  237.0s        0          445.4          457.6  12348.0  23622.3  25769.8  28991.0 newOrder
  237.0s        0           24.0           48.3   8589.9  11811.2  11811.2  11811.2 orderStatus
  237.0s        0          284.6          477.3  11811.2  21474.8  23622.3  28991.0 payment
  237.0s        0           26.0           47.7  10200.5  13421.8  19327.4  19327.4 stockLevel
  238.0s        0           19.0           47.9  13421.8  25769.8  30064.8  30064.8 delivery
  238.0s        0          351.3          457.1  12884.9  22548.6  27917.3  30064.8 newOrder
  238.0s        0           29.0           48.2   9126.8  11811.2  12348.0  12348.0 orderStatus
  238.0s        0          284.3          476.5  12348.0  23622.3  25769.8  30064.8 payment
  238.0s        0           24.0           47.6   9663.7  12884.9  18253.6  18253.6 stockLevel
  239.0s        0            6.0           47.7  14495.5  22548.6  22548.6  22548.6 delivery
  239.0s        0          215.1          456.1  12348.0  24696.1  26843.5  30064.8 newOrder
  239.0s        0           18.0           48.1   9663.7  12348.0  12884.9  12884.9 orderStatus
  239.0s        0           74.0          474.8  11811.2  21474.8  24696.1  25769.8 payment
  239.0s        0           15.0           47.4  10200.5  13421.8  18253.6  18253.6 stockLevel
  240.0s        0           25.0           47.6  15569.3  26843.5  30064.8  30064.8 delivery
  240.0s        0          413.6          455.9  12884.9  24696.1  31138.5  32212.3 newOrder
  240.0s        0           32.0           48.0  10200.5  12884.9  13958.6  13958.6 orderStatus
  240.0s        0          296.7          474.0  12348.0  23622.3  26843.5  28991.0 payment
  240.0s        0           26.0           47.3   9126.8  12884.9  12884.9  12884.9 stockLevel
  • I then ran with 2700 warehouses on master in the hope of reproducing the badness. It works fine. Here are summaries from 2 runs:
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1800.0s    33232.4  95.7%    358.2    318.8    704.6    872.4   1208.0   3623.9
_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1200.0s    33013.6  95.1%    400.6    318.8    838.9   1073.7   1677.7   7516.2

A comparable run with 2700 warehouses with the fake iter

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  900.0s    33037.6  95.1%    230.5    201.3    469.8    570.4    805.3   2415.9

(ignore the different elapsed -- I was fiddling with it) Note the lower latency with the fake iter, even though the efficiency numbers are comparable.
Looking at the cpu utilization on these runs on master shows it is running hot, with mean utilization of ~83% across the 3 nodes. The hottest node reaches 88%. My assumption is that no reasonable production setting will run their OLTP database so hot in steady state (even in OLAP settings I have not seen steady state > 75%, and those don't suffer from the OLTP issue of slowness due to cpu resulting in increased inter-transaction contention and/or retries). A profile shows 15% of the cpu in intentInterleavingIter and 5% in SeekEngineKey (there is only 0.1% in NewEngineIterator). Corresponding cpu utilization and profiles running with 2700 warehouses with the fake iter show a cpu utilization of ~79% and 10% in intentInterleavingIter (because the 5% from SeekEngineKeyGE is eliminated), so these numbers are quite consistent.

Perhaps what is happening is that we are too close to the edge at 83% utilization (where one node got to 88%), so we could tip over into badness. Here is a screenshot of transaction restarts with master
image
and corresponding with the fake iter
image
I don't understand the nature of TPCC to know whether collapse due to increased contention is a possibility. I am looking for guidance from folks who do understand it.

  • I looked into why SeekEngineKey is 5% of cpu despite the many Pebble optimizations intended to reduce its cost when iterators are reused. Based on logging of scan and put counts in BatchRequests most such batches have 1 put or 1 scan. So there is very little batching in this workload so little iterator reuse.

  • Another question is why is tpccbench not exceeding 2530 on master, while I can run it separately at 2700. Perhaps there is some perturbation from previous runs in the search that were also running at high CPU. The Pebble read amplification stayed low in the 2700 warehouse runs I did, but maybe they didn't in tpccbench. Or maybe there is something with loading 3000 warehouses and then only running with a subset as active warehouses. I could use some ideas on this.
    I did run with 2400 warehouses with the fake iter and cpu dropped from 79% to 74%. 74/79=0.94, and 2400/2700=0.88, which means cpu is not decreasing proportional to the warehouse count (there is a fixed overhead). So it could be that master gets unlucky even with lower warehouse counts wrt cpu utilization on some node getting high enough that badness ensues.

My next step will be to see if there is any scope for further optimizing SeekEngineKey, but I am not hopeful (there was nothing obvious in the profiles).

And I'm skeptical about the value of the tpccbench warehouse counts metric, since probing so close to the edge means small changes in cpu can have disproportional impact on the warehouse count.

@sumeerbhola
Copy link
Collaborator

The roachperf dashboard now shows a rise to 2815. Anyone know what changed there?

@irfansharif
Copy link
Contributor

It could be due to #61777. There was a regression around Feb 20th (#62148) that we think was due to #59992. The partial mitigation for #59992 was #61777.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 18, 2021

It's interesting that the same benchmark on GCE just stopped working after Feb 14:

https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16&tab=gce

Is this related to the TPCC VM overload issues you've been seeing @irfansharif?

@irfansharif
Copy link
Contributor

irfansharif commented Mar 18, 2021

Yup, exactly. Looks like #62039 is still not good enough (#62145). I'll try to resuscitate this test after I'm through other tracing related optimizations #62118. I haven't tried it, but I bet sql.txn_stats.sample_rate = 0.0 would get nurse this test back into health (by effectively disabling sampling/tracing). Sumeer, perhaps that's something you'd want to try if you're seeing these nodes fail over.

@irfansharif
Copy link
Contributor

Sumeer, if you're able to find another benchmark that doesn't go so "close to edge" like TPC-C does, one that also shows a regression between Feb 12th-13th, that might a path of lesser resistance.

@sumeerbhola
Copy link
Collaborator

Based on logging of scan and put counts in BatchRequests most such batches have 1 put or 1 scan. So there is very little batching in this workload so little iterator reuse.

I noticed there are 3316 replicas per node. I wonder if reducing the number of replicas could result in more batching.

My next step will be to see if there is any scope for further optimizing SeekEngineKey, but I am not hopeful

I found one thing. SeekPrefixGE, when it fails to match the bloom filter can needlessly skip to the next file, which is a wasted cost, and will defeat the seek-avoidance optimization when a batch has a need to do another SeekPrefixGE. One can see it in the seekEmptyFileForward in the following profile -- this happens mainly for SeekEngineKeyGE since that is used for the locks and usually there isn't any lock. This seems almost half of the 5% we pay in SeekEngineKeyGE.

image

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Mar 19, 2021
When SeekPrefixGE on the underlying file returns false
due to a bloom filter non-match, levelIter would skip
to the next file. This is wasteful if the upper bound
of the file is beyond the prefix. Additionally, it
defeats the optimization for sparse key spaces like
CockroachDB's lock table, where we try to reuse the
current position of the iterator -- by skipping to the
next file the subsequent SeekPrefixGE will again need
to reload the previous file.

This behavior was first noticed when diagnosing tpcc
slowness in CockroacbDB, where almost half the
overhead of seeking in the lock table could be
attributed to this (see
cockroachdb/cockroach#62078
for details).

The benchmark numbers for bloom=true/with-tombstone=false
are the ones intended to benefit from this change.

name                                                                        old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16     441ns ± 9%     445ns ± 7%     ~     (p=0.332 n=19+20)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16      299ns ± 3%     300ns ± 3%     ~     (p=0.455 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16     3.73µs ± 8%    0.82µs ± 2%  -78.02%  (p=0.000 n=20+16)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16      1.78µs ±73%    1.21µs ± 7%  -32.15%  (p=0.000 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16     484ns ±27%     427ns ± 2%  -11.83%  (p=0.000 n=19+19)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16      320ns ± 7%     300ns ± 3%   -6.11%  (p=0.000 n=16+19)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16     5.07µs ±41%    0.82µs ± 2%  -83.84%  (p=0.000 n=20+18)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16      1.76µs ±37%    1.21µs ± 9%  -30.92%  (p=0.000 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16     439ns ± 4%     436ns ± 6%     ~     (p=0.109 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16      435ns ±29%     307ns ± 5%  -29.40%  (p=0.000 n=20+19)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16     5.63µs ±19%    0.82µs ± 2%  -85.40%  (p=0.000 n=20+19)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16      1.87µs ±36%    1.24µs ± 8%  -33.66%  (p=0.000 n=20+20)

name                                                                        old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)

name                                                                        old allocs/op  new allocs/op  delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
@sumeerbhola
Copy link
Collaborator

The 2700 warehouse run with this SeekPrefixGE improvement (cockroachdb/pebble#1091) looks much better in terms of latency.

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 1200.0s    33345.9  96.0%    142.6    117.4    318.8    369.1    520.1   1811.9

The cpu utilization is lower and the relative cost of SeekEngineKeyGE (compared to the profile screenshot in the previous comment) is reduced.

image

sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Mar 19, 2021
When SeekPrefixGE on the underlying file returns false
due to a bloom filter non-match, levelIter would skip
to the next file. This is wasteful if the upper bound
of the file is beyond the prefix. Additionally, it
defeats the optimization for sparse key spaces like
CockroachDB's lock table, where we try to reuse the
current position of the iterator -- by skipping to the
next file the subsequent SeekPrefixGE will again need
to reload the previous file.

This behavior was first noticed when diagnosing tpcc
slowness in CockroacbDB, where almost half the
overhead of seeking in the lock table could be
attributed to this (see
cockroachdb/cockroach#62078
for details).

The benchmark numbers for bloom=true/with-tombstone=false
are the ones intended to benefit from this change.

name                                                                        old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16     441ns ± 9%     445ns ± 7%     ~     (p=0.332 n=19+20)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16      299ns ± 3%     300ns ± 3%     ~     (p=0.455 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16     3.73µs ± 8%    0.82µs ± 2%  -78.02%  (p=0.000 n=20+16)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16      1.78µs ±73%    1.21µs ± 7%  -32.15%  (p=0.000 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16     484ns ±27%     427ns ± 2%  -11.83%  (p=0.000 n=19+19)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16      320ns ± 7%     300ns ± 3%   -6.11%  (p=0.000 n=16+19)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16     5.07µs ±41%    0.82µs ± 2%  -83.84%  (p=0.000 n=20+18)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16      1.76µs ±37%    1.21µs ± 9%  -30.92%  (p=0.000 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16     439ns ± 4%     436ns ± 6%     ~     (p=0.109 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16      435ns ±29%     307ns ± 5%  -29.40%  (p=0.000 n=20+19)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16     5.63µs ±19%    0.82µs ± 2%  -85.40%  (p=0.000 n=20+19)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16      1.87µs ±36%    1.24µs ± 8%  -33.66%  (p=0.000 n=20+20)

name                                                                        old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)

name                                                                        old allocs/op  new allocs/op  delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
sumeerbhola added a commit to cockroachdb/pebble that referenced this issue Mar 19, 2021
When SeekPrefixGE on the underlying file returns false
due to a bloom filter non-match, levelIter would skip
to the next file. This is wasteful if the upper bound
of the file is beyond the prefix. Additionally, it
defeats the optimization for sparse key spaces like
CockroachDB's lock table, where we try to reuse the
current position of the iterator -- by skipping to the
next file the subsequent SeekPrefixGE will again need
to reload the previous file.

This behavior was first noticed when diagnosing tpcc
slowness in CockroacbDB, where almost half the
overhead of seeking in the lock table could be
attributed to this (see
cockroachdb/cockroach#62078
for details).

The benchmark numbers for bloom=true/with-tombstone=false
are the ones intended to benefit from this change.

name                                                                        old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16     441ns ± 9%     445ns ± 7%     ~     (p=0.332 n=19+20)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16      299ns ± 3%     300ns ± 3%     ~     (p=0.455 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16     3.73µs ± 8%    0.82µs ± 2%  -78.02%  (p=0.000 n=20+16)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16      1.78µs ±73%    1.21µs ± 7%  -32.15%  (p=0.000 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16     484ns ±27%     427ns ± 2%  -11.83%  (p=0.000 n=19+19)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16      320ns ± 7%     300ns ± 3%   -6.11%  (p=0.000 n=16+19)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16     5.07µs ±41%    0.82µs ± 2%  -83.84%  (p=0.000 n=20+18)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16      1.76µs ±37%    1.21µs ± 9%  -30.92%  (p=0.000 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16     439ns ± 4%     436ns ± 6%     ~     (p=0.109 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16      435ns ±29%     307ns ± 5%  -29.40%  (p=0.000 n=20+19)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16     5.63µs ±19%    0.82µs ± 2%  -85.40%  (p=0.000 n=20+19)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16      1.87µs ±36%    1.24µs ± 8%  -33.66%  (p=0.000 n=20+20)

name                                                                        old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)

name                                                                        old allocs/op  new allocs/op  delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
@sumeerbhola
Copy link
Collaborator

@nvanbenschoten @petermattis
Now that the Pebble change is merged, we'll need to bump Pebble for the release-21.1 branch to pick it up. There is one other change to Pebble since the last bump (and that change precedes this one), which is the limited iteration mode cockroachdb/pebble@639dfce Even though it is intertwined in the same functions (via an optional limit parameter), the limited iteration mode is not going to be exercised by CockroachDB code in 21.1. So I think the risk of also including the limited iteration change is low.

Thoughts?

@nvanbenschoten
Copy link
Member Author

Thank you for digging into this Sumeer. Your analysis throughout this issue has been awe-inspiring. If landing cockroachdb/pebble#1091 in this release significantly reduces the cost we pay to scan the lock table and helps close the gap on TPC-C, and we do intend to keep the empty lock table scan, then I'd like to see it get in. This will set us up well for the release and for furthering the lock table project in the next release. That said, while I've followed along with your changes in Pebble, I'm not the right person to gauge their potential risk.

sumeerbhola added a commit to cockroachdb/pebble that referenced this issue Mar 19, 2021
When SeekPrefixGE on the underlying file returns false
due to a bloom filter non-match, levelIter would skip
to the next file. This is wasteful if the upper bound
of the file is beyond the prefix. Additionally, it
defeats the optimization for sparse key spaces like
CockroachDB's lock table, where we try to reuse the
current position of the iterator -- by skipping to the
next file the subsequent SeekPrefixGE will again need
to reload the previous file.

This behavior was first noticed when diagnosing tpcc
slowness in CockroacbDB, where almost half the
overhead of seeking in the lock table could be
attributed to this (see
cockroachdb/cockroach#62078
for details).

The benchmark numbers for bloom=true/with-tombstone=false
are the ones intended to benefit from this change.

name                                                                        old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16     441ns ± 9%     445ns ± 7%     ~     (p=0.332 n=19+20)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16      299ns ± 3%     300ns ± 3%     ~     (p=0.455 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16     3.73µs ± 8%    0.82µs ± 2%  -78.02%  (p=0.000 n=20+16)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16      1.78µs ±73%    1.21µs ± 7%  -32.15%  (p=0.000 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16     484ns ±27%     427ns ± 2%  -11.83%  (p=0.000 n=19+19)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16      320ns ± 7%     300ns ± 3%   -6.11%  (p=0.000 n=16+19)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16     5.07µs ±41%    0.82µs ± 2%  -83.84%  (p=0.000 n=20+18)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16      1.76µs ±37%    1.21µs ± 9%  -30.92%  (p=0.000 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16     439ns ± 4%     436ns ± 6%     ~     (p=0.109 n=20+20)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16      435ns ±29%     307ns ± 5%  -29.40%  (p=0.000 n=20+19)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16     5.63µs ±19%    0.82µs ± 2%  -85.40%  (p=0.000 n=20+19)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16      1.87µs ±36%    1.24µs ± 8%  -33.66%  (p=0.000 n=20+20)

name                                                                        old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16     0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16      0.00B          0.00B          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16        271B ± 0%      271B ± 0%     ~     (all equal)

name                                                                        old allocs/op  new allocs/op  delta
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=1/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=2/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=false-16      0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=false/with-tombstone=true-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=false-16       0.00           0.00          ~     (all equal)
IteratorSeqSeekPrefixGENotFound/skip=4/bloom=true/with-tombstone=true-16        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
@erikgrinaker
Copy link
Contributor

FYI, I did a few kv95/enc=false/nodes=1/cpu=32 runs to see how they were affected, 5 runs at each commit taking the median result:

Description Commit Ops/s
Baseline 9f97177 74510
Sep. intents enabled 98beba6 71406
Sep. intents disabled (lock table reads) 197d0cc 71661
With Pebble leveliter fix 197d0cc + cockroachdb/pebble@92fbeeabfd 72392

Seems like we took a ~4% hit with separated intents enabled. Disabling them again didn't make much difference, and the Pebble fix only recovered about 1/3 of the drop.

@tbg tbg assigned stevendanna and angelapwen and unassigned sumeerbhola Mar 24, 2021
@tbg
Copy link
Member

tbg commented Mar 24, 2021

@angelapwen and @stevendanna are going to start a targeted bisection into this, notes (internal) here

@andy-kimball
Copy link
Contributor

Before we've resolved this issue, I think we need to explicitly state:

  1. The magnitude of the drop There are a lot of numbers in the comments above, but I'm not sure what the final numbers are.
  2. The reasons why we're accepting that drop I think we need to make an explicit, business-driven decision for why the benefits we get with the separated lock table justify a drop in this benchmark.

Looking at other layers to get offsetting improvements doesn't feel satisfying me. That's because I was hoping/expecting to do those improvements (like #57223) to improve our performance, not just to claw our way back to where we were before. The perf drop we see in this issue will put a permanent "cap" on our max performance, and we should make sure all the right stakeholders have agreed it's the right thing to do (have the PM's weighed in?).

@tbg
Copy link
Member

tbg commented Mar 31, 2021

For tpccbench/nodes=3/cpu=16, it looks like relative to the 20.2 release we went from ~1620 to ~2100 warehouses (gce), and ~2500 to ~2800 (aws) so we didn't actually go down at all. (cc @erikgrinaker to make sure I'm not saying things that are wrong for release-21.1)

The regression we last discussed is on kv95, which is essentially a contrived point-selects-only workload that is very sensitive. I would argue it isn't overly representative of anything realistic users might do, especially not in the regime at which these numbers are recorded (near full system utilization)

I was planning to bring this up in the next release-triage meeting on Monday, so I'll keep this issue open to make it pop up on the spreadsheet. I'm not sure which PM you would consult with or how they could make any kind of informed determination here, but feel free to rope someone in.

@erikgrinaker
Copy link
Contributor

For tpccbench/nodes=3/cpu=16, it looks like relative to the 20.2 release we went from ~1620 to ~2100 warehouses (gce), and ~2500 to ~2800 (aws) so we didn't actually go down at all.

I think this is because the older TPCC benchmarks were capped at lower warehouse counts. We've had a hard time getting a good signal from TPCC runs, as individual runs vary quite a lot. If necessary we can do a full set of comparison runs, but it'll be fairly time-consuming.

The regression we last discussed is on kv95, which is essentially a contrived point-selects-only workload that is very sensitive. I would argue it isn't overly representative of anything realistic users might do

I agree, kv95 is extremely narrow, and basically the worst-case for measuring overhead in the query path. I'll do a suite of all six YCSB workloads for 20.2, 21.1, and 21.1 without separated intents, which should give us a more varied picture.

@erikgrinaker erikgrinaker reopened this Mar 31, 2021
@tbg
Copy link
Member

tbg commented Mar 31, 2021

I think this is because the older TPCC benchmarks were capped at lower warehouse counts. We've had a hard time getting a good signal from TPCC runs, as individual runs vary quite a lot. If necessary we can do a full set of comparison runs, but it'll be fairly time-consuming.

If this is necessary, we have to nominate someone else to do it. This group has done more than the fair share of the work and we need to get back to actually fixing issues.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Mar 31, 2021

Results from a few YCSB and kv95 benchmarks on GCE, doing 5 runs each and taking the median. Commit hashes:

Details of YCSB workloads here: https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads

Numbers are cumulative ops/s. The first delta is the regression from 20.2 to 21.1, the second delta is the improvement by disabling the lock table -- all relative to the 20.2 baseline.

Workload 20.2 21.1 21.1 w/o locktable Regress No locktable
ycsb/A/nodes=3 17860 17009 17122 -4.8% +0.6%
ycsb/B/nodes=3 33359 31076 31375 -6.8% +0.9%
ycsb/C/nodes=3 46690 41040 42463 -12.1% +3.1%
ycsb/D/nodes=3 37442 33554 34860 -10.4% +3.5%
ycsb/E/nodes=3 1873 1894 1874 +1.1% -1.1%
ycsb/F/nodes=3 8256 8030 8173 -2.7% +1.7%
kv95/enc=false/nodes=1/cpu=32 76313 71412 72735 -6.4% +1.7%
kv95/enc=false/nodes=3 40295 38108 39004 -5.4% +2.2%
kv95/enc=false/nodes=3/batch=16 4584 4598 4793 +0.3% +4.3%

There is a bit of uncertainty here, e.g. kv95/enc=false/nodes=1/cpu=32 now shows a 1.7% point improvement without locktable reads while the earlier numbers showed a ~5.1% point improvement without locktable reads -- I'd say the error is roughly ±2%. Intermediate numbers here, and I have the raw logs available on request.

Even so, it's clear that we currently have a regression of about 5% or more. Looks like lock table reads make up about a third of this, and the rest is likely due to vectorized execution and tracing, as well as other minor regressions.

The roachperf graphs of performance over time may also be of interest: https://roachperf.crdb.dev/

@petermattis
Copy link
Collaborator

We need to run the ycsb and kv95 numbers again in light of cockroachdb/pebble#1098 being backported to 21.1. It sounds like the delta on several ycsb workloads will be eliminated.

Is there anything to do to get get a good determination of where we stand on tpcc vs 20.2? Tobi's comment (#62078 (comment)) suggests that tpcc performance might have improved, but we don't have a good stable signal from tpccbench. Is anyone concerned enough to advocate putting in the elbow grease to get a proper tpcc comparison? Or will we be satisfied with ycsb and kv95?

PS Rather than taking the median of 5 runs, in the past I've done 10 runs on each workload and the transformed the result output into go-bench format using the following script:

#!/bin/bash

if [ "$#" -ne 1 ]; then
    echo "usage: $0 <artifacts-dir>"
    exit 1
fi

for file in $(find $1 -name test.log -o -name '*ycsb.log' -o -name '*kv.log'); do
    name=$(dirname $(dirname $(realpath --relative-to=$1 $file)))
    grep -h -A1 __result $file \
        | grep -v '^--$' | grep -v __result | \
        awk "{printf \"Benchmark$name  1  %s ops/sec  %s p50  %s p95  %s p99\n\", \$4, \$6, \$7, \$8}"
done

You can then use benchstat old new to see the perf diff, which will throw out outlier numbers and give a +/- interval. I should probably check this script somewhere into the CRDB repo.

@erikgrinaker
Copy link
Contributor

We need to run the ycsb and kv95 numbers again in light of cockroachdb/pebble#1098 being backported to 21.1. It sounds like the delta on several ycsb workloads will be eliminated.

These numbers were taken after that was backported (#62676). The YCSB numbers used to be far worse.

Is there anything to do to get get a good determination of where we stand on tpcc vs 20.2?

Anecdotally, we've seen numbers that fit well with the kv95 and ycsb regression range (5-10%). Would be happy to apply some elbow grease if we think it's worth the effort.

I've done 10 runs on each workload and the transformed the result output into go-bench format

Nice, would be useful to have this checked in somewhere.

@petermattis
Copy link
Collaborator

These numbers were taken after that was backported (#62676). The YCSB numbers used to be far worse.

You are correct about cockroachdb/pebble#1098, but there is also cockroachdb/pebble#1107. I'm not sure if that latter PR moves the needle on the CRDB-level benchmarks, though.

Anecdotally, we've seen numbers that fit well with the kv95 and ycsb regression range (5-10%). Would be happy to apply some elbow grease if we think it's worth the effort.

The tpcc numbers fit within that range?

Nice, would be useful to have this checked in somewhere.

I agree. Care to pick up this ball?

@erikgrinaker
Copy link
Contributor

there is also cockroachdb/pebble#1107

Right, ok. I can do a few quick runs and see if it moves the needle.

The tpcc numbers fit within that range?

I didn't do any TPCC work myself, but I believe we'd seen numbers that had recovered to within 10% of the original baseline on individual runs. They're rather noisy though, so that may or may not be accurate.

Care to pick up this ball?

Will do.

@erikgrinaker
Copy link
Contributor

Did 5 runs of the following benchmarks at release-21.1 @ cf1a794 (compared with the 20.2 and 21.1 results from above):

Workload 20.2 21.1 Regression Prev 21.1
ycsb/A/nodes=3 17860 16973 -6.0% 17009
ycsb/C/nodes=3 46690 42438 -9.1% 41040
kv95/enc=false/nodes=1/cpu=32 76313 72053 -5.6% 71412

The results are within the error margin of the previous 21.1 results, so it doesn't appear as if cockroachdb/pebble#1107 had any significant effect.

@petermattis
Copy link
Collaborator

@erikgrinaker Did you use benchstat for this comparison? These benchmarks are quite noisy. Here is a view using benchstat for comparison:

~ benchstat -delta-test none out-v20.2 out-v21.1
name                             old ops/sec  new ops/sec  delta
kv95/enc=false/nodes=1/cpu=32     73.9k ± 4%   73.1k ± 1%   -1.10%
kv95/enc=false/nodes=3            36.1k ±11%   37.8k ± 6%   +4.60%
kv95/enc=false/nodes=3/batch=16   4.25k ± 9%   4.21k ± 3%   -0.90%
ycsb/A/nodes=3                    17.1k ±11%   16.7k ± 7%   -2.37%
ycsb/B/nodes=3                    32.4k ±11%   30.7k ±12%   -5.03%
ycsb/C/nodes=3                    44.1k ± 9%   41.0k ± 6%   -6.88%
ycsb/D/nodes=3                    35.6k ± 9%   34.6k ± 5%   -2.80%
ycsb/E/nodes=3                    1.82k ±13%   1.92k ± 1%   +5.69%
ycsb/F/nodes=3                    7.80k ±11%   7.72k ±15%   -1.00%

Definitely notice that variation per run. The above is with the delta-test disabled. If you use the default benchstat -delta-test (utest) we see:

~ benchstat out-v20.2 out-v21.1
name                             old ops/sec  new ops/sec  delta
kv95/enc=false/nodes=1/cpu=32     73.9k ± 4%   73.1k ± 1%     ~     (p=0.481 n=10+10)
kv95/enc=false/nodes=3            36.1k ±11%   37.8k ± 6%     ~     (p=0.193 n=7+10)
kv95/enc=false/nodes=3/batch=16   4.25k ± 9%   4.21k ± 3%     ~     (p=1.000 n=8+8)
ycsb/A/nodes=3                    17.1k ±11%   16.7k ± 7%     ~     (p=0.515 n=8+10)
ycsb/B/nodes=3                    32.4k ±11%   30.7k ±12%     ~     (p=0.059 n=8+9)
ycsb/C/nodes=3                    44.1k ± 9%   41.0k ± 6%     ~     (p=0.068 n=8+10)
ycsb/D/nodes=3                    35.6k ± 9%   34.6k ± 5%     ~     (p=0.743 n=8+9)
ycsb/E/nodes=3                    1.82k ±13%   1.92k ± 1%   +5.69%  (p=0.000 n=7+9)
ycsb/F/nodes=3                    7.80k ±11%   7.72k ±15%     ~     (p=0.515 n=8+10)

So benchstat thinks that most of the deltas are not statistically significant. We could probably tease out if there is a difference with more runs. Certainly looks like there is a regression on ycsb/C, but don't ignore that noise. I spent a lot of time futzing with similar benchmarks when we were switching from RocksDB to Pebble and eventually settled on doing 20 runs per test to give myself confidence.

@erikgrinaker
Copy link
Contributor

Did you use benchstat for this comparison?

No, in order to compare with the previous numbers I figured I shouldn't change up the methodology for this run.

Here is a view using benchstat for comparison:

Thanks for doing another set of runs, that's awesome! Definitely like the benchstat reporting, the uncertainty metrics are very useful. I've noticed it's the de facto format around here, and I'll be sure to adopt it as well.

don't ignore that noise. I spent a lot of time futzing with similar benchmarks when we were switching from RocksDB to Pebble and eventually settled on doing 20 runs per test to give myself confidence.

Absolutely. These numbers came on the tail end of a regression investigation that covered a five-month timeframe and a ~35% regression. We had to cover a lot of ground, and were looking for large deltas, so a 5-run median (to ignore outliers) seemed appropriate. I agree that once we're zooming in on more specific comparisons we need to be more rigorous, I suppose I was just used to doing things a certain way at that point.

I am fairly confident that we're looking at a ~5% regression overall though (for some definition of overall), and the roachperf graphs give that impression as well. Using averages for these sample sizes tend to skew results downwards, since outliers tend to be negative (it's much more likely that a cluster anomaly causes a large slowdown than a large speedup). Might be worth discarding the outer results in either direction.

@petermattis
Copy link
Collaborator

Absolutely. These numbers came on the tail end of a regression investigation that covered a five-month timeframe and a ~35% regression. We had to cover a lot of ground, and were looking for large deltas, so a 5-run median (to ignore outliers) seemed appropriate. I agree that once we're zooming in on more specific comparisons we need to be more rigorous, I suppose I was just used to doing things a certain way at that point.

Ack. Makes complete sense to do fewer runs when there are large discrepancies that are being looked for. We're likely going to do this sort of comparison again in 6 months. Perhaps it is worthwhile to create a script or playbook to describe how the testing should be done.

I am fairly confident that we're looking at a ~5% regression overall though (for some definition of overall), and the roachperf graphs give that impression as well. Using averages for these sample sizes tend to skew results downwards, since outliers tend to be negative (it's much more likely that a cluster anomaly causes a large slowdown than a large speedup). Might be worth discarding the outer results in either direction.

I believe benchstat does exclude outliers. That's why the n=7+9 values are not always n=10+10 even though I did 10 runs.

@erikgrinaker
Copy link
Contributor

Perhaps it is worthwhile to create a script or playbook to describe how the testing should be done.

We started one, will amend it tomorrow.

I believe benchstat does exclude outliers. That's why the n=7+9 values are not always n=10+10 even though I did 10 runs.

Indeed, nifty!

@tbg
Copy link
Member

tbg commented Apr 14, 2021

Ack. Makes complete sense to do fewer runs when there are large discrepancies that are being looked for. We're likely going to do this sort of comparison again in 6 months. Perhaps it is worthwhile to create a script or playbook to describe how the testing should be done.

We'll do this on a continuous basis going forward, though there are still details being ironed out on how that is organized.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 19, 2021
This commit introduces object pooling for `pebbleReadOnly` allocation avoidance.
I found this to be important both because it avoids the initial `pebbleReadOnly`
allocation, but also because it allows the memory recycling inside of each
`pebbleIterator` owned by a `pebbleReadOnly` to work correctly.

I found this while running a few microbenchmarks and noticing that the
lockTable's calls to `intentInterleavingReader` were resulting in a large number
of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because
`lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of
them) in `(*pebbleIterator).init` was causing an allocation.

```
name                          old time/op    new time/op    delta
KV/Scan/Native/rows=1-16        30.9µs ± 4%    29.9µs ± 6%   -3.29%  (p=0.000 n=20+20)
KV/Scan/Native/rows=100-16      54.2µs ± 4%    52.7µs ± 5%   -2.84%  (p=0.002 n=20+20)
KV/Scan/Native/rows=10-16       34.0µs ± 3%    33.1µs ± 6%   -2.64%  (p=0.001 n=20+20)
KV/Scan/Native/rows=1000-16      253µs ± 5%     255µs ± 5%     ~     (p=0.659 n=20+20)
KV/Scan/Native/rows=10000-16    2.16ms ± 4%    2.14ms ± 3%     ~     (p=0.072 n=20+20)

name                          old alloc/op   new alloc/op   delta
KV/Scan/Native/rows=1-16        8.69kB ± 0%    7.49kB ± 0%  -13.79%  (p=0.000 n=20+19)
KV/Scan/Native/rows=10-16       10.1kB ± 0%     8.9kB ± 0%  -11.87%  (p=0.000 n=20+18)
KV/Scan/Native/rows=100-16      22.7kB ± 0%    21.5kB ± 0%   -5.29%  (p=0.000 n=17+19)
KV/Scan/Native/rows=1000-16      174kB ± 0%     172kB ± 0%   -0.66%  (p=0.000 n=19+19)
KV/Scan/Native/rows=10000-16    1.51MB ± 0%    1.51MB ± 0%   -0.05%  (p=0.000 n=16+19)

name                          old allocs/op  new allocs/op  delta
KV/Scan/Native/rows=1-16          71.0 ± 0%      62.0 ± 0%  -12.68%  (p=0.000 n=20+20)
KV/Scan/Native/rows=10-16         75.0 ± 0%      66.0 ± 0%  -12.00%  (p=0.000 n=20+19)
KV/Scan/Native/rows=100-16        79.0 ± 0%      70.0 ± 0%  -11.39%  (p=0.000 n=19+19)
KV/Scan/Native/rows=1000-16       87.8 ± 1%      79.0 ± 0%   -9.97%  (p=0.000 n=20+16)
KV/Scan/Native/rows=10000-16       113 ± 2%       103 ± 2%   -8.19%  (p=0.000 n=17+19)
```

We may want to consider this as a candidate to backport to release-21.1, because
the lack of pooling here was even more detrimental with the separated lockTable,
which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078.

Release note (performance improvement): A series of heap allocations performed
when serving read-only queries are now avoided.
@tbg
Copy link
Member

tbg commented Apr 20, 2021

We discussed this yesterday in the release triage meeting and decided that we would eat the hit as it occurs on synthetic workloads that are unlikely to be representative of real workloads, plus we were reluctant to pull either separated intents or vectorized execution and noted that there are caching improvements coming in the 21.2 cycle that will more than make up for lost ground.

@tbg tbg closed this as completed Apr 20, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 20, 2021
This commit introduces object pooling for `pebbleReadOnly` allocation avoidance.
I found this to be important both because it avoids the initial `pebbleReadOnly`
allocation, but also because it allows the memory recycling inside of each
`pebbleIterator` owned by a `pebbleReadOnly` to work correctly.

I found this while running a few microbenchmarks and noticing that the
lockTable's calls to `intentInterleavingReader` were resulting in a large number
of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because
`lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of
them) in `(*pebbleIterator).init` was causing an allocation.

```
name                          old time/op    new time/op    delta
KV/Scan/Native/rows=1-16        30.9µs ± 4%    29.9µs ± 6%   -3.29%  (p=0.000 n=20+20)
KV/Scan/Native/rows=100-16      54.2µs ± 4%    52.7µs ± 5%   -2.84%  (p=0.002 n=20+20)
KV/Scan/Native/rows=10-16       34.0µs ± 3%    33.1µs ± 6%   -2.64%  (p=0.001 n=20+20)
KV/Scan/Native/rows=1000-16      253µs ± 5%     255µs ± 5%     ~     (p=0.659 n=20+20)
KV/Scan/Native/rows=10000-16    2.16ms ± 4%    2.14ms ± 3%     ~     (p=0.072 n=20+20)

name                          old alloc/op   new alloc/op   delta
KV/Scan/Native/rows=1-16        8.69kB ± 0%    7.49kB ± 0%  -13.79%  (p=0.000 n=20+19)
KV/Scan/Native/rows=10-16       10.1kB ± 0%     8.9kB ± 0%  -11.87%  (p=0.000 n=20+18)
KV/Scan/Native/rows=100-16      22.7kB ± 0%    21.5kB ± 0%   -5.29%  (p=0.000 n=17+19)
KV/Scan/Native/rows=1000-16      174kB ± 0%     172kB ± 0%   -0.66%  (p=0.000 n=19+19)
KV/Scan/Native/rows=10000-16    1.51MB ± 0%    1.51MB ± 0%   -0.05%  (p=0.000 n=16+19)

name                          old allocs/op  new allocs/op  delta
KV/Scan/Native/rows=1-16          71.0 ± 0%      62.0 ± 0%  -12.68%  (p=0.000 n=20+20)
KV/Scan/Native/rows=10-16         75.0 ± 0%      66.0 ± 0%  -12.00%  (p=0.000 n=20+19)
KV/Scan/Native/rows=100-16        79.0 ± 0%      70.0 ± 0%  -11.39%  (p=0.000 n=19+19)
KV/Scan/Native/rows=1000-16       87.8 ± 1%      79.0 ± 0%   -9.97%  (p=0.000 n=20+16)
KV/Scan/Native/rows=10000-16       113 ± 2%       103 ± 2%   -8.19%  (p=0.000 n=17+19)
```

We may want to consider this as a candidate to backport to release-21.1, because
the lack of pooling here was even more detrimental with the separated lockTable,
which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078.

Release note (performance improvement): A series of heap allocations performed
when serving read-only queries are now avoided.
craig bot pushed a commit that referenced this issue Apr 21, 2021
63829: changefeedccl: Improve avro encoder performance r=miretskiy a=miretskiy

Avoid expensive allocations (maps) when encoding datums.
Improve encoder performance by ~40%, and significantly reduce
memory allocations per op.

```
BenchmarkEncodeInt-16                    1834214               665.4 ns/op            73 B/op          5 allocs/op
BenchmarkEncodeBool-16                   1975244               597.8 ns/op            33 B/op          3 allocs/op
BenchmarkEncodeFloat-16                  1773226               661.6 ns/op            73 B/op          5 allocs/op                                                  BenchmarkEncodeBox2D-16                   628884              1740 ns/op             579 B/op         18 allocs/op
BenchmarkEncodeGeography-16              1734722               713.3 ns/op           233 B/op          5 allocs/op                                                  BenchmarkEncodeGeometry-16               1495227              1208 ns/op            2737 B/op          5 allocs/op                                                  BenchmarkEncodeBytes-16                  2171725               698.4 ns/op            64 B/op          5 allocs/op                                                  BenchmarkEncodeString-16                 1847884               696.0 ns/op            49 B/op          4 allocs/op
BenchmarkEncodeDate-16                   2159253               701.6 ns/op            64 B/op          5 allocs/op
BenchmarkEncodeTime-16                   1857284               682.9 ns/op            81 B/op          6 allocs/op
BenchmarkEncodeTimeTZ-16                  833163              1405 ns/op             402 B/op         14 allocs/op
BenchmarkEncodeTimestamp-16              1623998               720.5 ns/op            97 B/op          6 allocs/op
BenchmarkEncodeTimestampTZ-16            1614201               719.0 ns/op            97 B/op          6 allocs/op
BenchmarkEncodeDecimal-16                 790902              1473 ns/op             490 B/op         23 allocs/op
BenchmarkEncodeUUID-16                   2216424               783.0 ns/op           176 B/op          6 allocs/op
BenchmarkEncodeINet-16                   1545225               817.6 ns/op           113 B/op          8 allocs/op
BenchmarkEncodeJSON-16                   2146824              1731 ns/op             728 B/op         21 allocs/op
```


Release Notes: None


63845: storage: pool pebbleReadOnly allocations r=nvanbenschoten a=nvanbenschoten

This commit introduces object pooling for `pebbleReadOnly` allocation avoidance. I found this to be important both because it avoids the initial `pebbleReadOnly` allocation, but also because it allows the memory recycling inside of each `pebbleIterator` owned by a `pebbleReadOnly` to work correctly.

I found this while running a few microbenchmarks and noticing that the lockTable's calls to `intentInterleavingReader` were resulting in a large number of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because `lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of them) in `(*pebbleIterator).init` was causing an allocation.

```
name                          old time/op    new time/op    delta
KV/Scan/Native/rows=1-16        30.9µs ± 4%    29.9µs ± 6%   -3.29%  (p=0.000 n=20+20)
KV/Scan/Native/rows=100-16      54.2µs ± 4%    52.7µs ± 5%   -2.84%  (p=0.002 n=20+20)
KV/Scan/Native/rows=10-16       34.0µs ± 3%    33.1µs ± 6%   -2.64%  (p=0.001 n=20+20)
KV/Scan/Native/rows=1000-16      253µs ± 5%     255µs ± 5%     ~     (p=0.659 n=20+20)
KV/Scan/Native/rows=10000-16    2.16ms ± 4%    2.14ms ± 3%     ~     (p=0.072 n=20+20)

name                          old alloc/op   new alloc/op   delta
KV/Scan/Native/rows=1-16        8.69kB ± 0%    7.49kB ± 0%  -13.79%  (p=0.000 n=20+19)
KV/Scan/Native/rows=10-16       10.1kB ± 0%     8.9kB ± 0%  -11.87%  (p=0.000 n=20+18)
KV/Scan/Native/rows=100-16      22.7kB ± 0%    21.5kB ± 0%   -5.29%  (p=0.000 n=17+19)
KV/Scan/Native/rows=1000-16      174kB ± 0%     172kB ± 0%   -0.66%  (p=0.000 n=19+19)
KV/Scan/Native/rows=10000-16    1.51MB ± 0%    1.51MB ± 0%   -0.05%  (p=0.000 n=16+19)

name                          old allocs/op  new allocs/op  delta
KV/Scan/Native/rows=1-16          71.0 ± 0%      62.0 ± 0%  -12.68%  (p=0.000 n=20+20)
KV/Scan/Native/rows=10-16         75.0 ± 0%      66.0 ± 0%  -12.00%  (p=0.000 n=20+19)
KV/Scan/Native/rows=100-16        79.0 ± 0%      70.0 ± 0%  -11.39%  (p=0.000 n=19+19)
KV/Scan/Native/rows=1000-16       87.8 ± 1%      79.0 ± 0%   -9.97%  (p=0.000 n=20+16)
KV/Scan/Native/rows=10000-16       113 ± 2%       103 ± 2%   -8.19%  (p=0.000 n=17+19)
```

We may want to consider this as a candidate to backport to release-21.1, because the lack of pooling here was even more detrimental with the separated lockTable, which creates a separate `EngineIterator` with lower and upper bounds. So this may have a small impact on #62078.

Release note (performance improvement): A series of heap allocations performed when serving read-only queries are now avoided.

63904: colbuilder: support 'CASE expr WHEN exprs' form r=yuzefovich a=yuzefovich

Previously, we didn't support `CASE expr WHEN exprs` form of CASE
expression and had to fallback to row-by-row engine. This form requires
just another step of performing an equality comparison of `expr`
against the projection of the current WHEN arm to decide whether this
arm matched. This commit does so.

Release note: None

63947: execinfra: mark 'sql.distsql.temp_storage.workmem' as public r=yuzefovich a=yuzefovich

Release note (sql change): `sql.distsql.temp_storage.workmem` cluster
setting is now marked as public and is included into the documentation.
It determines how much RAM a single operation of a single query can use
before it must spill to temporary storage. Note the operations that
don't support the disk spilling will ignore this setting and are subject
only to `--max-sql-memory` startup argument.

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 21, 2021
This commit introduces object pooling for `pebbleReadOnly` allocation avoidance.
I found this to be important both because it avoids the initial `pebbleReadOnly`
allocation, but also because it allows the memory recycling inside of each
`pebbleIterator` owned by a `pebbleReadOnly` to work correctly.

I found this while running a few microbenchmarks and noticing that the
lockTable's calls to `intentInterleavingReader` were resulting in a large number
of heap allocations in `(*pebbleReadOnly).NewEngineIterator`. This was because
`lowerBoundBuf` and `upperBoundBuf` were always nil and so each append (all 4 of
them) in `(*pebbleIterator).init` was causing an allocation.

```
name                          old time/op    new time/op    delta
KV/Scan/Native/rows=1-16        30.9µs ± 4%    29.9µs ± 6%   -3.29%  (p=0.000 n=20+20)
KV/Scan/Native/rows=100-16      54.2µs ± 4%    52.7µs ± 5%   -2.84%  (p=0.002 n=20+20)
KV/Scan/Native/rows=10-16       34.0µs ± 3%    33.1µs ± 6%   -2.64%  (p=0.001 n=20+20)
KV/Scan/Native/rows=1000-16      253µs ± 5%     255µs ± 5%     ~     (p=0.659 n=20+20)
KV/Scan/Native/rows=10000-16    2.16ms ± 4%    2.14ms ± 3%     ~     (p=0.072 n=20+20)

name                          old alloc/op   new alloc/op   delta
KV/Scan/Native/rows=1-16        8.69kB ± 0%    7.49kB ± 0%  -13.79%  (p=0.000 n=20+19)
KV/Scan/Native/rows=10-16       10.1kB ± 0%     8.9kB ± 0%  -11.87%  (p=0.000 n=20+18)
KV/Scan/Native/rows=100-16      22.7kB ± 0%    21.5kB ± 0%   -5.29%  (p=0.000 n=17+19)
KV/Scan/Native/rows=1000-16      174kB ± 0%     172kB ± 0%   -0.66%  (p=0.000 n=19+19)
KV/Scan/Native/rows=10000-16    1.51MB ± 0%    1.51MB ± 0%   -0.05%  (p=0.000 n=16+19)

name                          old allocs/op  new allocs/op  delta
KV/Scan/Native/rows=1-16          71.0 ± 0%      62.0 ± 0%  -12.68%  (p=0.000 n=20+20)
KV/Scan/Native/rows=10-16         75.0 ± 0%      66.0 ± 0%  -12.00%  (p=0.000 n=20+19)
KV/Scan/Native/rows=100-16        79.0 ± 0%      70.0 ± 0%  -11.39%  (p=0.000 n=19+19)
KV/Scan/Native/rows=1000-16       87.8 ± 1%      79.0 ± 0%   -9.97%  (p=0.000 n=20+16)
KV/Scan/Native/rows=10000-16       113 ± 2%       103 ± 2%   -8.19%  (p=0.000 n=17+19)
```

We may want to consider this as a candidate to backport to release-21.1, because
the lack of pooling here was even more detrimental with the separated lockTable,
which creates a separate EngineIterator. So this may have a small impact on cockroachdb#62078.

Release note (performance improvement): A series of heap allocations performed
when serving read-only queries are now avoided.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker
Projects
None yet
Development

No branches or pull requests

9 participants