kvcoord: SingleRoundtripWithLatency performance regression #98887

nvanbenschoten · 2023-03-17T19:52:08Z

See #98068 and https://docs.google.com/spreadsheets/d/10GhYr_91CANCNKOM_gPy7Sk9hQkTyQGNgNwNgfHeUtI/edit#gid=4.

benchdiff --old=81a114c --new=master --post-checkout='dev generate go' --run='SingleRoundtripWithLatency' ./pkg/kv/kvclient/kvcoord

Jira issue: CRDB-25569

The text was updated successfully, but these errors were encountered:

Informs cockroachdb#98887. Avoids mixing logs with benchmark results, which breaks benchdiff. Release note: None

98792: kvserver: unskip `TestNewTruncateDecision` r=erikgrinaker a=erikgrinaker Passed after 10k stress runs. Has been skipped since 2019, issue seems to have been fixed in the meanwhile. Resolves #38584. Epic: none Release note: None 98855: roachtest: enable schema changes in acceptance/version-upgrade r=fqazi a=fqazi Previously, due to flakes we disabled schema changes inside the version update test. This patch re-enables them, since we are confident that the workload itslef is now stable in a mixed version state. Fixes: #58489 Release note: None 99023: kv: add log scope to BenchmarkSingleRoundtripWithLatency r=arulajmani a=nvanbenschoten Informs #98887. Avoids mixing logs with benchmark results, which breaks benchdiff. Release note: None 99033: storepool: set last unavailable on gossip dead r=andrewbaptist a=kvoli Previously, the `LastUnavailable` time was set in most parts of the storepool when a store was considered either `Unavailable`, `Dead`, `Decommissioned` or `Draining`. When `LastUnavailable` is within the last suspect duration (30s default), the node is treated as suspect by other nodes in the cluster. `LastUnavailable` was not being set when a store was considered dead due to the store not gossiping its store descriptor. This commit updates the `status` storepool function to do just that. Informs: #98928 Release note: None 99039: pkg/ccl/backupccl: Remove TestBackupRestoreControlJob r=benbardin a=benbardin This test has was marked skipped for flakiness, in 2018. Fixes: #24136 Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>

Informs #98887. Avoids mixing logs with benchmark results, which breaks benchdiff. Release note: None

nvanbenschoten · 2023-03-21T18:09:59Z

This regression looks real:

$ benchdiff --old=origin/release-22.2 --new=origin/master --post-checkout='dev generate go' --run='SingleRoundtripWithLatency' --count=10 ./pkg/kv/kvclient/kvcoord

  pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/kv/kvclient/kvcoord /

name                                        old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=10ms-30    12.0ms ± 2%    12.0ms ± 1%     ~     (p=0.237 n=10+8)
SingleRoundtripWithLatency/latency=0s-30       123µs ± 3%     172µs ± 4%  +39.04%  (p=0.000 n=10+10)

name                                        old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-30      16.0kB ± 1%    17.7kB ± 0%  +10.40%  (p=0.000 n=10+8)
SingleRoundtripWithLatency/latency=10ms-30    19.7kB ± 8%    23.1kB ± 4%  +17.31%  (p=0.000 n=10+10)

name                                        old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-30         151 ± 0%       147 ± 0%   -2.84%  (p=0.000 n=10+10)
SingleRoundtripWithLatency/latency=10ms-30       188 ± 3%       191 ± 1%     ~     (p=0.133 n=10+9)

nvanbenschoten · 2023-03-23T16:56:40Z

We've had some recent performance improvements around Raft which I think are showing up here, but there is still a regression.

On my GCE worker:

name                                      old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=0s-30     128µs ± 3%     170µs ± 4%  +32.85%  (p=0.000 n=20+19)

name                                      old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-30    16.0kB ± 1%    17.5kB ± 1%   +9.76%  (p=0.000 n=20+20)

name                                      old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-30       152 ± 0%       147 ± 0%   -2.97%  (p=0.000 n=20+20)

On my m1 mac:

name                                      old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=0s-10    52.8µs ± 6%    57.0µs ± 9%  +7.92%  (p=0.000 n=19+19)

name                                      old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-10    16.0kB ± 1%    17.3kB ± 1%  +8.58%  (p=0.000 n=19+20)

name                                      old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-10       151 ± 0%       146 ± 0%  -3.31%  (p=0.000 n=20+20)

nvanbenschoten · 2023-03-23T18:33:04Z

CPU profiles reveal that the new time is being spent in the raft scheduler.

I took a look at whether async raft storage writes was the cause. However, I did not see an improvement when disabling kv.raft_log.non_blocking_synchronization.enabled.

Instead, I see that the regression was caused by #89632. That change made single-replica raft groups pass through the raft scheduler twice instead of once when performing a write. This benchmark is running with replication factor 1x, so it makes sense that we would see a small (~5µs) regression due to that change.

Comparing 2c1c354 and master:

name                                      old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=0s-10    56.8µs ±11%    56.4µs ± 3%     ~     (p=0.573 n=10+8)

name                                      old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-10    18.2kB ± 1%    17.4kB ± 1%   -4.65%  (p=0.000 n=10+10)

name                                      old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-10       173 ± 0%       146 ± 0%  -15.61%  (p=0.000 n=9+10)

Comparing fad19d5 and master:

name                                      old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=0s-10    52.5µs ± 7%    57.2µs ± 6%   +8.90%  (p=0.000 n=10+10)

name                                      old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-10    18.0kB ± 1%    17.3kB ± 0%   -3.50%  (p=0.000 n=10+8)

name                                      old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-10       169 ± 0%       146 ± 0%  -13.61%  (p=0.000 n=10+10)

This impact is within the expectations of that change, so I'll close this issue.

nvanbenschoten mentioned this issue Mar 17, 2023

kv: triage microbenchmark regressions for v23.1 #98068

Closed

7 tasks

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 20, 2023

kv: add log scope to BenchmarkSingleRoundtripWithLatency

e9f92ec

Informs cockroachdb#98887. Avoids mixing logs with benchmark results, which breaks benchdiff. Release note: None

nvanbenschoten mentioned this issue Mar 20, 2023

kv: add log scope to BenchmarkSingleRoundtripWithLatency #99023

Merged

blathers-crl bot mentioned this issue Mar 20, 2023

release-22.2: kv: add log scope to BenchmarkSingleRoundtripWithLatency #99065

Merged

blathers-crl bot pushed a commit that referenced this issue Mar 20, 2023

kv: add log scope to BenchmarkSingleRoundtripWithLatency

6c0531a

Informs #98887. Avoids mixing logs with benchmark results, which breaks benchdiff. Release note: None

nvanbenschoten self-assigned this Mar 22, 2023

nvanbenschoten closed this as completed Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: SingleRoundtripWithLatency performance regression #98887

kvcoord: SingleRoundtripWithLatency performance regression #98887

nvanbenschoten commented Mar 17, 2023 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Mar 21, 2023

nvanbenschoten commented Mar 23, 2023

nvanbenschoten commented Mar 23, 2023

kvcoord: SingleRoundtripWithLatency performance regression #98887

kvcoord: SingleRoundtripWithLatency performance regression #98887

Comments

nvanbenschoten commented Mar 17, 2023 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Mar 21, 2023

nvanbenschoten commented Mar 23, 2023

nvanbenschoten commented Mar 23, 2023

nvanbenschoten commented Mar 17, 2023 •

edited by cockroach-jira-scripts

Loading