Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: SingleRoundtripWithLatency performance regression #98887

Closed
nvanbenschoten opened this issue Mar 17, 2023 · 3 comments
Closed

kvcoord: SingleRoundtripWithLatency performance regression #98887

nvanbenschoten opened this issue Mar 17, 2023 · 3 comments
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 17, 2023

See #98068 and https://docs.google.com/spreadsheets/d/10GhYr_91CANCNKOM_gPy7Sk9hQkTyQGNgNwNgfHeUtI/edit#gid=4.

benchdiff --old=81a114c --new=master --post-checkout='dev generate go' --run='SingleRoundtripWithLatency' ./pkg/kv/kvclient/kvcoord

Jira issue: CRDB-25569

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. GA-blocker T-kv KV Team branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 17, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 20, 2023
Informs cockroachdb#98887.

Avoids mixing logs with benchmark results, which breaks benchdiff.

Release note: None
craig bot pushed a commit that referenced this issue Mar 20, 2023
98792: kvserver: unskip `TestNewTruncateDecision` r=erikgrinaker a=erikgrinaker

Passed after 10k stress runs. Has been skipped since 2019, issue seems to have been fixed in the meanwhile.

Resolves #38584.

Epic: none
Release note: None

98855: roachtest: enable schema changes in acceptance/version-upgrade r=fqazi a=fqazi

Previously, due to flakes we disabled schema changes inside the version update test. This patch re-enables them, since we are confident that the workload itslef is now stable in a mixed version state.

Fixes: #58489
Release note: None

99023: kv: add log scope to BenchmarkSingleRoundtripWithLatency r=arulajmani a=nvanbenschoten

Informs #98887.

Avoids mixing logs with benchmark results, which breaks benchdiff.

Release note: None

99033: storepool: set last unavailable on gossip dead r=andrewbaptist a=kvoli

Previously, the `LastUnavailable` time was set in most parts of the storepool when a store was considered either `Unavailable`, `Dead`, `Decommissioned` or `Draining`. When `LastUnavailable` is within the last suspect duration (30s default), the node is treated as suspect by other nodes in the cluster.

`LastUnavailable` was not being set when a store was considered dead due to the store not gossiping its store descriptor. This commit updates the `status` storepool function to do just that.

Informs: #98928

Release note: None

99039: pkg/ccl/backupccl: Remove TestBackupRestoreControlJob r=benbardin a=benbardin

This test has was marked skipped for flakiness, in 2018.

Fixes: #24136

Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Ben Bardin <bardin@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Mar 20, 2023
Informs #98887.

Avoids mixing logs with benchmark results, which breaks benchdiff.

Release note: None
@nvanbenschoten
Copy link
Member Author

This regression looks real:

$ benchdiff --old=origin/release-22.2 --new=origin/master --post-checkout='dev generate go' --run='SingleRoundtripWithLatency' --count=10 ./pkg/kv/kvclient/kvcoord

  pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/kv/kvclient/kvcoord /

name                                        old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=10ms-30    12.0ms ± 2%    12.0ms ± 1%     ~     (p=0.237 n=10+8)
SingleRoundtripWithLatency/latency=0s-30       123µs ± 3%     172µs ± 4%  +39.04%  (p=0.000 n=10+10)

name                                        old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-30      16.0kB ± 1%    17.7kB ± 0%  +10.40%  (p=0.000 n=10+8)
SingleRoundtripWithLatency/latency=10ms-30    19.7kB ± 8%    23.1kB ± 4%  +17.31%  (p=0.000 n=10+10)

name                                        old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-30         151 ± 0%       147 ± 0%   -2.84%  (p=0.000 n=10+10)
SingleRoundtripWithLatency/latency=10ms-30       188 ± 3%       191 ± 1%     ~     (p=0.133 n=10+9)

@nvanbenschoten nvanbenschoten self-assigned this Mar 22, 2023
@nvanbenschoten
Copy link
Member Author

We've had some recent performance improvements around Raft which I think are showing up here, but there is still a regression.

On my GCE worker:

name                                      old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=0s-30     128µs ± 3%     170µs ± 4%  +32.85%  (p=0.000 n=20+19)

name                                      old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-30    16.0kB ± 1%    17.5kB ± 1%   +9.76%  (p=0.000 n=20+20)

name                                      old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-30       152 ± 0%       147 ± 0%   -2.97%  (p=0.000 n=20+20)

On my m1 mac:

name                                      old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=0s-10    52.8µs ± 6%    57.0µs ± 9%  +7.92%  (p=0.000 n=19+19)

name                                      old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-10    16.0kB ± 1%    17.3kB ± 1%  +8.58%  (p=0.000 n=19+20)

name                                      old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-10       151 ± 0%       146 ± 0%  -3.31%  (p=0.000 n=20+20)

@nvanbenschoten
Copy link
Member Author

CPU profiles reveal that the new time is being spent in the raft scheduler.

I took a look at whether async raft storage writes was the cause. However, I did not see an improvement when disabling kv.raft_log.non_blocking_synchronization.enabled.

Instead, I see that the regression was caused by #89632. That change made single-replica raft groups pass through the raft scheduler twice instead of once when performing a write. This benchmark is running with replication factor 1x, so it makes sense that we would see a small (~5µs) regression due to that change.

Comparing 2c1c354 and master:

name                                      old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=0s-10    56.8µs ±11%    56.4µs ± 3%     ~     (p=0.573 n=10+8)

name                                      old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-10    18.2kB ± 1%    17.4kB ± 1%   -4.65%  (p=0.000 n=10+10)

name                                      old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-10       173 ± 0%       146 ± 0%  -15.61%  (p=0.000 n=9+10)

Comparing fad19d5 and master:

name                                      old time/op    new time/op    delta
SingleRoundtripWithLatency/latency=0s-10    52.5µs ± 7%    57.2µs ± 6%   +8.90%  (p=0.000 n=10+10)

name                                      old alloc/op   new alloc/op   delta
SingleRoundtripWithLatency/latency=0s-10    18.0kB ± 1%    17.3kB ± 0%   -3.50%  (p=0.000 n=10+8)

name                                      old allocs/op  new allocs/op  delta
SingleRoundtripWithLatency/latency=0s-10       169 ± 0%       146 ± 0%  -13.61%  (p=0.000 n=10+10)

This impact is within the expectations of that change, so I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. branch-master Failures and bugs on the master branch. branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant