Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: kv0/kv95 regression around Feb 25 [loosely coupled log truncation] #78412

Open
24 tasks done
tbg opened this issue Mar 24, 2022 · 29 comments
Open
24 tasks done

perf: kv0/kv95 regression around Feb 25 [loosely coupled log truncation] #78412

tbg opened this issue Mar 24, 2022 · 29 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Mar 24, 2022

kv0:
image

kv95
image

I'm investigating this.

$ git log origin/master --topo-order --merges --oneline --since '2022-02-24' --until '2022-02-26'

Bisecting via

GCE_PROJECT=andrei-jepsen ./pkg/cmd/roachtest/roachstress.sh -c 5 '^kv95/enc=false/nodes=3/size=64kb$'
for f in artifacts/9dc9eb034d/*/kv95/enc\=false/nodes\=3/size\=64kb/run_*/*run_kv.log; do tail -n 12 $f; done | grep 'read$'

Jira issue: CRDB-14119

@tbg tbg added the C-investigation Further steps needed to qualify. C-label will change. label Mar 24, 2022
@tbg tbg self-assigned this Mar 24, 2022
@tbg
Copy link
Member Author

tbg commented Mar 24, 2022

The "winner" is #76902

 1800.0s        0       11939952         6633.3     16.0     11.0     48.2     71.3   2013.3  read
 1800.0s        0       11980208         6655.7     15.5     10.5     46.1     67.1   1677.7  read
 1800.0s        0       11874980         6597.2     15.9     11.0     48.2     71.3   1140.9  read
 1800.0s        0       10644279         5913.5     18.4     10.0     62.9    100.7   1744.8  read
 1800.0s        0       11759015         6532.8     15.8     11.0     46.1     71.3   1744.8  read

"Without regression" looks consistently like this:

 1800.0s        0       12552535         6973.6     15.0     10.5     44.0     65.0   1946.2  read
 1800.0s        0       12654748         7030.4     15.2     10.5     44.0     65.0   1409.3  read
 1800.0s        0       12558263         6976.8     15.0     10.0     44.0     65.0   2684.4  read
 1800.0s        0       12584422         6991.3     15.4     10.5     46.1     65.0   2147.5  read
 1800.0s        0       12380216         6877.9     15.1     10.0     44.0     67.1   1744.8  read

@tbg tbg assigned sumeerbhola and unassigned tbg Mar 24, 2022
@tbg tbg added the GA-blocker label Mar 24, 2022
@blathers-crl

This comment was marked as resolved.

@tbg tbg added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 branch-master Failures and bugs on the master branch. labels Mar 24, 2022
@sumeerbhola
Copy link
Collaborator

I'm trying the kv0/enc=false/nodes=3/size=4kb reproduction.
This is a strange test that does not overload cpu or cause Pebble write amplification to increase. If the bottleneck is the number of workers running the workload, I think it would be worth revising this test since having the throughput be sensitive to small changes in latency is not ideal. The KV transaction throughput slowly decreases during the test run, due to latency increase -- I don't know the root cause of the latency increase (this was also the behavior before this regression). There is ~1% spent in the truncator, mostly in loading the range applied state when there is a miss in the block cache (see screenshot below).
Screen Shot 2022-03-28 at 10 00 36 AM
Screen Shot 2022-03-28 at 9 59 59 AM
Screen Shot 2022-03-28 at 10 14 33 AM

@sumeerbhola
Copy link
Collaborator

Based on extra logging, the loosely-coupled truncations are never being merged, isDeltaTrusted is always true, and ~11% of the calls to tryEnactTruncations fail because the raft applied index has not advanced sufficiently. So loosely-coupled truncation is behaving as expected.

@sumeerbhola
Copy link
Collaborator

kv95/enc=false/nodes=3/size=64kb is similar in terms of ~10% of tryEnactTruncations failing because of raft applied index being insufficiently advanced. This test does overload the store, so admission control is throttling via IO tokens -- but the truncator writes are not throttled. The cpu profile is similar, and the utilization is ~25%.
Given the low cpu utilization I'm going to ignore the 1% spent in the truncator.

In kv95, there is often ~3s intervals between groups of tryEnactTruncations, which suggests it to be the interval of durability-advanced callbacks (due to Pebble flushes). Perhaps the lag is causing the raft cache to have more misses. I turned off loosely-coupled truncation at 15:38 in the following, and I couldn't spot the difference (the second and third graphs are the entrycache.size and entrycache.bytes)
Screen Shot 2022-03-28 at 11 42 00 AM

@tbg
Copy link
Member Author

tbg commented Mar 28, 2022

If you turn the loosely coupled truncations off, do you see the difference in qps? I'm still confused how this is even happening (as are you I assume)

@sumeerbhola
Copy link
Collaborator

I think we should try to find the root cause based on kv0/enc=false/nodes=3/size=4kb or any other roachtest which does not overload Pebble, that have regressed.
This is because there is a possibility that the lag in truncation, which causes the number of steady-state raft log entries to be higher, is making the read amp worse: this will be hard to pinpoint, and the probability is low (most of the bytes in the LSM should be in the state machine). And this will just make the investigation harder.

@sumeerbhola
Copy link
Collaborator

If you turn the loosely coupled truncations off, do you see the difference in qps? I'm still confused how this is even happening (as are you I assume)

Nothing that I can spot by eyeballing the graphs (following is the kv95 where loosely-coupled truncation was turned off at 15:38)
Screen Shot 2022-03-28 at 11 53 59 AM

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Mar 28, 2022

The fact that we only see this on benchmarks with large writes is interesting. If we suspect that the lag in truncation is related, then we should look for effects which would be more pronounced with larger raft log entries.

I was thinking that maybe the quota pool could be involved somehow, but once we're truncating up to a log index, the quota pool should have already advanced past that index.

Looking at the raft entry cache hit rate is a good idea. However, once we're truncating up to a log index, even if that truncation hasn't been applied yet, we shouldn't ever need to go back and read entries below the truncation index, so those entries shouldn't be causing other entries above the truncation index to be getting evicted. So what effect could more entries in the entry cache have? One theory is that we efficiently evict entries from the entry cache once they are truncated (see raftentry.Cache.Clear). If we delay truncation, large entries below the in-progress truncation index will remain in the cache longer. If they then get evicted by the LRU policy due to memory pressure, this entry-by-entry eviction may be less efficient than the bulk eviction path during truncation. Do we see more CPU time spent in raftentry.Cache.evictLocked?

@sumeerbhola
Copy link
Collaborator

Do we see more CPU time spent in raftentry.Cache.evictLocked?

0.036% when running with loosely-coupled for kv0/enc=false/nodes=3/size=4kb. The full profile is in https://drive.google.com/file/d/1JNJgnJseHeMEsqTBgyiHY_uU-uhB2lQ9/view?usp=sharing

@erikgrinaker
Copy link
Contributor

@sumeerbhola Will you have a chance to dig into this further? The release isn't that far off, so we should try to resolve this.

@sumeerbhola
Copy link
Collaborator

I am out of ideas, and would appreciate some help. Both tests have low CPU and one of them does not overload the store with writes either. Loosely-coupled truncation is not adding additional writes and is adding a small amount of extra CPU on its own thread. So the small latency increase for the workload, that must be causing the small throughput drop, is hard to explain.

@erikgrinaker
Copy link
Contributor

No ideas off-the-cuff, but I'll have a closer look when I have time.

@erikgrinaker
Copy link
Contributor

This is a strange test that does not overload cpu or cause Pebble write amplification to increase. If the bottleneck is the number of workers running the workload, I think it would be worth revising this test since having the throughput be sensitive to small changes in latency is not ideal.

The bottleneck here is disk throughput, which is probably significant. The first runs here are with 192 workers (default for this test), the latest with 384 workers:

Screenshot 2022-04-09 at 15 10 39

@erikgrinaker
Copy link
Contributor

One theory is that we efficiently evict entries from the entry cache once they are truncated (see raftentry.Cache.Clear). If we delay truncation, large entries below the in-progress truncation index will remain in the cache longer. If they then get evicted by the LRU policy due to memory pressure, this entry-by-entry eviction may be less efficient than the bulk eviction path during truncation.

Doesn't seem related to the entry cache. I tried a couple of builds that either removed the raftEntryCache.Clear() call, or called it every time a log truncation command was applied. No difference in either case.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 9, 2022

This seems to be a storage issue. Running with kv.raft_log.loosely_coupled_truncation.enabled = true has consistently higher values of these metrics across runs:

  • storage.l0-num-files: 429 vs. 243 at 16:45:30
  • storage.l0-sublevels: 20 vs. 13 at 16:45:00
  • rocksdb.compactions: 42 vs 40 at 16:45:00 (rate)

I wonder if this could have something to do with the fact that the old truncation uses the Raft batch for writes, while the loosely coupled truncation uses a new write batch (which adds about 1.4k additional batch commits per second). However, as far as I could tell, there didn't seem to be any other commands in the batches used for log truncation. Will dig into this a bit further.

Charts below. Top one is always true, bottom is false.

Screenshot 2022-04-09 at 18 51 11

Screenshot 2022-04-09 at 18 51 18

Screenshot 2022-04-09 at 18 51 44

Screenshot 2022-04-09 at 18 51 55

Screenshot 2022-04-09 at 18 50 02

Screenshot 2022-04-09 at 18 50 09

@erikgrinaker
Copy link
Contributor

Another observation: I'm seeing the same slowdown even when I'm only using a single worker for the writes. In that case, we're not saturating the disks, nor do we see anything in l0.

However, I do see a higher number of Pebble compactions: 1187 vs 1086 over a 3 minute benchmark. That's consistent with the other benchmarks, and could possibly explain this.

@erikgrinaker
Copy link
Contributor

FWIW, I tried using a common batch for all pending truncations, but didn't make any difference.

@erikgrinaker
Copy link
Contributor

Hmm. I disabled the Raft log truncation entirely (commented it out), and performance tanked, with L0 files and sublevels blowing up.

Here's what I'm thinking: the added log truncation delay means that Raft log entries stick around in L0 for longer, increasing the cost of L0 compactions. Does that check out @sumeerbhola?

@sumeerbhola
Copy link
Collaborator

Can we disable the loosely coupled truncation in 22.1, but leave the plumbing in place, and then enable it during the 22.2 upgrade when we also migrate to the separate log engine?

Yes we can disable without affecting future correctness.

// - v22.1 and v22.2: If the setting has been changed to false the v22.1 nodes
//   will do strongly coupled truncation and the v22.2 will do loosely
//   coupled. This co-existence is correct.
var looselyCoupledTruncationEnabled = func() *settings.BoolSetting {

But I think we should do one more step to confirm that writes are being distributed across all 1000 replicas, since we don't have an explanation for the truncation before flush behavior.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Apr 13, 2022

But I think we should do one more step to confirm that writes are being distributed across all 1000 replicas, since we don't have an explanation for the truncation before flush behavior.

The writes do end up evenly distributed, at least (with the old log truncation):

  • Stddev of 34 keys with 1200 average keys per range
  • Leaseholder QPS averaged 9.25 per range at 20 seconds in, with a stddev of 0.84

I'll be away on vacation for the next week and won't be able to dig into this any further. But I think we should disable the loosely coupled truncation on 22.1 for now, and we can reenable it if we do find any other explanations.

@erikgrinaker
Copy link
Contributor

@sumeerbhola Looks like we're cutting rc1 on April 19th, can you make sure we flip the switch before then?

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Apr 19, 2022
…nabled to false

This is due to the regression noticed in
cockroachdb#78412

Release note: None
craig bot pushed a commit that referenced this issue Apr 20, 2022
80193: kvserver: change default for kv.raft_log.loosely_coupled_truncation.e… r=sumeerbhola a=sumeerbhola

…nabled to false

This is due to the regression noticed in
#78412

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Apr 20, 2022
…nabled to false

This is due to the regression noticed in
#78412

Release note: None
blathers-crl bot pushed a commit that referenced this issue Apr 20, 2022
…nabled to false

This is due to the regression noticed in
#78412

Release note: None
@erikgrinaker erikgrinaker removed GA-blocker blocks-22.1.0-rc.1 branch-master Failures and bugs on the master branch. branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Apr 25, 2022
@sumeerbhola
Copy link
Collaborator

Though I am at a loss to explain why loosely-coupled=false is being able to delete log entries as early as the flush: assuming the defaultRaftLogTruncationThreshold of 16MB, and kv workload doing 1000 splits, and the hashGenerator uniformly distributing the writes across ranges. Then each range would get 64MB/1000 of the memtable before the memtable is flushed which is too tiny to trigger a raft log truncation before the flush.

My understanding was flawed. The raft log can be truncated after RaftLogQueueStaleSize = 64 << 10. I ran kv0 with additional instrumentation: commit-offset is the cumulative number of uncompressed bytes that have been written to memtables and WAL, and delta is the delta size of the log being truncated. The delta is often near -64KB, which is consistent with RaftLogQueueStaleSize. The commit-offset change for consecutive truncations is often ~200MB. We have ~1000 ranges so that means we wrote about 200KB to Pebble for each range, of which 64KB was the raft log and the rest was to the state machine -- this is plausible. And with a memtable of 64MB, and even distribution of writes, 64MB/200MB=0.32, i.e. 32% of the raft log entries will be in the same memtable as the truncation that deletes them, so they won't need to be flushed.

I220506 18:21:15.076767 290 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 1343  replica-truncation(527): delta: -61 KiB, commit-offset: 152 MiB
I220506 18:21:20.233908 310 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 2559  replica-truncation(527): delta: -63 KiB, commit-offset: 322 MiB
I220506 18:21:25.921901 294 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 3731  replica-truncation(527): delta: -75 KiB, commit-offset: 485 MiB
I220506 18:21:31.192371 282 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 4870  replica-truncation(527): delta: -63 KiB, commit-offset: 636 MiB
I220506 18:21:37.699372 272 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 6423  replica-truncation(527): delta: -59 KiB, commit-offset: 849 MiB
I220506 18:21:44.548491 271 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 7758  replica-truncation(527): delta: -80 KiB, commit-offset: 1022 MiB
I220506 18:21:46.906498 276 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 8470  replica-truncation(527): delta: -67 KiB, commit-offset: 1.1 GiB
I220506 18:21:52.793810 296 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 9921  replica-truncation(527): delta: -130 KiB, commit-offset: 1.3 GiB
I220506 18:21:56.161456 205 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 10830  replica-truncation(527): delta: -63 KiB, commit-offset: 1.5 GiB
I220506 18:21:58.976897 262 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 11473  replica-truncation(527): delta: -75 KiB, commit-offset: 1.6 GiB
I220506 18:22:02.664556 280 kv/kvserver/replica_application_state_machine.go:1292 ⋮ [n1,s1,r527/1:‹/Table/106/1/-61{2741…-0898…}›,raft] 12331  replica-truncation(527): delta: -75 KiB, commit-offset: 1.7 GiB

@jlinder jlinder removed the sync-me-3 label May 27, 2022
@exalate-issue-sync exalate-issue-sync bot changed the title perf: kv0/kv95 regression around Feb 25 perf: kv0/kv95 regression around Feb 25 [loosely coupled log truncation] Mar 29, 2023
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

7 participants