-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: read-triggered compactions causing excessive write amplification #1143
Comments
|
If I'm reading the code correctly, we only append to the slice the first time an sstable's |
you are right, I misread! |
Looks as though this came up during our pre-release performance regression checks, starting around here It's hard to piece together how we convinced ourselves that there wasn't more to look at at the time. The YCSB-A graphs definitely took a hit right when read compactions were introduced and never quite recovered: |
@tbg That was a separate issue that we later addressed: #1098 . It's hard to evaluate the benefit of that change on that graph as it's just tagged as "21.1 regressions fixed", but the PR shows how it brought back the same level of performance. YCSB-A likely didn't have quite the same kind of workload that'd trigger this issue. The missing piece was likely a backup. |
Nevermind, I see you linked to the thread that talked about #1098. I guess it is worthwhile to confirm if the remaining small difference in ycsb-A is emblematic of the same issue as telemetry. |
This change, which is *not* meant for merging into a release or master branch, is solely to add a logging event for when a read compaction is appended to the db's slice. This is to help inform #1143.
This change, which is *not* meant for merging into a release or master branch, is solely to add logging events when a read compaction is appended to the db's slice, and when a read compaction is picked off of the slice. This is to help inform cockroachdb#1143.
This change is not destined for any release. It's solely to add logging to inform cockroachdb/pebble#1143 Pulls in: cockroachdb/pebble#1151 Changes pulled in: ``` 85e0b39a8d1b695815781eb1f1fe006179177914 *: Log info about read compactions ``` Release note: None.
This change sets the ReadSamplingMultiplier to a negative value, effectively disabling read sampling and therefore read-triggered compactions. This is a mitigation of cockroachdb/pebble#1143 to unblock the 21.1 release while we try to understand the issue. Release note (general change): Disable read-triggered compactions to avoid instances where the storage engine would compact excessively.
Looking at the debug logs after #1147 was deployed to the telemetry cluster, all the imbalanced compactions are read compactions. That effectively confirms this as a bug, so I'll go ahead and disable read compactions on 21.1. |
This change sets the ReadSamplingMultiplier to a negative value, effectively disabling read sampling and therefore read-triggered compactions. This is a mitigation of cockroachdb/pebble#1143 to unblock the 21.1 release while we try to understand the issue. Release note (general change): Disable read-triggered compactions to avoid instances where the storage engine would compact excessively.
some quick observations from looking at the logs:
|
Looking at the badly balanced compactions, I can think of a couple solutions that would be worthwhile to implement:
And here's something worth exploring, but maybe not implementing rightaway:
Open to more ideas about this, but these were some I could think of. |
(Just adding stuff from my notes, which overlap)
|
Read-triggered compactions are already disabled on 21.1. As the fixes to address known shortcomings with read-triggered compactions are a bit involved (see cockroachdb/pebble#1143 ), disable the feature on master until that issue is fixed. That prevents this known issue from getting in the way of performance experiments. Release note: None.
65559: tracing,tracingservice: adds a trace service to pull clusterwide trace spans r=irfansharif,abarganier a=adityamaru Previously, every node in the cluster had a local inflight span registry that was aware of all the spans that were rooted on that particular node. Child spans of a given traceID executing on a remote node would only become visible to the local registry once execution completes, and the span pushes its recordings over gRPC to the "client" node. This change introduces a `tracingservice` package. Package tracingservice contains a gRPC service to be used for remote inflight span access. It is used for pulling inflight spans from all CockroachDB nodes. Each node will run a trace service, which serves the inflight spans from the local span registry on that node. Each node will also have a trace client dialer, which uses the nodedialer to connect to another node's trace service, and access its inflight spans. The trace client dialer is backed by a remote trace client or a local trace client, which serves as the point of entry to this service. Both clients support the `TraceClient` interface, which includes the following functionalities: - GetSpanRecordings The spans for a traceID are sorted by `StartTime` before they are returned. The per-node trace dialer has yet to be hooked up to an appropriate location depending on where we intend to use it. Resolves: #60999 Informs: #64992 Release note: None 66149: cloud: fix gcs to resuming reader r=dt a=adityamaru This change does a few things: 1. gcs_storage was not returning a resuming reader as a result of which the Read method of the resuming reader that contains logic to retry on certain kinds of errors was not being invoked. 2, Changes the resuming reader to take a storage specific function that can define what errors are retryable in the resuming reader. All storage providers use the same deciding function at the moment and so behavior is unchanged. Release note: None 66152: storage: Disable read sampling and read compactions r=sumeerbhola a=itsbilal Read-triggered compactions are already disabled on 21.1. As the fixes to address known shortcomings with read-triggered compactions are a bit involved (see cockroachdb/pebble#1143 ), disable the feature on master until that issue is fixed. That prevents this known issue from getting in the way of performance experiments. Release note: None. 66155: sql: drop "cluster" from EXPLAIN ANALYZE to improve readability r=maryliag a=maryliag Remove the word "cluster" from "cluster nodes" and "cluster regions" on EXPLAIN ANALYZE to improve readability. Release note: None 66157: sql: add time & contention time to EXPLAIN ANALYZE. r=matthewtodd a=matthewtodd The new fields are labeled `KV time` and `KV contention time`: ``` > EXPLAIN ANALYZE -> UPDATE users SET name = 'Bob Loblaw' -> WHERE id = '32a962b7-8440-4b81-97cd-a7d7757d6eac'; info -------------------------------------------------------------------------------------------- planning time: 353µs execution time: 3ms distribution: local vectorized: true rows read from KV: 52 (5.8 KiB) cumulative time spent in KV: 2ms maximum memory usage: 60 KiB network usage: 0 B (0 messages) cluster regions: us-east1 • update │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ table: users │ set: name │ auto commit │ └── • render │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ estimated row count: 0 │ └── • filter │ cluster nodes: n1 │ cluster regions: us-east1 │ actual row count: 1 │ estimated row count: 0 │ filter: id = '32a962b7-8440-4b81-97cd-a7d7757d6eac' │ └── • scan cluster nodes: n1 cluster regions: us-east1 actual row count: 52 KV time: 2ms KV contention time: 0µs KV rows read: 52 KV bytes read: 5.8 KiB estimated row count: 50 (100% of the table; stats collected 3 minutes ago) table: users@primary spans: FULL SCAN (42 rows) Time: 4ms total (execution 4ms / network 0ms) ``` Resolves #64200 Release note (sql change): EXPLAIN ANALYZE output now includes, for each plan step, the total time spent waiting for KV requests as well as the total time those KV requests spent contending with other transactions. Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com> Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Have we tested read-triggered compactions on workloads with heterogeneous access patterns for various regions of the keyspace? I'm thinking about a read-heavy, write-little SQL table, flanked within the keyspace by two read-little, write-heavy SQL tables. Flushes to the write-heavy tables may increase read-amplification for the reads, while read-triggered compactions would increase the write amplification for the write tables. |
Haven't looked into this yet. @itsbilal
Both benchmarks were run for over 10 minutes
Note that this step leads to a waste of ~100 compactions, which were otherwise providing value, cause it is those compactions which really improve the performance of ycsb bench.
I think the solution is to make sure that we keep trying to compact, if we add to the read compaction queue. |
If the queue empties out, what's likely happening is that the One thing you could do to address this is to have the same reader that sees Another option is to have all readers that see a negative Did you make the change to better account for overlapping spans instead of just looking for matching start+end strings? Are these findings after that suggested change? |
No, it was only looking for perfect overlaps. |
I was referring to the change I had advised on the limited size queue PR, the one where you deduplicate by key spans using the comparator and not the string method. That would result in better deduplication as well as better comparison performance. |
Oops! Yea, I edited that comment right when you posted this. The issue is that we stop doing compactions even though the queue has compactions. This happens because nothing is triggering the call of I'll also change the |
The |
We were seeing some issues with read compactions as described here: #1143 To prevent the issues, we came up with the following fixes: 1. We prevent the size of the read compactions queue, and also add some de-duplication logic to the queue. 2. We prevent read compactions which are too wide. 3. We prevent read compactions where the file which is being compacted overlaps with too much of the output level, relative to its size. 4. We prevent read compactions if the file associated with a range has changed. Results from ycsb/size=64 run using the pebble ycsb roachtest: name old ops/sec new ops/sec delta ycsb/E/values=64 229k ± 4% 221k ± 3% ~ (p=0.151 n=5+5) ycsb/A/values=64 825k ± 6% 825k ± 6% ~ (p=1.000 n=5+5) ycsb/D/values=64 877k ± 6% 858k ± 6% ~ (p=0.310 n=5+5) ycsb/F/values=64 237k ± 0% 237k ± 0% -0.31% (p=0.008 n=5+5) ycsb/C/values=64 2.63M ±17% 2.48M ±20% ~ (p=0.310 n=5+5) ycsb/B/values=64 1.09M ± 5% 1.10M ± 6% ~ (p=0.690 n=5+5) name old read new read delta ycsb/E/values=64 21.5G ± 6% 20.0G ± 6% -6.82% (p=0.032 n=5+5) ycsb/A/values=64 68.1G ± 3% 67.7G ± 3% ~ (p=0.421 n=5+5) ycsb/D/values=64 36.8G ± 7% 34.6G ± 7% ~ (p=0.095 n=5+5) ycsb/F/values=64 160G ± 1% 160G ± 1% ~ (p=1.000 n=5+5) ycsb/C/values=64 2.75G ± 4% 2.92G ± 4% ~ (p=0.056 n=5+5) ycsb/B/values=64 47.4G ±15% 22.7G ± 6% -52.10% (p=0.008 n=5+5) name old write new write delta ycsb/E/values=64 22.8G ± 6% 21.3G ± 6% -6.67% (p=0.032 n=5+5) ycsb/A/values=64 98.9G ± 4% 98.6G ± 4% ~ (p=0.548 n=5+5) ycsb/D/values=64 42.2G ± 7% 39.9G ± 6% ~ (p=0.095 n=5+5) ycsb/F/values=64 190G ± 0% 190G ± 0% ~ (p=1.000 n=5+5) ycsb/C/values=64 2.67G ± 4% 2.83G ± 4% +6.31% (p=0.032 n=5+5) ycsb/B/values=64 51.5G ±14% 27.0G ± 6% -47.71% (p=0.008 n=5+5) name old r-amp new r-amp delta ycsb/E/values=64 2.66 ± 2% 3.09 ± 2% +16.34% (p=0.008 n=5+5) ycsb/A/values=64 6.57 ± 2% 6.61 ± 2% ~ (p=0.452 n=5+5) ycsb/D/values=64 4.13 ± 1% 4.21 ± 0% +1.89% (p=0.008 n=5+5) ycsb/F/values=64 0.00 0.00 ~ (all equal) ycsb/C/values=64 1.28 ±27% 1.55 ±36% ~ (p=0.190 n=5+5) ycsb/B/values=64 3.71 ± 2% 3.75 ± 9% ~ (p=0.579 n=5+5) name old w-amp new w-amp delta ycsb/E/values=64 26.9 ± 3% 26.0 ± 2% ~ (p=0.095 n=5+5) ycsb/A/values=64 3.24 ± 2% 3.23 ± 2% ~ (p=0.246 n=5+5) ycsb/D/values=64 13.0 ± 1% 12.6 ± 1% -3.19% (p=0.008 n=5+5) ycsb/F/values=64 10.8 ± 1% 10.8 ± 0% ~ (p=0.198 n=5+5) ycsb/C/values=64 0.00 0.00 ~ (all equal) ycsb/B/values=64 12.8 ±11% 6.6 ± 1% -48.17% (p=0.008 n=5+5)
Closing this out as I've tested the fixes with our benchmarks. I'll keep an eye on the telemetry cluster. |
We observed the following in the CockroachDB telemetry cluster when switching from 20.2 to 21.1: the L4 shrank in size from 2GB to 120MB and from then one we see hugely imbalanced compactions with tiny bytes in L4 and huge bytes in L5 (see example below). These keep happening despite L4 score being low and compactions into L4 only adding ~30MB every ~3min to L4. The likely cause is read-triggered compactions. The write amp keeps growing to significantly > 100.
(internal links. more details: https://cockroachlabs.atlassian.net/browse/SREOPS-2002, discussion: https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1621008284293900)
ReadSamplingMultiplier
for Pebble running in a CockroachDB node. To do this,pebble.Options.Parse
needs to support this option.pebble/compaction_picker.go
Line 942 in 3ebefb1
shrinks
imbalanced compaction:
growth in write amp:
The text was updated successfully, but these errors were encountered: