-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: slower incremental backups on Pebble #49710
Comments
Hi @petermattis, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Full backup on May 28 took 51m29s. Full backup on May 29 took 3h25m27s. So perhaps this isn’t an incremental backup problem, but simply a backup problem. |
As part of the investigation into cockroachdb#49710, this change adds a benchmark for ExportToSst that tests both RocksDB and Pebble. Release note: None.
As part of the investigation into cockroachdb#49710, this change adds a benchmark for ExportToSst that tests both RocksDB and Pebble. Here are some example runs without contention (old = rocksdb, new = pebble): name old time/op new time/op delta ExportToSst/rocksdb/numKeys=64/numRevisions=1-12 43.9µs ± 3% 34.5µs ± 4% -21.33% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=10-12 281µs ± 3% 169µs ± 6% -39.89% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=100-12 1.82ms ±22% 1.17ms ± 1% -35.73% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=1-12 212µs ± 6% 111µs ± 3% -47.77% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=10-12 1.91ms ± 1% 1.19ms ± 8% -37.65% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=512/numRevisions=100-12 13.7ms ± 3% 10.1ms ±12% -26.21% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12 390µs ± 1% 215µs ±12% -44.94% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12 4.01ms ± 6% 2.40ms ±16% -40.13% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12 27.9ms ± 2% 20.8ms ± 2% -25.48% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12 2.97ms ± 2% 1.42ms ± 5% -52.24% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12 32.8ms ± 7% 19.1ms ± 3% -41.59% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12 224ms ± 3% 169ms ±25% -24.64% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12 23.7ms ± 4% 13.4ms ±20% -43.65% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12 264ms ± 4% 201ms ±24% -23.92% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12 1.88s ± 6% 1.23s ± 8% -34.70% (p=0.000 n=10+8) And some with contention=true: name old time/op new time/op delta ExportToSst/rocksdb/numKeys=65536/numRevisions=10/contention=true-12 362ms ± 7% 168ms ± 3% -53.60% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100/contention=true-12 2.24s ± 6% 1.24s ±10% -44.50% (p=0.000 n=10+10) Release note: None.
Based on my own testing today, this can be generalized to all backups, not just incremental backups. I spun up two clusters on AWS with these parameters: Both were running 20.1.1, one with pebble and one with rocksdb. Then imported tpcc with 2000 warehouses into both clusters, started running the tpcc workload, and in parallel I kicked off a backup to s3. The almost-identical backup took 35 mins to finish in rocksdb, while it has been ongoing on the pebble one for 1h21m so far (and is still ~60% done). I noticed that both rocksdb and pebble took full advantage of the provisioned 1300 IOPS at the start of the backup, before declining to just under the limit, but more interestingly, rocksdb had much higher read bytes/throughput numbers than pebble did, throughout the backup run: (rocksdb on top, pebble on the bottom) I'd say this is a good enough reproduction. Will profile and see what's actually going on tomorrow. |
Yeah, this looks like a solid reproduction. The timings even match as it looks like the Pebble backup will take ~4x as long as the RocksDB one which matches what we saw on the test cluster. Definitely curious what is limiting Pebble here. |
Another day of investigation. I looked at CPU profiles but nothing really stood out; I tried out some improvements to Running the pebble and rocks clusters side-by-side and More interestingly, the first read in a "new" region of disk seems to be pretty large (always larger than 24) under rocksdb, after which we see many "small" sequential reads (8,16,24 blocks). With pebble, the first initial read is also small, requiring many more IO operations to begin reading data blocks in the file. The summary stats under both engines also agree with what the admin UI showed: fewer IO events resulting in more data being returned. I haven't dived deep into the block based table reader code in RocksDB yet, but at first glance there seems to be quite a bit of prefetching of index/filter blocks happening there - maybe that helps make read I/O more efficient under rocksdb? |
As part of the investigation into cockroachdb#49710, this change adds a benchmark for ExportToSst that tests both RocksDB and Pebble. Here are some example runs (old = rocksdb, new = pebble): name old time/op new time/op delta ExportToSst/rocksdb/numKeys=64/numRevisions=1-12 43.9µs ± 3% 34.5µs ± 4% -21.33% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=10-12 281µs ± 3% 169µs ± 6% -39.89% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=100-12 1.82ms ±22% 1.17ms ± 1% -35.73% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=1-12 212µs ± 6% 111µs ± 3% -47.77% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=10-12 1.91ms ± 1% 1.19ms ± 8% -37.65% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=512/numRevisions=100-12 13.7ms ± 3% 10.1ms ±12% -26.21% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12 390µs ± 1% 215µs ±12% -44.94% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12 4.01ms ± 6% 2.40ms ±16% -40.13% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12 27.9ms ± 2% 20.8ms ± 2% -25.48% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12 2.97ms ± 2% 1.42ms ± 5% -52.24% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12 32.8ms ± 7% 19.1ms ± 3% -41.59% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12 224ms ± 3% 169ms ±25% -24.64% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12 23.7ms ± 4% 13.4ms ±20% -43.65% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12 264ms ± 4% 201ms ±24% -23.92% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12 1.88s ± 6% 1.23s ± 8% -34.70% (p=0.000 n=10+8) Release note: None.
Certainly possible. I assume you're referring to the code in |
A reminder about cockroachdb/pebble#198. Pebble should be specifying the same |
Here's the RocksDB PR where the dynamic read-ahead code landed: facebook/rocksdb#3282. |
TIL about the |
49565: sql: serialize UDTs in expressions in a stable way r=otan,jordanlewis a=rohany Fixes #49379. This PR ensures that serialized expressions stored durably in table descriptors are serialized in a format that is stable across changes to user defined types present in those expressions. An effect of this change is that these expressions must be reparsed and formatted in a human readable way before display in statements like `SHOW CREATE TABLE`. Release note: None 49721: storage: Add rocksdb-vs-pebble benchmark for ExportToSst r=itsbilal a=itsbilal As part of the investigation into #49710, this change adds a benchmark for ExportToSst that tests both RocksDB and Pebble. Here are some example runs without contention (old = rocksdb, new = pebble): name old time/op new time/op delta ExportToSst/rocksdb/numKeys=64/numRevisions=1-12 43.9µs ± 3% 34.5µs ± 4% -21.33% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=10-12 281µs ± 3% 169µs ± 6% -39.89% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=100-12 1.82ms ±22% 1.17ms ± 1% -35.73% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=1-12 212µs ± 6% 111µs ± 3% -47.77% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=10-12 1.91ms ± 1% 1.19ms ± 8% -37.65% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=512/numRevisions=100-12 13.7ms ± 3% 10.1ms ±12% -26.21% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12 390µs ± 1% 215µs ±12% -44.94% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12 4.01ms ± 6% 2.40ms ±16% -40.13% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12 27.9ms ± 2% 20.8ms ± 2% -25.48% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12 2.97ms ± 2% 1.42ms ± 5% -52.24% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12 32.8ms ± 7% 19.1ms ± 3% -41.59% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12 224ms ± 3% 169ms ±25% -24.64% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12 23.7ms ± 4% 13.4ms ±20% -43.65% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12 264ms ± 4% 201ms ±24% -23.92% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12 1.88s ± 6% 1.23s ± 8% -34.70% (p=0.000 n=10+8) And some with contention=true: name old time/op new time/op delta ExportToSst/rocksdb/numKeys=65536/numRevisions=10/contention=true-12 362ms ± 7% 168ms ± 3% -53.60% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100/contention=true-12 2.24s ± 6% 1.24s ±10% -44.50% (p=0.000 n=10+10) Release note: None. 49815: roachpb: refuse nil desc in NewRangeKeyMismatchError r=andreimatei a=andreimatei Since recently RangeKeyMismatchError does not support nil descriptors, but it still had code that pretended to deal with nils (even though a nil would have exploded a bit later). Only one test caller was passing a nil, and it turns out that was dead code. Release note: None Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Disabling the dynamic readahead code in rocksdb completely closed the backup performance/time gap in my tpcc-2k test setup; in that it was just as bad as Pebble then. I've added a simple change to implement the same functionality in Pebble: cockroachdb/pebble#726 From early test runs with this change, I can see Pebble backups slightly outperform RocksDB (with implicit automatic readahead added back in). Doing more testing to confirm my early results. |
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.
Updates Pebble to the latest version in its crl-release-20.1 branch. Pulls in a readahead change in sstable/reader.go to eventually address cockroachdb#49710. Release note (performance improvement): Optimize reading of files when doing backups and storage-level compactions of files. Should deliver a performance improvement for some read-heavy operations on an IOPS-constrained device.
@itsbilal Can this be closed now? |
@petermattis I was thinking of keeping this open until the fix is in 20.1 and we can confirm it works in the telemetry cluster. But I'm confident enough with closing this as that may not happen for a long time, and it addresses the issue we saw in my test cluster. |
Excited for the fix! |
A series of test runs with tpcc-3k before/after the readahead change showed that the gap with rocksdb is almost closed: a RocksDB backup took 20-25mins, a Pebble (w/ readahead) backup took 25-30 mins, and Pebble without readahead took 1h-1h15m. The RocksDB-like simpler-but-more-wasteful readahead implementation that I prototyped earlier performed slightly better on backups than the more conservative implementation that landed, but I didn't really benchmark other read-heavy workloads with that prototype (vs. the new one). I'm sure there are some cases where the RocksDB-like implementation performs worse (reads ahead too early when it's not necessary) than the pebble solution. I think it's fair to close this issue at this point. |
In internal testing of Pebble on one node of an internal test cluster, we noticed increased read ops on the Pebble node. (Pebble is
n3
, the node shown in red).Further investigation is pointing the finger at the read op increase occurring during backups. Incremental backups when Pebble was enabled on
n3
were taking ~40min. After switchingn3
back to RocksDB the next incremental backup took 7m12s.Nightly backup roachtests which perform full backups do not show any time difference between RocksDB and Pebble.
It is possible there is a big in the timebound iterator code, or in the
ExportToSst
code (which lies at the heart of backup) which is causing significantly increased reads that in turn leads to slower backups. @dt will be doing some experimentation on another test cluster. @joshimhoff is pulling the sizes of recent Pebble and RocksDB generated incremental backups to see if they are similar in size.This is a developing story and will be updated soon.
The text was updated successfully, but these errors were encountered: