storage: slower incremental backups on Pebble #49710

petermattis · 2020-05-29T18:17:53Z

In internal testing of Pebble on one node of an internal test cluster, we noticed increased read ops on the Pebble node. (Pebble is n3, the node shown in red).

Further investigation is pointing the finger at the read op increase occurring during backups. Incremental backups when Pebble was enabled on n3 were taking ~40min. After switching n3 back to RocksDB the next incremental backup took 7m12s.

Nightly backup roachtests which perform full backups do not show any time difference between RocksDB and Pebble.

It is possible there is a big in the timebound iterator code, or in the ExportToSst code (which lies at the heart of backup) which is causing significantly increased reads that in turn leads to slower backups. @dt will be doing some experimentation on another test cluster. @joshimhoff is pulling the sizes of recent Pebble and RocksDB generated incremental backups to see if they are similar in size.

This is a developing story and will be updated soon.

The text was updated successfully, but these errors were encountered:

blathers-crl · 2020-05-29T18:17:57Z

Hi @petermattis, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

petermattis · 2020-05-29T18:30:08Z

Full backup on May 28 took 51m29s. Full backup on May 29 took 3h25m27s. So perhaps this isn’t an incremental backup problem, but simply a backup problem.

As part of the investigation into cockroachdb#49710, this change adds a benchmark for ExportToSst that tests both RocksDB and Pebble. Release note: None.

As part of the investigation into cockroachdb#49710, this change adds a benchmark for ExportToSst that tests both RocksDB and Pebble. Here are some example runs without contention (old = rocksdb, new = pebble): name old time/op new time/op delta ExportToSst/rocksdb/numKeys=64/numRevisions=1-12 43.9µs ± 3% 34.5µs ± 4% -21.33% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=10-12 281µs ± 3% 169µs ± 6% -39.89% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=100-12 1.82ms ±22% 1.17ms ± 1% -35.73% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=1-12 212µs ± 6% 111µs ± 3% -47.77% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=10-12 1.91ms ± 1% 1.19ms ± 8% -37.65% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=512/numRevisions=100-12 13.7ms ± 3% 10.1ms ±12% -26.21% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12 390µs ± 1% 215µs ±12% -44.94% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12 4.01ms ± 6% 2.40ms ±16% -40.13% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12 27.9ms ± 2% 20.8ms ± 2% -25.48% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12 2.97ms ± 2% 1.42ms ± 5% -52.24% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12 32.8ms ± 7% 19.1ms ± 3% -41.59% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12 224ms ± 3% 169ms ±25% -24.64% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12 23.7ms ± 4% 13.4ms ±20% -43.65% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12 264ms ± 4% 201ms ±24% -23.92% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12 1.88s ± 6% 1.23s ± 8% -34.70% (p=0.000 n=10+8) And some with contention=true: name old time/op new time/op delta ExportToSst/rocksdb/numKeys=65536/numRevisions=10/contention=true-12 362ms ± 7% 168ms ± 3% -53.60% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100/contention=true-12 2.24s ± 6% 1.24s ±10% -44.50% (p=0.000 n=10+10) Release note: None.

itsbilal · 2020-06-01T23:07:07Z

Based on my own testing today, this can be generalized to all backups, not just incremental backups.

I spun up two clusters on AWS with these parameters:
roachprod create $CLUSTER --aws-ebs-iops 1300 --aws-ebs-volume-size 250 --aws-ebs-volume-type io1 --local-ssd=false --aws-machine-type m5.xlarge -n 4 --clouds aws

Both were running 20.1.1, one with pebble and one with rocksdb.

Then imported tpcc with 2000 warehouses into both clusters, started running the tpcc workload, and in parallel I kicked off a backup to s3. The almost-identical backup took 35 mins to finish in rocksdb, while it has been ongoing on the pebble one for 1h21m so far (and is still ~60% done).

I noticed that both rocksdb and pebble took full advantage of the provisioned 1300 IOPS at the start of the backup, before declining to just under the limit, but more interestingly, rocksdb had much higher read bytes/throughput numbers than pebble did, throughout the backup run:

(rocksdb on top, pebble on the bottom)

I'd say this is a good enough reproduction. Will profile and see what's actually going on tomorrow.

petermattis · 2020-06-02T00:01:45Z

I'd say this is a good enough reproduction. Will profile and see what's actually going on tomorrow.

Yeah, this looks like a solid reproduction. The timings even match as it looks like the Pebble backup will take ~4x as long as the RocksDB one which matches what we saw on the test cluster. Definitely curious what is limiting Pebble here.

itsbilal · 2020-06-02T22:39:26Z

Another day of investigation. I looked at CPU profiles but nothing really stood out; I tried out some improvements to sstable.MemFile around pre-growing the bytes.Buffer up to targetSize and that helped increase the CPU time used by the export goroutine, but not at all significantly.

Running the pebble and rocks clusters side-by-side and blktrace-ing the writes on the data directory showed something interesting: the reads done with pebble max out at 24 file blocks requested, while the rocksdb reads max out at a much higher block count (288).

More interestingly, the first read in a "new" region of disk seems to be pretty large (always larger than 24) under rocksdb, after which we see many "small" sequential reads (8,16,24 blocks). With pebble, the first initial read is also small, requiring many more IO operations to begin reading data blocks in the file.

The summary stats under both engines also agree with what the admin UI showed: fewer IO events resulting in more data being returned.

I haven't dived deep into the block based table reader code in RocksDB yet, but at first glance there seems to be quite a bit of prefetching of index/filter blocks happening there - maybe that helps make read I/O more efficient under rocksdb?

As part of the investigation into cockroachdb#49710, this change adds a benchmark for ExportToSst that tests both RocksDB and Pebble. Here are some example runs (old = rocksdb, new = pebble): name old time/op new time/op delta ExportToSst/rocksdb/numKeys=64/numRevisions=1-12 43.9µs ± 3% 34.5µs ± 4% -21.33% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=10-12 281µs ± 3% 169µs ± 6% -39.89% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=100-12 1.82ms ±22% 1.17ms ± 1% -35.73% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=1-12 212µs ± 6% 111µs ± 3% -47.77% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=10-12 1.91ms ± 1% 1.19ms ± 8% -37.65% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=512/numRevisions=100-12 13.7ms ± 3% 10.1ms ±12% -26.21% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12 390µs ± 1% 215µs ±12% -44.94% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12 4.01ms ± 6% 2.40ms ±16% -40.13% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12 27.9ms ± 2% 20.8ms ± 2% -25.48% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12 2.97ms ± 2% 1.42ms ± 5% -52.24% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12 32.8ms ± 7% 19.1ms ± 3% -41.59% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12 224ms ± 3% 169ms ±25% -24.64% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12 23.7ms ± 4% 13.4ms ±20% -43.65% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12 264ms ± 4% 201ms ±24% -23.92% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12 1.88s ± 6% 1.23s ± 8% -34.70% (p=0.000 n=10+8) Release note: None.

petermattis · 2020-06-02T23:13:29Z

I haven't dived deep into the block based table reader code in RocksDB yet, but at first glance there seems to be quite a bit of prefetching of index/filter blocks happening there - maybe that helps make read I/O more efficient under rocksdb?

Certainly possible. I assume you're referring to the code in BlockBasedTableIterator<TBlockIter, TValue>::InitDataBlock. Rather than try to duplicate this code in Pebble, a quicker way to test your theory is to disable the implicit read-ahead code in RocksDB.

petermattis · 2020-06-02T23:16:47Z

A reminder about cockroachdb/pebble#198. Pebble should be specifying the same fadvise settings as RocksDB. Worthwhile to double-check that, though. I'd do so by instrumenting the RocksDB code as trying to trace through what it uses based on code inspection is error prone.

petermattis · 2020-06-02T23:39:54Z

Here's the RocksDB PR where the dynamic read-ahead code landed: facebook/rocksdb#3282.

petermattis · 2020-06-02T23:42:13Z

TIL about the readahead syscall. If this is the source of the perf discrepancy, adding support for this in Pebble should be straightforward.

49565: sql: serialize UDTs in expressions in a stable way r=otan,jordanlewis a=rohany Fixes #49379. This PR ensures that serialized expressions stored durably in table descriptors are serialized in a format that is stable across changes to user defined types present in those expressions. An effect of this change is that these expressions must be reparsed and formatted in a human readable way before display in statements like `SHOW CREATE TABLE`. Release note: None 49721: storage: Add rocksdb-vs-pebble benchmark for ExportToSst r=itsbilal a=itsbilal As part of the investigation into #49710, this change adds a benchmark for ExportToSst that tests both RocksDB and Pebble. Here are some example runs without contention (old = rocksdb, new = pebble): name old time/op new time/op delta ExportToSst/rocksdb/numKeys=64/numRevisions=1-12 43.9µs ± 3% 34.5µs ± 4% -21.33% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=10-12 281µs ± 3% 169µs ± 6% -39.89% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=64/numRevisions=100-12 1.82ms ±22% 1.17ms ± 1% -35.73% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=1-12 212µs ± 6% 111µs ± 3% -47.77% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=512/numRevisions=10-12 1.91ms ± 1% 1.19ms ± 8% -37.65% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=512/numRevisions=100-12 13.7ms ± 3% 10.1ms ±12% -26.21% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12 390µs ± 1% 215µs ±12% -44.94% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12 4.01ms ± 6% 2.40ms ±16% -40.13% (p=0.000 n=10+9) ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12 27.9ms ± 2% 20.8ms ± 2% -25.48% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12 2.97ms ± 2% 1.42ms ± 5% -52.24% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12 32.8ms ± 7% 19.1ms ± 3% -41.59% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12 224ms ± 3% 169ms ±25% -24.64% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12 23.7ms ± 4% 13.4ms ±20% -43.65% (p=0.000 n=9+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12 264ms ± 4% 201ms ±24% -23.92% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12 1.88s ± 6% 1.23s ± 8% -34.70% (p=0.000 n=10+8) And some with contention=true: name old time/op new time/op delta ExportToSst/rocksdb/numKeys=65536/numRevisions=10/contention=true-12 362ms ± 7% 168ms ± 3% -53.60% (p=0.000 n=10+10) ExportToSst/rocksdb/numKeys=65536/numRevisions=100/contention=true-12 2.24s ± 6% 1.24s ±10% -44.50% (p=0.000 n=10+10) Release note: None. 49815: roachpb: refuse nil desc in NewRangeKeyMismatchError r=andreimatei a=andreimatei Since recently RangeKeyMismatchError does not support nil descriptors, but it still had code that pretended to deal with nils (even though a nil would have exploded a bit later). Only one test caller was passing a nil, and it turns out that was dead code. Release note: None Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

itsbilal · 2020-06-03T21:56:47Z

Disabling the dynamic readahead code in rocksdb completely closed the backup performance/time gap in my tpcc-2k test setup; in that it was just as bad as Pebble then. I've added a simple change to implement the same functionality in Pebble: cockroachdb/pebble#726

From early test runs with this change, I can see Pebble backups slightly outperform RocksDB (with implicit automatic readahead added back in). Doing more testing to confirm my early results.

For sequential-like IO workload where we read data blocks one after the other in quick succession, signalling the OS to asynchronously bring them to cache in advance can deliver significant savings in IOPS dispatched. In IOPS-bound workloads such as backup on an EBS disk, this delivers a 3x speedup. Presumably aggregate queries and compactions will be faster as well, though this hasn't been benchmarked in practice yet. This change maintains a counter for the number of data block reads performed in a singleLevelIterator, and when that count exceeds 2, a readahead system call is made on Linux. RocksDB has almost the exact same behaviour, including the same min/max readahead sizes and read count thresholds. Will address cockroachdb/cockroach#49710 when it lands in cockraoch.

Updates Pebble to the latest version in its crl-release-20.1 branch. Pulls in a readahead change in sstable/reader.go to eventually address cockroachdb#49710. Release note (performance improvement): Optimize reading of files when doing backups and storage-level compactions of files. Should deliver a performance improvement for some read-heavy operations on an IOPS-constrained device.

petermattis · 2020-06-15T13:08:32Z

@itsbilal Can this be closed now?

itsbilal · 2020-06-15T13:37:45Z

@petermattis I was thinking of keeping this open until the fix is in 20.1 and we can confirm it works in the telemetry cluster. But I'm confident enough with closing this as that may not happen for a long time, and it addresses the issue we saw in my test cluster.

joshimhoff · 2020-06-15T13:41:20Z

Excited for the fix!

itsbilal · 2020-06-16T16:01:46Z

A series of test runs with tpcc-3k before/after the readahead change showed that the gap with rocksdb is almost closed: a RocksDB backup took 20-25mins, a Pebble (w/ readahead) backup took 25-30 mins, and Pebble without readahead took 1h-1h15m.

The RocksDB-like simpler-but-more-wasteful readahead implementation that I prototyped earlier performed slightly better on backups than the more conservative implementation that landed, but I didn't really benchmark other read-heavy workloads with that prototype (vs. the new one). I'm sure there are some cases where the RocksDB-like implementation performs worse (reads ahead too early when it's not necessary) than the pebble solution.

I think it's fair to close this issue at this point.

petermattis added A-disaster-recovery A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels May 29, 2020

petermattis assigned dt and itsbilal May 29, 2020

itsbilal mentioned this issue May 29, 2020

storage: Add rocksdb-vs-pebble benchmark for ExportToSst #49721

Merged

itsbilal mentioned this issue Jun 3, 2020

vfs: Add prefetchFile, use it for reading ahead data blocks cockroachdb/pebble#726

Merged

itsbilal mentioned this issue Jun 10, 2020

crl-release-20.1: vfs: Add vfs.Prefetch, use it for reading ahead data blocks cockroachdb/pebble#733

Merged

itsbilal mentioned this issue Jun 11, 2020

release-20.1: vendor: Bump pebble to 7928b15b5c541dfae551945ea91f03e7cfba71fa #50105

Merged

itsbilal closed this as completed Jun 16, 2020

itsbilal mentioned this issue Jun 18, 2020

storage: large value performance degradation since switching to pebble #49750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: slower incremental backups on Pebble #49710

storage: slower incremental backups on Pebble #49710

petermattis commented May 29, 2020

blathers-crl bot commented May 29, 2020

petermattis commented May 29, 2020

itsbilal commented Jun 1, 2020 •

edited

Loading

petermattis commented Jun 2, 2020

itsbilal commented Jun 2, 2020

petermattis commented Jun 2, 2020

petermattis commented Jun 2, 2020

petermattis commented Jun 2, 2020

petermattis commented Jun 2, 2020

itsbilal commented Jun 3, 2020

petermattis commented Jun 15, 2020

itsbilal commented Jun 15, 2020

joshimhoff commented Jun 15, 2020

itsbilal commented Jun 16, 2020

storage: slower incremental backups on Pebble #49710

storage: slower incremental backups on Pebble #49710

Comments

petermattis commented May 29, 2020

blathers-crl bot commented May 29, 2020

petermattis commented May 29, 2020

itsbilal commented Jun 1, 2020 • edited Loading

petermattis commented Jun 2, 2020

itsbilal commented Jun 2, 2020

petermattis commented Jun 2, 2020

petermattis commented Jun 2, 2020

petermattis commented Jun 2, 2020

petermattis commented Jun 2, 2020

itsbilal commented Jun 3, 2020

petermattis commented Jun 15, 2020

itsbilal commented Jun 15, 2020

joshimhoff commented Jun 15, 2020

itsbilal commented Jun 16, 2020

itsbilal commented Jun 1, 2020 •

edited

Loading