Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: slower incremental backups on Pebble #49710

Closed
petermattis opened this issue May 29, 2020 · 14 comments
Closed

storage: slower incremental backups on Pebble #49710

petermattis opened this issue May 29, 2020 · 14 comments
Assignees
Labels
A-disaster-recovery A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@petermattis
Copy link
Collaborator

In internal testing of Pebble on one node of an internal test cluster, we noticed increased read ops on the Pebble node. (Pebble is n3, the node shown in red).

Screen Shot 2020-05-29 at 2 12 04 PM

Further investigation is pointing the finger at the read op increase occurring during backups. Incremental backups when Pebble was enabled on n3 were taking ~40min. After switching n3 back to RocksDB the next incremental backup took 7m12s.

Nightly backup roachtests which perform full backups do not show any time difference between RocksDB and Pebble.

It is possible there is a big in the timebound iterator code, or in the ExportToSst code (which lies at the heart of backup) which is causing significantly increased reads that in turn leads to slower backups. @dt will be doing some experimentation on another test cluster. @joshimhoff is pulling the sizes of recent Pebble and RocksDB generated incremental backups to see if they are similar in size.

This is a developing story and will be updated soon.

@blathers-crl
Copy link

blathers-crl bot commented May 29, 2020

Hi @petermattis, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@petermattis petermattis added A-disaster-recovery A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels May 29, 2020
@petermattis
Copy link
Collaborator Author

Full backup on May 28 took 51m29s. Full backup on May 29 took 3h25m27s. So perhaps this isn’t an incremental backup problem, but simply a backup problem.

itsbilal added a commit to itsbilal/cockroach that referenced this issue May 29, 2020
As part of the investigation into cockroachdb#49710, this change adds a
benchmark for ExportToSst that tests both RocksDB and Pebble.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 1, 2020
As part of the investigation into cockroachdb#49710, this change adds a
benchmark for ExportToSst that tests both RocksDB and Pebble.

Here are some example runs without contention (old = rocksdb,
new = pebble):

	name                                                   old time/op  new time/op  delta
	ExportToSst/rocksdb/numKeys=64/numRevisions=1-12       43.9µs ± 3%  34.5µs ± 4%  -21.33%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=64/numRevisions=10-12       281µs ± 3%   169µs ± 6%  -39.89%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=64/numRevisions=100-12     1.82ms ±22%  1.17ms ± 1%  -35.73%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=512/numRevisions=1-12       212µs ± 6%   111µs ± 3%  -47.77%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=512/numRevisions=10-12     1.91ms ± 1%  1.19ms ± 8%  -37.65%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=512/numRevisions=100-12    13.7ms ± 3%  10.1ms ±12%  -26.21%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12      390µs ± 1%   215µs ±12%  -44.94%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12    4.01ms ± 6%  2.40ms ±16%  -40.13%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12   27.9ms ± 2%  20.8ms ± 2%  -25.48%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12     2.97ms ± 2%  1.42ms ± 5%  -52.24%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12    32.8ms ± 7%  19.1ms ± 3%  -41.59%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12    224ms ± 3%   169ms ±25%  -24.64%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12    23.7ms ± 4%  13.4ms ±20%  -43.65%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12    264ms ± 4%   201ms ±24%  -23.92%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12   1.88s ± 6%   1.23s ± 8%  -34.70%  (p=0.000 n=10+8)

And some with contention=true:

	name                                                                   old time/op  new time/op  delta
	ExportToSst/rocksdb/numKeys=65536/numRevisions=10/contention=true-12    362ms ± 7%   168ms ± 3%  -53.60%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=100/contention=true-12   2.24s ± 6%   1.24s ±10%  -44.50%  (p=0.000 n=10+10)

Release note: None.
@itsbilal
Copy link
Member

itsbilal commented Jun 1, 2020

Based on my own testing today, this can be generalized to all backups, not just incremental backups.

I spun up two clusters on AWS with these parameters:
roachprod create $CLUSTER --aws-ebs-iops 1300 --aws-ebs-volume-size 250 --aws-ebs-volume-type io1 --local-ssd=false --aws-machine-type m5.xlarge -n 4 --clouds aws

Both were running 20.1.1, one with pebble and one with rocksdb.

Then imported tpcc with 2000 warehouses into both clusters, started running the tpcc workload, and in parallel I kicked off a backup to s3. The almost-identical backup took 35 mins to finish in rocksdb, while it has been ongoing on the pebble one for 1h21m so far (and is still ~60% done).

I noticed that both rocksdb and pebble took full advantage of the provisioned 1300 IOPS at the start of the backup, before declining to just under the limit, but more interestingly, rocksdb had much higher read bytes/throughput numbers than pebble did, throughout the backup run:

Screen Shot 2020-06-01 at 7 04 39 PM
Screen Shot 2020-06-01 at 7 04 58 PM

(rocksdb on top, pebble on the bottom)

I'd say this is a good enough reproduction. Will profile and see what's actually going on tomorrow.

@petermattis
Copy link
Collaborator Author

I'd say this is a good enough reproduction. Will profile and see what's actually going on tomorrow.

Yeah, this looks like a solid reproduction. The timings even match as it looks like the Pebble backup will take ~4x as long as the RocksDB one which matches what we saw on the test cluster. Definitely curious what is limiting Pebble here.

@itsbilal
Copy link
Member

itsbilal commented Jun 2, 2020

Another day of investigation. I looked at CPU profiles but nothing really stood out; I tried out some improvements to sstable.MemFile around pre-growing the bytes.Buffer up to targetSize and that helped increase the CPU time used by the export goroutine, but not at all significantly.

Running the pebble and rocks clusters side-by-side and blktrace-ing the writes on the data directory showed something interesting: the reads done with pebble max out at 24 file blocks requested, while the rocksdb reads max out at a much higher block count (288).

More interestingly, the first read in a "new" region of disk seems to be pretty large (always larger than 24) under rocksdb, after which we see many "small" sequential reads (8,16,24 blocks). With pebble, the first initial read is also small, requiring many more IO operations to begin reading data blocks in the file.

The summary stats under both engines also agree with what the admin UI showed: fewer IO events resulting in more data being returned.

I haven't dived deep into the block based table reader code in RocksDB yet, but at first glance there seems to be quite a bit of prefetching of index/filter blocks happening there - maybe that helps make read I/O more efficient under rocksdb?

itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 2, 2020
As part of the investigation into cockroachdb#49710, this change adds a
benchmark for ExportToSst that tests both RocksDB and Pebble.

Here are some example runs (old = rocksdb, new = pebble):

	name                                                   old time/op  new time/op  delta
	ExportToSst/rocksdb/numKeys=64/numRevisions=1-12       43.9µs ± 3%  34.5µs ± 4%  -21.33%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=64/numRevisions=10-12       281µs ± 3%   169µs ± 6%  -39.89%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=64/numRevisions=100-12     1.82ms ±22%  1.17ms ± 1%  -35.73%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=512/numRevisions=1-12       212µs ± 6%   111µs ± 3%  -47.77%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=512/numRevisions=10-12     1.91ms ± 1%  1.19ms ± 8%  -37.65%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=512/numRevisions=100-12    13.7ms ± 3%  10.1ms ±12%  -26.21%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12      390µs ± 1%   215µs ±12%  -44.94%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12    4.01ms ± 6%  2.40ms ±16%  -40.13%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12   27.9ms ± 2%  20.8ms ± 2%  -25.48%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12     2.97ms ± 2%  1.42ms ± 5%  -52.24%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12    32.8ms ± 7%  19.1ms ± 3%  -41.59%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12    224ms ± 3%   169ms ±25%  -24.64%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12    23.7ms ± 4%  13.4ms ±20%  -43.65%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12    264ms ± 4%   201ms ±24%  -23.92%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12   1.88s ± 6%   1.23s ± 8%  -34.70%  (p=0.000 n=10+8)

Release note: None.
@petermattis
Copy link
Collaborator Author

I haven't dived deep into the block based table reader code in RocksDB yet, but at first glance there seems to be quite a bit of prefetching of index/filter blocks happening there - maybe that helps make read I/O more efficient under rocksdb?

Certainly possible. I assume you're referring to the code in BlockBasedTableIterator<TBlockIter, TValue>::InitDataBlock. Rather than try to duplicate this code in Pebble, a quicker way to test your theory is to disable the implicit read-ahead code in RocksDB.

@petermattis
Copy link
Collaborator Author

A reminder about cockroachdb/pebble#198. Pebble should be specifying the same fadvise settings as RocksDB. Worthwhile to double-check that, though. I'd do so by instrumenting the RocksDB code as trying to trace through what it uses based on code inspection is error prone.

@petermattis
Copy link
Collaborator Author

Here's the RocksDB PR where the dynamic read-ahead code landed: facebook/rocksdb#3282.

@petermattis
Copy link
Collaborator Author

TIL about the readahead syscall. If this is the source of the perf discrepancy, adding support for this in Pebble should be straightforward.

craig bot pushed a commit that referenced this issue Jun 3, 2020
49565: sql: serialize UDTs in expressions in a stable way r=otan,jordanlewis a=rohany

Fixes #49379.

This PR ensures that serialized expressions stored durably in table
descriptors are serialized in a format that is stable across changes to
user defined types present in those expressions. An effect of this
change is that these expressions must be reparsed and formatted in a
human readable way before display in statements like `SHOW CREATE
TABLE`. 

Release note: None

49721: storage: Add rocksdb-vs-pebble benchmark for ExportToSst r=itsbilal a=itsbilal

As part of the investigation into #49710, this change adds a
benchmark for ExportToSst that tests both RocksDB and Pebble.

Here are some example runs without contention (old = rocksdb,
new = pebble):

	name                                                   old time/op  new time/op  delta
	ExportToSst/rocksdb/numKeys=64/numRevisions=1-12       43.9µs ± 3%  34.5µs ± 4%  -21.33%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=64/numRevisions=10-12       281µs ± 3%   169µs ± 6%  -39.89%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=64/numRevisions=100-12     1.82ms ±22%  1.17ms ± 1%  -35.73%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=512/numRevisions=1-12       212µs ± 6%   111µs ± 3%  -47.77%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=512/numRevisions=10-12     1.91ms ± 1%  1.19ms ± 8%  -37.65%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=512/numRevisions=100-12    13.7ms ± 3%  10.1ms ±12%  -26.21%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=1-12      390µs ± 1%   215µs ±12%  -44.94%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=10-12    4.01ms ± 6%  2.40ms ±16%  -40.13%  (p=0.000 n=10+9)
	ExportToSst/rocksdb/numKeys=1024/numRevisions=100-12   27.9ms ± 2%  20.8ms ± 2%  -25.48%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=1-12     2.97ms ± 2%  1.42ms ± 5%  -52.24%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=10-12    32.8ms ± 7%  19.1ms ± 3%  -41.59%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=8192/numRevisions=100-12    224ms ± 3%   169ms ±25%  -24.64%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=1-12    23.7ms ± 4%  13.4ms ±20%  -43.65%  (p=0.000 n=9+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=10-12    264ms ± 4%   201ms ±24%  -23.92%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=100-12   1.88s ± 6%   1.23s ± 8%  -34.70%  (p=0.000 n=10+8)

And some with contention=true:

	name                                                                   old time/op  new time/op  delta
	ExportToSst/rocksdb/numKeys=65536/numRevisions=10/contention=true-12    362ms ± 7%   168ms ± 3%  -53.60%  (p=0.000 n=10+10)
	ExportToSst/rocksdb/numKeys=65536/numRevisions=100/contention=true-12   2.24s ± 6%   1.24s ±10%  -44.50%  (p=0.000 n=10+10)

Release note: None.

49815: roachpb: refuse nil desc in NewRangeKeyMismatchError r=andreimatei a=andreimatei

Since recently RangeKeyMismatchError does not support nil descriptors,
but it still had code that pretended to deal with nils (even though a
nil would have exploded a bit later). Only one test caller was passing a
nil, and it turns out that was dead code.

Release note: None

Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@itsbilal
Copy link
Member

itsbilal commented Jun 3, 2020

Disabling the dynamic readahead code in rocksdb completely closed the backup performance/time gap in my tpcc-2k test setup; in that it was just as bad as Pebble then. I've added a simple change to implement the same functionality in Pebble: cockroachdb/pebble#726

From early test runs with this change, I can see Pebble backups slightly outperform RocksDB (with implicit automatic readahead added back in). Doing more testing to confirm my early results.

itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 3, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 4, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 8, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 8, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 8, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 9, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 9, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 9, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to cockroachdb/pebble that referenced this issue Jun 10, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/pebble that referenced this issue Jun 10, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to cockroachdb/pebble that referenced this issue Jun 10, 2020
For sequential-like IO workload where we read
data blocks one after the other in quick succession,
signalling the OS to asynchronously bring them to
cache in advance can deliver significant savings in IOPS
dispatched. In IOPS-bound workloads such as backup on
an EBS disk, this delivers a 3x speedup. Presumably
aggregate queries and compactions will be faster
as well, though this hasn't been benchmarked in
practice yet.

This change maintains a counter for the number of data
block reads performed in a singleLevelIterator, and
when that count exceeds 2, a readahead system
call is made on Linux. RocksDB has almost
the exact same behaviour, including the same
min/max readahead sizes and read count thresholds.

Will address cockroachdb/cockroach#49710
when it lands in cockraoch.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jun 11, 2020
Updates Pebble to the latest version in its crl-release-20.1 branch.
Pulls in a readahead change in sstable/reader.go to eventually
address cockroachdb#49710.

Release note (performance improvement): Optimize reading of files when doing backups
and storage-level compactions of files. Should deliver a performance
improvement for some read-heavy operations on an IOPS-constrained device.
@petermattis
Copy link
Collaborator Author

@itsbilal Can this be closed now?

@itsbilal
Copy link
Member

@petermattis I was thinking of keeping this open until the fix is in 20.1 and we can confirm it works in the telemetry cluster. But I'm confident enough with closing this as that may not happen for a long time, and it addresses the issue we saw in my test cluster.

@joshimhoff
Copy link
Collaborator

Excited for the fix!

@itsbilal
Copy link
Member

A series of test runs with tpcc-3k before/after the readahead change showed that the gap with rocksdb is almost closed: a RocksDB backup took 20-25mins, a Pebble (w/ readahead) backup took 25-30 mins, and Pebble without readahead took 1h-1h15m.

The RocksDB-like simpler-but-more-wasteful readahead implementation that I prototyped earlier performed slightly better on backups than the more conservative implementation that landed, but I didn't really benchmark other read-heavy workloads with that prototype (vs. the new one). I'm sure there are some cases where the RocksDB-like implementation performs worse (reads ahead too early when it's not necessary) than the pebble solution.

I think it's fair to close this issue at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

4 participants