libroach: Set bytes_per_sync to smooth out disk usage #20352

a-robinson · 2017-11-30T17:58:51Z

This depends on also getting our build flags fixed -- bytes_per_sync
doesn't work without -DROCKSDB_RANGESYNC_PRESENT being set.

The 512KB option is somewhat arbitrary, but was chosen as the best
value for certain configurations by resources such as
https://www.youtube.com/watch?v=pvUqbIeoPzM and
https://www.percona.com/live/data-performance-conference-2016/sites/default/files/slides/Percona_RocksDB_v1.3.pdf

Also, I'm trying to upstream the related max_background_jobs change
I mentioned in #19436: facebook/rocksdb#3208.
If they don't take that upstream, I'll change it here.

Helps with #19436.

Release note (performance improvement): Smoothen out disk usage under
very write heavy workloads by syncing to disk more frequently.

This depends on also getting our build flags fixed -- `bytes_per_sync` doesn't work without `-DROCKSDB_RANGESYNC_PRESENT` being set. The 512KB option is somewhat arbitrary, but was chosen as the best value for certain configurations by resources such as https://www.youtube.com/watch?v=pvUqbIeoPzM and https://www.percona.com/live/data-performance-conference-2016/sites/default/files/slides/Percona_RocksDB_v1.3.pdf Also, I'm trying to upstream the related `max_background_jobs` change I mentioned in cockroachdb#19436: facebook/rocksdb#3208. If they don't take that upstream, I'll change it here. Helps with cockroachdb#19436. Release note (performance improvement): Smoothen out disk usage under very write heavy workloads by syncing to disk more frequently.

cockroach-teamcity · 2017-11-30T17:58:56Z

This change is

petermattis · 2017-11-30T18:21:07Z

So happy that we're making progress.

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from Reviewable

a-robinson · 2017-11-30T18:58:33Z

Me too, believe me.

I'm going to wait to merge this until we have the build fixed up. Do you think it's worth checking in the disk contention test I've been using?

petermattis · 2017-11-30T20:01:26Z

Do you think it's worth checking in the disk contention test I've been using?

Perhaps. @arjunravinarayan is taking a look at my debug synctest PR. I think we should have some diagnostic tools built in to cockroach. Perhaps you should sync up with Arjun.

cockroachdb#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.

38932: storage: build SSTs from KV_BATCH snapshot r=jeffrey-xiao a=jeffrey-xiao Implements the SST snapshot strategy discussed in #16954 and partially implemented in #25134 and #38873, but only have the logic on the receiver side for ease of testing and compatibility. This PR also handles the complications of subsumed replicas that are not fully contained by the current replica. The maximum number of SSTs created using this strategy is 4 + SR + 2 where SR is the number of subsumed replicas. - Three SSTs get streamed from the sender (range local keys, replicated range-id local keys, and data keys) - One SST is constructed for the unreplicated range-id local keys. - One SST is constructed for every subsumed replica to clear the range-id local keys. These SSTs consists of one range deletion tombstone and one `RaftTombstone` key. - A maximum of two SSTs for all subsumed replicas to account for the case of not fully contained subsumed replicas. Note that currently, subsumed replicas can have keys right of the current replica, but not left of, so there will be a maximum of one SST created for the range-local keys and one for the data keys. These SSTs consist of one range deletion tombstone. This number can be further reduced to 3 + SR if we pass the file handles and sst writers from the receiving step to the application step. We can combine the SSTs of the unreplicated range id and replicated id, and the range local of the subsumed replicas and data SSTs of the subsumed replicas. We probably don't want to do this optimization since we'll have to undo this optimization if we start constructing the SSTs from the sender or start chunking large SSTs into smaller SSTs. Blocked by facebook/rocksdb#5649. # Test Plan - [x] Testing knob to inspect SSTs before ingestion. Ensure that expected SSTs for subsumed replicas are ingested. - [x] Unit tests for `SSTSnapshotStorage`. # Metrics and Evaluation One way to evaluate this change is the following steps: 1. Setup 3 node cluster 2. Set default Raft log truncation threshold to some low constant: ```go defaultRaftLogTruncationThreshold = envutil.EnvOrDefaultInt64( "COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD", 128<<10 /* 128 KB */) ``` 3. Set `range_min_bytes` to 0 and `range_max_bytes` to some large number. 4. Increase `kv.snapshot_recovery.max_rate` and `kv.snapshot_rebalance.max_rate` to some large number. 5. Disable load-based splitting. 6. Stop node 2. 7. Run an insert heavy workload (kv0) on the cluster. 8. Start node 2. 9. Time how long it takes for node 2 to have all the ranges. Roachtest: https://gist.github.com/jeffrey-xiao/e69fcad04968822d603f6807ca77ef3b We can have two independent variables 1. Fixed total data size (4000000 ops; ~3.81 GiB), variable number of splits - 32 splits (~121 MiB ranges) - 64 splits (~61.0 MiB ranges) - 128 splits (~31.2 MiB ranges) - 256 splits (~15.7 MiB ranges) - 512 splits (~7.9 MiB ranges) - 1024 splits (~3.9 MiB ranges) 2. Fixed number of splits (32), variable total data size - 125000 (~ 3.7 MiB ranges) - 250000 (~7.5 MiB ranges) - 500000 (~15 MiB ranges) - 1000000 (~30 MiB ranges) - 2000000 (60 MiB ranges) - 4000000 (121 MiB ranges) # Fsync Chunk Size The size of the SST chunk that we write before fsync-ing impacts how fast node 2 has all the ranges. I've experimented 32 splits and an median range size of 121 MB with no fsync-ing (~27s recovery), fsync-ing in 8 MB chunks (~30s recovery), fsync-ing in 2 MB chunks (~40s recovery), fsync-ing in 256 KB chunks (~42s recovery). The default bulk sst sync rate is 2MB and #20352 sets `bytes_per_sync` to 512 KB, so something between those options is probably good. The reason we would want to fsync is to prevent the OS from accumulating such a large buffer that it blocks unrelated small/fast writes for a long time when it flushes. # Impact on Foreground Traffic For testing the impact on foreground traffic, I ran kv0 on a four node cluster with the merge queue and split queue disabled and starting with a constant number of splits. After 5 minutes, I decommissioned node 1 so its replicas would drain to other nodes using snapshots. Roachtest: https://gist.github.com/jeffrey-xiao/5d9443a37b0929884aca927f9c320b6c **Average Range Size of 3 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398633-41a2bb00-b547-11e9-9e3d-747ee724943b.png) - [After](https://user-images.githubusercontent.com/8853434/62398634-41a2bb00-b547-11e9-85e7-445b7989d173.png) **Average Range Size of 32 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398631-410a2480-b547-11e9-9019-86d3bd2e6f73.png) - [After](https://user-images.githubusercontent.com/8853434/62398632-410a2480-b547-11e9-9513-8763e132e76b.png) **Average Range Size 128 MiB** - [Before](https://user-images.githubusercontent.com/8853434/62398558-15873a00-b547-11e9-8ab6-2e8e9bae658c.png) - [After](https://user-images.githubusercontent.com/8853434/62398559-15873a00-b547-11e9-9c72-b3e90fce1acc.png) We see p99 latency wins for larger range sizes and comparable performance for smaller range sizes. Release note (performance improvement): Snapshots sent between replicas are now applied more performantly and use less memory. Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>

a-robinson requested review from petermattis and a team November 30, 2017 17:58

dt added a commit to dt/cockroach that referenced this pull request Dec 1, 2017

storage: fsync sideload sst writes every 512kb

2566f53

cockroachdb#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.

a-robinson merged commit 2b54eb3 into cockroachdb:master Dec 4, 2017

dt mentioned this pull request Dec 4, 2017

storage: fsync sideload sst writes every 2MB #20449

Merged

dt added a commit to dt/cockroach that referenced this pull request Dec 4, 2017

storage: fsync sideload sst writes every 512kb

f5a1a89

cockroachdb#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.

a-robinson mentioned this pull request Dec 8, 2017

cli: machine/network qualification tool #20579

Closed

dt added a commit to dt/cockroach that referenced this pull request Jan 8, 2018

storage: fsync sideload sst writes every 512kb

8409758

cockroachdb#20352 configured rocksdb to sync every 512kb. This does the same for our sst sideload file writes.

bdarnell mentioned this pull request Mar 29, 2018

windows: error when running on WSL #24344

Closed

a-robinson deleted the bytespersync branch May 18, 2018 20:24

jeffrey-xiao mentioned this pull request Aug 1, 2019

storage: build SSTs from KV_BATCH snapshot #38932

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libroach: Set bytes_per_sync to smooth out disk usage #20352

libroach: Set bytes_per_sync to smooth out disk usage #20352

a-robinson commented Nov 30, 2017

cockroach-teamcity commented Nov 30, 2017

petermattis commented Nov 30, 2017

a-robinson commented Nov 30, 2017

petermattis commented Nov 30, 2017

libroach: Set bytes_per_sync to smooth out disk usage #20352

libroach: Set bytes_per_sync to smooth out disk usage #20352

Conversation

a-robinson commented Nov 30, 2017

cockroach-teamcity commented Nov 30, 2017

petermattis commented Nov 30, 2017

a-robinson commented Nov 30, 2017

petermattis commented Nov 30, 2017