storage: investigate larger default range sizes #39717

jeffrey-xiao · 2019-08-16T18:06:40Z

#16954 was a major blocker in being able to increase the default range size from 64 MiB (another potential problem is #22348). Since #16954 has been closed, we can experiment with larger default range sizes.

I ran tpcc on a 3 node cluster with 16 vCPUs, 2500 warehouses, and an average range size of 500 MiB and a max range size of 1 GiB and nothing seemed to blow up.

There were a couple of latency spikes during the ramp-up period of tpcc, but I think they were caused by the destruction of old replicas when the initial splits were merged from setting up the workload (not too sure about this).

petermattis · 2019-08-16T18:20:55Z

Not a blocker, but I recall there is a timeout for snapshot send operations. If we increase the range size significantly, we may need to increase that timeout. For future proofing, it may be a good idea to adjust the timeout based on the size of the snapshot. Or since we send the snapshot in chunks, put the timeout on the sending of each chunk instead of the operation as a whole.

ajwerner · 2019-11-21T18:56:15Z

Not a blocker, but I recall there is a timeout for snapshot send operations.

This timeout isn't directly on the snapshot itself but rather is due to the timeout on queue processing in the storage package. I'm going to type a patch to allow queues to independently control that timeout. For the raft snapshot queue my sense is maybe it'd be best to set the queue process timeout quite high and then do what you suggest with a timeout per chunk or per snapshot.

ajwerner · 2019-11-21T18:57:36Z

Err we have the ability to set the timeout per queue but it was set to the default 1m.

ajwerner · 2019-12-13T19:26:33Z

Concerns:

Snapshot timeouts
- fixed by storage: make queue timeouts dynamic and add cluster setting for snapshots #42686
Backup memory usage
- Backups generate SSTs which are range sized. These SSTs are built in C++ and handed back to go. A 512MB slice is a big slice in memory constrained environments.
CDC catch up scans
- CDC uses export requests to pull large amounts of data to perform catch up scans. We need a mechanism to limit the size which is retrieved.
RESTORE
- there might also be a mem blowup on the RESTORE side since all the SSTs that cover a given span being restored need to be downloaded and opened as iterators that can be wrapped up in one merging-iterator from which to produce the RESTORE’d span SST. SST index is at the end, so we have to download the whole thing before we can create an iterator over it. Right now that means all he SSTs that contain keys for a given span being imported are held in ram while we construct, in ram, the resulting SST that covers that range. We could spill them to disk but encryption at rest makes that somewhat more involved.

dt · 2019-12-13T20:12:04Z

Another reason RESTORE might prefer not to spill to disk is that disk bandwidth is often a hot commodity during a RESTORE or bulk-ingestion -- using any of it to write something other than the final ingestion SSTs data could potentially make RESTORE slower.

ajwerner · 2019-12-16T15:32:50Z

@dt correct me if I'm wrong but from the discussion on slack it seems like we can mitigate all of the above concerns if we creates SSTs during backup with some target size rather than as the entire range size.

SSTs created prior to increasing the range size will target 64MB. If we continued to target 64MB exported SSTs for BACKUP then that would continue to be the size used for RESTORE. The potential downside of this approach is that RESTORE will create smaller ranges. We're not too concerned about this downside in the long run because the ranges will get merged together. The problem with waiting for the merge queue to merge the ranges together is that it will need to move data around to collocate the ranges for merging which could lead to a bunch of extra data movement following the RESTORE. @dt does limiting the size of the SSTs created during backup seem challenging?

This commit extends the engine interface to take a targetSize parameter in the ExportToSst method. The iteration will stope if the first version of a key to be added to the SST would lead to targetSize being exceeded. If exportAllRevisions is false, the targetSize will not be exceeded unless the first kv pair exceeds it. This commit additionally fixes a bug in the rocksdb implementation of DBExportToSst whereby the first key in the export request would be skipped. This case likely never occurred because the key passed to Export was rarely exactly the first key to be included (see the change related to seek_key in db.cc). The exportccl.TestRandomKeyAndTimestampExport was extended to excercise various targetSize limits. That test run under stress with the tee engine inspires some confidence and did catch the above mentioned bug. More testing would likely be good. This commit leaves the task of adopting the targetSize parameter for later. Fixes cockroachdb#39717. Release note: None

44440: libroach,engine: support pagination of ExportToSst r=ajwerner a=ajwerner This commit extends the engine interface to take a targetSize parameter in the ExportToSst method. The iteration will stope if the first version of a key to be added to the SST would lead to targetSize being exceeded. If exportAllRevisions is false, the targetSize will not be exceeded unless the first kv pair exceeds it. This commit additionally fixes a bug in the rocksdb implementation of DBExportToSst whereby the first key in the export request would be skipped. This case likely never occurred because the key passed to Export was rarely exactly the first key to be included (see the change related to seek_key in db.cc). The exportccl.TestRandomKeyAndTimestampExport was extended to excercise various targetSize limits. That test run under stress with the tee engine inspires some confidence and did catch the above mentioned bug. More testing would likely be good. This commit leaves the task of adopting the targetSize parameter for later. Fixes #39717. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

@danhhz

This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to cockroachdb#39717. Release note: None.

@danhhz

This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of cockroachdb#44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to cockroachdb#39717. Release note: None.

@danhhz

44663: changefeedccl: use ScanRequest instead of ExportRequest during backfills r=danhhz a=ajwerner This PR is motivated by the desire to get the memory usage of CDC under control in the presense of much larger ranges. Currently when a changefeed decides it needs to do a backfill, it breaks the spans up along range boundaries and then fetches the data (with some parallelism) for the backfill. The memory overhead was somewhat bounded by the range size. If we want to make the range size dramatically larger, the memory usage would become a function of that new, much larger range size. Fortunately, we don't have much need for these `ExportRequest`s any more. Another fascinating revelation of late is that the `ScanResponse` does indeed include MVCC timestamps (not the we necessarily needed them but it's a good idea to keep them for compatibility). The `ScanRequest` permits currently a limit on `NumRows` which this commit utilized. I wanted to get this change typed in anticipation of #44341 which will provide a limit on `NumBytes`. I retained the existing parallelism as ScanRequests with limits are not parallel. I would like to do some benchmarking but I feel pretty okay about the testing we have in place already. @danhhz what do you want to see here? Relates to #39717. Release note: None. 44719: sql: add telemetry for uses of alter primary key r=otan a=rohany Fixes #44716. This PR adds a telemetry counter for uses of the alter primary key command. Release note (sql change): This PR adds collected telemetry from clusters upon using the alter primary key command. 44721: vendor: bump pebble to 89adc50375ffd11c8e62f46f1a5c320012cffafe r=petermattis a=petermattis * db: additional tweak to the sstable boundary generation * db: add memTable.logSeqNum * db: force flushing of overlapping queued memtables during ingestion * tool: lsm visualization tool * db: consistently handle key decoding failure * cmd/pebble: fix lint for maxOpsPerSec's type inference * tool: add "find" command * cmd/pebble: fix --rate flag * internal/metamorphic: use the strict MemFS and add an operation to reset the DB * db: make DB.Close() wait for any ongoing deletion of obsolete files * sstable: encode varints directly into buf in blockWriter.store * sstable: micro-optimize Writer.addPoint() * sstable: minor cleanup of Writer/blockWriter * sstable: micro-optimize blockWriter.store Fixes #44631 Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu> Co-authored-by: Peter Mattis <petermattis@gmail.com>

awoods187 added the C-investigation Further steps needed to qualify. C-label will change. label Aug 30, 2019

ajwerner self-assigned this Nov 21, 2019

ajwerner mentioned this issue Nov 21, 2019

storage: restarted node in need of snapshots can be wedged for long time #37906

Closed

ajwerner mentioned this issue Dec 19, 2019

storage: paginate export requests with a size limit #43356

Closed

ajwerner mentioned this issue Jan 28, 2020

libroach,engine: support pagination of ExportToSst #44440

Merged

craig bot closed this as completed in 05f91e5 Jan 28, 2020

ajwerner reopened this Jan 28, 2020

ajwerner mentioned this issue Jan 29, 2020

[dnm] kv: expose (and use) byte batch response size limit #44341

Closed

ajwerner mentioned this issue Feb 3, 2020

changefeedccl: use ScanRequest instead of ExportRequest during backfills #44663

Merged

ajwerner mentioned this issue Feb 10, 2020

zone: change default range size to 2GB with min size of 256MB #44941

Closed

ajwerner closed this as completed Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: investigate larger default range sizes #39717

storage: investigate larger default range sizes #39717

jeffrey-xiao commented Aug 16, 2019 •

edited

Loading

petermattis commented Aug 16, 2019

ajwerner commented Nov 21, 2019

ajwerner commented Nov 21, 2019

ajwerner commented Dec 13, 2019 •

edited

Loading

dt commented Dec 13, 2019

ajwerner commented Dec 16, 2019

storage: investigate larger default range sizes #39717

storage: investigate larger default range sizes #39717

Comments

jeffrey-xiao commented Aug 16, 2019 • edited Loading

petermattis commented Aug 16, 2019

ajwerner commented Nov 21, 2019

ajwerner commented Nov 21, 2019

ajwerner commented Dec 13, 2019 • edited Loading

dt commented Dec 13, 2019

ajwerner commented Dec 16, 2019

jeffrey-xiao commented Aug 16, 2019 •

edited

Loading

ajwerner commented Dec 13, 2019 •

edited

Loading