storage: Don't require splits to complete before running replicate queue #25047

a-robinson · 2018-04-24T19:11:46Z

Currently, the replicate queue requires that all splits complete before doing its thing (

cockroach/pkg/storage/replicate_queue.go

Line 117 in 5d39b64

acceptsUnsplitRanges: store.TestingKnobs().ReplicateQueueAcceptsUnsplit,

,

cockroach/pkg/storage/replicate_queue.go

Line 152 in 5d39b64

if !repl.store.splitQueue.Disabled() && repl.needsSplitBySize() {

).

This makes sense, because it's cheaper/less risky to send snapshots for smaller ranges than for ones that are bigger than the configured limit. However, this leaves the cluster susceptible to data unavailability/loss if a range fails to ever split for some reason. If we had a perfect track record of splits never getting stuck, this wouldn't be a big deal, but we have had some problems with ranges failing to split (e.g., #24966, #25036, #24896, #23310, #21357).

I suspect we'd be better off allowing replication even if size-based splitting was needed. Do you have opinions, @nvanbenschoten or @tschottdorf?

tbg · 2018-04-24T19:54:54Z

Snapshots are currently pulled into memory on the receiving side, which I think is one of the original motivations for this restriction. With @nvanbenschoten's upcoming change and possible elimination of other memory-blowup inducing code, we may be able to drop this coupling again, and I absolutely think we should do so once it becomes safe.

nvanbenschoten · 2018-04-24T20:00:37Z

I don't know the exact reason for introducing this restriction, but I suspect there were two main factors:

ranges had no upper bound on size, so if they needed a split then that means they could have been arbitrarily large.
large snapshots are expensive and potentially deadly because they buffer the entire range in memory.

The first factor is similar to the reason why we used to prevent snapshots when the range size was larger than 2x the max_range_size. We removed this restriction after we introduced backpressure that attempted to bound the size of ranges. The second factor is still valid, but will be addressed in #16954. Since we do a better job bounding range size and will soon remove the effective snapshot size limit, I think we can make this change shortly.

Whether we should make the change is another question. I personally would love to see it because it reduces the number of components that have dependencies on each other. As we saw in #20589, these dependencies can easily turn into cycles which create deadlocks, so it's almost always best to avoid them.

a-robinson · 2018-04-24T20:08:57Z

I don't know the exact reason for introducing this restriction, but I suspect there were two main factors:

Those are definitely the factors -- there's a comment in the code I linked to that explicitly lays that out. Sorry for not including the full range of relevant lines in the original post:

cockroach/pkg/storage/replicate_queue.go

Lines 153 to 157 in 5d39b64

    
           // If the range exceeds the split threshold, let that finish first. 
        
           // Ranges must fit in memory on both sender and receiver nodes while 
        
           // being replicated. This supplements the check provided by 
        
           // acceptsUnsplitRanges, which looks at zone config boundaries rather 
        
           // than data size.

Whether we should make the change is another question.

That question is easy. Once the large snapshot problem is addressed, we should certainly remove the repl.needsSplitBySize() check. As is, splits not working can lead to under-replication/unavailability/data loss because we won't ever up-replicate the range even if nodes are lost.

Fixes cockroachdb#16954. Related to cockroachdb#25047. This depends on the following two upstream changes to RockDB: - facebook/rocksdb#3778 - facebook/rocksdb#3779 The change introduces a new snapshot strategy called "SST". This strategy stream sst files consisting of all keys in a range from the sender to the receiver. These sst files are then atomically ingested directly into RocksDB. An important property of the strategy is that the amount of memory required for a receiver using the strategy is constant with respect to the size of a range, instead of linear as it is with the KV_BATCH strategy. This will be critical for increasing the default range size and potentially for increasing the number of concurrent snapshots allowed per node. The strategy also seems to significantly speed up snapshots once ranges are above a certain size (somewhere in the single digit MBs). This is a WIP change. Before it can be merged it needs: - to be cleaned up a bit - more testing (unit test, testing knobs, maybe some chaos) - proper version handling - heuristic tuning - decisions on questions like compactions after ingestion Release note: None

nvanbenschoten · 2019-07-10T14:33:34Z

@andreimatei you just addressed this in #38529, right?

andreimatei · 2019-07-10T14:34:45Z

yup

a-robinson added the A-kv-distribution Relating to rebalancing and leasing. label Apr 24, 2018

nvanbenschoten mentioned this issue Apr 27, 2018

[DNM] storage: introduce SST snapshot strategy #25134

Closed

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018

tbg added this to the 2.2 milestone Jul 22, 2018

petermattis removed this from the 2.2 milestone Oct 5, 2018

a-robinson mentioned this issue Nov 29, 2018

Under-replicated ranges when scaling from 3 to 5 nodes #32699

Closed

andreimatei closed this as completed Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Don't require splits to complete before running replicate queue #25047

storage: Don't require splits to complete before running replicate queue #25047

a-robinson commented Apr 24, 2018

tbg commented Apr 24, 2018 •

edited

Loading

nvanbenschoten commented Apr 24, 2018

a-robinson commented Apr 24, 2018

nvanbenschoten commented Jul 10, 2019 •

edited

Loading

andreimatei commented Jul 10, 2019

storage: Don't require splits to complete before running replicate queue #25047

storage: Don't require splits to complete before running replicate queue #25047

Comments

a-robinson commented Apr 24, 2018

tbg commented Apr 24, 2018 • edited Loading

nvanbenschoten commented Apr 24, 2018

a-robinson commented Apr 24, 2018

nvanbenschoten commented Jul 10, 2019 • edited Loading

andreimatei commented Jul 10, 2019

tbg commented Apr 24, 2018 •

edited

Loading

nvanbenschoten commented Jul 10, 2019 •

edited

Loading