Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Don't require splits to complete before running replicate queue #25047

Closed
a-robinson opened this issue Apr 24, 2018 · 5 comments
Closed
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@a-robinson
Copy link
Contributor

Currently, the replicate queue requires that all splits complete before doing its thing (

acceptsUnsplitRanges: store.TestingKnobs().ReplicateQueueAcceptsUnsplit,
,
if !repl.store.splitQueue.Disabled() && repl.needsSplitBySize() {
).

This makes sense, because it's cheaper/less risky to send snapshots for smaller ranges than for ones that are bigger than the configured limit. However, this leaves the cluster susceptible to data unavailability/loss if a range fails to ever split for some reason. If we had a perfect track record of splits never getting stuck, this wouldn't be a big deal, but we have had some problems with ranges failing to split (e.g., #24966, #25036, #24896, #23310, #21357).

I suspect we'd be better off allowing replication even if size-based splitting was needed. Do you have opinions, @nvanbenschoten or @tschottdorf?

@a-robinson a-robinson added the A-kv-distribution Relating to rebalancing and leasing. label Apr 24, 2018
@tbg
Copy link
Member

tbg commented Apr 24, 2018

Snapshots are currently pulled into memory on the receiving side, which I think is one of the original motivations for this restriction. With @nvanbenschoten's upcoming change and possible elimination of other memory-blowup inducing code, we may be able to drop this coupling again, and I absolutely think we should do so once it becomes safe.

@nvanbenschoten
Copy link
Member

I don't know the exact reason for introducing this restriction, but I suspect there were two main factors:

  • ranges had no upper bound on size, so if they needed a split then that means they could have been arbitrarily large.
  • large snapshots are expensive and potentially deadly because they buffer the entire range in memory.

The first factor is similar to the reason why we used to prevent snapshots when the range size was larger than 2x the max_range_size. We removed this restriction after we introduced backpressure that attempted to bound the size of ranges. The second factor is still valid, but will be addressed in #16954. Since we do a better job bounding range size and will soon remove the effective snapshot size limit, I think we can make this change shortly.

Whether we should make the change is another question. I personally would love to see it because it reduces the number of components that have dependencies on each other. As we saw in #20589, these dependencies can easily turn into cycles which create deadlocks, so it's almost always best to avoid them.

@a-robinson
Copy link
Contributor Author

I don't know the exact reason for introducing this restriction, but I suspect there were two main factors:

Those are definitely the factors -- there's a comment in the code I linked to that explicitly lays that out. Sorry for not including the full range of relevant lines in the original post:

// If the range exceeds the split threshold, let that finish first.
// Ranges must fit in memory on both sender and receiver nodes while
// being replicated. This supplements the check provided by
// acceptsUnsplitRanges, which looks at zone config boundaries rather
// than data size.

Whether we should make the change is another question.

That question is easy. Once the large snapshot problem is addressed, we should certainly remove the repl.needsSplitBySize() check. As is, splits not working can lead to under-replication/unavailability/data loss because we won't ever up-replicate the range even if nodes are lost.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 27, 2018
Fixes cockroachdb#16954.
Related to cockroachdb#25047.

This depends on the following two upstream changes to RockDB:
- facebook/rocksdb#3778
- facebook/rocksdb#3779

The change introduces a new snapshot strategy called "SST". This strategy
stream sst files consisting of all keys in a range from the sender to the
receiver. These sst files are then atomically ingested directly into RocksDB.
An important property of the strategy is that the amount of memory required
for a receiver using the strategy is constant with respect to the size of
a range, instead of linear as it is with the KV_BATCH strategy. This will
be critical for increasing the default range size and potentially for
increasing the number of concurrent snapshots allowed per node. The
strategy also seems to significantly speed up snapshots once ranges are
above a certain size (somewhere in the single digit MBs).

This is a WIP change. Before it can be merged it needs:
- to be cleaned up a bit
- more testing (unit test, testing knobs, maybe some chaos)
- proper version handling
- heuristic tuning
- decisions on questions like compactions after ingestion

Release note: None
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018
@tbg tbg added this to the 2.2 milestone Jul 22, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jul 10, 2019

@andreimatei you just addressed this in #38529, right?

@andreimatei
Copy link
Contributor

yup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants