-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Don't require splits to complete before running replicate queue #25047
Comments
Snapshots are currently pulled into memory on the receiving side, which I think is one of the original motivations for this restriction. With @nvanbenschoten's upcoming change and possible elimination of other memory-blowup inducing code, we may be able to drop this coupling again, and I absolutely think we should do so once it becomes safe. |
I don't know the exact reason for introducing this restriction, but I suspect there were two main factors:
The first factor is similar to the reason why we used to prevent snapshots when the range size was larger than 2x the Whether we should make the change is another question. I personally would love to see it because it reduces the number of components that have dependencies on each other. As we saw in #20589, these dependencies can easily turn into cycles which create deadlocks, so it's almost always best to avoid them. |
Those are definitely the factors -- there's a comment in the code I linked to that explicitly lays that out. Sorry for not including the full range of relevant lines in the original post: cockroach/pkg/storage/replicate_queue.go Lines 153 to 157 in 5d39b64
That question is easy. Once the large snapshot problem is addressed, we should certainly remove the |
Fixes cockroachdb#16954. Related to cockroachdb#25047. This depends on the following two upstream changes to RockDB: - facebook/rocksdb#3778 - facebook/rocksdb#3779 The change introduces a new snapshot strategy called "SST". This strategy stream sst files consisting of all keys in a range from the sender to the receiver. These sst files are then atomically ingested directly into RocksDB. An important property of the strategy is that the amount of memory required for a receiver using the strategy is constant with respect to the size of a range, instead of linear as it is with the KV_BATCH strategy. This will be critical for increasing the default range size and potentially for increasing the number of concurrent snapshots allowed per node. The strategy also seems to significantly speed up snapshots once ranges are above a certain size (somewhere in the single digit MBs). This is a WIP change. Before it can be merged it needs: - to be cleaned up a bit - more testing (unit test, testing knobs, maybe some chaos) - proper version handling - heuristic tuning - decisions on questions like compactions after ingestion Release note: None
@andreimatei you just addressed this in #38529, right? |
yup |
Currently, the replicate queue requires that all splits complete before doing its thing (
cockroach/pkg/storage/replicate_queue.go
Line 117 in 5d39b64
cockroach/pkg/storage/replicate_queue.go
Line 152 in 5d39b64
This makes sense, because it's cheaper/less risky to send snapshots for smaller ranges than for ones that are bigger than the configured limit. However, this leaves the cluster susceptible to data unavailability/loss if a range fails to ever split for some reason. If we had a perfect track record of splits never getting stuck, this wouldn't be a big deal, but we have had some problems with ranges failing to split (e.g., #24966, #25036, #24896, #23310, #21357).
I suspect we'd be better off allowing replication even if size-based splitting was needed. Do you have opinions, @nvanbenschoten or @tschottdorf?
The text was updated successfully, but these errors were encountered: