-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range created by Split may conflict with Range created by multiraft #1644
Comments
"In this snapshot, Range1 still have key space [a,KeyMax)" in the right-bottom of the last picture should be "In this snapshot, Range1 still have key space [KeyMin, KeyMax)". |
What if we include the start and end keys of the range to a header of InternalRaftRequest, and ignore them if they refer to a range that overlaps with a range we already have? We might be able to limit this to just MsgSnap and MsgVote, but I think there's still a bit of an edge case with the MsgApp that precedes a MsgSnap. |
FYI, case 1 happens regularly in the Put acceptance test. |
@bdarnell, what's the meaning of the InternalRaftMessage? |
I meant Making |
@bdarnell, you are right that // RaftMessageRequest wraps a raft message.
type RaftMessageRequest struct {
GroupID proto.RaftID
RangeDescriptor *proto.RangeDescriptor
Message raftpb.Message
}``` |
Oops, yes, I meant RaftMessageRequest instead of InternalRaftCommand. Unconditionally sending MsgVoteResp is risky. If we always send |
Ok, we can do like this. |
@es-chow: @bdarnell and I have been discussing a more radical restructuring of things in order to eliminate the confusion we're currently dealing with in splitting and also in change replica updates. The idea is to create a new type StorageKey struct {
groupID proto.RaftID
replicaLogIndex uint64
key proto.Key
} This composite key struct would be passed into the various storage/engine methods in place of Thoughts? |
This StorageKey solution will be helpful for #768 as it differentiate each incarnation of the range data in store. |
There are two parts to the proposal. One is the introduction of StorageKey, to guarantee that the snapshots arriving on range 2 don't affect data owned by range 1. The second is that splitTrigger must copy all the data from range 1 to range 2, and this copy is like handling an incoming snapshot (in fact, we may just have splitTrigger call ApplySnapshot). This copy will be skipped if the range has already been initialized. |
Could you briefly review how that solves all the racy problems in #768 (i.e. short pseudo-code)? If we remove a replica and then receive a stale message, how do we prevent re-adding the group? How will receiving a Raft message before a split has been executed on a replica handled in that case? Will new information be sent along with the Raft messages? You say about removal that
It doesn't have to be atomic though, isn't it enough to have the tombstone write precede the Raft removal and couldn't we arrange for that? I'm uncomfortable having ranges go from a logical slicing of the key space to something physically separated. For one, that makes the keyspace scattered (we'll never look at it without tooling, so that's probably ok, but finding a certain key just from the encoded keys would now mean a scan of everything), though that's mostly a concern of taste. But also I've noticed that applying a Split can already take more than a second (I'm talking applying the Raft command, nothing else - haven't looked into why) and it could be a price to pay to have to copy data around on top of that. If the |
Another propose for the removal and re-add issue, can we add a
Please help check. |
For the disruptive node issue mentioned in #768 section 1, we may add a |
@tschottdorf If we get a stale message we'll still re-add the group. The difference is that if a range is deleted and re-added, the new incarnation of the range will have a new StorageKey, so the GC of the old range will not race with the new one. It will no longer be possible for a deleted range incarnation to become alive again. Regarding performance, one advantage of the new scheme would be that the copying could be asynchronous with respect to the original EndTransaction call (we just need to start the rocksdb snapshot during the split trigger). Concretely, the plan is to change |
I've written up a design doc for this proposal and put it on the cockroach wiki. One correction to my last comment: the copying cannot be asynchronous with respect to the |
Have we made any progress on dealing with this? |
Yes, the entire ReplicaID/StorageKey series of RFCs was partially motivated by this issue. We now have all the major pieces in place so we should be able to fix this soon. |
I was really wondering what the fix would be, but the rejection notices in the storage key proposal were instructive:
|
When a range is split, followers of that range may receive a snapshot from the right-hand side of the split before they have caught up and processed the left-hand side where the split originated. This results in a "range already exists" panic. The solution is to silently drop any snapshots which would cause a conflict. They will be retried and will succeed once the left-hand range has performed its split. Fixes cockroachdb#1644.
When a range is split, followers of that range may receive a snapshot from the right-hand side of the split before they have caught up and processed the left-hand side where the split originated. This results in a "range already exists" panic. The solution is to silently drop any snapshots which would cause a conflict. They will be retried and will succeed once the left-hand range has performed its split. Fixes cockroachdb#1644.
FYI, I'm seeing the error
|
yeah, that's why this issue exists. #2944 should take care of it. |
When a range is split, followers of that range may receive a snapshot from the right-hand side of the split before they have caught up and processed the left-hand side where the split originated. This results in a "range already exists" panic. The solution is to silently drop any snapshots which would cause a conflict. They will be retried and will succeed once the left-hand range has performed its split. Fixes cockroachdb#1644. Also check destination stopper in multiTestContext.rpcSend
When a range is split, followers of that range may receive a snapshot from the right-hand side of the split before they have caught up and processed the left-hand side where the split originated. This results in a "range already exists" panic. The solution is to silently drop any snapshots which would cause a conflict. They will be retried and will succeed once the left-hand range has performed its split. Fixes cockroachdb#1644. Also check destination stopper in multiTestContext.rpcSend
For a normal range split, Range struct in memory created by Split will be created first before a Raft message is received.
But there are some corner cases Range created by Split may conflict with Range created by multiraft when receiving a Raft message.
2. Before a EndTransactionRequest(split_trigger) is applied, multiple Raft message include MsgSnap from other node had been received, then this Range will failed in Range.ApplySnapshot as the key space is conflict with existing key space in the store such as the node 3 in the following picture.
3. As the Applying Raft command is asynchorous with writing into storage, so some Raft messages may delayed in applying even it's a Raft leader. This also cause key space conflicting such as the node 3.
One way to resolve case 1 and 2 is intercepting the MsgVote, MsgApp message if a multiraft group cannot be found in multiraft and delay the group and range creating until a MsgSnap is received and the RangeDescriptor inside is not conflict the key space in the store, if conflict, then a multiraft group and range will not be created to wait the EndTransaction(split_trigger) to finish. But this is not work with case 3.
The text was updated successfully, but these errors were encountered: