-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: In-memory Pessimistic Locks #77
Conversation
Can you write a section to reasoning correctness explicitly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a notable difference between pipelined pessimistic lock and in memory pessimistic lock -- if a pessimistic lock is written, it's never lost in pipelined pessimistic lock but in memory pessimistic lock may be lost. Can you elaborate on that?
d4c5da7
to
e772d05
Compare
The new commit adds some more detailed handlings about leader transfer and some explanations about correctness. |
|
||
- Pessimistic locks are written into a region-level lock table. | ||
- Pessimistic locks are sent to other peers before a voluntary leader transfer. | ||
- Pessimistic locks in the source region are sent to the target region before a region merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about tikv store level failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the leader encounters network isolation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior under these store-level failures is not different from using pipelined pessimistic lock. The pessimisitc transactions are not guaranteed to commit successfully, but they can commit if no write conflicts have happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipelined pessimistic lock is off by default? so this feature will not open by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipielined pessimistic lock is enabled by default from 5.0. So, this feature will be enabled by default as well.
|
||
By default, a Raft message has a size limit of 1 MiB. We will guarantee that the total size of in-memory pessimistic locks in a single region will not exceed the limit. This will be discussed later. | ||
|
||
If the transfer leader message is lost or rejected, we need to revert `valid` to `true`. But it is not possible for the leader to know. So, we have to use a timeout to implement it. That means, we can revert `valid` to `true` if the leader is still not transferred after some period. And if the leader receives the `MsgTransferLeader` response from the follower after the timeout, it should ignore the message and not trigger a leader transfer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not understand. Because raft will only append and apply log in order as they are send, we just need to propose the entry MsgTransferLeader
after lock
. Just like that.
fn propose_transfer_leader(&mut self, msg: RaftMessage) {
if !self.locks.is_empty() {
let lock_msg = RaftMessage::new();
for lock in self.locks.take() {
lock_msg.append(lock);
if lock_msg.size() > 1024 * 1024 { // 1MB
self.propose(lock_msg);
}
}
self.propose(lock_msg);
}
self.propose(msg);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my initial thought was to not change the transfer leader procedure. Using a proposal is simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another difference is the impact of leader transafer on ongoing pessimistic transactions such as the fallback processing described in this document. If the impact is ok we could use the simpler way and make it a whole, it's easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout mechanism is still needed. The leader transfer timeout is election timeout in raft. We can check it on tick.
|
||
By default, a Raft message has a size limit of 1 MiB. We will guarantee that the total size of in-memory pessimistic locks in a single region will not exceed the limit. This will be discussed later. | ||
|
||
If the transfer leader message is lost or rejected, we need to revert `valid` to `true`. But it is not possible for the leader to know. So, we have to use a timeout to implement it. That means, we can revert `valid` to `true` if the leader is still not transferred after some period. And if the leader receives the `MsgTransferLeader` response from the follower after the timeout, it should ignore the message and not trigger a leader transfer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout mechanism is still needed. The leader transfer timeout is election timeout in raft. We can check it on tick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
- Reading exactly the key of which pessimistic lock is lost is not affected, because the pessimistic lock is totally invisible to the reader. | ||
- If a secondary 2PC lock is read while the primary lock is still in the pessimistic stage, the reader will call `CheckTxnStatus` to the primary lock: | ||
- If the primary lock exists, `min_commit_ts` of the lock is advanced, so the reader will not be blocked. **This operation must be replicated through Raft.** Otherwise, if the primary lock is lost, we may allow a smaller commit TS, breaking snapshot isolation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation must be replicated through Raft.
maybe this can get some inspiration from CRDB... (every read cmd need to go through raft is a little expensive.
it seems they also can have similar push-txn mechanics, but they only update timestamp in memory tsCache instead of updating a field in the lock(write-intent), the new leader also couldn't get old pushed timestamp info because of lost memory.
in the previous version, it seems they choose to update tcCache low water ts to new lease time to let conflict write txn report error & retry...https://github.com/cockroachdb/cockroach/blob/df826cdf700a79948d083827ca67967016a1a1af/pkg/kv/kvserver/replica_proposal.go#L383... (it's unfriendly to user but safe..
but it seems in this year, they take some new improvements cockroachdb/cockroach@a7472e3, to transfer ts cache during leader transfer and range merge, maybe have litter chance to force the user to retry in commit phase.. (ps: I'm not carefully read this new code, maybe wrong 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not mean every read cmd will go through Raft. It is rare in practice to update the min_commit_ts
in the pessimistic lock. Keys are prewritten in parallel, modifying the pessimistic lock only happens when secondary keys are prewritten and the primary key is not prewritten.
I haven't understood the new CRDB approach yet. We will get the pessimistic lock lost under unexpected crashes. But then, the ts cache cannot be transferred as well... Currently, I don't get how it will decrease the failure rate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not mean every read cmd
thanks for explaining 😄 ~
I don't get how it will decrease the failure rate
I mean it will help decrease failure rate for "transfer leader", it seems previous version transfer leader also will set low water as newLease.Start
just like crash - -
but for real "unexpected crashes", the pr should have no improvement for fail rate (
- A different transaction can resolve the pessimistic lock when it encounters the pessimistic lock in `AcquirePessimisticLock` or `Prewrite`. So, if the lock is lost, `PessimisticRollback` will find no lock and do nothing. No change is needed. | ||
- `TxnHeartBeat` will fail after the loss of pessimistic locks. But it will not affect correctness. | ||
|
||
### Lock migration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When TiKV add a learner for this region, it will send a snapshot to learner node. But how we could keep the consistency for both lockcf and writecf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I means that this memory structure need give a snapshot-like result, so that not transaction will effect this result, which is scanning data to send to learner node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pessimistic locks are totally ignored for all reading requests. So, we can care only about the leader.
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
…ng leader Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
* RFC: In-memory Pessimistic Locks Signed-off-by: Yilin Chen <sticnarf@gmail.com> * clarify where to delete memory locks after writing a lock CF KV Signed-off-by: Yilin Chen <sticnarf@gmail.com> * Elaborate transfer leader handlings and add correctness section Signed-off-by: Yilin Chen <sticnarf@gmail.com> * add an addition step of proposing pessimistic locks before transferring leader Signed-off-by: Yilin Chen <sticnarf@gmail.com> * clarify about new leaders of region split Signed-off-by: Yilin Chen <sticnarf@gmail.com> * Add tracking issue link Signed-off-by: Yilin Chen <sticnarf@gmail.com> * update design and correctness analysis of lock migration Signed-off-by: Yilin Chen <sticnarf@gmail.com> * add configurations Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com>
* RFC: RawKV Batch Export (#76) Signed-off-by: pingyu <yuping@pingcap.com> * rawkv bulk load: add description for pause merge (#74) * rawkv bulk load: add description for pause merge Signed-off-by: Peng Guanwen <pg999w@outlook.com> * Update text/0072-online-bulk-load-for-rawkv.md Co-authored-by: Liangliang Gu <marsishandsome@gmail.com> Signed-off-by: Peng Guanwen <pg999w@outlook.com> * Add future improvements Signed-off-by: Peng Guanwen <pg999w@outlook.com> Co-authored-by: Liangliang Gu <marsishandsome@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * ref pd#4112: implementation detail of PD Signed-off-by: pingyu <yuping@pingcap.com> * ref pd#4112: implementation detail of PD Signed-off-by: pingyu <yuping@pingcap.com> * remove raw cf Signed-off-by: Andy Lok <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * update Signed-off-by: Andy Lok <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * update pd design Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * revert to keyspace_next_id Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * RFC: Improve the Scalability of TSO Service (#78) Signed-off-by: pingyu <yuping@pingcap.com> * make region size dynamic (#82) Signed-off-by: Jay Lee <BusyJayLee@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * update pd url Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * address comment Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * resolve pd flashback problem Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * update rfcs Signed-off-by: Andy Lok <andylokandy@hotmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * RFC: In-memory Pessimistic Locks (#77) * RFC: In-memory Pessimistic Locks Signed-off-by: Yilin Chen <sticnarf@gmail.com> * clarify where to delete memory locks after writing a lock CF KV Signed-off-by: Yilin Chen <sticnarf@gmail.com> * Elaborate transfer leader handlings and add correctness section Signed-off-by: Yilin Chen <sticnarf@gmail.com> * add an addition step of proposing pessimistic locks before transferring leader Signed-off-by: Yilin Chen <sticnarf@gmail.com> * clarify about new leaders of region split Signed-off-by: Yilin Chen <sticnarf@gmail.com> * Add tracking issue link Signed-off-by: Yilin Chen <sticnarf@gmail.com> * update design and correctness analysis of lock migration Signed-off-by: Yilin Chen <sticnarf@gmail.com> * add configurations Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * propose online unsafe recovery (#91) Signed-off-by: Connor1996 <zbk602423539@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * physical isolation between region (#93) Signed-off-by: Jay Lee <BusyJayLee@gmail.com> Signed-off-by: pingyu <yuping@pingcap.com> * wip Signed-off-by: pingyu <yuping@pingcap.com> * update Signed-off-by: pingyu <yuping@pingcap.com> * update Signed-off-by: pingyu <yuping@pingcap.com> * Apply suggestions from code review Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com> Signed-off-by: pingyu <yuping@pingcap.com> * fix case Signed-off-by: pingyu <yuping@pingcap.com> Signed-off-by: pingyu <yuping@pingcap.com> Signed-off-by: Andy Lok <andylokandy@hotmail.com> Signed-off-by: andylokandy <andylokandy@hotmail.com> Signed-off-by: Jay Lee <BusyJayLee@gmail.com> Signed-off-by: Yilin Chen <sticnarf@gmail.com> Signed-off-by: Connor1996 <zbk602423539@gmail.com> Co-authored-by: Liangliang Gu <marsishandsome@gmail.com> Co-authored-by: Peng Guanwen <pg999w@outlook.com> Co-authored-by: Andy Lok <andylokandy@hotmail.com> Co-authored-by: JmPotato <ghzpotato@gmail.com> Co-authored-by: Jay <BusyJay@users.noreply.github.com> Co-authored-by: Yilin Chen <sticnarf@gmail.com> Co-authored-by: Connor <zbk602423539@gmail.com> Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com>
This is a more aggresive optimization than pipelined pessimistic lock. It tries to put pessimistic locks only in memory and not replicate them through Raft while not decreasing the success rate of pessimistic transactions.
According to preliminary tests, this optimization reduces disk write flow and raftstore CPU by about 20%.
cc @HunDunDM @youjiali1995 @gengliqi @cfzjywxk @MyonKeminta @longfangsong