Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: In-memory Pessimistic Locks #77

Merged
merged 8 commits into from
Apr 6, 2022
Merged

RFC: In-memory Pessimistic Locks #77

merged 8 commits into from
Apr 6, 2022

Conversation

sticnarf
Copy link
Contributor

This is a more aggresive optimization than pipelined pessimistic lock. It tries to put pessimistic locks only in memory and not replicate them through Raft while not decreasing the success rate of pessimistic transactions.

According to preliminary tests, this optimization reduces disk write flow and raftstore CPU by about 20%.

cc @HunDunDM @youjiali1995 @gengliqi @cfzjywxk @MyonKeminta @longfangsong

@BusyJay
Copy link
Member

BusyJay commented Oct 14, 2021

Can you write a section to reasoning correctness explicitly?

Copy link

@youjiali1995 youjiali1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a notable difference between pipelined pessimistic lock and in memory pessimistic lock -- if a pessimistic lock is written, it's never lost in pipelined pessimistic lock but in memory pessimistic lock may be lost. Can you elaborate on that?

text/0077-in-memory-pessimistic-locks.md Show resolved Hide resolved
@sticnarf
Copy link
Contributor Author

The new commit adds some more detailed handlings about leader transfer and some explanations about correctness.


- Pessimistic locks are written into a region-level lock table.
- Pessimistic locks are sent to other peers before a voluntary leader transfer.
- Pessimistic locks in the source region are sent to the target region before a region merge.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about tikv store level failure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the leader encounters network isolation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior under these store-level failures is not different from using pipelined pessimistic lock. The pessimisitc transactions are not guaranteed to commit successfully, but they can commit if no write conflicts have happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipelined pessimistic lock is off by default? so this feature will not open by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipielined pessimistic lock is enabled by default from 5.0. So, this feature will be enabled by default as well.


By default, a Raft message has a size limit of 1 MiB. We will guarantee that the total size of in-memory pessimistic locks in a single region will not exceed the limit. This will be discussed later.

If the transfer leader message is lost or rejected, we need to revert `valid` to `true`. But it is not possible for the leader to know. So, we have to use a timeout to implement it. That means, we can revert `valid` to `true` if the leader is still not transferred after some period. And if the leader receives the `MsgTransferLeader` response from the follower after the timeout, it should ignore the message and not trigger a leader transfer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not understand. Because raft will only append and apply log in order as they are send, we just need to propose the entry MsgTransferLeader after lock. Just like that.

fn propose_transfer_leader(&mut self, msg: RaftMessage) {
    if !self.locks.is_empty() {
           let lock_msg = RaftMessage::new();
           for lock in self.locks.take() {
                 lock_msg.append(lock);
                 if lock_msg.size() > 1024 * 1024 {   // 1MB
                     self.propose(lock_msg);
                 }
           }
           self.propose(lock_msg);
     }
     self.propose(msg);
}

Copy link
Contributor Author

@sticnarf sticnarf Nov 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my initial thought was to not change the transfer leader procedure. Using a proposal is simpler.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another difference is the impact of leader transafer on ongoing pessimistic transactions such as the fallback processing described in this document. If the impact is ok we could use the simpler way and make it a whole, it's easier to maintain.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout mechanism is still needed. The leader transfer timeout is election timeout in raft. We can check it on tick.

text/0077-in-memory-pessimistic-locks.md Outdated Show resolved Hide resolved

By default, a Raft message has a size limit of 1 MiB. We will guarantee that the total size of in-memory pessimistic locks in a single region will not exceed the limit. This will be discussed later.

If the transfer leader message is lost or rejected, we need to revert `valid` to `true`. But it is not possible for the leader to know. So, we have to use a timeout to implement it. That means, we can revert `valid` to `true` if the leader is still not transferred after some period. And if the leader receives the `MsgTransferLeader` response from the follower after the timeout, it should ignore the message and not trigger a leader transfer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout mechanism is still needed. The leader transfer timeout is election timeout in raft. We can check it on tick.

text/0077-in-memory-pessimistic-locks.md Outdated Show resolved Hide resolved
Copy link

@youjiali1995 youjiali1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


- Reading exactly the key of which pessimistic lock is lost is not affected, because the pessimistic lock is totally invisible to the reader.
- If a secondary 2PC lock is read while the primary lock is still in the pessimistic stage, the reader will call `CheckTxnStatus` to the primary lock:
- If the primary lock exists, `min_commit_ts` of the lock is advanced, so the reader will not be blocked. **This operation must be replicated through Raft.** Otherwise, if the primary lock is lost, we may allow a smaller commit TS, breaking snapshot isolation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation must be replicated through Raft.

maybe this can get some inspiration from CRDB... (every read cmd need to go through raft is a little expensive.

it seems they also can have similar push-txn mechanics, but they only update timestamp in memory tsCache instead of updating a field in the lock(write-intent), the new leader also couldn't get old pushed timestamp info because of lost memory.

in the previous version, it seems they choose to update tcCache low water ts to new lease time to let conflict write txn report error & retry...https://github.com/cockroachdb/cockroach/blob/df826cdf700a79948d083827ca67967016a1a1af/pkg/kv/kvserver/replica_proposal.go#L383... (it's unfriendly to user but safe..

but it seems in this year, they take some new improvements cockroachdb/cockroach@a7472e3, to transfer ts cache during leader transfer and range merge, maybe have litter chance to force the user to retry in commit phase.. (ps: I'm not carefully read this new code, maybe wrong 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not mean every read cmd will go through Raft. It is rare in practice to update the min_commit_ts in the pessimistic lock. Keys are prewritten in parallel, modifying the pessimistic lock only happens when secondary keys are prewritten and the primary key is not prewritten.

I haven't understood the new CRDB approach yet. We will get the pessimistic lock lost under unexpected crashes. But then, the ts cache cannot be transferred as well... Currently, I don't get how it will decrease the failure rate...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not mean every read cmd

thanks for explaining 😄 ~

I don't get how it will decrease the failure rate

I mean it will help decrease failure rate for "transfer leader", it seems previous version transfer leader also will set low water as newLease.Start just like crash - -

but for real "unexpected crashes", the pr should have no improvement for fail rate (

- A different transaction can resolve the pessimistic lock when it encounters the pessimistic lock in `AcquirePessimisticLock` or `Prewrite`. So, if the lock is lost, `PessimisticRollback` will find no lock and do nothing. No change is needed.
- `TxnHeartBeat` will fail after the loss of pessimistic locks. But it will not affect correctness.

### Lock migration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When TiKV add a learner for this region, it will send a snapshot to learner node. But how we could keep the consistency for both lockcf and writecf?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I means that this memory structure need give a snapshot-like result, so that not transaction will effect this result, which is scanning data to send to learner node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pessimistic locks are totally ignored for all reading requests. So, we can care only about the leader.

Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
…ng leader

Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
@sticnarf sticnarf merged commit 2334fd4 into tikv:master Apr 6, 2022
pingyu pushed a commit to pingyu/tikv-rfcs that referenced this pull request Nov 4, 2022
* RFC: In-memory Pessimistic Locks

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* clarify where to delete memory locks after writing a lock CF KV

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Elaborate transfer leader handlings and add correctness section

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* add an addition step of proposing pessimistic locks before transferring leader

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* clarify about new leaders of region split

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Add tracking issue link

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* update design and correctness analysis of lock migration

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* add configurations

Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>
pingyu added a commit that referenced this pull request Nov 8, 2022
* RFC: RawKV Batch Export (#76)

Signed-off-by: pingyu <yuping@pingcap.com>

* rawkv bulk load: add description for pause merge (#74)

* rawkv bulk load: add description for pause merge

Signed-off-by: Peng Guanwen <pg999w@outlook.com>

* Update text/0072-online-bulk-load-for-rawkv.md

Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Signed-off-by: Peng Guanwen <pg999w@outlook.com>

* Add future improvements

Signed-off-by: Peng Guanwen <pg999w@outlook.com>

Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* ref pd#4112: implementation detail of PD

Signed-off-by: pingyu <yuping@pingcap.com>

* ref pd#4112: implementation detail of PD

Signed-off-by: pingyu <yuping@pingcap.com>

* remove raw cf

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update pd design

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* revert to keyspace_next_id

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* RFC: Improve the Scalability of TSO Service (#78)

Signed-off-by: pingyu <yuping@pingcap.com>

* make region size dynamic (#82)

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update pd url

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* address comment

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* resolve pd flashback problem

Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* update rfcs

Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* RFC: In-memory Pessimistic Locks (#77)

* RFC: In-memory Pessimistic Locks

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* clarify where to delete memory locks after writing a lock CF KV

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Elaborate transfer leader handlings and add correctness section

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* add an addition step of proposing pessimistic locks before transferring leader

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* clarify about new leaders of region split

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* Add tracking issue link

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* update design and correctness analysis of lock migration

Signed-off-by: Yilin Chen <sticnarf@gmail.com>

* add configurations

Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* propose online unsafe recovery (#91)

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* physical isolation between region (#93)

Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* wip

Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: pingyu <yuping@pingcap.com>

* update

Signed-off-by: pingyu <yuping@pingcap.com>

* Apply suggestions from code review

Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com>
Signed-off-by: pingyu <yuping@pingcap.com>

* fix case

Signed-off-by: pingyu <yuping@pingcap.com>

Signed-off-by: pingyu <yuping@pingcap.com>
Signed-off-by: Andy Lok <andylokandy@hotmail.com>
Signed-off-by: andylokandy <andylokandy@hotmail.com>
Signed-off-by: Jay Lee <BusyJayLee@gmail.com>
Signed-off-by: Yilin Chen <sticnarf@gmail.com>
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
Co-authored-by: Liangliang Gu <marsishandsome@gmail.com>
Co-authored-by: Peng Guanwen <pg999w@outlook.com>
Co-authored-by: Andy Lok <andylokandy@hotmail.com>
Co-authored-by: JmPotato <ghzpotato@gmail.com>
Co-authored-by: Jay <BusyJay@users.noreply.github.com>
Co-authored-by: Yilin Chen <sticnarf@gmail.com>
Co-authored-by: Connor <zbk602423539@gmail.com>
Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants