Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: range merge can result in serializability violation if applied through Raft snapshot on leaseholder #60520

Closed
nvanbenschoten opened this issue Feb 12, 2021 · 0 comments · Fixed by #60521
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-replication Relating to Raft, consensus, and coordination. A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently.

Comments

@nvanbenschoten
Copy link
Member

We prevent the post-merged range from serving writes below any reads previously served by the RHS by bumping the LHS’s timestamp cache to the RHS’s freeze time. That all works great when the LHS’s leaseholder applies the range merge trigger through normal Raft log application. But what if the LHS’s leaseholder applies the range merge trigger through a Raft snapshot? In these cases where the RHS is "subsumed" during the snapshot, the LHS's leaseholder does not bump its timestamp cache. This allows the post-merged range to invalidate reads served by the pre-merge RHS range.

For context, at one point, we had a bug related to learning about becoming the leaseholder through a snapshot, and so we now handle that case in Replica.applySnapshot by calling leasePostApply.

The reason why we've likely never actually seen this before is that it is extremely difficult to get a leaseholder to apply a range merge through a Raft snapshot. Here is a comment from a Slack thread that explains what is needed to create this situation:

Getting a LHS leaseholder to apply a merge through a snapshot is not easy. It requires a leader/leaseholder split, of course. But then it also requires the leaseholder to be partitioned from the leader and then the log to be truncated ahead of the leaseholder’s log. But for all intents and purposes, the leaseholder is the only one that ever makes the decision to truncate. So we need to hit one of the few edge cases where the log is truncated ahead of the leaseholder’s last log index.

But that’s not all. The LHS leaseholder is intimately involved in the merge process. So it needs to have been part of the quorum and properly applying log entries right up to the point of the merge trigger. Otherwise, the merge txn would have gotten stuck. But if it somehow gets to the merge trigger, forwards the proposal to the leader, gets partitioned, the log gets truncated, then it receives a snapshot that includes the merge, it won’t properly bumps its timestamp cache (if the LHS and RHS leaseholders were on different nodes) and it can then serve writes that rewrite history.

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. A-kv-distribution Relating to rebalancing and leasing. A-kv-transactions Relating to MVCC and the transactional model. S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently. labels Feb 12, 2021
@nvanbenschoten nvanbenschoten self-assigned this Feb 12, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 12, 2021
Fixes cockroachdb#57688.
Fixes cockroachdb#59679.
Fixes cockroachdb#60520.

This commit introduces new logic to ship summaries of a leaseholders
timestamp cache through lease transfers and range merges. For lease
transfers, the read summary is sent from the outgoing leaseholder to the
incoming leaseholder. For range merges, the read summary is sent from
the right-hand side leaseholder (through the SubsumeResponse), to the
left-hand side leaseholder (through the MergeTrigger).

The read summaries perform the role of the lease start time and merge
freeze time used to play for lease transfers and range merges,
respectively - the summaries instruct the post-operation leaseholder on
how to update its timestamp cache to ensure that no future writes are
allowed to invalidate prior reads.

Read summaries have two distinct advantages over the old approach:
1. they can transfer a higher-resolution snapshot of the reads on the
    range through a lease transfer, to make the lease transfers less
    disruptive to writes because the timestamp cache won't be bumped as
    high. This avoids transaction aborts and retries after lease
    transfers and merges.
2. they can transfer information about reads with synthetic timestamps,
    which are not otherwise captured by the new lease's start time.
    Because of this, they are needed for correctness on `global_read`
    ranges, which can serve reads in the future.

This commit does not realize the first benefit, because it uses very
low-resolution read summaries. However, it sets up the infrastructure
that will allow us to realize the benefit in the future by capturing and
shipping higher-resolution read summaries. The commit does realize the
second benefit, as it fixes correctness issues around future time reads.

----

The commit also fixes a related bug that was revealed during the
development of this patch. As explained in cockroachdb#60520, it was possible for a
range merge to be applied to the leaseholder of the LHS of the merge
through a Raft snapshot. In such cases, we were not properly updating
the leaseholder's timestamp cache to reflect the reads served on the RHS
range. This could allow the post-merged range to invalidate reads served
by the pre-merge RHS range.

This commit fixes this bug using the new read summary infrastructure.
Merge triggers now write to the left-hand side's prior read summary with
a read summary gathered from the right-hand side during subsumption.
Later, upon ingesting a Raft snapshot, we check if we subsumed any
replicas and if we are the leaseholder. If both of those conditions are
true, we forward the replica's timestamp cache to the read summary on
the range. Since this read summary must have been updated by the merge
trigger, it will include all reads served on the pre-merge RHS range.

----

Release note (bug fix): Fixes a very rare, possible impossible in
practice, bug where a range merge that applied through a Raft snapshot
on the left-hand side range's leaseholder could allow that leaseholder
to serve writes that invalidated reads from before the merge on the
right-hand side.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 16, 2021
Fixes cockroachdb#57688.
Fixes cockroachdb#59679.
Fixes cockroachdb#60520.

This commit introduces new logic to ship summaries of a leaseholders
timestamp cache through lease transfers and range merges. For lease
transfers, the read summary is sent from the outgoing leaseholder to the
incoming leaseholder. For range merges, the read summary is sent from
the right-hand side leaseholder (through the SubsumeResponse), to the
left-hand side leaseholder (through the MergeTrigger).

The read summaries perform the role of the lease start time and merge
freeze time used to play for lease transfers and range merges,
respectively - the summaries instruct the post-operation leaseholder on
how to update its timestamp cache to ensure that no future writes are
allowed to invalidate prior reads.

Read summaries have two distinct advantages over the old approach:
1. they can transfer a higher-resolution snapshot of the reads on the
    range through a lease transfer, to make the lease transfers less
    disruptive to writes because the timestamp cache won't be bumped as
    high. This avoids transaction aborts and retries after lease
    transfers and merges.
2. they can transfer information about reads with synthetic timestamps,
    which are not otherwise captured by the new lease's start time.
    Because of this, they are needed for correctness on `global_read`
    ranges, which can serve reads in the future.

This commit does not realize the first benefit, because it uses very
low-resolution read summaries. However, it sets up the infrastructure
that will allow us to realize the benefit in the future by capturing and
shipping higher-resolution read summaries. The commit does realize the
second benefit, as it fixes correctness issues around future time reads.

----

The commit also fixes a related bug that was revealed during the
development of this patch. As explained in cockroachdb#60520, it was possible for a
range merge to be applied to the leaseholder of the LHS of the merge
through a Raft snapshot. In such cases, we were not properly updating
the leaseholder's timestamp cache to reflect the reads served on the RHS
range. This could allow the post-merged range to invalidate reads served
by the pre-merge RHS range.

This commit fixes this bug using the new read summary infrastructure.
Merge triggers now write to the left-hand side's prior read summary with
a read summary gathered from the right-hand side during subsumption.
Later, upon ingesting a Raft snapshot, we check if we subsumed any
replicas and if we are the leaseholder. If both of those conditions are
true, we forward the replica's timestamp cache to the read summary on
the range. Since this read summary must have been updated by the merge
trigger, it will include all reads served on the pre-merge RHS range.

----

Release note (bug fix): Fixes a very rare, possible impossible in
practice, bug where a range merge that applied through a Raft snapshot
on the left-hand side range's leaseholder could allow that leaseholder
to serve writes that invalidated reads from before the merge on the
right-hand side.
craig bot pushed a commit that referenced this issue Feb 24, 2021
60521: kv: ship timestamp cache summary during lease transfers and range merges r=nvanbenschoten a=nvanbenschoten

Fixes #57688.
Fixes #59679.
Fixes #60520.

This commit introduces new logic to ship summaries of a leaseholders timestamp cache through lease transfers and range merges. For lease transfers, the read summary is sent from the outgoing leaseholder to the incoming leaseholder. For range merges, the read summary is sent from the right-hand side leaseholder (through the SubsumeResponse), to the left-hand side leaseholder (through the MergeTrigger).

The read summaries perform the role of the lease start time and merge freeze time used to play for lease transfers and range merges, respectively - the summaries instruct the post-operation leaseholder on how to update its timestamp cache to ensure that no future writes are allowed to invalidate prior reads.

Read summaries have two distinct advantages over the old approach:
1. they can transfer a higher-resolution snapshot of the reads on the range through a lease transfer, to make the lease transfers less disruptive to writes because the timestamp cache won't be bumped as high. This avoids transaction aborts and retries after lease transfers and merges.
2. they can transfer information about reads with synthetic timestamps, which are not otherwise captured by the new lease's start time. Because of this, they are needed for correctness on `global_read` ranges, which can serve reads in the future.

This commit does not realize the first benefit, because it uses very low-resolution read summaries. However, it sets up the infrastructure that will allow us to realize the benefit in the future by capturing and shipping higher-resolution read summaries. The commit does realize the second benefit, as it fixes correctness issues around future time reads.

----

The commit also fixes a related bug that was revealed during the development of this patch. As explained in #60520, it was possible for a range merge to be applied to the leaseholder of the LHS of the merge through a Raft snapshot. In such cases, we were not properly updating the leaseholder's timestamp cache to reflect the reads served on the RHS range. This could allow the post-merged range to invalidate reads served by the pre-merge RHS range.

This commit fixes this bug using the new read summary infrastructure. Merge triggers now write to the left-hand side's prior read summary with a read summary gathered from the right-hand side during subsumption. Later, upon ingesting a Raft snapshot, we check if we subsumed any replicas and if we are the leaseholder. If both of those conditions are true, we forward the replica's timestamp cache to the read summary on the range. Since this read summary must have been updated by the merge trigger, it will include all reads served on the pre-merge RHS range.

The existence of this bug was verified by a new variant of `TestStoreRangeMergeTimestampCache`, which only passes with the rest of this commit.

----

Release note (bug fix): Fixes a very rare, possible impossible in practice, bug where a range merge that applied through a Raft snapshot on the left-hand side range's leaseholder could allow that leaseholder to serve writes that invalidated reads from before the merge on the right-hand side.

Release justification: bug fix

61013: sql/pgwire: fix encoding of int4 and bpchar in tuples  r=otan a=rafiss

fixes #58069

This includes 3 commits:

### sql/pgwire: make PGTest pass against PostgresSQL

There are a few minor differences: dataTypeSize is different for tuples,
and PG does not show seconds offset for times if it is zero.

### sql/pgwire: fix binary encoding of ints in tuples

### sql/pgwire: fix text encoding of bpchar in tuples

### sql/pgwire: fix encoding of collated strings

Release note (bug fix): Integers inside of tuples were not being encoded
properly when using the binary format for retrieving data. This is now
fixed, and the proper integer width is reported.

Release note (bug fix): Blank-padded chars (e.g. CHAR(3)) were not being
encoded correctly when returning results to the client. Now they
correctly include blank-padding when appropriate.

Release note (bug fix): Collated strings were not encoded with the
proper type OID when sending results to the client if the OID was
for the `char` type. This is now fixed.


61027: authors: add Rachit Srivastava to authors r=rachitgsrivastava a=rachitgsrivastava

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Rachit Srivastava <rachit@cockroachlabs.com>
@craig craig bot closed this as completed in a7472e3 Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-replication Relating to Raft, consensus, and coordination. A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant