Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: disable observed timestamp captured before range merge from applying to merged range #73292

Closed
nvanbenschoten opened this issue Nov 30, 2021 · 2 comments · Fixed by #83345
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Nov 30, 2021

85d4627 introduced logic to handle the where after a lease changes, observed timestamps captured on the incoming leaseholder's node cannot be used to ignore uncertainty for data written by the former leaseholder. This logic currently lives here:

// The lease's start time is also taken into consideration to ensure that a
// lease transfer does not result in the observed timestamp for this node being
// inapplicable to data previously written by the former leaseholder. To wit:
//
// 1. put(k on leaseholder n1), gateway chooses t=1.0
// 2. begin; read(unrelated key on n2); gateway chooses t=0.98
// 3. pick up observed timestamp for n2 of t=0.99
// 4. n1 transfers lease for range with k to n2 @ t=1.1
// 5. read(k) on leaseholder n2 at ReadTimestamp=0.98 should get
// ReadWithinUncertaintyInterval because of the write in step 1, so
// even though we observed n2's timestamp in step 3 we must expand
// the uncertainty interval to the lease's start time, which is
// guaranteed to be greater than any write which occurred under
// the previous leaseholder.

And is tested in TestRangeLocalUncertaintyLimitAfterNewLease.

While working on some refactors to fix #58459 and eventually #36431, I noticed a similar case after a range merge, where observed timestamps from before the merge cannot be used to ignore uncertainty for data written by the former RHS range. A comment describing this was added to #73244:

// A similar hazard applies to range merges, where we cannot apply observed
// timestamps captured from the leaseholder of the left-hand side of the merge
// to data written on the right-hand side of the merge, even after the merge has
// completed. To wit:
//
// 1. put(k2 on n2, r2); gateway chooses t=1.0
// 2. begin; read(k on n1, r1); gateway chooses t=0.98
// 3. pick up observed timestamp for n1 of t=0.99
// 4. r1 merged right-hand neighbor r2 @ t=1.1
// 5. read(k2) on joint range at ReadTimestamp=0.98 should get
//    ReadWithinUncertaintyInterval because of the write in step 1, so
//    even though we observed n1's timestamp in step 3 we must expand
//    the uncertainty interval to the range merge's freeze time, which
//    is guaranteed to be greater than any write which occurred on the
//    right-hand side.

We currently have no protection here, and so we are susceptible to a stale read in this case.

Epic CRDB-1514

Jira issue: CRDB-11522

@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Nov 30, 2021
@nvanbenschoten nvanbenschoten self-assigned this Nov 30, 2021
@blathers-crl blathers-crl bot added the T-kv KV Team label Nov 30, 2021
@andrewbaptist andrewbaptist self-assigned this May 26, 2022
@andrewbaptist
Copy link
Collaborator

There may be a few different ways to solve this, and all have some drawbacks:

  1. Force a lease transfer prior to a merge so that the merging ranges have aligned leaseholders. If the transfer lease code correctly prevents stale reads, then merging after the lease transfer will also prevent stale reads as the local clock on both sides of the merge will be synchronized since they are the same node. There is a note (that I don't fully understand) in the merge document where it states this is infeasible due to possible failures of leaseholders and leaseholder transfers. Assuming aligned leaseholders are still infeasible, this option is also infeasible.

  2. Force a reset of all observed timestamps due to a merge operation on both sides of the merge by updating the lease on the LHS as part of the merge operation. The merge operation will necessarily align the clocks after the merge. The impact of this is that some transactions will fail with a ReadWithUncertaintyInterval that should not have, however as this is a somewhat rare operation, it is likely acceptable to have extra failures here. I'm not sure if this could be made to work without delaying the merge until the previous lease end time (which might be unacceptable)

  3. Instead of storing the ObservedTimestamps map of <NodeID, timestamp> in the Transaction, store a map from <Key, (Lease, timestamp)>. This would correctly handle both lease transfers and merges as the store could determine if the Lease it was originally issued under is the same. This is the "more precise" information and the stores can correctly map the leases to themselves to determine if there is an uncertainty window. The drawback of this approach is that it could make the ObservedTimestamps map much larger. This could be done with Spans, like the Read timestamp cache.

  4. Include additional information as part of the AdminRelocateRange response message that would allow the LHS to correctly process transactions that had an observed. TBD - what information would need to be transferred?

@nvanbenschoten
Copy link
Member Author

Force a lease transfer prior to a merge so that the merging ranges have aligned leaseholders. If the transfer lease code correctly prevents stale reads, then merging after the lease transfer will also prevent stale reads as the local clock on both sides of the merge will be synchronized since they are the same node.

For this to work given the current lease transfer protection, we would need the LHS of the merge to both be:

  1. collocated with the RHS
  2. have a lease start timestamp above the lease start timestamp of the RHS

Collocating the LHS to the RHS's node is necessary (as opposed to the other way around) because we'd be trying to piggyback the merge protection on top of the lease transfer protection, which is provided by the lease start time. The merge's LHS subsumes the RHS and retains its lease, so it's the LHS that we'd need to worry about.

The first condition here is straightforward — it ensures that an observed timestamp served before the merge by the node after it holds the two leaseholders forms an upper bound with all local timestamps in the RHS range.

The second condition is more subtle — it ensures that if the leaseholders were already collocated, then the lease transfer protection for the LHS inherits the lease transfer protection from the RHS. In other words, it ensures that an observed timestamp served before the merge and before the collocation by the node that later holds the two leaseholders cannot be assumed to form an upper bound with all local timestamps in the RHS range. Without this, we could see the following hazard:

1. put(k2 on n2, r2); gateway chooses t=1.0
2. begin; read(k on n1, r1); gateway chooses t=0.98
3. pick up observed timestamp for n1 of t=0.99
4. lease transfer unrelated to merge of r2 from n2->n1. Lease start time 1.1
5. r1 and r2 already collocated on n1
6. r1 merged right-hand neighbor r2 @ t=1.2
7. read(k2) on joint range at ReadTimestamp=0.98. Stale read because observe timestamp < put's local timestamp

Force a reset of all observed timestamps due to a merge operation on both sides of the merge by updating the lease on the LHS as part of the merge operation. The merge operation will necessarily align the clocks after the merge. The impact of this is that some transactions will fail with a ReadWithUncertaintyInterval that should not have, however as this is a somewhat rare operation, it is likely acceptable to have extra failures here. I'm not sure if this could be made to work without delaying the merge until the previous lease end time (which might be unacceptable)

This is roughly what I had in mind. If we remembered when the largest freeze timestamp of any merge into a range, we could use this as a lower bound for all local uncertainty limits later assigned by that range. This would be symetrical with our handling of new leases.

Care would need to be taken to propagate this information to the RHS of a range split.

This feels like the cleanest solution to me. It requires a small amount of extra persistent state, but it avoids coupling range merges to our leasing protocol.

Instead of storing the ObservedTimestamps map of <NodeID, timestamp> in the Transaction, store a map from <Key, (Lease, timestamp)>. This would correctly handle both lease transfers and merges as the store could determine if the Lease it was originally issued under is the same. This is the "more precise" information and the stores can correctly map the leases to themselves to determine if there is an uncertainty window. The drawback of this approach is that it could make the ObservedTimestamps map much larger. This could be done with Spans, like the Read timestamp cache

The other drawback to this approach is that we would only pick up observed timestamps for a single range when visiting a node, instead of an observed timestamp for all current ranges on that node. So for instance, if we visited a node with r1 and r2 to read a key from r1, we wouldn't collect an observed timestamp which could be used to limit uncertainty if we later returned to read a key from r2.

Hypothetically we could collect a separate observed timestamp from all ranges on a node when visiting it, but that would be very expensive.

Include additional information as part of the AdminRelocateRange response message that would allow the LHS to correctly process transactions that had an observed. TBD - what information would need to be transferred?

I don't understand this solution. Do you mind spelling it out in more detail?

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Jun 23, 2022
Previously a lease could be transferred using AdminTransferLease
to a different node, but a call to transfer back to the original
node was a no-op. This is done to prepare the work for the fix
to cockroachdb#73292. Specifically the fix there is to self-transfer the lease
to restart the lease timestamp and get correct protection on merges.

A few unit tests needed to be modified to make this work, as they
were assuming that AdminTransferLease was idempotent, when in
fact it was only lucky that they worked.

Release note: None
@andrewbaptist andrewbaptist linked a pull request Jun 27, 2022 that will close this issue
@craig craig bot closed this as completed in #83345 Jul 1, 2022
andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Jan 24, 2023
Previously a lease could be transferred using AdminTransferLease
to a different node, but a call to transfer back to the original
node was a no-op. This is done to prepare the work for the fix
to cockroachdb#73292. Specifically the fix there is to self-transfer the lease
to restart the lease timestamp and get correct protection on merges.

A few unit tests needed to be modified to make this work, as they
were assuming that AdminTransferLease was idempotent, when in
fact it was only lucky that they worked.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants