kv: disable observed timestamp captured before range merge from applying to merged range #73292

nvanbenschoten · 2021-11-30T04:51:54Z

85d4627 introduced logic to handle the where after a lease changes, observed timestamps captured on the incoming leaseholder's node cannot be used to ignore uncertainty for data written by the former leaseholder. This logic currently lives here:

cockroach/pkg/kv/kvserver/observedts/limit.go

Lines 27 to 40 in 7097a90

    
           // The lease's start time is also taken into consideration to ensure that a 
        
           // lease transfer does not result in the observed timestamp for this node being 
        
           // inapplicable to data previously written by the former leaseholder. To wit: 
        
           // 
        
           //  1. put(k on leaseholder n1), gateway chooses t=1.0 
        
           //  2. begin; read(unrelated key on n2); gateway chooses t=0.98 
        
           //  3. pick up observed timestamp for n2 of t=0.99 
        
           //  4. n1 transfers lease for range with k to n2 @ t=1.1 
        
           //  5. read(k) on leaseholder n2 at ReadTimestamp=0.98 should get 
        
           //     ReadWithinUncertaintyInterval because of the write in step 1, so 
        
           //     even though we observed n2's timestamp in step 3 we must expand 
        
           //     the uncertainty interval to the lease's start time, which is 
        
           //     guaranteed to be greater than any write which occurred under 
        
           //     the previous leaseholder.

And is tested in TestRangeLocalUncertaintyLimitAfterNewLease.

While working on some refactors to fix #58459 and eventually #36431, I noticed a similar case after a range merge, where observed timestamps from before the merge cannot be used to ignore uncertainty for data written by the former RHS range. A comment describing this was added to #73244:

// A similar hazard applies to range merges, where we cannot apply observed
// timestamps captured from the leaseholder of the left-hand side of the merge
// to data written on the right-hand side of the merge, even after the merge has
// completed. To wit:
//
// 1. put(k2 on n2, r2); gateway chooses t=1.0
// 2. begin; read(k on n1, r1); gateway chooses t=0.98
// 3. pick up observed timestamp for n1 of t=0.99
// 4. r1 merged right-hand neighbor r2 @ t=1.1
// 5. read(k2) on joint range at ReadTimestamp=0.98 should get
//    ReadWithinUncertaintyInterval because of the write in step 1, so
//    even though we observed n1's timestamp in step 3 we must expand
//    the uncertainty interval to the range merge's freeze time, which
//    is guaranteed to be greater than any write which occurred on the
//    right-hand side.

We currently have no protection here, and so we are susceptible to a stale read in this case.

Epic CRDB-1514

Jira issue: CRDB-11522

The text was updated successfully, but these errors were encountered:

andrewbaptist · 2022-06-07T18:59:58Z

There may be a few different ways to solve this, and all have some drawbacks:

Force a lease transfer prior to a merge so that the merging ranges have aligned leaseholders. If the transfer lease code correctly prevents stale reads, then merging after the lease transfer will also prevent stale reads as the local clock on both sides of the merge will be synchronized since they are the same node. There is a note (that I don't fully understand) in the merge document where it states this is infeasible due to possible failures of leaseholders and leaseholder transfers. Assuming aligned leaseholders are still infeasible, this option is also infeasible.
Force a reset of all observed timestamps due to a merge operation on both sides of the merge by updating the lease on the LHS as part of the merge operation. The merge operation will necessarily align the clocks after the merge. The impact of this is that some transactions will fail with a ReadWithUncertaintyInterval that should not have, however as this is a somewhat rare operation, it is likely acceptable to have extra failures here. I'm not sure if this could be made to work without delaying the merge until the previous lease end time (which might be unacceptable)
Instead of storing the ObservedTimestamps map of <NodeID, timestamp> in the Transaction, store a map from <Key, (Lease, timestamp)>. This would correctly handle both lease transfers and merges as the store could determine if the Lease it was originally issued under is the same. This is the "more precise" information and the stores can correctly map the leases to themselves to determine if there is an uncertainty window. The drawback of this approach is that it could make the ObservedTimestamps map much larger. This could be done with Spans, like the Read timestamp cache.
Include additional information as part of the AdminRelocateRange response message that would allow the LHS to correctly process transactions that had an observed. TBD - what information would need to be transferred?

nvanbenschoten · 2022-06-08T16:33:20Z

Force a lease transfer prior to a merge so that the merging ranges have aligned leaseholders. If the transfer lease code correctly prevents stale reads, then merging after the lease transfer will also prevent stale reads as the local clock on both sides of the merge will be synchronized since they are the same node.

For this to work given the current lease transfer protection, we would need the LHS of the merge to both be:

collocated with the RHS
have a lease start timestamp above the lease start timestamp of the RHS

Collocating the LHS to the RHS's node is necessary (as opposed to the other way around) because we'd be trying to piggyback the merge protection on top of the lease transfer protection, which is provided by the lease start time. The merge's LHS subsumes the RHS and retains its lease, so it's the LHS that we'd need to worry about.

The first condition here is straightforward — it ensures that an observed timestamp served before the merge by the node after it holds the two leaseholders forms an upper bound with all local timestamps in the RHS range.

The second condition is more subtle — it ensures that if the leaseholders were already collocated, then the lease transfer protection for the LHS inherits the lease transfer protection from the RHS. In other words, it ensures that an observed timestamp served before the merge and before the collocation by the node that later holds the two leaseholders cannot be assumed to form an upper bound with all local timestamps in the RHS range. Without this, we could see the following hazard:

1. put(k2 on n2, r2); gateway chooses t=1.0
2. begin; read(k on n1, r1); gateway chooses t=0.98
3. pick up observed timestamp for n1 of t=0.99
4. lease transfer unrelated to merge of r2 from n2->n1. Lease start time 1.1
5. r1 and r2 already collocated on n1
6. r1 merged right-hand neighbor r2 @ t=1.2
7. read(k2) on joint range at ReadTimestamp=0.98. Stale read because observe timestamp < put's local timestamp

Force a reset of all observed timestamps due to a merge operation on both sides of the merge by updating the lease on the LHS as part of the merge operation. The merge operation will necessarily align the clocks after the merge. The impact of this is that some transactions will fail with a ReadWithUncertaintyInterval that should not have, however as this is a somewhat rare operation, it is likely acceptable to have extra failures here. I'm not sure if this could be made to work without delaying the merge until the previous lease end time (which might be unacceptable)

This is roughly what I had in mind. If we remembered when the largest freeze timestamp of any merge into a range, we could use this as a lower bound for all local uncertainty limits later assigned by that range. This would be symetrical with our handling of new leases.

Care would need to be taken to propagate this information to the RHS of a range split.

This feels like the cleanest solution to me. It requires a small amount of extra persistent state, but it avoids coupling range merges to our leasing protocol.

Instead of storing the ObservedTimestamps map of <NodeID, timestamp> in the Transaction, store a map from <Key, (Lease, timestamp)>. This would correctly handle both lease transfers and merges as the store could determine if the Lease it was originally issued under is the same. This is the "more precise" information and the stores can correctly map the leases to themselves to determine if there is an uncertainty window. The drawback of this approach is that it could make the ObservedTimestamps map much larger. This could be done with Spans, like the Read timestamp cache

The other drawback to this approach is that we would only pick up observed timestamps for a single range when visiting a node, instead of an observed timestamp for all current ranges on that node. So for instance, if we visited a node with r1 and r2 to read a key from r1, we wouldn't collect an observed timestamp which could be used to limit uncertainty if we later returned to read a key from r2.

Hypothetically we could collect a separate observed timestamp from all ranges on a node when visiting it, but that would be very expensive.

Include additional information as part of the AdminRelocateRange response message that would allow the LHS to correctly process transactions that had an observed. TBD - what information would need to be transferred?

I don't understand this solution. Do you mind spelling it out in more detail?

Previously a lease could be transferred using AdminTransferLease to a different node, but a call to transfer back to the original node was a no-op. This is done to prepare the work for the fix to cockroachdb#73292. Specifically the fix there is to self-transfer the lease to restart the lease timestamp and get correct protection on merges. A few unit tests needed to be modified to make this work, as they were assuming that AdminTransferLease was idempotent, when in fact it was only lucky that they worked. Release note: None

nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Nov 30, 2021

nvanbenschoten self-assigned this Nov 30, 2021

blathers-crl bot added the T-kv KV Team label Nov 30, 2021

andrewbaptist self-assigned this May 26, 2022

exalate-issue-sync bot unassigned nvanbenschoten May 31, 2022

andrewbaptist mentioned this issue Jun 21, 2022

kv: Allow leases to be transfered back to self #83157

Closed

andrewbaptist linked a pull request Jun 27, 2022 that will close this issue

kv: Track the min valid observed timestamp #83345

Merged

craig bot closed this as completed in #83345 Jul 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: disable observed timestamp captured before range merge from applying to merged range #73292

kv: disable observed timestamp captured before range merge from applying to merged range #73292

nvanbenschoten commented Nov 30, 2021 •

edited by cockroach-jira-scripts

Loading

andrewbaptist commented Jun 7, 2022

nvanbenschoten commented Jun 8, 2022

kv: disable observed timestamp captured before range merge from applying to merged range #73292

kv: disable observed timestamp captured before range merge from applying to merged range #73292

Comments

nvanbenschoten commented Nov 30, 2021 • edited by cockroach-jira-scripts Loading

andrewbaptist commented Jun 7, 2022

nvanbenschoten commented Jun 8, 2022

nvanbenschoten commented Nov 30, 2021 •

edited by cockroach-jira-scripts

Loading