kv: consider intent in uncertainty interval to be uncertain, not a write-read conflict #57136

nvanbenschoten · 2020-11-25T18:12:32Z

This PR is a partial revert of #40600 and #46830. It solves the same problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction observes an intent in its uncertainty interval. Before those fixes, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on intents in a scan's uncertainty interval. Effectively, it made scans consider intents in their uncertainty interval to be write-read conflicts. This had the benefit that the scan would push the intent and eventually resolve the conflict, either by aborting the intent, pushing it out of the read's uncertainty interval, or waiting for it to commit. In this last case (which is by the far the most common), the scan would then throw an uncertainty error, because it now had a committed value in its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from this decision. Once we started trying to keep the lock table in sync with the MVCC state, we needed to use latches to ensure proper synchronization between any operations that could conflict because such conflicts would influence the state of the lock table. This meant that we needed to start holding latches across a read's uncertainty interval, because intent writes in this interval were considered write-read conflicts. This led to some moderately gross code and always felt a little wrong.

Now that we are complicating the logic around uncertainty intervals even further, this becomes even more of a burden. This motivates a reworking of these interactions. This commit replaces the logic that considers intents in a transaction's uncertainty interval to be write-read conflicts for logic that considers such intents to be... uncertain. Doing so means that a transaction will immediately refresh/restart above the uncertain timestamp and will only then begin conflicting with the intent.

This has a number of nice benefits:

it keeps all considerations of read uncertainty intervals down in MVCC iteration logic. The lock table no longer needs to be aware of it. This is a big win now and an even bigger win once we introduce synthetic timestamps.
readers that are almost certainly bound to hit an uncertainty error and need to restart will now do so sooner. In rare cases, this may result in wasted work. In the vast majority of cases, this will allow the reader to be more responsive to the commit of its conflict.
uncertainty errors will only be thrown for locks in the uncertainty interval of a read that are protecting a provisional write (intents). Before, any form of a lock in a read's uncertainty interval would be considered a write-read conflict, which was pessimistic and not needed for correctness.

In a future with a fully segregated lock table, the change in semantic meaning here becomes even more clear. Instead of detecting the lock associated with an intent in a read's uncertainty interval and declaring a write-read conflict, the read will instead pass through the lock table untouched and will detect the provisional value associated with an intent and declaring uncertainty. This seems closer to what were actually trying to say about these interactions.

Before making this change, I intend to validate the hypothesis that it will not affect performance (or may even slightly improve performance) by running it on the YCSB benchmark suite.

cockroach-teamcity · 2020-11-25T18:12:46Z

This change is

sumeerbhola

Nice!
And when I read this explanation, it sounds so obvious, that I am scratching my head on why we did it the other complicated way to begin with.

Reviewed 8 of 8 files at r6, 3 of 3 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

pkg/storage/pebble_mvcc_scanner.go, line 388 at r4 (raw file):

	ownIntent := p.txn != nil && p.meta.Txn.ID.Equal(p.txn.ID)
	otherIntentVisible := metaTS.LessEq(p.ts) ||
		(p.checkUncertainty && p.isUncertainValue(metaTS)) ||

why this change in meaning of otherIntentVisible and the associated change below?

sumeerbhola · 2020-11-30T22:43:52Z

This code is delaying discovery of the uncertainty until the MVCC scan: wouldn't it be beneficial in the future, with the separated lock table, to discover these early? We are already going to be iterating over the lock table key space to find locks with timestamp < read timestamp -- we may as well also inform the reader about the uncertainty so it can push itself right away.

nvanbenschoten

Thanks for the review! I've split this off from #57077 so we can push this ahead.

I did need to make one small change here. I had hoped that this change would allow us to no longer declare read latches all the way up to a read's uncertainty interval, but this doesn't actually appear to be true. Reads still need to continue declaring latches up through their uncertainty interval so that they are properly synchronized with earlier writes that may have a happened-before relationship with the read. These writes could not have completed and returned to the client until they were durable in the Range's Raft log. However, they may not have been applied to the replica's state machine by the time the write was acknowledged, because Raft entry application occur be asynchronously with respect to the writer (see AckCommittedEntriesBeforeApplication). Latching is the only mechanism that ensures that any observers of the write wait for the write apply before reading.

This code is delaying discovery of the uncertainty until the MVCC scan: wouldn't it be beneficial in the future, with the separated lock table, to discover these early? We are already going to be iterating over the lock table key space to find locks with timestamp < read timestamp -- we may as well also inform the reader about the uncertainty so it can push itself right away.

Yeah, I think that's a good point. I had hoped that we wouldn't need to introduce any of the complications of #57077 into the lockTable, but this might be reason enough to do so.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/storage/pebble_mvcc_scanner.go, line 388 at r4 (raw file):

Previously, sumeerbhola wrote…

why this change in meaning of otherIntentVisible and the associated change below?

We're no longer considering an intent in a read's uncertainty interval to be "visible", for some meaning of "visible". Instead, we consider ourselves reading below the intent even if it is in our uncertainty interval, but then later check that it's not. I renamed this variable to intentVisible because "other" wasn't quite correct. I'm thinking of renaming it again to conflictingIntent to avoid the ambiguity of the term "visible". What do you think?

nvanbenschoten · 2021-01-06T00:16:43Z

I still owe this change some YCSB numbers. I'll try to collect those tonight.

sumeerbhola

Reviewed 25 of 31 files at r1, 26 of 26 files at r8, 3 of 3 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/batcheval/declare.go, line 63 at r8 (raw file):

			// However, they may not have been applied to the replica's state
			// machine by the time the write was acknowledged, because Raft
			// entry application occur be asynchronously with respect to the

... occurs asynchronously ...

pkg/storage/pebble_mvcc_scanner.go, line 388 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We're no longer considering an intent in a read's uncertainty interval to be "visible", for some meaning of "visible". Instead, we consider ourselves reading below the intent even if it is in our uncertainty interval, but then later check that it's not. I renamed this variable to intentVisible because "other" wasn't quite correct. I'm thinking of renaming it again to conflictingIntent to avoid the ambiguity of the term "visible". What do you think?

conflictingIntent sounds fine.

pkg/storage/pebble_mvcc_scanner.go, line 390 at r8 (raw file):

				return p.uncertaintyError(metaTS)
			}

nit: can you remove the empty line

This could use a comment, such as
// The intent is not within the uncertainty, but there could be an uncertain committed value, so seek and check uncertainty using MaxTimestamp.

This commit is a partial revert of cockroachdb#40600 and cockroachdb#46830. It solves the same problem that those PRs were solving, but in a different way. Those two PRs were handling the case where a reading transaction observes an intent in its uncertainty interval. Before those fixes, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and is not a causal ancestor of the scan. The approach the first PR took was to throw a WriteIntentError on intents in a scan's uncertainty interval. Effectively, it made scans consider intents in their uncertainty interval to be write-read conflicts. This had the benefit that the scan would push the intent and eventually resolve the conflict, either by aborting the intent, pushing it out of the read's uncertainty interval, or waiting for it to commit. In this last case (which is by the far the most common), the scan would then throw an uncertainty error, because it now had a committed value in its uncertainty interval. The second PR introduced some awkward code to deal with the fallout from this decision. Once we started trying to keep the lock table in sync with the MVCC state, we needed to use latches to ensure proper synchronization between any operations that could conflict because such conflicts would influence the state of the lock table. This meant that we needed to start holding latches across a read's uncertainty interval, because intent writes in this interval were considered write-read conflicts. This led to some moderately gross code and always felt a little wrong. Now that we are complicating the logic around uncertainty intervals even further, this becomes even more of a burden. This motivates a reworking of these interactions. This commit replaces the logic that considers intents in a transaction's uncertainty interval to be write-read conflicts for logic that considers such intents to be... uncertain. Doing so means that a transaction will immediately refresh/restart above the uncertain timestamp and will only then begin conflicting with the intent. This has a number of nice benefits: 1. it keeps all considerations of read uncertainty intervals down in MVCC iteration logic. The lock table no longer needs to be aware of it. This is a big win now and an even bigger win once we introduce synthetic timestamps. 2. readers that are almost certainly bound to hit an uncertainty error and need to restart will now do so sooner. In rare cases, this may result in wasted work. In the vast majority of cases, this will allow the reader to be more responsive to the commit of its conflict. 3. uncertainty errors will only be thrown for locks in the uncertainty interval of a read that are protecting a provisional write (intents). Before, any form of a lock in a read's uncertainty interval would be considered a write-read conflict, which was pessimistic and not needed for correctness. In a future with a fully segregated lock table, the change in semantic meaning here becomes even more clear. Instead of detecting the lock associated with an intent in a read's uncertainty interval and declaring a write-read conflict, the read will instead pass through the lock table untouched and will detect the provisional value associated with an intent and declaring uncertainty. This seems closer to what were actually trying to say about these interactions. Before making this change, I intend to validate the hypothesis that it will not affect performance (or may even slightly improve performance) by running it on the YCSB benchmark suite.

…plit Now that the readConflictTimestamp does not include a read's uncertainty interval, this distinction is immaterial. We were already not using writeConflictTimestamp in the lockTable anyway.

This commit updates the check in tryBumpBatchTimestamp to prevent a batch from bumping its timestamp if any read latches are held, regardless of whether they are held at a high-enough timestamp to "protect" the read. Even if a request holds read latches with high enough timestamps to fully cover ("protect") the batch at the new timestamp, we still don't want to allow the bump. This is because a batch with read spans and a higher timestamp may now conflict with locks that it previously did not. However, server-side retries don't re-scan the lock table. This can lead to requests missing unreplicated locks in the lock table that they should have seen or discovering replicated intents in MVCC that they should not have seen (from the perspective of the lock table's AddDiscoveredLock method). This was discovered by kvnemesis, which saw a "discovered non-conflicting lock" assertion failure.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)

pkg/kv/kvserver/batcheval/declare.go, line 63 at r8 (raw file):

Previously, sumeerbhola wrote…

... occurs asynchronously ...

Done.

pkg/storage/pebble_mvcc_scanner.go, line 388 at r4 (raw file):

Previously, sumeerbhola wrote…

conflictingIntent sounds fine.

Done.

pkg/storage/pebble_mvcc_scanner.go, line 390 at r8 (raw file):

Previously, sumeerbhola wrote…

nit: can you remove the empty line

This could use a comment, such as
// The intent is not within the uncertainty, but there could be an uncertain committed value, so seek and check uncertainty using MaxTimestamp.

Done.

nvanbenschoten · 2021-01-16T05:15:12Z

bors r=sumeerbhola

craig · 2021-01-16T06:20:22Z

Build succeeded:

GitHub CI (Cockroach)

Needed for cockroachdb#57688. This commit reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general. Background Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder. This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins. This structure is inflexible, subtle, and falls over as soon as we try to extend it. Motivation The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock. So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!). Solution The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease. This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will. Other benefits Besides this primary change, a number of other benefits fall out of this restructuring. 1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism. 2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere. 3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry. 4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path. 5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts? 6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119). 7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on cockroachdb#57077. Other behavioral changes There are two auxiliary behavioral changes made by this commit that deserve attention. The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worse case latency during lease transfers. The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.

sumeerbhola · 2021-02-01T15:06:21Z

I still owe this change some YCSB numbers. I'll try to collect those tonight.

did you happen to collect the YCSB numbers?

Needed for cockroachdb#57688. This commit reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general. Background Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder. This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins. This structure is inflexible, subtle, and falls over as soon as we try to extend it. Motivation The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock. So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!). Solution The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease. This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will. Other benefits Besides this primary change, a number of other benefits fall out of this restructuring. 1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism. 2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere. 3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry. 4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path. 5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts? 6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119). 7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on cockroachdb#57077. Other behavioral changes There are two auxiliary behavioral changes made by this commit that deserve attention. The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worse case latency during lease transfers. The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.

nvanbenschoten · 2021-02-01T17:20:08Z

I did, but I see that I forgot to post them. There wasn't much movement. The largest statistically significant change was a 4% regression in throughput on ycsb/B/nodes=3, but then there was also a statistically significant 1.4% speedup in ycsb/B/nodes=3/cpu=32, the same workload but on larger machines. That lines up with my prior experience with YCSB - that there is so much contention on certain keys that small changes in the interactions between txns can cause performance to vary, often unpredictably. Since there was no clear positive or negative trend, this didn't seem to warrant further investigation.

name                   old ops/s    new ops/s    delta
ycsb/A/nodes=3          15.4k ± 2%   14.9k ±10%     ~     (p=0.165 n=10+10)
ycsb/B/nodes=3          32.3k ± 1%   31.0k ±10%   -4.02%  (p=0.017 n=9+10)
ycsb/C/nodes=3          45.9k ± 1%   44.6k ± 4%     ~     (p=0.075 n=10+10)
ycsb/D/nodes=3          34.8k ± 1%   33.8k ± 7%     ~     (p=0.165 n=10+10)
ycsb/E/nodes=3            824 ± 4%     805 ± 4%   -2.36%  (p=0.035 n=10+10)
ycsb/F/nodes=3          7.07k ± 3%   7.27k ± 4%   +2.88%  (p=0.010 n=9+10)
ycsb/A/nodes=3/cpu=32   27.1k ± 3%   26.6k ± 5%     ~     (p=0.123 n=10+10)
ycsb/B/nodes=3/cpu=32   90.1k ± 3%   91.4k ± 5%   +1.39%  (p=0.040 n=9+9)
ycsb/C/nodes=3/cpu=32    122k ± 2%    122k ± 2%     ~     (p=0.280 n=10+10)
ycsb/D/nodes=3/cpu=32   91.0k ± 4%   91.2k ± 5%     ~     (p=0.684 n=10+10)
ycsb/E/nodes=3/cpu=32   1.13k ± 6%   1.13k ± 5%     ~     (p=0.971 n=10+10)
ycsb/F/nodes=3/cpu=32   14.3k ± 7%   14.9k ± 6%     ~     (p=0.190 n=10+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.26 ± 3%    6.38 ± 4%     ~     (p=0.307 n=10+9)
ycsb/B/nodes=3           4.48 ± 3%    4.81 ± 9%   +7.37%  (p=0.026 n=10+10)
ycsb/C/nodes=3           3.13 ± 2%    3.23 ± 4%   +3.19%  (p=0.021 n=10+10)
ycsb/D/nodes=3           2.77 ± 3%    2.83 ± 6%     ~     (p=0.196 n=10+10)
ycsb/E/nodes=3            116 ± 4%     119 ± 4%   +2.42%  (p=0.035 n=10+10)
ycsb/F/nodes=3           13.6 ± 4%    13.2 ± 4%   -2.95%  (p=0.010 n=9+10)
ycsb/A/nodes=3/cpu=32    5.31 ± 4%    5.43 ± 4%     ~     (p=0.089 n=10+10)
ycsb/B/nodes=3/cpu=32    2.10 ± 0%    2.10 ± 0%     ~     (all equal)
ycsb/C/nodes=3/cpu=32    1.60 ± 0%    1.60 ± 0%     ~     (all equal)
ycsb/D/nodes=3/cpu=32    1.56 ± 4%    1.57 ± 8%     ~     (p=1.000 n=10+10)
ycsb/E/nodes=3/cpu=32     127 ± 6%     127 ± 5%     ~     (p=0.955 n=10+10)
ycsb/F/nodes=3/cpu=32    10.1 ± 7%     9.7 ± 6%     ~     (p=0.155 n=10+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           43.5 ±23%    41.1 ±13%     ~     (p=0.112 n=9+8)
ycsb/B/nodes=3           29.0 ± 9%    31.8 ±17%   +9.66%  (p=0.015 n=10+10)
ycsb/C/nodes=3           12.9 ± 3%    13.4 ± 6%     ~     (p=0.061 n=10+10)
ycsb/D/nodes=3           14.4 ± 2%    14.6 ± 7%     ~     (p=0.344 n=10+10)
ycsb/E/nodes=3            580 ± 8%     587 ± 3%     ~     (p=0.714 n=10+10)
ycsb/F/nodes=3            110 ±29%     129 ±37%     ~     (p=0.202 n=9+10)
ycsb/A/nodes=3/cpu=32    71.9 ±11%    72.5 ±10%     ~     (p=0.792 n=10+9)
ycsb/B/nodes=3/cpu=32    15.9 ±10%    15.6 ± 6%     ~     (p=0.497 n=9+9)
ycsb/C/nodes=3/cpu=32    7.57 ± 4%    7.51 ± 3%     ~     (p=0.720 n=10+10)
ycsb/D/nodes=3/cpu=32    8.80 ± 7%    8.62 ± 9%     ~     (p=0.439 n=10+10)
ycsb/E/nodes=3/cpu=32     671 ± 0%     671 ± 0%     ~     (all equal)
ycsb/F/nodes=3/cpu=32     154 ±20%     148 ±19%     ~     (p=0.535 n=10+10)

This also seems to indicate that reducing maximum clock uncertainty would positively impact YCSB throughput. That would be an interesting experiment to run.

Needed for cockroachdb#57688. This commit reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general. Background Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder. This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins. This structure is inflexible, subtle, and falls over as soon as we try to extend it. Motivation The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock. So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!). Solution The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease. This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will. Other benefits Besides this primary change, a number of other benefits fall out of this restructuring. 1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism. 2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere. 3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry. 4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path. 5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts? 6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119). 7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on cockroachdb#57077. Other behavioral changes There are two auxiliary behavioral changes made by this commit that deserve attention. The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worse case latency during lease transfers. The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.

59086: kv: move range lease checks and transfers below latching r=nvanbenschoten a=nvanbenschoten Needed for #57688. This PR reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general. ### Background Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder. This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins. This structure is inflexible, subtle, and falls over as soon as we try to extend it. ### Motivation The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock. So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!). ### Solution The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease. This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will. ### Other benefits Besides this primary change, a number of other benefits fall out of this restructuring. 1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism. 2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere. 3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry. 4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path. 5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts? 6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119). 7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on #57077. ### Other behavioral changes There are two auxiliary behavioral changes made by this commit that deserve attention. The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like #22837 and #32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worst-case latency during lease transfers. The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of #57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see #57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten requested a review from sumeerbhola November 25, 2020 18:12

sumeerbhola requested changes Nov 30, 2020

View reviewed changes

nvanbenschoten added the A-multiregion Related to multi-region label Dec 2, 2020

nvanbenschoten force-pushed the nvanbenschoten/rwuieIntent branch from eca6833 to 7c5fb0a Compare January 6, 2021 00:08

nvanbenschoten commented Jan 6, 2021

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/rwuieIntent branch from 7c5fb0a to 3bb27d8 Compare January 6, 2021 01:42

sumeerbhola approved these changes Jan 6, 2021

View reviewed changes

nvanbenschoten added 3 commits January 14, 2021 21:39

kv/concurrency: remove readConflictTimestamp/writeConflictTimestamp s…

bca6d9d

…plit Now that the readConflictTimestamp does not include a read's uncertainty interval, this distinction is immaterial. We were already not using writeConflictTimestamp in the lockTable anyway.

nvanbenschoten force-pushed the nvanbenschoten/rwuieIntent branch from 3bb27d8 to 0850deb Compare January 15, 2021 03:00

nvanbenschoten commented Jan 15, 2021

View reviewed changes

craig bot merged commit cd54fb7 into cockroachdb:master Jan 16, 2021

nvanbenschoten deleted the nvanbenschoten/rwuieIntent branch January 16, 2021 17:40

nvanbenschoten mentioned this pull request Jan 16, 2021

kv: move range lease checks and transfers below latching #59086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: consider intent in uncertainty interval to be uncertain, not a write-read conflict #57136

kv: consider intent in uncertainty interval to be uncertain, not a write-read conflict #57136

nvanbenschoten commented Nov 25, 2020 •

edited

Loading

cockroach-teamcity commented Nov 25, 2020

sumeerbhola left a comment

sumeerbhola commented Nov 30, 2020

nvanbenschoten left a comment

nvanbenschoten commented Jan 6, 2021

sumeerbhola left a comment

nvanbenschoten left a comment

nvanbenschoten commented Jan 16, 2021

craig bot commented Jan 16, 2021

sumeerbhola commented Feb 1, 2021

nvanbenschoten commented Feb 1, 2021

kv: consider intent in uncertainty interval to be uncertain, not a write-read conflict #57136

kv: consider intent in uncertainty interval to be uncertain, not a write-read conflict #57136

Conversation

nvanbenschoten commented Nov 25, 2020 • edited Loading

cockroach-teamcity commented Nov 25, 2020

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Nov 30, 2020

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 6, 2021

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 16, 2021

craig bot commented Jan 16, 2021

sumeerbhola commented Feb 1, 2021

nvanbenschoten commented Feb 1, 2021

nvanbenschoten commented Nov 25, 2020 •

edited

Loading