-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/storage: introduce local timestamps for MVCC versions in MVCCValue #80706
kv/storage: introduce local timestamps for MVCC versions in MVCCValue #80706
Conversation
c46d218
to
204c01e
Compare
@sumeerbhola @aayushshah15 @erikgrinaker this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key. We've discussed the difference between these two options in a few different venues. At a high-level, I think the tradeoffs between the two approaches are: Option 1: Store Option 2: Store Both approaches touch performance-sensitive code, so both have to be very careful to not introduce regressions. After writing this out and thinking about the trade-offs, I'm leaning towards the approach taken by this PR. The main reason for that is that I don't think the impact of this on key-value separation will be pronounced. Storing the local timestamp in the value means that we need to fetch the value of keys seen by scans to check uncertainty if 1) the scan has an uncertainty interval and 2) the scan is reading at a timestamp below the value. I'm going to assume that we don't care about cases where the value actually is uncertain, because this leads to a much more expensive uncertainty restart and is also a cost paid once (i.e. it doesn't compound because we stop scanning once we see an uncertain value). So let's focus on the case where the values are above the read timestamp but not uncertain. First off, is this a common situation in workloads that we care about? I'm not sure that it is because any historical AOST read does not have an uncertainty interval and any consistent read will rarely see keys at timestamps above its read timestamp, as they would have needed to be written roughly concurrently with the read. So this is really only a problem for long-running consistent scans operating over keyspaces that see high write traffic. If this does cause issues and we do find that avoiding a fetch of the value of keys above a consistent read's timestamp is important for performance, we could avoid some of the cost by splitting uncertainty checking into two phases. We could first perform an imprecise check using just the version timestamp from the key and comparing that to the scan's global uncertainty limit. Only if this check fails would we need to pull the value to perform the precise uncertainty check using the local timestamp and the local uncertainty limit. That means that long-running scans will have a Meanwhile, I think there are immediately meaningful benefits to this approach. If you've already reviewed #77342 then only the last two commits here are new. Thanks for working with me through the multiple iterations of this patch. |
I added benchmarks results demonstrating the impact of this change on performance to the PR description. The benchmarks are from three different levels: microbenchmarks around mvcc value encoding and decoding, |
ffe93b5
to
ba07e50
Compare
Also, when Pebble starts storing older versions in a value block, and there is exactly one version above this read timestamp, that version's value will not be in the value block. |
|
||
// D0 ———————————————————————————————————————————————— | ||
// | ||
// MVCCKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(@nvanbenschoten): fix this, now that the local timestamp is in the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(still reading -- flushing some comments)
Reviewed 14 of 64 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, @stevendanna, and @sumeerbhola)
pkg/storage/mvcc.go
line 400 at r11 (raw file):
// Straightforward: old contribution goes, new contribution comes, and we're done. ms.SysBytes -= origMetaKeySize + origMetaValSize + orig.KeyBytes + orig.ValBytes ms.SysBytes += metaKeySize + metaValSize + meta.KeyBytes + meta.ValBytes
Was this a bug in the existing code since it was not using orig.{KeyBytes,ValBytes} and meta.{KeyBytes,ValBytes}?
The KeyBytes is always MVCCVersionTimestampSize, yes?
pkg/storage/mvcc_value.go
line 60 at r11 (raw file):
// // In either encoding, the suffix corresponding to the roachpb.Value can be // omitted, indicating a deletion tombstone. For the simple encoding, this
This is a bit confusing, partly because roachpb.Value is already encoded, in that Value.RawBytes is already the encoded value. How about spelling this out a bit more, with something like:
// For a deletion tombstone, the encoding of roachpb.Value is special cased to be empty, i.e., no checksum, tag or encoded-data. In that case the extended encoding above is
// simply <4-byte-header-len><1-byte-sentinel>.
pkg/storage/mvcc_value.go
line 103 at r11 (raw file):
if v.LocalTimestamp.IsEmpty() { if k.Timestamp.Synthetic { return hlc.MinClockTimestamp
Is this to ensure that we can never ignore this value based on observed timestamp, and have to use the global uncertainty limit? Could use a code comment.
pkg/storage/mvcc_value.go
line 170 at r11 (raw file):
return nil, errors.Wrap(err, "marshaling MVCCValueHeader") } // <4-byte-checksum><1-byte-tag><encoded-data>
or empty for a tombstone?
pkg/storage/enginepb/mvcc.go
line 266 at r11 (raw file):
if meta.LocalTimestamp == nil { if meta.Timestamp.ToTimestamp().Synthetic { return hlc.MinClockTimestamp
I assume this is to ensure that we can never ignore this value based on observed timestamp, and have to use the global uncertainty limit. Is that correct? Could use a code comment.
pkg/storage/enginepb/mvcc.proto
line 40 at r11 (raw file):
// The local timestamp of the most recent versioned value if this is a // value that may have multiple versions. For values which may have only // one version, this timestamp is set to nil. See MVCCValueHeader for
"may have only version" is a bit confusing to me. Is this talking about this particular key or the category of versioned values? I assume it is the former, since this comment is about versioned values and that means it is always possible to have multiple versions.
Why do we need this at all -- it seems duplication from what we have stored in the provisional value?
pkg/storage/enginepb/mvcc.proto
line 72 at r11 (raw file):
// Value is the value written to the key as part of the transaction at // the above Sequence. Value uses the roachpb.Value encoding. optional bytes value = 2;
So the local timestamp is not stored for the older sequence of values written by this transaction?
And if one of those was restored due to a savepoint rollback, is it behaving like a new write in terms of the local timestamp?
A code comment would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from a few tests, I've finished reading the last 2 commits. Looks good!
Reviewed 40 of 64 files at r11, 11 of 15 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, and @stevendanna)
pkg/kv/kvserver/batcheval/cmd_add_sstable.go
line 408 at r7 (raw file):
return errors.AssertionFailedf("SST contains inline value or intent for key %s", key) } if len(value) == 0 {
just curious: what is the confidence level that all such code has been found and changed?
Tests that don't write a local timestamp would still succeed with buggy code, yes?
Should we have some way to force all tests to write local timestamps?
pkg/kv/kvserver/uncertainty/interval.go
line 55 at r11 (raw file):
// version timestamp and with the specified uncertainty interval. func (in Interval) IsUncertain(valueTs hlc.Timestamp, localTs hlc.ClockTimestamp) bool { if !in.LocalLimit.IsEmpty() && in.LocalLimit.Less(localTs) {
Is everything required to have a non-empty localTs now, whether it be the version timestamp itself or hlc.MinTimestamp for the synthetic case?
Should we assert !localTs.IsEmpty()
to make sure we have not forgotten something?
pkg/storage/mvcc.go
line 3087 at r10 (raw file):
if err != nil { return false, err }
This change looks like it is mainly code movement, since we need to figure out newValue.LocalTimestamp before updating and writng meta. Is that correct?
Do you think we have good test coverage of the various paths in this function?
pkg/storage/mvcc.go
line 858 at r11 (raw file):
iterAlreadyPositioned bool, meta *enginepb.MVCCMetadata, localTs *hlc.ClockTimestamp,
is localTS for allocation avoidance? is it required that meta.localTS be populated if the return value of ok is true?
A code comment would help.
pkg/storage/mvcc.go
line 1495 at r11 (raw file):
} if haveNextVersion { prevVal, err := DecodeMVCCValue(prevUnsafeVal)
is this rare enough to use this less optimized function?
pkg/storage/mvcc.go
line 3045 at r11 (raw file):
// TODO(nvanbenschoten): this is an awkward interface. We shouldn't // be mutating meta and we shouldn't be restoring the previous value // here. Instead, this should all be handled down below.
This whole intent resolution function is quite confusing, though we probably need to shore up our randomized testing before making any significant changes here. There is also a TODO mentioned in mvcc_test.go (may be stale now)
// TODO(sumeer): mvccResolveWriteIntent has a bug when the txn is being
// ABORTED and there are IgnoredSeqNums that are causing a partial rollback.
// It does the partial rollback and does not actually resolve the intent.
// This does not affect correctness since the intent resolution will get
// retried.
pkg/storage/pebble.go
line 1310 at r12 (raw file):
// a cluster are at or beyond clusterversion.TODO, different nodes will see the // version state transition at different times. Nodes that have not yet seen the // transition may remove the local timestamp from an intent that has one during
comment needs updating
…tents Related to cockroachdb#80706. Related to cockroachdb#66485. This commit makes a slight modification to `pebbleMVCCScanner` that changes how it determines whether an intent is uncertain or not. Instead of consulting the version timestamp in the `MVCCMetadata` struct and comparing that against the scan's uncertainty interval, the `pebbleMVCCScanner` now scans through the uncertainty interval and uses the intent's provisional value's timestamp to determine uncertainty. The `pebbleMVCCScanner` was actually already doing this scan to compute uncertainty for other committed values in its uncertainty interval if it found that the intent was not uncertain. However, after this change, it also relies on the scan to compute uncertainty for the intent itself. This is safe, because the calling code knows that the intent has a higher timestamp than the scan, so there is no risk that the scan adds the provisional value to its result set. This change is important for two reasons: 1. it avoids the need to store the `local_timestamp` field (introduced in cockroachdb#80706) in the `MVCCMetadata` struct. 2. it moves us closer in the direction of using `MVCCMetadata` values (ts=0, essentially locks protecting provisional values) to determine read-write conflicts but then using versioned provisional values to determine uncertainty. Doing so allows us to decompose a KV scan into a separate lock table scan to detect read-write conflicts and a MVCC scan to accumulate a result set while checking for uncertainty. This will be important for cockroachdb#66485.
…tents Related to cockroachdb#80706. Related to cockroachdb#66485. This commit makes a slight modification to `pebbleMVCCScanner` that changes how it determines whether an intent is uncertain or not. Instead of consulting the version timestamp in the `MVCCMetadata` struct and comparing that against the scan's uncertainty interval, the `pebbleMVCCScanner` now scans through the uncertainty interval and uses the intent's provisional value's timestamp to determine uncertainty. The `pebbleMVCCScanner` was actually already doing this scan to compute uncertainty for other committed values in its uncertainty interval if it found that the intent was not uncertain. However, after this change, it also relies on the scan to compute uncertainty for the intent itself. This is safe, because the calling code knows that the intent has a higher timestamp than the scan, so there is no risk that the scan adds the provisional value to its result set. This change is important for two reasons: 1. it avoids the need to store the `local_timestamp` field (introduced in cockroachdb#80706) in the `MVCCMetadata` struct. 2. it moves us closer in the direction of using `MVCCMetadata` values (ts=0, essentially locks protecting provisional values) to determine read-write conflicts but then using versioned provisional values to determine uncertainty. Doing so allows us to decompose a KV scan into a separate lock table scan to detect read-write conflicts and a MVCC scan to accumulate a result set while checking for uncertainty. This will be important for cockroachdb#66485.
ba07e50
to
8e87495
Compare
95aae23
to
841c16e
Compare
This commit adds a cluster version gate and a cluster setting for local timestamps, to assist with their migration into an existing cluster. This fixes mixed-version clusters' interaction with local timestamps.
This commit switches from storing encoded `roachpb.Value`s to storing encoded `storage.MVCCValue`s in `MVCCMetadata`'s `SequencedIntent`s. Doing so ensures that MVCCValue headers are not lost when an intent is rolled back. This is important to avoid losing the local timestamp of values in a key's intent history. Failure to do so could allow for stale reads.
This commit adds an assertion to `Interval.IsUncertain` that the provided value and local timestamps are non-zero.
…testing This commit adds a metamorphic knob that randomly disables the simple MVCC value encoding scheme. Doing so ensures that code which interacts with encoded MVCC values does not mistake these values for encoded roachpb values. This could take place in two different ways: 1. broken code could assign an encoded MVCC value directly to a roachpb.Value's `RawBytes` field. This typically caused the test to fail with an error. 2. broken code could assume that a non-zero-length value was not a tombstone. This caused tests to fail in more obscure ways. The commit then fixes broken tests in one of three ways: - it fixes incorrect assumptions about the MVCC value encoding being equivalent to the roachpb value encoding. - it updates a few important tests (mostly related to MVCC stats) to work with and without the simple encoding. - it skips a few unimportant tests when the simple encoding scheme is disabled.
081c53a
to
720c8cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @erikgrinaker, @stevendanna, and @sumeerbhola)
pkg/kv/kvserver/uncertainty/doc.go
line 200 at r27 (raw file):
Previously, sumeerbhola wrote…
MVCC key-value pairs track their ...
Done.
pkg/kv/kvserver/uncertainty/doc.go
line 204 at r27 (raw file):
Previously, sumeerbhola wrote…
should this be: with a greater or equal timestamp than the intent's local timestamp?
my understanding was that the local clock for a node is not advanced if a write has sampled it.
Kind of. HLC clock readings are strictly monotonically increasing, meaning that no two clock readings will be identical. So if a local timestamp is assigned from a reading of the leaseholder's clock, any future clock reading will be greater than this value.
pkg/storage/mvcc.go
line 1434 at r27 (raw file):
Previously, sumeerbhola wrote…
I think of intent == MVCCMetadata. How about calling this curProvisionalValRaw?
Done.
pkg/storage/mvcc.go
line 1500 at r27 (raw file):
Previously, sumeerbhola wrote…
is this inlining
unsafeNextVersion
, or something more subtle?
Yes, just inlining unsafeNextVersion
, which was only used in one place and not worth keeping.
pkg/storage/mvcc.go
line 3167 at r27 (raw file):
Previously, sumeerbhola wrote…
I am trying to make sure I don't overlook something regarding what we need for older versions, for the Pebble design that keeps older versions in value blocks. Earlier, it was keeping just the value size together with the key. Now I think we need an additional bit regarding whether the underlying roachpb.Value is empty or not. Is my understanding correct?
That depends on why it was keeping the value size together with the key. Was that because it was using this size to determine which keys were pointing at tombstones without looking at the value? We can still introduce a fast-path to avoid looking at the value block when the value size is 0, but we will no longer be able to say that a value is not a tombstone by just looking at its value size.
I feel like I'm may be missing something here. Could you explain why we would need an additional bit regarding whether the underlying roachpb.Value is empty or not?
pkg/storage/pebble.go
line 1310 at r12 (raw file):
Previously, sumeerbhola wrote…
I still see a couple of TODOs in the comment.
Heh, I thought you were talking about s/key/value/ and missed the TODOs. Done.
pkg/storage/enginepb/mvcc.proto
line 72 at r11 (raw file):
Though I suppose one could use the local timestamp from the latest provisional value, since that is the latest local timestamp we know that the txn is still active?
Yes, this would be correct, but I agree with you that it's not worth optimizing for at this point if doing so leads to more complex code.
Thanks for the review Sumeer! I'm going to go ahead and merge this because it's at risk of skewing with other changes on master if it waits too long. Aayush and I spent some time together walking through the changes last week. We were in agreement on the approach that this PR takes and nothing major came out of the code walk-through that hasn't since been addressed. I still think this could use a careful eye from someone on KV and I'm happy to address feedback in a follow-up PR. bors r=sumeerbhola |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 39 files at r37, 1 of 7 files at r39.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, @stevendanna, and @sumeerbhola)
pkg/storage/mvcc.go
line 3167 at r27 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
That depends on why it was keeping the value size together with the key. Was that because it was using this size to determine which keys were pointing at tombstones without looking at the value? We can still introduce a fast-path to avoid looking at the value block when the value size is 0, but we will no longer be able to say that a value is not a tombstone by just looking at its value size.
I feel like I'm may be missing something here. Could you explain why we would need an additional bit regarding whether the underlying roachpb.Value is empty or not?
I need to reread some stuff to remember, but the rough idea had been that there are various places (gc, stats etc.) that only needed the size of an older version. At that time I didn't pay full attention to what it did with the size, but now I think it used the size both for this "is-value" determination and for the actual size when it was > 0. So if we want to be able to do similar things without reading the value block, we could keep the "is-value" too. And yes, we can also just do a fast-path which relies on most of the !is-value cases being ones where there is no explicit local timestamp, and the slow-path would read the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, @stevendanna, and @sumeerbhola)
pkg/storage/enginepb/mvcc.proto
line 72 at r11 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Though I suppose one could use the local timestamp from the latest provisional value, since that is the latest local timestamp we know that the txn is still active?
Yes, this would be correct, but I agree with you that it's not worth optimizing for at this point if doing so leads to more complex code.
How about a TODO, since we are leaving on the table some fast-forwarding opportunity of the local timestamp? Future PR is fine since this is already running through bors.
Build succeeded: |
// Furthermore, even if our pushType is not PUSH_ABORT, we may have ended up | ||
// with the responsibility to abort the intents (for example if we find the | ||
// transaction aborted). To do better here, we need per-intent information | ||
// on whether we need to poison. | ||
resolve := roachpb.MakeLockUpdate(pusheeTxn, roachpb.Span{Key: ws.key}) | ||
if pusheeTxn.Status == roachpb.PENDING { | ||
// The pushee was still PENDING at the time that the push observed its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvanbenschoten: To confirm that I understand the idea here, we're saying "we want to move the local timestamp of an intent as close as possible to its mvcc commit timestamp". Is that fair to say?
In other words, with this commit we're leveraging the fact that before every Push, we have the opportunity to look at the intent leaseholder's local clock and update the intent's local timestamp if its txn is found to be PENDING
(in order to move the intent out of the pusher's uncertainty window, since the intent's txn could not have causally preceded the pusher).
If the above sounds good to you, what I don't understand is what would happen if we didn't have this "optimization". I recall from our meeting that you'd alluded to this being more than just an optimization. Without this optimization, a reader might redundantly block on a txn that commits way later and doesn't causally precede the reader, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aayushshah15 and I talked about this in person and we're on the same page now. Summarizing the discussion below.
In other words, with this commit we're leveraging the fact that before every Push, we have the opportunity to look at the intent leaseholder's local clock and update the intent's local timestamp if its txn is found to be PENDING (in order to move the intent out of the pusher's uncertainty window, since the intent's txn could not have causally preceded the pusher).
Yes, this is correct.
I recall from our meeting that you'd alluded to this being more than just an optimization. Without this optimization, a reader might redundantly block on a txn that commits way later and doesn't causally precede the reader, yes?
This is also mostly correct. Without this, a high-priority pusher that pushes the timestamp of another transaction would still see the pushee's intent as uncertain when it returned to read because the intent would retain its local timestamp. It would then ratchet its read timestamp to that of the pushee and end up in the same situation when it attempted to read again. This would continue indefinitely. In effect, this would allow a high-priority reader to block on a lower-priotity writer — a form of priority inversion.
Before this change, we were updating the local clock with each BatchResponse's WriteTimestamp. This was meant to handle cases where the batch request timestamp was forwarded during evaluation. This was unnecessary for two reasons. The first is that a BatchResponse can legitimately carry an operation timestamp that leads the local HLC clock on the leaseholder that evaluated the request. This has been true since cockroachdb#80706, which introduced the concept of a "local timestamp". This allowed us to remove the (broken) attempt at ensuring that the HLC on a leaseholder always leads the MVCC timestamp of all values in the leaseholder's keyspace (see the update to `pkg/kv/kvserver/uncertainty/doc.go` in that PR). The second was that it was not even correct. The idea behind bumping the HLC on the response path was to ensure that if a batch request was forwarded to a newer timestamp during evaluation and then completed a write, that forwarded timestamp would be reflected in the leaseholder's HLC. However, this ignored the fact that any forwarded timestamp must have either come from an existing value in the range or from the leaseholder's clock. So if those didn't lead the clock, the derived timestamp wouldn't either. It also ignored the fact that the clock bump here was too late (post-latch release) and if it had actually been needed (it wasn't), it wouldn't have even ensured that the timestamp on any lease transfer led the maximum time of any response served by the outgoing leaseholder. There are no mixed-version migration concerns of this change, because cockroachdb#80706 ensured that any future-time operation will still continue to use the synthetic bit until all nodes are running v22.2 or later.
…ncasting This commit adds an explicit `ClockTimestamp` field called `Now` to the `BatchRequest` header, which mirrors the `Now` field on the `BatchResponse` header. In doing so, it removes the last instance where we downcasted a `Timestamp` to a `ClockTimestamp` using the `TryToClockTimestamp` method. With this change, MVCC ("operation") timestamps never flow back into HLC clocks as clock signals. This was enabled by cockroachdb#80706 and sets the groundwork to remove synthetic timestamps in v23.1 — the role they played in dynamic typing of clock timestamps is now entirely fulfilled by statically typed `ClockTimestamp` channels. This is an important step in separating out the MVCC timestamp domain from the clock timestamp domain and clarifying the roles of the two layers. In turn, this layering opens the door for CockroachDB to start thinking about dynamic clock synchronization error bounds.
63416: sql: emit point deletes during delete fastpath r=yuzefovich a=jordanlewis Previously, the "deleteRange" SQL operator, which is meant to be a fast-path for cases in which an entire range of keys can be deleted, always did what it said: emitted DeleteRange KV operations. This precludes a crucial optimization: sending point deletes when the list of deleted keys is exactly known. For example, a query like `DELETE FROM kv WHERE k = 10000` uses the "fast path" delete, since it has a contiguous set of keys to delete, and it doesn't need to know the values that were deleted. But, in this case, the performance is actually worse if we use a DeleteRange KV operation for various reasons (see #53939), because: - ranged KV writes (DeleteRangeRequest) cannot be pipelined because an enumeration of the intents that they will leave cannot be known ahead of time. They must therefore perform evaluation and replication synchronously. - ranged KV writes (DeleteRangeRequest) result in ranged intent resolution, which is less efficient (although this became less important since we re-enabled time-bound iterators). The reason we couldn't previously emit point deletes in this case is that SQL needs to know whether it deleted something or not. This means we can't do a "blind put" of a deletion: we need to actually understand whether there was something that we were "overwriting" with our delete. This commit modifies the DeleteResponse to always return a boolean indicating whether a key from the DeleteRequest was actually deleted. Additionally, the deleteRange SQL operator detects when it can emit single-key deletes, and does so. Closes #53939. Release note (performance improvement): point deletes in SQL are now more efficient during concurrent workloads. 76233: kv: remove clock update on BatchResponse r=nvanbenschoten a=nvanbenschoten Before this change, we were updating the local clock with each BatchResponse's WriteTimestamp. This was meant to handle cases where the batch request timestamp was forwarded during evaluation. This was unnecessary for two reasons. The first is that a BatchResponse can legitimately carry an operation timestamp that leads the local HLC clock on the leaseholder that evaluated the request. This has been true since #80706, which introduced the concept of a "local timestamp". This allowed us to remove the (broken) attempt at ensuring that the HLC on a leaseholder always leads the MVCC timestamp of all values in the leaseholder's keyspace (see the update to `pkg/kv/kvserver/uncertainty/doc.go` in that PR). The second was that it was not even correct. The idea behind bumping the HLC on the response path was to ensure that if a batch request was forwarded to a newer timestamp during evaluation and then completed a write, that forwarded timestamp would be reflected in the leaseholder's HLC. However, this ignored the fact that any forwarded timestamp must have either come from an existing value in the range or from the leaseholder's clock. So if those didn't lead the clock, the derived timestamp wouldn't either. It also ignored the fact that the clock bump here was too late (post-latch release) and if it had actually been needed (it wasn't), it wouldn't have even ensured that the timestamp on any lease transfer led the maximum time of any response served by the outgoing leaseholder. There are no mixed-version migration concerns of this change, because #80706 ensured that any future-time operation will still continue to use the synthetic bit until all nodes are running v22.2 or later. 85350: insights: ingester r=matthewtodd a=matthewtodd Closes #81021. Here we begin observing statements and transactions asynchronously, to avoid slowing down the hot sql execution path as much as possible. Release note: None 85440: colmem: improve memory-limiting behavior of the accounting helpers r=yuzefovich a=yuzefovich **colmem: introduce a helper method when no memory limit should be applied** This commit is a pure mechanical change. Release note: None **colmem: move some logic of capacity-limiting into the accounting helper** This commit moves the logic that was duplicated across each user of the SetAccountingHelper into the helper itself. Clearly, this allows us to de-duplicate some code, but it'll make it easier to refactor the code which is done in the following commit. Additionally, this commit makes a tiny change to make the resetting behavior in the hash aggregator more precise. Release note: None **colmem: improve memory-limiting behavior of the accounting helpers** This commit fixes an oversight in how we are allocating batches of the "dynamic" capacity. We have two related ways for reallocating batches, and both of them work by growing the capacity of the batch until the memory limit is exceeded, and then the batch would be reused until the end of the query execution. This is a reasonable heuristic under the assumption that all tuples in the data stream are roughly equal in size, but this might not be the case. In particular, consider an example when 10k small rows of 1KiB are followed by 10k large rows of 1MiB. According to our heuristic, we happily grow the batch until 1024 in capacity, and then we do not shrink the capacity of that batch, so once the large rows start appearing, we put 1GiB worth of data into a single batch, significantly exceeding our memory limit (usually 64MiB with the default `workmem` setting). This commit introduces a new heuristic as follows: - the first time a batch exceeds the memory limit, its capacity is memorized, and from now on that capacity will determine the upper bound on the capacities of the batches allocated through the helper; - if at any point in time a batch exceeds the memory limit by at least a factor of two, then that batch is discarded, and the capacity will never exceed half of the capacity of the discarded batch; - if the memory limit is not reached, then the behavior of the dynamic growth of the capacity provided by `Allocator.ResetMaybeReallocate` is still applicable (i.e. the capacities will grow exponentially until coldata.BatchSize()). Note that this heuristic does not have an ability to grow the maximum capacity once it's been set although it might make sense to do so (say, if after shrinking the capacity, the next five times we see that the batch is using less than half of the memory limit). This is an conscious omission since I want this change to be backported, and never growing seems like a safer choice. Thus, this improvement is left as a TODO. Also, we still might create batches that are too large in memory footprint in those places that don't use the SetAccountingHelper (e.g. in the columnarizer) since we perform the memory limit check at the batch granularity. However, this commit improves things there so that we don't reuse that batch on the next iteration and will use half of the capacity on the next iteration. Fixes: #76464. Release note (bug fix): CockroachDB now more precisely respects the `distsql_workmem` setting which improves the stability of each node and makes OOMs less likely. **colmem: unexport Allocator.ResetMaybeReallocate** This commit is a mechanical change to unexport `Allocator.ResetMaybeReallocate` so that the users would be forced to use the method with the same name from the helpers. This required splitting off the tests into two files. Release note: None 85492: backupccl: remap all restored tables r=dt a=dt This PR has a few changes, broken down into separate commits: a) stop restoring tmp tables and remove the special-case code to synthesize their special schemas; These were previously restored only to be dropped so that restored jobs that referenced them would not be broken, but we stopped restoring jobs. b) synthesize type-change jobs during cluster restore; this goes with not restoring jobs. c) fix some assumptions in tests/other code about what IDs restored tables have. d) finally, always assign new IDs to all restored objects, even during cluster restore, removing the need to carefully move conflicting tables or other things around. Commit-by-commit review recommended. 85930: jobs: make expiration use intended txn priority r=ajwerner a=rafiss In aed014f these operations were supposed to be changed to use MinUserPriority. However, they weren't using the appropriate txn, so it didn't have the intended effect. Release note: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Matthew Todd <todd@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
…ncasting This commit adds an explicit `ClockTimestamp` field called `Now` to the `BatchRequest` header, which mirrors the `Now` field on the `BatchResponse` header. In doing so, it removes the last instance where we downcasted a `Timestamp` to a `ClockTimestamp` using the `TryToClockTimestamp` method. With this change, MVCC ("operation") timestamps never flow back into HLC clocks as clock signals. This was enabled by cockroachdb#80706 and sets the groundwork to remove synthetic timestamps in v23.1 — the role they played in dynamic typing of clock timestamps is now entirely fulfilled by statically typed `ClockTimestamp` channels. This is an important step in separating out the MVCC timestamp domain from the clock timestamp domain and clarifying the roles of the two layers. In turn, this layering opens the door for CockroachDB to start thinking about dynamic clock synchronization error bounds.
…ncasting This commit adds an explicit `ClockTimestamp` field called `Now` to the `BatchRequest` header, which mirrors the `Now` field on the `BatchResponse` header. In doing so, it removes the last instance where we downcasted a `Timestamp` to a `ClockTimestamp` using the `TryToClockTimestamp` method. With this change, MVCC ("operation") timestamps never flow back into HLC clocks as clock signals. This was enabled by cockroachdb#80706 and sets the groundwork to remove synthetic timestamps in v23.1 — the role they played in dynamic typing of clock timestamps is now entirely fulfilled by statically typed `ClockTimestamp` channels. This is an important step in separating out the MVCC timestamp domain from the clock timestamp domain and clarifying the roles of the two layers. In turn, this layering opens the door for CockroachDB to start thinking about dynamic clock synchronization error bounds.
85764: kv: pass explicit Now timestamp on BatchRequest, remove timestamp downcasting r=nvanbenschoten a=nvanbenschoten This commit adds an explicit `ClockTimestamp` field called `Now` to the `BatchRequest` header, which mirrors the `Now` field on the `BatchResponse` header. In doing so, it removes the last instance where we downcasted a `Timestamp` to a `ClockTimestamp` using the `TryToClockTimestamp` method. With this change, MVCC ("operation") timestamps never flow back into HLC clocks as clock signals. This was enabled by #80706 and sets the groundwork to remove synthetic timestamps in v23.1 — the role they played in dynamic typing of clock timestamps is now entirely fulfilled by statically typed ClockTimestamp channels. This is an important step in separating out the MVCC timestamp domain from the clock timestamp domain and clarifying the roles of the two layers. In turn, this layering opens the door for CockroachDB to start thinking about dynamic clock synchronization error bounds. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Informs cockroachdb#101938. This commit removes logic in mvcc key decoding routines to decode synthetic timestamps. We retain the ability to decode keys with the synthetic timestamp bit set, but we simply ignore its presence. As discussed in the previous commit, the role of these synthetic timestamp markers was eliminated in cockroachdb#80706 by the local_timestamp field in the mvcc value header, which was first present in v22.2. v23.2 does not require compatibility with v22.2, so it can rely on the fact that any txn that has a synthetic timestamp (because it writes in the future) will also write local timestamps into each of its values. Release note: None
Fixes #36431.
Fixes #49360.
Replaces #72121.
Replaces #77342.
NOTE: this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key.
This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment).
This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we introduced the
multi-register
test to the Jepsen test suite, and have since observed the test fail when combined with thestrobe-skews
nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test.Explanation
The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure1 relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing.
In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make.
This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction2 of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold.
This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart.
For more, see the updates made in this commit to
pkg/kv/kvserver/observedts/doc.go
.To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads.
MVCCValue
To store the local timestamp, the commit introduces a new
MVCCValue
type to parallel theMVCCKey
type.MVCCValue
wraps aroachpb.Value
and extends it with MVCC-level metadata which is stored in anenginepb.MVCCValueHeader
protobuf struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in theroachpb.Value
encoding.To further avoid any performance hit,
MVCCValue
has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that ofroachpb.Value
. This provided backwards compatibility and ensures that theMVCCValue
optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are:The two encoding scheme variants are distinguished using the 5th byte, which is either the
roachpb.Value
tag (which has many possible values) or a sentinel tag not used by theroachpb.Value
encoding which indicates the extended encoding scheme.Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. See microbenchmarks below.
Future improvements
As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables.
The synthetic bit's first role of providing dynamic typing for
ClockTimestamps
is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock.The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two
ClockTimestamps
, the version's local timestamp and the reader's observed timestamp.Correctness testing
I was not able to stress
jepsen/multi-register/strobe-skews
hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate.However, the commit does add a new test called
TestTxnReadWithinUncertaintyIntervalAfterIntentResolution
which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes.Performance analysis
This correctness fix will lead to an increased rate of transaction retries under some workloads.
MVCCValue Encoding and Decoding Microbenchmarks
pkg/sql/tests End-To-End Microbenchmarks
YCSB benchmark suite
Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew.
Footnotes
see pkg/kv/kvserver/observedts/doc.go for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. ↩
see analysis in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-714221846. ↩