storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option #44473

nvanbenschoten · 2020-01-29T05:43:25Z

Relates to #40205.

This change introduces a new option to MVCCGetOptions and MVCCScanOptions that causes reads to throw an error if they observe an MVCC version at a timestamp above their read timestamp. Specifically, when the option is enabled, a WriteTooOldError will be returned if a scan observes an MVCC version with a timestamp above the read. Similarly, a WriteIntentError will be returned if a scan observes another transaction's intent, even if that intent has a timestamp above the scan's read timestamp.

This option will be used in the future by ScanRequests and ReverseScanRequests that intend to acquire unreplicated key-level locks, which will power SELECT FOR UPDATE. These scans do not want to ignore existing MVCC versions at higher timestamps like traditional scans do. Instead, they want to throw errors and force their transaction to increase its timestamp through either a refresh or a retry. This was previously prototyped in 4a8e8dc.

Interestingly, this is not new logic to the MVCC layer. This behavior is exactly the same as that of the initial key-value lookup performed during MVCC writes (see mvccPutInternal). It's fitting that behavior needed for SELECT FOR UPDATE would mirror that already exhibited by the read portion of read-write operations. This also hints at an opportunity to potentially use this option to merge the two implementations and get rid of custom logic like mvccGetInternal that only exists on the write path. We'd need to be careful about doing so though, as this code is heavily tuned.

Release note: None

cockroach-teamcity · 2020-01-29T05:43:32Z

This change is

This commit commits the changes from `make c-deps-fmt`. This hadn't been done in a while. I don't want the diff leaking into the next commit. I was surprised to see that we don't lint against this.

itsbilal

save for a discussion about an unclear interaction with inconsistent.

Folding mvccGetInternal can be possible now that the entire scanner is back in Go, but mvccGetInternal is still more inlined and likely more performant as a result.

Reviewed 20 of 20 files at r1, 9 of 9 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

c-deps/libroach/mvcc.h, line 287 at r2 (raw file):

      }

      if (fail_on_more_recent_) {

Is it worth adding a && !inconsistent here? Having fail_on_more_recent == true and inconsistent == true makes little sense, so we might as well have fail_on_more_recent_ not do anything in inconsistent mode (and the comment on MVCCGet/MVCCScan would also need to say which bool takes priority).

pkg/storage/engine/rocksdb.go, line 2763 at r2 (raw file):

func writeTooOldToError(readTS hlc.Timestamp, existingCTS C.DBTimestamp) error {
	if existingCTS.wall_time != 0 || existingCTS.logical != 0 {

Nit: Might be cleaner to do the conversion first, then existingTS != hlc.Timestamp{}

pkg/storage/engine/testdata/mvcc_histories/read_fail_on_more_recent, line 27 at r2 (raw file):

#   for op in (get, scan):
#     for ts in (9, 10, 11):
#       for failOnMoreRecent in (false, true):

Might be a good idea to also toggle inconsistent in the k=k1, ts = 9, failOnMoreRecent = true cases.

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/engine/mvcc.go, line 2462 at r2 (raw file):

// When scanning in "fail on more recent" mode, a WriteTooOldError will be
// returned if the scan observes a version with a timestamp above the read
// timestamp. Similarly, a WriteIntentError will be returned if the scan

just making sure I understand: the reason we distinguish these two errors is that for the latter the transaction needs to wait behind the transaction with the intent (and possibly push it), so in the new code this will result in an addDiscoveredLock() call on lockTable. Is that correct?

pkg/storage/engine/pebble_mvcc_scanner.go, line 370 at r2 (raw file):

	if !ownIntent {
		// 8. The key contains an intent which was not written by our
		// transaction and our read timestamp is newer than that of the

is this comment stale in light of the changes above, in that the read timestamp is not necessarily newer than the intent?

Relates to cockroachdb#40205. This change introduces a new option to `MVCCGetOptions` and `MVCCScanOptions` that causes reads to throw an error if they observe an MVCC version at a timestamp above their read timestamp. Specifically, when the option is enabled, a `WriteTooOldError` will be returned if a scan observes an MVCC version with a timestamp above the read. Similarly, a `WriteIntentError` will be returned if a scan observes another transaction's intent, even if that intent has a timestamp above the scan's read timestamp. This option will be used in the future by `ScanRequests` and `ReverseScanRequests` that intend to acquire unreplicated key-level locks, which will power `SELECT FOR UPDATE`. These scans do not want to ignore existing MVCC versions at higher timestamps like traditional scans do. Instead, they want to throw errors and force their transaction to increase its timestamp through either a refresh or a retry. This was previously prototyped in 4a8e8dc. Interestingly, this is not new logic to the MVCC layer. This behavior is exactly the same as that of the initial key-value lookup performed during MVCC writes (see `mvccPutInternal`). It's fitting that behavior needed for `SELECT FOR UPDATE` would mirror that already exhibited by the read portion of read-write operations. This also hints at an opportunity to potentially use this option to merge the two implementations and get rid of custom logic like `mvccGetInternal` that only exists on the write path. We'd need to be careful about doing so though, as this code is heavily tuned. Release note: None

nvanbenschoten

TFTRs!

Folding mvccGetInternal can be possible now that the entire scanner is back in Go, but mvccGetInternal is still more inlined and likely more performant as a result.

Yes, I don't think we'll be able to get it to work in practice without having an impact on performance. Instead, I'd like to see us move the other way and explore blind writes into the MVCC keyspace through a combination of the segregated lock table, a reintroduction of the write timestamp cache, and inexact MVCC stats. This will become more pressing as we look towards a "cloud-native" storage hierarchy.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @sumeerbhola)

c-deps/libroach/mvcc.h, line 287 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Is it worth adding a && !inconsistent here? Having fail_on_more_recent == true and inconsistent == true makes little sense, so we might as well have fail_on_more_recent_ not do anything in inconsistent mode (and the comment on MVCCGet/MVCCScan would also need to say which bool takes priority).

Very good point. This combination wasn't working before and doesn't really make sense. It's unclear whether it should fail on intents at higher timestamps or return them without failing. I don't want to ignore the combination because that seems error-prone, but I think it's fine to reject it. Done.

pkg/storage/engine/mvcc.go, line 2462 at r2 (raw file):

Previously, sumeerbhola wrote…

just making sure I understand: the reason we distinguish these two errors is that for the latter the transaction needs to wait behind the transaction with the intent (and possibly push it), so in the new code this will result in an addDiscoveredLock() call on lockTable. Is that correct?

Yes, that's exactly correct. A WriteTooOldError just means that the transaction found a write above the time that it was trying to write, so it needs to bump its timestamp. A WriteIntentError means that the transaction found another transaction's intent, so it needs to hit the new addDiscoveredLock code and hop into the lockTable.

pkg/storage/engine/pebble_mvcc_scanner.go, line 370 at r2 (raw file):

Previously, sumeerbhola wrote…

is this comment stale in light of the changes above, in that the read timestamp is not necessarily newer than the intent?

Yes, it was stale for a few reasons. Fixed.

pkg/storage/engine/rocksdb.go, line 2763 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: Might be cleaner to do the conversion first, then existingTS != hlc.Timestamp{}

Done.

pkg/storage/engine/testdata/mvcc_histories/read_fail_on_more_recent, line 27 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Might be a good idea to also toggle inconsistent in the k=k1, ts = 9, failOnMoreRecent = true cases.

Done.

cmdScan was already doing this, so we should be consistent or it becomes confusing why scan results show up on errors cases but not get results.

nvanbenschoten · 2020-02-01T00:59:53Z

bors r+

44473: storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option r=nvanbenschoten a=nvanbenschoten Relates to #40205. This change introduces a new option to `MVCCGetOptions` and `MVCCScanOptions` that causes reads to throw an error if they observe an MVCC version at a timestamp above their read timestamp. Specifically, when the option is enabled, a `WriteTooOldError` will be returned if a scan observes an MVCC version with a timestamp above the read. Similarly, a `WriteIntentError` will be returned if a scan observes another transaction's intent, even if that intent has a timestamp above the scan's read timestamp. This option will be used in the future by `ScanRequests` and `ReverseScanRequests` that intend to acquire unreplicated key-level locks, which will power `SELECT FOR UPDATE`. These scans do not want to ignore existing MVCC versions at higher timestamps like traditional scans do. Instead, they want to throw errors and force their transaction to increase its timestamp through either a refresh or a retry. This was previously prototyped in 4a8e8dc. Interestingly, this is not new logic to the MVCC layer. This behavior is exactly the same as that of the initial key-value lookup performed during MVCC writes (see `mvccPutInternal`). It's fitting that behavior needed for `SELECT FOR UPDATE` would mirror that already exhibited by the read portion of read-write operations. This also hints at an opportunity to potentially use this option to merge the two implementations and get rid of custom logic like `mvccGetInternal` that only exists on the write path. We'd need to be careful about doing so though, as this code is heavily tuned. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

craig · 2020-02-01T01:28:28Z

Build succeeded

GitHub CI (Cockroach)

sumeerbhola

Yes, I don't think we'll be able to get it to work in practice without having an impact on performance. Instead, I'd like to see us move the other way and explore blind writes into the MVCC keyspace through a combination of the segregated lock table, a reintroduction of the write timestamp cache, and inexact MVCC stats. This will become more pressing as we look towards a "cloud-native" storage hierarchy.

If we are going to do blind writes we will need to support a mode where the exclusive lock is acquired only when the preceding read/upgrade/exclusive lock is still held, otherwise we could miss an intervening write since the read, yes?
And why is this useful for "cloud-native" storage, by which you presumably mean higher latency since the files being read may not be local. My understanding is that we currently do read-modify-writes for puts and cputs in order to check the condition and to modify only certain columns in a row, and hold latches while doing so -- is the blind write idea motivated by reducing the duration of latches (though that does not quite matter since non-locking reads at lower timestamps can be concurrent)?

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @sumeerbhola)

nvanbenschoten · 2020-02-04T04:49:08Z

If we are going to do blind writes we will need to support a mode where the exclusive lock is acquired only when the preceding read/upgrade/exclusive lock is still held, otherwise we could miss an intervening write since the read, yes?

This isn't how I was picturing things working, but that does seem like an option. A more straightforward approach would be to reintroduce the write timestamp cache and insert all writes into it. Remember that the timestamp cache is a concurrent in-memory data structure with a strict memory limit and a guarantee that if a key span is inserted with a timestamp, all overlapping queries into the structure will return a timestamp equal to or greater than the insert timestamp. In other words, it efficiently provides an upper bound on all timestamps inserted in a given key range.

by which you presumably mean higher latency since the files being read may not be local.

Yes, that is exactly what I was meaning.

My understanding is that we currently do read-modify-writes for puts and cputs in order to check the condition and to modify only certain columns in a row, and hold latches while doing so

The condition check is true of cputs, but not of puts. Puts perform a read-modify-write strictly to check for conflicting intents and to ensure monotonicity across writes to a given key. The lockTable can now serve this first purpose (and we expect its LSM blocks to always be cached locally) and the write timestamp cache can serve the second purpose (which is necessary for single-key linearizability). So Puts could become blind writes with respect to the LSM, which could speed up workloads like bulk UPSERTs significantly.

is the blind write idea motivated by reducing the duration of latches (though that does not quite matter since non-locking reads at lower timestamps can be concurrent)?

It's not particularly motivated by contention, no. It's primarily motivated by write-heavy workloads like the classic timeseries use-case.

nvanbenschoten requested review from itsbilal and sumeerbhola January 29, 2020 05:43

c-deps/libroach: run clang-format

6e7fa8f

This commit commits the changes from `make c-deps-fmt`. This hadn't been done in a while. I don't want the diff leaking into the next commit. I was surprised to see that we don't lint against this.

nvanbenschoten force-pushed the nvanbenschoten/wtoScan branch from 10b529f to 964640e Compare January 30, 2020 05:50

itsbilal approved these changes Jan 30, 2020

View reviewed changes

sumeerbhola reviewed Jan 30, 2020

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/wtoScan branch from 964640e to a25a9c9 Compare January 31, 2020 03:35

nvanbenschoten commented Jan 31, 2020

View reviewed changes

storage/engine: output result even on cmdGet error

7759868

cmdScan was already doing this, so we should be consistent or it becomes confusing why scan results show up on errors cases but not get results.

craig bot merged commit 7759868 into cockroachdb:master Feb 1, 2020

sumeerbhola reviewed Feb 1, 2020

View reviewed changes

nvanbenschoten deleted the nvanbenschoten/wtoScan branch February 4, 2020 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option #44473

storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option #44473

nvanbenschoten commented Jan 29, 2020

cockroach-teamcity commented Jan 29, 2020

itsbilal left a comment

sumeerbhola left a comment

nvanbenschoten left a comment

nvanbenschoten commented Feb 1, 2020

craig bot commented Feb 1, 2020

sumeerbhola left a comment

nvanbenschoten commented Feb 4, 2020

storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option #44473

storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option #44473

Conversation

nvanbenschoten commented Jan 29, 2020

cockroach-teamcity commented Jan 29, 2020

itsbilal left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Feb 1, 2020

craig bot commented Feb 1, 2020

Build succeeded

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Feb 4, 2020