Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option #44473

Merged
merged 3 commits into from
Feb 1, 2020

Conversation

nvanbenschoten
Copy link
Member

Relates to #40205.

This change introduces a new option to MVCCGetOptions and MVCCScanOptions that causes reads to throw an error if they observe an MVCC version at a timestamp above their read timestamp. Specifically, when the option is enabled, a WriteTooOldError will be returned if a scan observes an MVCC version with a timestamp above the read. Similarly, a WriteIntentError will be returned if a scan observes another transaction's intent, even if that intent has a timestamp above the scan's read timestamp.

This option will be used in the future by ScanRequests and ReverseScanRequests that intend to acquire unreplicated key-level locks, which will power SELECT FOR UPDATE. These scans do not want to ignore existing MVCC versions at higher timestamps like traditional scans do. Instead, they want to throw errors and force their transaction to increase its timestamp through either a refresh or a retry. This was previously prototyped in 4a8e8dc.

Interestingly, this is not new logic to the MVCC layer. This behavior is exactly the same as that of the initial key-value lookup performed during MVCC writes (see mvccPutInternal). It's fitting that behavior needed for SELECT FOR UPDATE would mirror that already exhibited by the read portion of read-write operations. This also hints at an opportunity to potentially use this option to merge the two implementations and get rid of custom logic like mvccGetInternal that only exists on the write path. We'd need to be careful about doing so though, as this code is heavily tuned.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

This commit commits the changes from `make c-deps-fmt`. This hadn't been
done in a while. I don't want the diff leaking into the next commit.

I was surprised to see that we don't lint against this.
Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:LGTM: save for a discussion about an unclear interaction with inconsistent.

Folding mvccGetInternal can be possible now that the entire scanner is back in Go, but mvccGetInternal is still more inlined and likely more performant as a result.

Reviewed 20 of 20 files at r1, 9 of 9 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


c-deps/libroach/mvcc.h, line 287 at r2 (raw file):

      }

      if (fail_on_more_recent_) {

Is it worth adding a && !inconsistent here? Having fail_on_more_recent == true and inconsistent == true makes little sense, so we might as well have fail_on_more_recent_ not do anything in inconsistent mode (and the comment on MVCCGet/MVCCScan would also need to say which bool takes priority).


pkg/storage/engine/rocksdb.go, line 2763 at r2 (raw file):

func writeTooOldToError(readTS hlc.Timestamp, existingCTS C.DBTimestamp) error {
	if existingCTS.wall_time != 0 || existingCTS.logical != 0 {

Nit: Might be cleaner to do the conversion first, then existingTS != hlc.Timestamp{}


pkg/storage/engine/testdata/mvcc_histories/read_fail_on_more_recent, line 27 at r2 (raw file):

#   for op in (get, scan):
#     for ts in (9, 10, 11):
#       for failOnMoreRecent in (false, true):

Might be a good idea to also toggle inconsistent in the k=k1, ts = 9, failOnMoreRecent = true cases.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/engine/mvcc.go, line 2462 at r2 (raw file):

// When scanning in "fail on more recent" mode, a WriteTooOldError will be
// returned if the scan observes a version with a timestamp above the read
// timestamp. Similarly, a WriteIntentError will be returned if the scan

just making sure I understand: the reason we distinguish these two errors is that for the latter the transaction needs to wait behind the transaction with the intent (and possibly push it), so in the new code this will result in an addDiscoveredLock() call on lockTable. Is that correct?


pkg/storage/engine/pebble_mvcc_scanner.go, line 370 at r2 (raw file):

	if !ownIntent {
		// 8. The key contains an intent which was not written by our
		// transaction and our read timestamp is newer than that of the

is this comment stale in light of the changes above, in that the read timestamp is not necessarily newer than the intent?

Relates to cockroachdb#40205.

This change introduces a new option to `MVCCGetOptions` and `MVCCScanOptions`
that causes reads to throw an error if they observe an MVCC version at a
timestamp above their read timestamp. Specifically, when the option is enabled,
a `WriteTooOldError` will be returned if a scan observes an MVCC version with a
timestamp above the read. Similarly, a `WriteIntentError` will be returned if a
scan observes another transaction's intent, even if that intent has a timestamp
above the scan's read timestamp.

This option will be used in the future by `ScanRequests` and `ReverseScanRequests`
that intend to acquire unreplicated key-level locks, which will power `SELECT FOR
UPDATE`. These scans do not want to ignore existing MVCC versions at higher timestamps
like traditional scans do. Instead, they want to throw errors and force their
transaction to increase its timestamp through either a refresh or a retry. This was
previously prototyped in 4a8e8dc.

Interestingly, this is not new logic to the MVCC layer. This behavior is exactly
the same as that of the initial key-value lookup performed during MVCC writes
(see `mvccPutInternal`). It's fitting that behavior needed for `SELECT FOR UPDATE`
would mirror that already exhibited by the read portion of read-write operations.
This also hints at an opportunity to potentially use this option to merge the two
implementations and get rid of custom logic like `mvccGetInternal` that only exists
on the write path. We'd need to be careful about doing so though, as this code is
heavily tuned.

Release note: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Folding mvccGetInternal can be possible now that the entire scanner is back in Go, but mvccGetInternal is still more inlined and likely more performant as a result.

Yes, I don't think we'll be able to get it to work in practice without having an impact on performance. Instead, I'd like to see us move the other way and explore blind writes into the MVCC keyspace through a combination of the segregated lock table, a reintroduction of the write timestamp cache, and inexact MVCC stats. This will become more pressing as we look towards a "cloud-native" storage hierarchy.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @sumeerbhola)


c-deps/libroach/mvcc.h, line 287 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Is it worth adding a && !inconsistent here? Having fail_on_more_recent == true and inconsistent == true makes little sense, so we might as well have fail_on_more_recent_ not do anything in inconsistent mode (and the comment on MVCCGet/MVCCScan would also need to say which bool takes priority).

Very good point. This combination wasn't working before and doesn't really make sense. It's unclear whether it should fail on intents at higher timestamps or return them without failing. I don't want to ignore the combination because that seems error-prone, but I think it's fine to reject it. Done.


pkg/storage/engine/mvcc.go, line 2462 at r2 (raw file):

Previously, sumeerbhola wrote…

just making sure I understand: the reason we distinguish these two errors is that for the latter the transaction needs to wait behind the transaction with the intent (and possibly push it), so in the new code this will result in an addDiscoveredLock() call on lockTable. Is that correct?

Yes, that's exactly correct. A WriteTooOldError just means that the transaction found a write above the time that it was trying to write, so it needs to bump its timestamp. A WriteIntentError means that the transaction found another transaction's intent, so it needs to hit the new addDiscoveredLock code and hop into the lockTable.


pkg/storage/engine/pebble_mvcc_scanner.go, line 370 at r2 (raw file):

Previously, sumeerbhola wrote…

is this comment stale in light of the changes above, in that the read timestamp is not necessarily newer than the intent?

Yes, it was stale for a few reasons. Fixed.


pkg/storage/engine/rocksdb.go, line 2763 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: Might be cleaner to do the conversion first, then existingTS != hlc.Timestamp{}

Done.


pkg/storage/engine/testdata/mvcc_histories/read_fail_on_more_recent, line 27 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Might be a good idea to also toggle inconsistent in the k=k1, ts = 9, failOnMoreRecent = true cases.

Done.

cmdScan was already doing this, so we should be consistent or it becomes
confusing why scan results show up on errors cases but not get results.
@nvanbenschoten
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Feb 1, 2020
44473: storage/engine: expose FailOnMoreRecent MVCC{Get/Scan}Option r=nvanbenschoten a=nvanbenschoten

Relates to #40205.

This change introduces a new option to `MVCCGetOptions` and `MVCCScanOptions` that causes reads to throw an error if they observe an MVCC version at a timestamp above their read timestamp. Specifically, when the option is enabled, a `WriteTooOldError` will be returned if a scan observes an MVCC version with a timestamp above the read. Similarly, a `WriteIntentError` will be returned if a scan observes another transaction's intent, even if that intent has a timestamp above the scan's read timestamp.

This option will be used in the future by `ScanRequests` and `ReverseScanRequests` that intend to acquire unreplicated key-level locks, which will power `SELECT FOR UPDATE`. These scans do not want to ignore existing MVCC versions at higher timestamps like traditional scans do. Instead, they want to throw errors and force their transaction to increase its timestamp through either a refresh or a retry. This was previously prototyped in 4a8e8dc.

Interestingly, this is not new logic to the MVCC layer. This behavior is exactly the same as that of the initial key-value lookup performed during MVCC writes (see `mvccPutInternal`). It's fitting that behavior needed for `SELECT FOR UPDATE` would mirror that already exhibited by the read portion of read-write operations. This also hints at an opportunity to potentially use this option to merge the two implementations and get rid of custom logic like `mvccGetInternal` that only exists on the write path. We'd need to be careful about doing so though, as this code is heavily tuned.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Contributor

craig bot commented Feb 1, 2020

Build succeeded

@craig craig bot merged commit 7759868 into cockroachdb:master Feb 1, 2020
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think we'll be able to get it to work in practice without having an impact on performance. Instead, I'd like to see us move the other way and explore blind writes into the MVCC keyspace through a combination of the segregated lock table, a reintroduction of the write timestamp cache, and inexact MVCC stats. This will become more pressing as we look towards a "cloud-native" storage hierarchy.

If we are going to do blind writes we will need to support a mode where the exclusive lock is acquired only when the preceding read/upgrade/exclusive lock is still held, otherwise we could miss an intervening write since the read, yes?
And why is this useful for "cloud-native" storage, by which you presumably mean higher latency since the files being read may not be local. My understanding is that we currently do read-modify-writes for puts and cputs in order to check the condition and to modify only certain columns in a row, and hold latches while doing so -- is the blind write idea motivated by reducing the duration of latches (though that does not quite matter since non-locking reads at lower timestamps can be concurrent)?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal and @sumeerbhola)

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/wtoScan branch February 4, 2020 00:53
@nvanbenschoten
Copy link
Member Author

If we are going to do blind writes we will need to support a mode where the exclusive lock is acquired only when the preceding read/upgrade/exclusive lock is still held, otherwise we could miss an intervening write since the read, yes?

This isn't how I was picturing things working, but that does seem like an option. A more straightforward approach would be to reintroduce the write timestamp cache and insert all writes into it. Remember that the timestamp cache is a concurrent in-memory data structure with a strict memory limit and a guarantee that if a key span is inserted with a timestamp, all overlapping queries into the structure will return a timestamp equal to or greater than the insert timestamp. In other words, it efficiently provides an upper bound on all timestamps inserted in a given key range.

by which you presumably mean higher latency since the files being read may not be local.

Yes, that is exactly what I was meaning.

My understanding is that we currently do read-modify-writes for puts and cputs in order to check the condition and to modify only certain columns in a row, and hold latches while doing so

The condition check is true of cputs, but not of puts. Puts perform a read-modify-write strictly to check for conflicting intents and to ensure monotonicity across writes to a given key. The lockTable can now serve this first purpose (and we expect its LSM blocks to always be cached locally) and the write timestamp cache can serve the second purpose (which is necessary for single-key linearizability). So Puts could become blind writes with respect to the LSM, which could speed up workloads like bulk UPSERTs significantly.

is the blind write idea motivated by reducing the duration of latches (though that does not quite matter since non-locking reads at lower timestamps can be concurrent)?

It's not particularly motivated by contention, no. It's primarily motivated by write-heavy workloads like the classic timeseries use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants