Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: separated lock-table keyspace #41720

Closed
nvanbenschoten opened this issue Oct 18, 2019 · 28 comments
Closed

storage: separated lock-table keyspace #41720

nvanbenschoten opened this issue Oct 18, 2019 · 28 comments
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Oct 18, 2019

Summary

Move lock portion of write intents to a separate lock-table keyspace instead of storing it inline with MVCC data.

Motivation

Write intents are currently made up of an MVCCMetadata key and a provisional value key. Both are stored inline in the global MVCC keyspace, with the former stored at timestamp 0 of the write intent's key and the latter stored at the transaction's provisional commit timestamp at the time of writing the record.

Here's what this looks like visually for an intent at key "b":

/Table/"a"/123: {val1}
/Table/"b"/0:   {txn: txn1, ts: 999}
/Table/"b"/999: {val2}
/Table/"b"/500: {val3}
/Table/"c"/700: {val4}

We can think of the information stored in the MVCCMetadata key as a "write lock" and the information stored in the provisional value key as the "value under lock".

Storing the provisional value where it likely will remain after the transaction has committed is desirable because we only need to write it once. However, storing the write lock inline with the MVCC data is less of a clear win. Here are the only benefits I can think of for doing so:

  1. can discover locks and read data without needing to read from two different iterators.
  2. ... that's it?

I propose we move it to a separate "lock-table" keyspace for all of the following reasons:

  1. simplifies MVCC logic significantly. A scan can check for conflicts first before looking at data. After this, scanning the data keyspace doesn't need to be aware of transaction isolation at all. We shouldn't even need to push transaction information into MVCC methods, just the read/write timestamp. This complexity has bitten us in the past and continue to (see time-bound iterator complications)
  2. avoids cluttering the MVCC keyspace with RocksDB deletion tombstones at timestamp zero of each key. This slows down scans
  3. opens up the possibility to "blind-write" into the MVCC keyspace (with cooperation from the write timestamp cache if we need to maintain single-key linearizability). This would be a significant win for write-heavy workloads into large LSMs
  4. opens up the possibility to store multiple write intents on a single key (see storage: allow multiple intents #5861)
  5. makes it significantly cleaner to store point and ranged read locks
  6. makes it easier to ignore closed timestamp/timestamp cache when a transaction is updating its own intent (see storage: adaptive closed timestamp target duration #36478 (comment))
  7. makes it easier to augment the durable lock-table with a cache of known committed transactions, which might allow us to resolve intents (release locks) without holding latches. This would reduce the contention footprint of transactions (see kv: block less on intent resolutions, reduce transaction contention footprint #22349)
  8. makes optimizations like using SingleDelete when resolving intents easier to make in isolation (see kvserver: investigate SingleDelete for raft log truncation #8979)
  9. makes scanning the intents on a Range cheap. Multiple other proposals have wanted this access pattern to be improved
  10. @bdarnell believes this would be good for filling up SSTs in the local keyspace

In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance. In fact, the speedup due to the simplification of MVCC code (not to mention the other optimizations this allows) will likely outweigh the overhead of separated lock checking.

Strawman Proposal

  1. Remove Txn and IntentHistory fields from MVCCMetadata. Remove isolation concerns from MVCC logic.
  2. Create new range-local lock-table keyspace. Store locks at keys with the structure: /Local/<key>/lock/<txnID>. This ends up being in-line with transaction record keys, which isn't really a problem. It hints at the equivalence between these two concepts.
  3. Adjust read and write path to check lock-table after acquiring latches. The spans declared for latches would probably be the right input to check for conflicting locks.
  4. Address one or more of the improvements listed above that can be made with this new structure.

This will require a fairly serious migration, but it seems like an important prereq for a number of desirable projects.

gz#9005

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Oct 18, 2019
@nvanbenschoten nvanbenschoten self-assigned this Oct 18, 2019
@petermattis
Copy link
Collaborator

This complexity has bitten us in the past and continue to (see time-bound iterator complications)

This sentence appears to be incomplete. Many of the time-bound iterator complications stem from deletion of the metadata key not carrying any timestamp information. How would this proposal address that?

In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance.

We could test this hypothesis by changing the existing MVCC code to always do a separate seek for the metadata and version key (I'm thinking of MVCCPut here). I suspect the performance hit will be higher than you expect. This would be the worst case, though. I'd be curious to know what this performance hit is so we could judge how many of the improvements need to be done to offset it.

Strawman Proposal

The interleaving of the metadata records with the versioned records is embedded in a lot of MVCC code. It would be worthwhile to enumerate the places that need to get adjusted, such as MVCCComputeStats, MVCCScan, MVCCPut, etc. Many of those routines take a single iterator. With a segregated lock table we'd need them to take two iterators, but presumably at the same snapshot (which is a bit of overhead with RocksDB, though something we could solve with Pebble).

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Oct 21, 2019

How would this proposal address that?

This proposal would get rid of (non-synthetic) metadata keys outside of their use for inline kvs (which have no timestamp). Transaction isolation would be completely separate from the MVCC keyspace. A user of time-bound iterators (along with all other requests) would first check for locks at their desired timestamp. This would all be done based on the declared spans of the request. If none exists, they could scan with a time-bound iterator without needing to worry about whether they should observe intents or not.

With a segregated lock table we'd need them to take two iterators, but presumably at the same snapshot (which is a bit of overhead with RocksDB, though something we could solve with Pebble).

My strawman should have made it more clear that we wouldn't be trying to merge iterators or keep the current MVCC logic. We would check for locks in the lock table before doing the scan. We wouldn't need to scan at the same RocksDB snapshot because we already have in-mem latches to coordinate conflicting requests.

The only case where this would result in different semantics is with limited scans. Right now, a limited scan will ignore an intent if it never reaches it. This proposal as is wouldn't know that it can avoid an intent that it never reaches. We already have this issue with latches though, so I don't think this is really a new issue (see #9521 + #33373).

@petermattis
Copy link
Collaborator

My strawman should have made it more clear that we wouldn't be trying to merge iterators or keep the current MVCC logic. We would check for locks in the lock table before doing the scan. We wouldn't need to scan at the same RocksDB snapshot because we already have in-mem latches to coordinate conflicting requests.

That works for MVCC scan, but what about MVCC put?

@nvanbenschoten
Copy link
Member Author

It would be roughly the same thing. We'd check for conflicting locks in the lock table before performing the put. If there were no conflicts then the put would write the provisional version kv and add a lock to the lock table. This is how we could get blind writes into the MVCC keyspace. Does that answer your question or am I missing something?

One complication here is when a transaction updates its own write. We'd need to store the sequence history on the lock itself.

@petermattis
Copy link
Collaborator

Does that answer your question or am I missing something?

Yes, though I was thinking about the complication you mention below.

One complication here is when a transaction updates its own write. We'd need to store the sequence history on the lock itself.

I was imagining this complication as well. Essentially you're proposing to decompose MVCC operations into lock checking phases and then execution phases. With MVCC put we may need to be passing info from the lock check phase to the execution phase. I only vaguely see how this would be done, though it doesn't seem infeasible.

@sumeerbhola
Copy link
Collaborator

@nvanbenschoten Regarding "This proposal as is wouldn't know that it can avoid an intent that it never reaches. We already have this issue with latches though ..."
IIUC, latch lifetime is always short, while write intents will live for the duration of the transaction which could be long. Will this change imply one of (a) evaluate the limited scan and then see if it picked up a "value under lock", or (b) pass all the locks down to the limited scan so it can return early if it is seeing a "value under lock"?

@nvanbenschoten
Copy link
Member Author

latch lifetime is always short, while write intents will live for the duration of the transaction which could be long.

Latch lifetime is often on the order of the latency of replication, so it's not always short. In fact, in a lot of cases, it's roughly the same as the lifetime of a transaction.

Will this change imply one of (a) evaluate the limited scan and then see if it picked up a "value under lock", or (b) pass all the locks down to the limited scan so it can return early if it is seeing a "value under lock"?

I wasn't imagining the change trying to do either of those proposals immediately, but they seem like good follow-on optimizations.

@nvanbenschoten
Copy link
Member Author

Here's are a few other thoughts/clarifications from last week that are now floating somewhere around the bay area.

(1) One of the largest remaining abstraction violations of this proposal (which @petermattis picked up on above) is that a transaction will need a delicate interaction with its locks when rewriting intents and when reading below the latest sequence number. This is because we'll likely need to continue storing the intent history on the lock key, which will now move to the lock table keyspace. This could all be made simpler if we included a sequence number suffix in the MVCC keyspace directly. If we did this then keys would look like key/timestamp/sequence.

With this structure, we would then be able to let a transaction write multiple versions of a key directly into the keyspace and not require any coupling with an intent history when reading or writing. If a transaction was reading at an earlier sequence then it would naturally observe only sequence numbers up to the sequence it desires. If we did this then we wouldn't even need to remove old sequence numbers during intent resolution, although I think we'd want to in practice. I think @andy-kimball's AcidLib did something like this, and I can see why.

The big thing I haven't figured out here is how to deal with timestamp collisions between txns. To get the desired effect, I think we would need the portion of the keyspace before the sequence to be unique to a given transaction so that a transaction's own sequence number would never influence the values it can observe from other transactions. Something like key/timestamp/txnid/sequence would do the trick. Of course, this would be too expensive to do if the txnid was a UUID.

This would also be another migration, although it's slightly less daunting than the main proposal here because we could just default missing seqnums to 0.

(2) This proposal would naturally lead to us releasing read latches before performing the read itself. A read could simply grab a read latch, check locks (triggering contention handling if necessary), bump the timestamp cache, and then grab a RocksDB iterator before dropping its latches. The actual MVCC iteration could be done completely outside of latches. This would address a question that @andreimatei has had for a long time.

(3) Allowing intents to be resolved without latching would be beneficial for a few reasons that are listed above. The biggest of these is that conflicting transactions wouldn't need to wait for the intent resolution to go through Raft before proceeding.

In addition to this, I think the change would have another benefit which could simplify conflict resolution. Right now, requests that hit WriteIntentErrors release their latches before handling the conflict. They then push the conflicting transaction or enter the contentionQueue.

If we could efficiently check for intents/locks before evaluation then I think we could have tighter interaction between latching and locking and simplify the transaction pushing interaction. After a transaction has acquired latches it would check for conflicting locks. If it finds any, it could enter a contentionQueue-like queue before dropping its latches and pushing. This would end up working like a condition variable, where the condition is !conflicting_locks && !conflicting_contention_queues and the condition variable's lock is the latch.Manager.

This came up recently in a SELECT FOR UPDATE prototype. Maintaining an in-memory lock structure for "upgrade locks" had the desired effect of reducing transaction retries. However, I was also seeing it push out tail latencies. This appeared to be due to fairness issues around the contentionQueue. Because requests don't maintain latches while in the queue, other requests can slip in ahead of the head of the queue after the transaction holding the lock finishes. This led to a tremendous amount of barging. Some transactions would queue up in the contentionQueue while others would slip by and grab the lock, leading to a fast class of txns and a slow class of txns. Tying together latching and conflict resolution would avoid these issues and improve fairness while also making the interactions easier to think about.

@petermattis
Copy link
Collaborator

This would also be another migration, although it's slightly less daunting than the main proposal here because we could just default missing seqnums to 0.

How would we encode the seqnum? MVCC keys are currently encoded as:

<key>\x00[<wall_time>[<logical>]]<#timestamp-bytes>

The last byte in the key is either always 0, 8, or 12, indicating no timestamp, walltime-only, or walltime+logical. We can't easily append anything else to this format without rewriting all of the keys. I think we can add an optional sequence number. This would give the following meanings to the timestamp-bytes value:

  • 0 - no timestamp or txn-seqnnum
  • 8 - walltime, no txn-seqnum
  • 12 - walltime+logical, no txn-seqnum
  • 16 - walltime+txn-seqnum
  • 20 - walltime+logical+txn-seqnum

I think that works. Not exactly what I'd design if starting from scratch, but it does mean we wouldn't have to rewrite all of the data on a store for a migration.

@nvanbenschoten
Copy link
Member Author

@petermattis That backward-compatible encoding does seem workable if we go this route. At the moment though, I don't think the rest of the idea is fully workable. The primary blockers to it are:

  1. The "timestamp collisions between txns" issue mentioned above
  2. How read-your-writes would interact with transactions that read and write at different timestamps.

The way I have this working in my prototype is that we have a lockAwareIterator. This iterator can be thought of as a merging iterator pointing into MVCC keys and into the lock table, but which has a few fast-paths for transactions with 0 or a small number of locks (discovered during lock checking). The iterator is configured with a roachpb.Transaction (for the id, epoch, seq_num, and read_timestamp). When MVCC code iterates over a provisional value owned by the reading transaction and associated with a lock in the lock table, the iterator performs a translation step which:

  1. logically moves the provisional value to the read timestamp. This ensures that transactions always read-their-writes in MVCC code, even if they don't realize that they are their own
  2. replaces the value with a value from the sequence history if necessary, based on the seq nums of the lock and the reading transaction

makes it easier to augment the durable lock-table with a cache of known committed transactions, which might allow us to resolve intents (release locks) without holding latches. This would reduce the contention footprint of transactions (see #22349)

The idea here would have reduced the contention footprint of transactions from 3 rounds of distributed consensus down to 2. We could mark locks as COMMITTED before going through the process of actually removing them. Other transactions would then ignore these locks during their lock checking phase. The round of Raft to remove the locks could then be done without holding latches.

There would need to be a translation step for provisional values that are being moved to new timestamps during the intent resolution process. The lockAwareIterator could help out here as well. It could translate the provisional values for COMMITTED-but-not-removed locks to higher timestamps, if necessary.

Reducing the contention footprint from 3 writes to 2 is great, but @tbg just had a fantastic idea that would allow us to reduce it even further, down to just 1 write. The idea would address the longstanding desire to resolving intents while a transaction is implicitly committed (STAGING) instead of explicitly committed (COMMITTED). While in this phase, we could mark the lock as COMMITTED_BUT_NOT_REMOVABLE as soon as we're aware that the transaction's parallel commit has finished. Other transactions would ignore the lock just as before, but no-one would be allowed to remove the lock until the lock's owner became explicitly committed. In this way, we could get the lock out of the way of other transactions without removing evidence of an implicitly committed transaction.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 6, 2019
This commit removes the MVCCGet and MVCCScan methods from engine.Iterator and
uses the rest of the interface to implement these methods as free functions.
This restructuring allows the MVCC operations to support polymorphism of the
iterator, which is what the code was intending to do when originally written.

The code was moved to the current structure as a way of avoiding cgo calls when
using RocksDB's iterator implementation. This is an important optimization when
using RocksDB (but not Pebble) so the commit retains it through optional
specializations of MVCCGet and MVCCScan. RocksDB's iterator implements this
specialization but Pebble's does not need to.

This isn't quite ready for a review. I'm mainly pushing it in case others want
to take a look. It will be used to get the prototype of cockroachdb#41720 up and running.
Benchmarks show about a 0-1% performance regression due to this change. More
testing should be done if we actually want to productionize this.
@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Nov 6, 2019

@nvanbenschoten I don't fully understand the 3 rounds of distributed consensus and the optimizations to get it down to 2 rounds and then to 1 round.
My understanding of parallel commits is that the 3 rounds are

  • transaction becomes STAGED (this is written with the last batch)
  • transaction becomes COMMITTED
  • intent resolution/lock removal

After the STAGED round and all the writes being successful, the client can be informed of the commit but there are two more rounds before intents are removed. Is that correct? Is the leaseholder for the range with the transaction record supposed to drive the intent resolution? And the second and third rounds cannot be concurrent since they may be at different ranges and if the third races ahead of the second and the second needs to be retried we would have lost the intents that confirmed that the STAGED transaction could be COMMITTED, and would incorrectly think it had failed and remove all its writes?
IIUC, getting it down to 2 rounds relies on having multiple intents/locks, of which at most one has a transaction that is not COMMITTED. Marking these locks COMMITTED would happen in the second round after which new locks can be allowed. And the third round would remove these COMMITTED locks. And the final optimization marks these locks as COMMITTED(_BUT_NOT_REMOVABLE) without a round of consensus (this transition would be best-effort since it is not using raft), the second round of consensus would guarantee that these would be marked COMMITTED and the third round would remove them. Is my understanding correct?

@nvanbenschoten
Copy link
Member Author

After the STAGED...

Yes, your understanding is correct for all three questions in this paragraph.

Marking these locks COMMITTED would happen in the second round after which new locks can be allowed. And the third round would remove these COMMITTED locks. And the final optimization marks these locks as COMMITTED(_BUT_NOT_REMOVABLE) without a round of consensus (this transition would be best-effort since it is not using raft)

Neither transition needs to go through Raft. This information could all be held in memory while we wait to be able to remove the locks and during the process of removing the locks. I don't think the locks would ever be marked as COMMITTED durably. The important part is that the leaseholder can begin letting new requests observe the result of the know-committed transaction immediately after it hears that the transaction is committed, even if it doesn't remove the lock immediately.

So getting the contention footprint down from 3 rounds to 2 is a clear win. Getting it down from 2 to 1 would require more communication ("a txn is committed so release its locks, but don't remove them"), so it's not obvious that it's a win in all cases, but it certainly would be in contended cases.

@tbg
Copy link
Member

tbg commented Nov 6, 2019

I don't know if this idea already came up somewhere (it looks like we've discussed it only for non-versioned keys), but if we had a key encoding scheme which contains the seqno and txnid, we could ensure that the versions are never overwritten and thus we'd be able to use SingleDelete when we eventually delete them. I don't know if this would really give us a lot of benefit, though, since versions are only deleted when GC runs, at which point the original write has probably sunk down in the LSM to some degree (definitely with today's 25h TTL).

@petermattis
Copy link
Collaborator

I don't know if this would really give us a lot of benefit, though, since versions are only deleted when GC runs, at which point the original write has probably sunk down in the LSM to some degree (definitely with today's 25h TTL).

There has been a small bit of water-cooler discussion about performing GC during compactions which would completely eliminate any overhead for GC'ing old data.

@tbg
Copy link
Member

tbg commented Nov 6, 2019

There has been a small bit of water-cooler discussion

It's here: #16132

@sumeerbhola
Copy link
Collaborator

@tbg Adding to what Peter said, GC during compactions is one of the ways we can exploit knowledge of the timestamp in the key at the Pebble level. It would need to know which time intervals are needed for backups etc. (this would be similar to how snapshot aware gc of sequence numbers works in RocksDB and Pebble), to know which "dead" versions can be thrown away. When there are dead and live versions in the same sstable one could track the amount of dead data in each sstable with a coarse age histogram. This could be used to trigger sstable rewrites (independent of normal compactions) based on a simple cost-benefit function. This won't clean all dead versions -- those that can only be known to be dead when looking at a different sstable that does not participate in compactions with this sstable will need to be deleted by a periodic scan of the key space. Another optimization could be to make these key scans cheap by segregating keys and values inside a file.

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Nov 6, 2019

I wonder if the ability to mark locks as non-active without removing them could allow us to get rid of the AbortSpan. A poisoning PushTxnRequest could mark the lock as ABORTED instead of removing it. It could then defer to the transaction itself to actually remove the lock. When scanning the lock table, transactions would take notice of whether their own locks have been ABORTED, which would provide the same protection against zombie transactions that the AbortSpan currently does.

If we got rid of the AbortSpan lookups then this change wouldn't even be introducing any new RocksDB iterations. cc. @andreimatei.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 12, 2019
This commit removes the MVCCGet and MVCCScan methods from engine.Iterator and
uses the rest of the interface to implement these methods as free functions.
This restructuring allows the MVCC operations to support polymorphism of the
iterator, which is what the code was intending to do when originally written.

The code was moved to the current structure as a way of avoiding cgo calls when
using RocksDB's iterator implementation. This is an important optimization when
using RocksDB (but not Pebble) so the commit retains it through optional
specializations of MVCCGet and MVCCScan. RocksDB's iterator implements this
specialization but Pebble's does not need to.

This isn't quite ready for a review. I'm mainly pushing it in case others want
to take a look. It will be used to get the prototype of cockroachdb#41720 up and running.
Benchmarks show about a 0-1% performance regression due to this change. More
testing should be done if we actually want to productionize this.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 13, 2019
This commit removes the MVCCGet and MVCCScan methods from engine.Iterator and
uses the rest of the interface to implement these methods as free functions.
This restructuring allows the MVCC operations to support polymorphism of the
iterator, which is what the code was intending to do when originally written.

The code was moved to the current structure as a way of avoiding cgo calls when
using RocksDB's iterator implementation. This is an important optimization when
using RocksDB (but not Pebble) so the commit retains it through optional
specializations of MVCCGet and MVCCScan. RocksDB's iterator implements this
specialization but Pebble's does not need to.

This isn't quite ready for a review. I'm mainly pushing it in case others want
to take a look. It will be used to get the prototype of cockroachdb#41720 up and running.
Benchmarks show about a 0-1% performance regression due to this change. More
testing should be done if we actually want to productionize this.
craig bot pushed a commit that referenced this issue Nov 13, 2019
42210: storage/engine: lift Go MVCC implementation above Iterator interface r=nvanbenschoten a=nvanbenschoten

This commit removes the `MVCCGet` and `MVCCScan` methods from `engine.Iterator` and uses the rest of the interface to implement these methods as free functions. This restructuring allows the MVCC operations to support polymorphism of the iterator, which is what the code was intending to do when originally written.

The code was moved to the current structure as a way of avoiding cgo calls when using RocksDB's iterator implementation. This is an important optimization when using RocksDB (but not Pebble) so the commit retains it through optional specializations of `MVCCGet` and `MVCCScan`. RocksDB's iterator implements this specialization but Pebble's does not need to.

This isn't quite ready for a review. I'm mainly pushing it in case others want to take a look. It will be used to get the prototype of #41720 up and running. Benchmarks show about a 0-1% performance regression due to this change. More testing should be done if we actually want to productionize this.

cc. @petermattis @itsbilal 

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@sumeerbhola
Copy link
Collaborator

The segregated lock table also helps bulk operations, based on a discussion with @dt (that I've tried to capture below):

  • The timebounds property collector is not being used to constrain the sstables being read for an incremental backup because of the known bug that the deletion of the MVCCMetadata at timestamp 0 is not seen. This makes incremental backups more expensive. While there is a workaround implemented (but not enabled because of worries about bugs), the segregated lock table will make this simple: only constrain the data scan based on the sstable time bounds, and scan the whole segregated lock table (since it will be small).

  • When a backup encounters an intent it resolves it and repeats the work. So if it encounters n intents the cost is O(n^2) of the single data scan. One could reduce this to 2 data scans with one scan to find all the intents, resolve them and then do a second scan for the backup. This is simpler with the segregated lock table since one would first scan the lock table (this scan is cheap because the locks are segregated) and after resolving intents scan the data once.

  • For imports and new indices, that are going to add an sstable at a particular timestamp t:

    • it needs to validate that there are no provisional values < t. This requires an expensive scan of the data. With the segregated lock table one only needs a cheap scan of the lock table.
    • the add cannot be transactional since any concurrent transactions that have timestamp > t may miss the data being added. For imports this requires taking the table offline. With the lock table the import could acquire an exclusive lock on the span of the sstable being imported. Importing into a part of the table key space without taking the whole table offline is a feature that has been requested.

itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 19, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 28, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Jul 30, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 5, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 6, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 9, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 9, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 10, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 11, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 11, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 12, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 12, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 12, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
itsbilal added a commit to itsbilal/cockroach that referenced this issue Aug 12, 2021
Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of cockroachdb#41720.

Release note: None.
craig bot pushed a commit that referenced this issue Aug 13, 2021
66445: kvserver: Add migration for interleaved intents to separated r=nvanbenschoten a=itsbilal

Adds a long running migration, SeparatedIntentsMigration,
that does the following steps:

1) Iterate through all range descriptors using a RangeIterator
2) Calls a new ranged write command, Barrier, that is a no-op
  during evaluation but waits on any other writes on that range
  to finish before returning
3) Call ScanInterleavedIntents, a new ranged read command, to
  return interleaved intents encountered over key range
4) Iterate over returned intents then push all those txns and
  resolve any intents returned.

The barrier in step 2 ensures that no future writes after that
point will write interleaved intents. The actual disabling
of the setting to write interleaved intents happens in the
next commit.

Part of #41720.

Release note: None.

68252: sql: add metric for schema change job user/database errors r=postamar a=fqazi

Fixes: #68111

Previously, we had no way of tracking the reason
a schema change job failed in our metrics. This was
inadequate because we had no way of knowing why
schema jobs failed. To address this, this patch
will add metrics for tracking failures due to constraint
violations and database errors.

Release note (sql change): Add a new metrics to track schema job failure
(sql.schema_changer.errors.all,
sql.schema_changer.errors.constraint_violation,
sql.schema_changer.errors.uncategorized), errors inside the
crdb_internal.feature_usage table.

Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
@ajwerner
Copy link
Contributor

Now that the migration is checked in, is it time to close this issue?

@lunevalex
Copy link
Collaborator

Closing this, as it seems like we did all we are going to do in 21.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

8 participants