-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: Non-Blocking Transactions #52745
rfc: Non-Blocking Transactions #52745
Conversation
d5a8f11
to
6e142e9
Compare
One thing I am missing from the guide-level explanation is how this looks like from the perspective of a SQL client: how does one engage a non-blocking transaction? |
@knz the guide-level explanation includes this paragraph:
Do you mind expanding on what you think is missing from that? One thing that certainly is missing in the RFC right now is a full derivation of how long we expect the Commit Wait stage of one of these non-blocking transactions to be in a global cluster. This is touched upon in "Tuning non_blocking_duration", but it would be helpful to give some first-order approximation of what we expect this delay to be in practice. |
I gave this a quick skim this evening and it seems pretty exciting. I think we should design with the SQL in mind from day one as @knz was inquiring about and I like |
|
||
As defined above, "non-blocking transactions" perform locking in such a way that | ||
contending reads do not need to wait on their locks. In practice, this means | ||
that non-blocking reads **perform their writes at a timestamp in advance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/reads/writes/ or /transactions
Commit Wait rule then it would have been delayed before committing and the stale | ||
read would not be possible. | ||
|
||
But relying on Commit Wait here violates a desirable property of our transaction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope I'm misunderstanding something here, but isn't this basically saying that our read latencies will spike to ~30ms (uncertainty interval) for each non-blocking write when it becomes visible?
For read-only txns (or non-blocking read-write txns) I hoped we would not need to apply uncertainty to writes that it knows came from a non-blocking txn, but it doesn't seem to quite work. If the non-blocking writer committed the intents only after making sure that the commit timestamp is no longer synthetic (i.e. its clock is past that timestamp), then we're resolving intents too late and readers contend. If we resolve them early and then wait, the readers have to deal with uncertainty.
See if you can polish this section a bit - I had some trouble following "second read", "first read-only transaction", maybe an explicit itemized example with timestamps would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think/hope I'm just confused here. If it's t=100 now and a transaction future-writes-and-commits at t=1000, only readers with t \in [970,999)
have to worry about uncertainty. But if reader's timestamp was < 1000
when it was chosen (non-synthetically), it can't causally depend on the previous writer if that writer would have waited for its local clock to hit 1030 (future write ts + uncertainty) to ack the commit, which means our dear reader couldn't have started after the writer returned and gotten a timestamp less than 1000.
tracker that is leading "present time" by some duration. As it turns out, | ||
configuring this closed timestamp tracker to lead "present time" by | ||
`non_blocking_duration` is enough to implement "non-blocking transactions" | ||
without any other server-side changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still changes here, no? We don't do per-range closed timestamps at the moment, so we need the infra for that. But it'd be nice if those were the only changes. I think the quiescence stuff will need some work as it's tied to the store-wide closed timestamp.
Any standard transaction that writes to a "non-blocking Range" will naturally | ||
get pushed into the future. If it had performed writes in the past, these will | ||
eventually get moved up to the new commit timestamp. If it had performed reads | ||
in the past, these will eventually get refreshed up to the new commit timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so they'll get refreshed to a synthetic timestamp? And updating a synthetic timestamp with a non-synthetic one with a >= walltime will result in a non-synthetic one, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up! When the parallels to Global Optimistic Transactions came up in our last conversation I was intrigued and I'm glad this wasn't just a temporary confusion. I agree that this seems much more reasonable in scope than the other proposals we have discussed (consistent read replicas + global optimistic reads) and at the same time it has what I would call intuitive interactions with our existing transactions.
Are you at all worried about the need for readers to continue to wait out the uncertainty? If you want ~3s local reads but on every write readers are forced into ~30ms (uncertainty) stalls, this is still a ~10x latency spike. Sure better than a 100x one, but still - beyond non-blocking txns, do you foresee us still lowering the consistency levels further still?
Thanks for the reads!
Yes, absolutely. It shouldn't even be that expensive. It's also worth reiterating that both this proposal and the consistent read replicas proposal are applicable for single-region use cases, as they both can provide a form of load balancing that may be important for read-heavy data.
I agree! It was eye-opening to me to dig into how these interact with our standard transaction model.
Good question. I'll say yes and no. We should be very conscious that committing to this proposal is committing to making a long-term commitment to investing in better clock synchronization. Up to this point, clock synchronization has mostly been out of the critical path of performance, outside of impacting txn retry susceptibility. This proposal brings it directly into the equation. I am certainly concerned that reads will still need to wait out a maximum of the clock uncertainty (strawman is 30ms) under contention. Ideally, read/write contention would have no effect on read latency, but without dropping the consistency level, that doesn't seem possible. So obviously, there still is some blocking here in certain cases. I used of the term "non-blocking" here to convey the fact that with this approach, reads will block on the order of clock uncertainty instead of on the order of communication latency. This seems like a highly desirable property for a globally distributed database like Cockroach, because clock uncertainty seems like an easier opponent to fight than the speed of light. So as we think about the future of CockroachDB and Cockroach Cloud, this is where I'd rather place a bet. We know clouds are starting to offer better time services (ref: Amazon Time Sync Service, Cloud TrueTime, etc.) and we're also aware innovation in this area (ref: Tick Tock Networks, etc.).
I'll first point out, because I made the mistake a few times while thinking through this myself, that consistency levels != isolation levels here. Even the lowest isolation levels don't allow stale reads, to the extent that they make any claim about real-time ordering. And so if we want to maintain strong consistency, even at lower isolation levels, there will be times that we need to block to establish ordering between causally related events. The question is whether we block on clock synchronization or communication. So then back to the question of consistency levels: I'm not sure. I think there is a good reason to expose different consistency levels, and yet, it gets very confusing to work with. I can't get through https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels without confusing myself, and I'm not trying to build an application on top of these guarantees. It also gets challenging to answer that question when we start talking about lowering the consistency level of operations within CRDB itself that may be trying to maintain referential integrity. For instance, how do we let users configure foreign key checks to use reduced consistency levels and what does that do to our optimizer's ability to trust this referential integrity. Making foreign key checks work was actually an important litmus test I used when considering different approaches. So to the degree that we can, I'd prefer to expose understandable consistency levels to users (strong consistency + exact/bounded staleness) but optimize for and push the boundaries of strong consistency. I think that's what makes Cockroach special. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
docs/RFCS/20200811_non_blocking_txns.md, line 630 at r1 (raw file):
Complexity. But not overly so, especially compared to Consistent Read Replicas. Only effective if we can reduce the clock uncertainty interval dramatically.
Why? I see that transactions that write to non-blocking ranges need to wait for a duration that is a function of the uncertainty interval. But as long as reads are unaffected by uncertainty (which I think is the case but I'm not certain I've understood this correctly), I think there'd still be demand for this. There are a lot of reference table use cases that could tolerate write times measured in seconds as long as reads were fast.
Reducing the uncertainty interval and speeding up non-blocking txn writes would of course be great, but it gets more difficult/expensive as you approach zero. We could easily knock a few hundred milliseconds off our uncertainty bound (this is safe today in many environments). Getting down to 30ms is probably doable, but if it only affects writes I'm not sure it's necessary, and if it affects reads I'm not sure it's enough.
6e142e9
to
000779b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @tbg)
docs/RFCS/20200811_non_blocking_txns.md, line 352 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s/reads/writes/ or /transactions
Done.
docs/RFCS/20200811_non_blocking_txns.md, line 501 at r1 (raw file):
I hope I'm misunderstanding something here, but isn't this basically saying that our read latencies will spike to ~30ms (uncertainty interval) for each non-blocking write when it becomes visible?
Yes, it is. This wasn't clear to either you or Ben, so the RFC is definitely not making this clear. I added two transaction latency comparison tables to the guide-level explanation to make this more clear.
See if you can polish this section a bit - I had some trouble following "second read", "first read-only transaction", maybe an explicit itemized example with timestamps would be helpful.
Added an example timeline here.
Hmm, I think/hope I'm just confused here. If it's t=100 now and a transaction future-writes-and-commits at t=1000, only readers with t \in [970,999) have to worry about uncertainty. But if reader's timestamp was < 1000 when it was chosen (non-synthetically), it can't causally depend on the previous writer if that writer would have waited for its local clock to hit 1030 (future write ts + uncertainty) to ack the commit, which means our dear reader couldn't have started after the writer returned and gotten a timestamp less than 1000.
The writer will only wait until its local clock hits 1000 before acking the commit, so a causally dependent reader can have a clock as low as 970.
docs/RFCS/20200811_non_blocking_txns.md, line 556 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
There are still changes here, no? We don't do per-range closed timestamps at the moment, so we need the infra for that. But it'd be nice if those were the only changes. I think the quiescence stuff will need some work as it's tied to the store-wide closed timestamp.
Yep, definitely some engineering work here. On one end of the spectrum, we could give every range their own closed timestamp subsystem (clearly unrealistic). On the other, we could have one for normal ranges and one for non-blocking Ranges. Or maybe something in the middle. @ajwerner has been talking about configuring closed timestamps differently for different ranges/groups of ranges for some time.
We can flesh more of this out if we decide to push the rest of this proposal forward. For now, it seems like a known quantity.
docs/RFCS/20200811_non_blocking_txns.md, line 561 at r1 (raw file):
so they'll get refreshed to a synthetic timestamp?
Yes, added.
And updating a synthetic timestamp with a non-synthetic one with a >= walltime will result in a non-synthetic one, right?
Good point, added.
docs/RFCS/20200811_non_blocking_txns.md, line 630 at r1 (raw file):
which I think is the case but I'm not certain I've understood this correctly
No, this, unfortunately, isn't the case. This tripped up you and Tobi, so it definitely wasn't clear enough. I added two transaction latency comparison tables to the guide-level explanation to make this more clear and an example in the Uncertainty Intervals: To Wait or Not to Wait
section to demonstrate why readers have to wait in the cases where they see a value in their uncertainty interval.
As such, the uncertainty bounds necessary for this proposal to make sense are | ||
much lower than what we currently use, but are also well within the realm of | ||
possibility for software-only solutions. Atomic clocks and TrueTime are not | ||
prerequisites! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe point out that Amazon has its Time Sync Service, which should enable us to take advantage of hardware in CC:
Today we’re launching Amazon Time Sync Service, a time synchronization service
delivered over Network Time Protocol (NTP) which uses a fleet of redundant satellite-
connected and atomic clocks in each region to deliver a highly accurate reference clock.
Could also link to the Huygens paper: https://www.usenix.org/system/files/conference/nsdi18/nsdi18-geng.pdf
badly pessimizes write latency. On the surface, this seems to be the expected | ||
tradeoff. The problem is that when reads and writes contend, reads can get | ||
blocked on writes and have to wait for writing transactions to complete. If the | ||
writing transaction is implicit, this results in at least 1 WAN RTT of blocking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"implicit" meaning no BEGIN TRANSACTION?
the standard read-write transaction that performs locking in a manner such that | ||
contending reads by other transactions can avoid waiting on its locks. | ||
|
||
The RFC then introduces the term "non-blocking Range", which is a Range in which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the term be a non-blocking keyrange
instead? When I hear "range", I immediately wonder if splits/merges can change the boundaries of a non-blocking Range. But I think what this proposal means by a "non-blocking Range" is just a (start,end) keyrange that is independent of splits/merges. Even if "Range" is the accepted KV terminology for this concept, at the very least add some clarifying language here for those not well-versed in KV-speak.
> In addition to tables that are immutable, it is common for schemas to have some | ||
> tables that are mutable but only rarely mutated. An example of this is a user | ||
> profile table in an application for a global user base. This table may be read | ||
> frequently across the world and it expected to be updated infrequently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"and yet it is expected"
|
||
Merging a "synthetic" and "real" timestamp, typically done using a `Forward` | ||
operation obeys the following rule: the `synthetic` bit from the larger of the | ||
two timestamps is carrier over to the result. In the case of a tie between a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
carrier => carried
##### Timestamp.Forward | ||
|
||
Merging a "synthetic" and "real" timestamp, typically done using a `Forward` | ||
operation obeys the following rule: the `synthetic` bit from the larger of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain your reasons as to why this is the right algorithm?
For "real" timestamps, it is cheap to implement this operation by ratcheting a | ||
few integers within the HLC because we know that the timestamp must have come | ||
from another node, which itself must have been ahead of the slowest node by less | ||
that `mox_offset`, so the ratcheting operation will not push our HLC further |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mox_offset => max_offset
between the idea of writing in the future and closing time in the future. It is | ||
interesting to point out that in systems like Spanner and YB which assign | ||
monotonically increasing timestamps to all writes within a replication group and | ||
therefore "instantaneously" close out time, these two operations are one in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> "one and the same"
For cases where RTT << uncertainty interval (e.g. because environment requires 100's ms interval, or because RTT is local within region), do we really need to wait out that entire interval when we encounter a write? Couldn't the follower replica send a message to the leaseholder in order to ensure its current time is >= leaseholder current time (equivalent to reading from the leaseholder)? And in the worst case, the follower's transaction would block for <= uncertainty interval, no matter how many non-stale follower reads it did. Reason I ask is b/c, if possible, I'd like this proposal to work at least as well as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cases where RTT << uncertainty interval (e.g. because environment requires 100's ms interval, or because RTT is local within region), do we really need to wait out that entire interval when we encounter a write? Couldn't the follower replica send a message to the leaseholder in order to ensure its current time is >= leaseholder current time (equivalent to reading from the leaseholder)?
So you're suggesting essentially to start redirecting all the reads to the leaseholder when they're uncertain? That's an interesting idea I think, although it kinda messes with the load-balancing hopes of these proposals.
One thing I'd like to elevate here is the manner in which the synthetic timestamps spread through the system - they're not just localized to the non-blocking ranges. This is illustrated the non-blocking txn's read / standard write
cell in the txn interaction table: reads from non-blocking txns can now disrupt "standard writes" on random ranges. Basically we don't just have "writes in the future" to worry about, but also "reads in the future". Perhaps this isn't the biggest concern in the world given how "non-blocking txns" are supposed to be pretty rare...
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)
docs/RFCS/20200811_non_blocking_txns.md, line 158 at r2 (raw file):
If the writing transaction is doing other work, this blocking can be unbounded.
To make it fair, I'd add an asterisk here explaining that this unbounded wait is a problem for all proposals, that seems to be addressable also for all proposals similarly.
docs/RFCS/20200811_non_blocking_txns.md, line 706 at r2 (raw file):
standard read | both bump ts cache, no interaction | bump ts cache, may bump write ts, but still below present time | both bump ts cache, no interaction | bump ts cache, no interaction standard write | read ignores write if below, read waits on write if above | 2nd write waits on 1st write, bumps write ts above 1st write | read waits on write | non-blocking write waits on standard write non-blocking txn's read | both bump ts cache, no interaction | **bump ts cache, bump write ts, standard write becomes non-blocking write, must wait on commit** | both bump ts cache, no interaction | bump ts cache, may bump write ts
This case is interesting, isn't it. I kinda woke up thinking about it too, as I was thinking about how the synthetic timestamps spread. Standard writers become infected by previous reads, and so now they have to do the commit-wait. And this commit wait is RTT+max_offset
, not just one or the other (right?), so they can be quite disruptive to traffic. Non-blocking txns are all about how their writes are less disruptive, but now their reads can be quite disruptive... Not taking any sides, but I'd call this out in the comparison section. A saving grace is that, as the previous sections says, a transaction only becomes a "non-blocking" txn when it does some writing, so read-only txns to not cause disruption.
Similarly for the non-blocking txn's write / standard write case below, although that one is less concerning to me.
I don't think the load-balancing story will be materially impacted. Assuming these tables are read-mostly, with very infrequent writes, there should be minimal leaseholder traffic caused by these timestamp sync'ing calls (these calls are only made when uncertain reads happen, after all). And the messages that do get sent should be tiny; it's just a timestamp sync; no rows are actually read nor returned. I'd expect a tiny overhead from doing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)
docs/RFCS/20200811_non_blocking_txns.md, line 501 at r1 (raw file):
The writer will only wait until its local clock hits 1000 before acking the commit, so a causally dependent reader can have a clock as low as 970.
Right, but what if we waited out the clock uncertainty at the writer as well?
000779b
to
3685609
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cases where RTT << uncertainty interval (e.g. because environment requires 100's ms interval, or because RTT is local within region), do we really need to wait out that entire interval when we encounter a write? Couldn't the follower replica send a message to the leaseholder in order to ensure its current time is >= leaseholder current time (equivalent to reading from the leaseholder)? And in the worst case, the follower's transaction would block for <= uncertainty interval, no matter how many non-stale follower reads it did.
This is an interesting idea. At first, I thought it would work, but now, I'm not so sure. On its own, it works because the leaseholder's clock is monotonic and so we preserve linearizability. However, it doesn't appear to compose with the current proposal because the leaseholder's clock is not guaranteed to lead all followers' clocks. Here's an example of how this could go wrong:
clocks:
leaseholder: 50
follower A (close, cheaper to communicate): 50
follower B (far away, cheaper to wait): 75
max_clock_offset: 50
value @ 100
1. read on follower B
-- observes value in uncertainty interval
-- waits 25s
-- returns value and commits
clocks:
leaseholder: 75
follower A (close): 75
follower B (far away): 100
max_clock_offset: 50
2. dependent read on follower A
-- observes value in uncertainty interval
-- forwards to leaseholder
-- value not in uncertainty interval
-- ignores value
-- linearizability violated!
So even when we consult with the leaseholder, we still need to respect uncertainty intervals. For this kind of this to work, I think we'd need to switch to waiting out uncertainty on commit like Spanner does. So instead of saying that after a commit, all clocks will be within max_clock_offset
away from any values written, we'd need to be able to say that after a commit, all clocks will be equal to or above any values written.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @andy-kimball, @bdarnell, @nvanbenschoten, and @tbg)
docs/RFCS/20200811_non_blocking_txns.md, line 501 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The writer will only wait until its local clock hits 1000 before acking the commit, so a causally dependent reader can have a clock as low as 970.
Right, but what if we waited out the clock uncertainty at the writer as well?
That still wouldn't be enough. Linearizability doesn't just talk about the relationship between writes and reads. It also talks about the relationship between multiple reads. If we were to wait out the clock uncertainty on the write then we would ensure that all causally dependent events observe the write. However, we wouldn't ensure that two concurrent reads respect their real-time ordering with each other. The new example here actually demonstrates exactly this.
docs/RFCS/20200811_non_blocking_txns.md, line 59 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Maybe point out that Amazon has its Time Sync Service, which should enable us to take advantage of hardware in CC:
Today we’re launching Amazon Time Sync Service, a time synchronization service delivered over Network Time Protocol (NTP) which uses a fleet of redundant satellite- connected and atomic clocks in each region to deliver a highly accurate reference clock.
Could also link to the Huygens paper: https://www.usenix.org/system/files/conference/nsdi18/nsdi18-geng.pdf
Done.
docs/RFCS/20200811_non_blocking_txns.md, line 155 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
"implicit" meaning no BEGIN TRANSACTION?
Yes, just a single-statement transaction.
docs/RFCS/20200811_non_blocking_txns.md, line 158 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
If the writing transaction is doing other work, this blocking can be unbounded.
To make it fair, I'd add an asterisk here explaining that this unbounded wait is a problem for all proposals, that seems to be addressable also for all proposals similarly.
I just removed the sentence.
docs/RFCS/20200811_non_blocking_txns.md, line 301 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
"and yet it is expected"
Done.
docs/RFCS/20200811_non_blocking_txns.md, line 460 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Can you explain your reasons as to why this is the right algorithm?
Done.
docs/RFCS/20200811_non_blocking_txns.md, line 461 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
carrier => carried
Done.
docs/RFCS/20200811_non_blocking_txns.md, line 491 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
mox_offset => max_offset
Done.
docs/RFCS/20200811_non_blocking_txns.md, line 651 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
=> "one and the same"
Done.
docs/RFCS/20200811_non_blocking_txns.md, line 706 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This case is interesting, isn't it. I kinda woke up thinking about it too, as I was thinking about how the synthetic timestamps spread. Standard writers become infected by previous reads, and so now they have to do the commit-wait. And this commit wait is
RTT+max_offset
, not just one or the other (right?), so they can be quite disruptive to traffic. Non-blocking txns are all about how their writes are less disruptive, but now their reads can be quite disruptive... Not taking any sides, but I'd call this out in the comparison section. A saving grace is that, as the previous sections says, a transaction only becomes a "non-blocking" txn when it does some writing, so read-only txns to not cause disruption.Similarly for the non-blocking txn's write / standard write case below, although that one is less concerning to me.
Absolutely, everything you said is true. A non-blocking transaction is disruptive to conflicting writes both on non-blocking ranges and off non-blocking ranges. But it should never be disruptive to conflicting reads (by read-only or read-write txns).
I'll add to the discussion section.
3685609
to
0da9c08
Compare
0da9c08
to
0243f65
Compare
@bdarnell @petermattis I've updated this RFC for clarity and to better contextualize it within the rest of the thinking around "Making Global Easy". I'm pretty happy about how the proposal is turning out. It's the first one that I've liked more after sitting on it for a week, rather than less. I'd love to get your thoughts on it when you get a chance. I'll also draw attention to the Implementation Touch Points section, which highlights how little work this ends up being. This is a testament to how this approach extends existing concepts in the CockroachDB architecture (i.e. uncertainty intervals, read refreshes, closed timestamps, learner replicas) instead of introducing new concepts. Notably, there are no new RPCs and no new liveness mechanisms. The few extensions that the proposal does make to existing concepts are universally applicable and reduce to trivial no-ops for the standard transaction model. I heard there was concern about this proposal being complex, but I'd argue it's actually the least complex of the ideas floated so far for solving the problem of global, consistent reads. |
Informs cockroachdb#52745. This commit introduces the concept of a RangeClosedTimestampPolicy, which represents the policy used by the leaseholder of a range to establish and publish closed timestamps. The policy dictates how far in the past (lag) or in the future (lead) MVCC history is closed off at. Currently, there are two RangeClosedTimestampPolicy: - LAG_BY_CLUSTER_SETTING - LEAD_FOR_GLOBAL_READS After introducing these policies, the commit teaches the RangeCache about this information. In addition to a range's descriptor and lease, the cache will now maintain an up-to-date understanding of each range's closed timestamp policy. Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so that the client <-> server RangeInfo protocol will ensure that the kv client is informed of each Range's closed timestamp policy and kept up to date when its cached information has gone stale. Now that the kv client is aware of which ranges have configured their closed timestamps to serve global reads, it will be able to use this information in `CanSendToFollower` to target follower replicas for non-stale reads in read-only and read-write transactions. Release note: None
Informs cockroachdb#52745. This commit introduces the concept of a RangeClosedTimestampPolicy, which represents the policy used by the leaseholder of a range to establish and publish closed timestamps. The policy dictates how far in the past (lag) or in the future (lead) MVCC history is closed off at. Currently, there are two RangeClosedTimestampPolicy: - LAG_BY_CLUSTER_SETTING - LEAD_FOR_GLOBAL_READS After introducing these policies, the commit teaches the RangeCache about this information. In addition to a range's descriptor and lease, the cache will now maintain an up-to-date understanding of each range's closed timestamp policy. Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so that the client <-> server RangeInfo protocol will ensure that the kv client is informed of each Range's closed timestamp policy and kept up to date when its cached information has gone stale. Now that the kv client is aware of which ranges have configured their closed timestamps to serve global reads, it will be able to use this information in `CanSendToFollower` to target follower replicas for non-stale reads in read-only and read-write transactions. Release note: None
Informs cockroachdb#52745. This commit introduces the concept of a RangeClosedTimestampPolicy, which represents the policy used by the leaseholder of a range to establish and publish closed timestamps. The policy dictates how far in the past (lag) or in the future (lead) MVCC history is closed off at. Currently, there are two RangeClosedTimestampPolicy: - LAG_BY_CLUSTER_SETTING - LEAD_FOR_GLOBAL_READS After introducing these policies, the commit teaches the RangeCache about this information. In addition to a range's descriptor and lease, the cache will now maintain an up-to-date understanding of each range's closed timestamp policy. Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so that the client <-> server RangeInfo protocol will ensure that the kv client is informed of each Range's closed timestamp policy and kept up to date when its cached information has gone stale. Now that the kv client is aware of which ranges have configured their closed timestamps to serve global reads, it will be able to use this information in `CanSendToFollower` to target follower replicas for non-stale reads in read-only and read-write transactions. Release note: None
59505: kv: cache RangeClosedTimestampPolicy in RangeCache, keep up to date r=nvanbenschoten a=nvanbenschoten Informs #52745. This commit introduces the concept of a RangeClosedTimestampPolicy, which represents the policy used by the leaseholder of a range to establish and publish closed timestamps. The policy dictates how far in the past (lag) or in the future (lead) MVCC history is closed off. Currently, there are two RangeClosedTimestampPolicy: - `LAG_BY_CLUSTER_SETTING` - `LEAD_FOR_GLOBAL_READS` After introducing these policies, the commit teaches the RangeCache about this information. In addition to a range's descriptor and lease, the cache will now maintain an up-to-date understanding of each range's closed timestamp policy. Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so that the client <-> server RangeInfo protocol will ensure that the kv client is informed of each Range's closed timestamp policy and kept up to date when its cached information has gone stale. Now that the kv client is aware of which ranges have configured their closed timestamps to serve global reads, it will be able to use this information in `CanSendToFollower` to target follower replicas for non-stale reads in read-only and read-write transactions. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Informs cockroachdb#52745. This commit introduces the concept of a RangeClosedTimestampPolicy, which represents the policy used by the leaseholder of a range to establish and publish closed timestamps. The policy dictates how far in the past (lag) or in the future (lead) MVCC history is closed off at. Currently, there are two RangeClosedTimestampPolicy: - LAG_BY_CLUSTER_SETTING - LEAD_FOR_GLOBAL_READS After introducing these policies, the commit teaches the RangeCache about this information. In addition to a range's descriptor and lease, the cache will now maintain an up-to-date understanding of each range's closed timestamp policy. Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so that the client <-> server RangeInfo protocol will ensure that the kv client is informed of each Range's closed timestamp policy and kept up to date when its cached information has gone stale. Now that the kv client is aware of which ranges have configured their closed timestamps to serve global reads, it will be able to use this information in `CanSendToFollower` to target follower replicas for non-stale reads in read-only and read-write transactions. Release note: None
57077: roachpb/storage: introduce localUncertaintyLimit, teach MVCC about synthetic timestamps r=nvanbenschoten a=nvanbenschoten Informs #52745. Informs #36431. This PR introduces a new `localUncertaintyLimit` parameter to MVCC. The PR also renames the existing `Txn.MaxTimestamp` to `Txn.GlobalUncertaintyLimit`. `localUncertaintyLimit` is introduced alongside `GlobalUncertaintyLimit` for two reasons: 1. it allows us to stop modifying `global_uncertainty_limit` (formerly `max_timestamp`) when applying an `observed_timestamp` to a transaction when evaluating on a replica. Instead, we can keep the `localUncertaintyLimit` directly on the stack, which better represents its lifetime. 2. it allows MVCC logic to distinguish between the pre- and post-observed timestamp uncertainty interval, depending on whether a possibly-uncertain value has a standard or synthetic timestamp. The field is defined as follows: > LocalUncertaintyLimit is the transaction's GlobalUncertaintyLimit, reduce > by any observed timestamp that the transaction has acquired from the node > that holds the lease for the range that the transaction is evaluating on. > > The local uncertainty limit can reduce the uncertainty interval applied > to most values on a range. This can lead to values that would otherwise > be considered uncertain by the original global uncertainty limit to be > considered "certainly concurrent", and thus not causally related, with > the transaction due to observed timestamps. > > However, the local uncertainty limit does not apply to all values on a > range. Specifically, values with "synthetic timestamps" must use the > transaction's original global uncertainty limit for the purposes of > uncertainty, because observed timestamps do not apply to values with > synthetic timestamps. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Fixes cockroachdb#57687. Related to cockroachdb#52745. This commit introduces a "commit-wait" sleep stage after a transaction commits, which is entered if doing so is deemed necessary for consistency. By default, commit-wait is only necessary for transactions that commit with a future-time timestamp that leads the local HLC clock. This is because CockroachDB's consistency model depends on all transactions waiting until their commit timestamp is below their gateway clock. In doing so, transactions ensure that at the time that they complete, all other clocks in the system (i.e. on all possible gateways) will be no more than the max_offset below the transaction's commit timestamp. This property ensures that all causally dependent transactions will have an uncertainty interval (see GlobalUncertaintyLimit) that exceeds the original transaction's commit timestamp, preventing stale reads. Without the wait, it would be possible for a read-write transaction to write a future-time value and then for a causally dependent transaction to read below that future-time value, violating "read your writes". The property must also hold for read-only transactions, which may have a commit timestamp in the future due to an uncertainty restart after observing a future-time value in their uncertainty interval. In such cases, the property that the transaction must wait for the local HLC clock to exceed its commit timestamp is not necessary to prevent stale reads, but it is necessary to ensure monotonic reads. Without the wait, it would be possible for a read-only transaction coordinated on a gateway with a fast clock to return a future-time value and then for a causally dependent read-only transaction coordinated on a gateway with a slow clock to read below that future-time value, violating "monotonic reads". In practice, most transactions do not need to wait at all, because their commit timestamps were pulled from an HLC clock (i.e. are not synthetic) and so they will be guaranteed to lead the local HLC's clock, assuming proper HLC time propagation. Only transactions whose commit timestamps were pushed into the future will need to wait, like those who wrote to a global_read range and got bumped by the closed timestamp or those who conflicted (write-read or write-write) with an existing future-time value. However, CockroachDB also supports a stricter model of consistency through its "linearizable" flag. When in linearizable mode (also known as "strict serializable" mode), all writing transactions (but not read-only transactions) must wait an additional max_offset after committing to ensure that their commit timestamp is below the current HLC clock time of any other node in the system. In doing so, all causally dependent transactions are guaranteed to start with higher timestamps, regardless of the gateway they use. This ensures that all causally dependent transactions commit with higher timestamps, even if their read and writes sets do not conflict with the original transaction's. This obviates the need for uncertainty intervals and prevents the "causal reverse" anamoly which can be observed by a third, concurrent transaction. For more, see https://www.cockroachlabs.com/blog/consistency-model/ and docs/RFCS/20200811_non_blocking_txns.md. Release notes: None Release justification: New functionality.
Fixes cockroachdb#57687. Related to cockroachdb#52745. This commit introduces a "commit-wait" sleep stage after a transaction commits, which is entered if doing so is deemed necessary for consistency. By default, commit-wait is only necessary for transactions that commit with a future-time timestamp that leads the local HLC clock. This is because CockroachDB's consistency model depends on all transactions waiting until their commit timestamp is below their gateway clock. In doing so, transactions ensure that at the time that they complete, all other clocks in the system (i.e. on all possible gateways) will be no more than the max_offset below the transaction's commit timestamp. This property ensures that all causally dependent transactions will have an uncertainty interval (see GlobalUncertaintyLimit) that exceeds the original transaction's commit timestamp, preventing stale reads. Without the wait, it would be possible for a read-write transaction to write a future-time value and then for a causally dependent transaction to read below that future-time value, violating "read your writes". The property must also hold for read-only transactions, which may have a commit timestamp in the future due to an uncertainty restart after observing a future-time value in their uncertainty interval. In such cases, the property that the transaction must wait for the local HLC clock to exceed its commit timestamp is not necessary to prevent stale reads, but it is necessary to ensure monotonic reads. Without the wait, it would be possible for a read-only transaction coordinated on a gateway with a fast clock to return a future-time value and then for a causally dependent read-only transaction coordinated on a gateway with a slow clock to read below that future-time value, violating "monotonic reads". In practice, most transactions do not need to wait at all, because their commit timestamps were pulled from an HLC clock (i.e. are not synthetic) and so they will be guaranteed to lead the local HLC's clock, assuming proper HLC time propagation. Only transactions whose commit timestamps were pushed into the future will need to wait, like those who wrote to a global_read range and got bumped by the closed timestamp or those who conflicted (write-read or write-write) with an existing future-time value. However, CockroachDB also supports a stricter model of consistency through its "linearizable" flag. When in linearizable mode (also known as "strict serializable" mode), all writing transactions (but not read-only transactions) must wait an additional max_offset after committing to ensure that their commit timestamp is below the current HLC clock time of any other node in the system. In doing so, all causally dependent transactions are guaranteed to start with higher timestamps, regardless of the gateway they use. This ensures that all causally dependent transactions commit with higher timestamps, even if their read and writes sets do not conflict with the original transaction's. This obviates the need for uncertainty intervals and prevents the "causal reverse" anamoly which can be observed by a third, concurrent transaction. For more, see https://www.cockroachlabs.com/blog/consistency-model/ and docs/RFCS/20200811_non_blocking_txns.md. Release notes: None Release justification: New functionality.
61110: kv: implement commit-wait for future time transaction commits r=nvanbenschoten a=nvanbenschoten Fixes #57687. Related to #52745. This PR introduces a "commit-wait" sleep stage after a transaction commits, which is entered if doing so is deemed necessary for consistency. By default, commit-wait is only necessary for transactions that commit with a future-time timestamp that leads the local HLC clock. This is because CockroachDB's consistency model depends on all transactions waiting until their commit timestamp is below their gateway clock. In doing so, transactions ensure that at the time that they complete, all other clocks in the system (i.e. on all possible gateways) will be no more than the max_offset below the transaction's commit timestamp. This property ensures that all causally dependent transactions will have an uncertainty interval (see GlobalUncertaintyLimit) that exceeds the original transaction's commit timestamp, preventing stale reads. Without the wait, it would be possible for a read-write transaction to write a future-time value and then for a causally dependent transaction to read below that future-time value, violating "read your writes". The property must also hold for read-only transactions, which may have a commit timestamp in the future due to an uncertainty restart after observing a future-time value in their uncertainty interval. In such cases, the property that the transaction must wait for the local HLC clock to exceed its commit timestamp is not necessary to prevent stale reads, but it is necessary to ensure monotonic reads. Without the wait, it would be possible for a read-only transaction coordinated on a gateway with a fast clock to return a future-time value and then for a causally dependent read-only transaction coordinated on a gateway with a slow clock to read below that future-time value, violating "monotonic reads". In practice, most transactions do not need to wait at all, because their commit timestamps were pulled from an HLC clock (i.e. are not synthetic) and so they will be guaranteed to lead the local HLC's clock, assuming proper HLC time propagation. Only transactions whose commit timestamps were pushed into the future will need to wait, like those who wrote to a global_read range and got bumped by the closed timestamp or those who conflicted (write-read or write-write) with an existing future-time value. However, CockroachDB also supports a stricter model of consistency through its "linearizable" flag. When in linearizable mode (also known as "strict serializable" mode), all writing transactions (but not read-only transactions) must wait an additional max_offset after committing to ensure that their commit timestamp is below the current HLC clock time of any other node in the system. In doing so, all causally dependent transactions are guaranteed to start with higher timestamps, regardless of the gateway they use. This ensures that all causally dependent transactions commit with higher timestamps, even if their read and writes sets do not conflict with the original transaction's. This obviates the need for uncertainty intervals and prevents the "causal reverse" anamoly which can be observed by a third, concurrent transaction. For more, see https://www.cockroachlabs.com/blog/consistency-model/ and [docs/RFCS/20200811_non_blocking_txns.md](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md). ---- The PR also fixes a bug by properly marking read-only txn as aborted on rollback, which was missed in a85115a. We were assuming that all calls to `commitReadOnlyTxnLocked` were for `EndTxn` requests with the Commit flag set to true, but this is not the case. This was not only confusing, but it was also leading to the `txn.commit` metric being incremented on rollback of a read-only transaction, instead of the `txn.aborts` metric. Release justification: New functionality. 61170: kvserver: remove `kv.atomic_replication_changes.enabled` setting r=aayushshah15 a=aayushshah15 This setting was added in 19.2 to provide a fallback against atomic replication changes. They've now been a part of CockroachDB for over 3 releases. They're also a requirement for non-voting replicas. Release note (backward-incompatible change): Removed the `kv.atomic_replication_changes.enabled` cluster setting. All replication changes on a range now use joint-consensus. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Informs cockroachdb#59680. Informs cockroachdb#52745. This commit updates `closedts.TargetForPolicy` to calculate a target closed timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS closed timestamp policy. This is needed for non-blocking transactions, which require ranges to closed time in the future. TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset from the publisher's perspective, the policy defines a goal from the consumer's perspective - the goal being that present time reads (with a possible uncertainty interval) can be served from all followers. To accomplish this, we must work backwards to establish a lead time to publish closed timestamps at. The calculation looks something like the following: ``` // this should be sufficient for any present-time transaction, // because its global uncertainty limit should be <= this time. // For more, see (*Transaction).RequiredFrontier. closed_ts_at_follower = now + max_offset // the sender must account for the time it takes to propagate a // closed timestamp update to its followers. closed_ts_at_sender = closed_ts_at_follower + propagation_time // closed timestamps propagate in two ways. Both need to make it to // followers in time. propagation_time = max(raft_propagation_time, side_propagation_time) // raft propagation takes 3 network hops to go from a leader proposing // a write (with a closed timestamp update) to the write being applied. // 1. leader sends MsgProp with entry // 2. followers send MsgPropResp with vote // 3. leader sends MsgProp with higher commit index // // we also add on a small bit of overhead for request evaluation, log // sync, and state machine apply latency. raft_propagation_time = max_network_rtt * 1.5 + raft_overhead // side-transport propagation takes 1 network hop, as there is no voting. // However, it is delayed by the full side_transport_close_interval in // the worst-case. side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval // put together, we get the following result closed_ts_at_sender = now + max_offset + max( max_network_rtt * 1.5 + raft_overhead, max_network_rtt * 0.5 + side_transport_close_interval, ) ``` While writing this, I explored what it would take to use dynamic network latency measurements in this calculation to complete cockroachdb#59680. The code for that wasn't too bad, but brought up a number of questions, including how far into the tail we care about and whether we place upper and lower bounds on this value. To avoid needing to immediately answer these questions, the commit hardcodes a maximum network RTT of 150ms, which should be an overestimate for almost any cluster we expect to run on. The commit also adds a new `kv.closed_timestamp.lead_for_global_reads_override` cluster setting, which, if nonzero, overrides the lead time that global_read ranges use to publish closed timestamps. The cluster setting is hidden, but should provide an escape hatch for cases where we get the calculation (especially when it becomes dynamic) wrong. Release justification: needed for new functionality
Informs cockroachdb#59680. Informs cockroachdb#52745. This commit updates `closedts.TargetForPolicy` to calculate a target closed timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS closed timestamp policy. This is needed for non-blocking transactions, which require ranges to closed time in the future. TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset from the publisher's perspective, the policy defines a goal from the consumer's perspective - the goal being that present time reads (with a possible uncertainty interval) can be served from all followers. To accomplish this, we must work backwards to establish a lead time to publish closed timestamps at. The calculation looks something like the following: ``` // this should be sufficient for any present-time transaction, // because its global uncertainty limit should be <= this time. // For more, see (*Transaction).RequiredFrontier. closed_ts_at_follower = now + max_offset // the sender must account for the time it takes to propagate a // closed timestamp update to its followers. closed_ts_at_sender = closed_ts_at_follower + propagation_time // closed timestamps propagate in two ways. Both need to make it to // followers in time. propagation_time = max(raft_propagation_time, side_propagation_time) // raft propagation takes 3 network hops to go from a leader proposing // a write (with a closed timestamp update) to the write being applied. // 1. leader sends MsgProp with entry // 2. followers send MsgPropResp with vote // 3. leader sends MsgProp with higher commit index // // we also add on a small bit of overhead for request evaluation, log // sync, and state machine apply latency. raft_propagation_time = max_network_rtt * 1.5 + raft_overhead // side-transport propagation takes 1 network hop, as there is no voting. // However, it is delayed by the full side_transport_close_interval in // the worst-case. side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval // put together, we get the following result closed_ts_at_sender = now + max_offset + max( max_network_rtt * 1.5 + raft_overhead, max_network_rtt * 0.5 + side_transport_close_interval, ) ``` While writing this, I explored what it would take to use dynamic network latency measurements in this calculation to complete cockroachdb#59680. The code for that wasn't too bad, but brought up a number of questions, including how far into the tail we care about and whether we place upper and lower bounds on this value. To avoid needing to immediately answer these questions, the commit hardcodes a maximum network RTT of 150ms, which should be an overestimate for almost any cluster we expect to run on. The commit also adds a new `kv.closed_timestamp.lead_for_global_reads_override` cluster setting, which, if nonzero, overrides the lead time that global_read ranges use to publish closed timestamps. The cluster setting is hidden, but should provide an escape hatch for cases where we get the calculation (especially when it becomes dynamic) wrong. Release justification: needed for new functionality
61207: cli: don't refresh prompt on line continuations r=jordanlewis a=jordanlewis Previously, the CLI would recalculate its prompt (including any required network roundtrips to determine transaction status and current database) on every newline, even if the newline came in the middle of a SQL statement. This was wasteful, of course, but it's also very confusing for users when entering a large multi-line statement via paste: the prompt refreshes defer until enter is pressed for the first time, and there is a lot of unexplained latency during the statement that doesn't reflect the actual SQL being executed. It also doesn't match the reported execution time, leading to mass confusion and sadness. Now, we don't bother refreshing the prompt during line continuations. Closes #61095 Release note (cli change): optimize handling of multi-line SQL strings to avoid unwanted extra server roundtrips. Release justification: bug fixes and low-risk updates to new functionality 61305: kvserver: add closedts side-transport consumer r=andreimatei a=andreimatei Add the consumer of closed timestamps communicated by the side transport (i.e. the gRPC server for our new push-based streaming protocol). This side-transport consumer accumulates closed timestamps communicated to it by other nodes (the leaseholders of the respective ranges). Its state is queried whenever a range needs a higher closed timestamp than what it has locally in the Replica state, at which point the Replica's state is lazily updated. Release note: None Release justification: Needed for GLOBAL tables. 61386: kv: configure leading closed timestamp target for global_read ranges r=nvanbenschoten a=nvanbenschoten Informs #59680. Informs #52745. This commit updates `closedts.TargetForPolicy` to calculate a target closed timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS closed timestamp policy. This is needed for non-blocking transactions, which require ranges to closed time in the future. TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset from the publisher's perspective, the policy defines a goal from the consumer's perspective - the goal being that present time reads (with a possible uncertainty interval) can be served from all followers. To accomplish this, we must work backwards to establish a lead time to publish closed timestamps at. The calculation looks something like the following: ``` // this should be sufficient for any present-time transaction, // because its global uncertainty limit should be <= this time. // For more, see (*Transaction).RequiredFrontier. closed_ts_at_follower = now + max_offset // the sender must account for the time it takes to propagate a // closed timestamp update to its followers. closed_ts_at_sender = closed_ts_at_follower + propagation_time // closed timestamps propagate in two ways. Both need to make it to // followers in time. propagation_time = max(raft_propagation_time, side_propagation_time) // raft propagation takes 3 network hops to go from a leader proposing // a write (with a closed timestamp update) to the write being applied. // 1. leader sends MsgProp with entry // 2. followers send MsgPropResp with vote // 3. leader sends MsgProp with higher commit index // // we also add on a small bit of overhead for request evaluation, log // sync, and state machine apply latency. raft_propagation_time = max_network_rtt * 1.5 + raft_overhead // side-transport propagation takes 1 network hop, as there is no voting. // However, it is delayed by the full side_transport_close_interval in // the worst-case. side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval // put together, we get the following result closed_ts_at_sender = now + max_offset + max( max_network_rtt * 1.5 + raft_overhead, max_network_rtt * 0.5 + side_transport_close_interval, ) ``` While writing this, I explored what it would take to use dynamic network latency measurements in this calculation to complete #59680. The code for that wasn't too bad, but brought up a number of questions, including how far into the tail we care about and whether we place upper and lower bounds on this value. To avoid needing to immediately answer these questions, the commit hardcodes a maximum network RTT of 150ms, which should be an overestimate for almost any cluster we expect to run on. The commit also adds a new `kv.closed_timestamp.lead_for_global_reads_override` cluster setting, which, if nonzero, overrides the lead time that global_read ranges use to publish closed timestamps. The cluster setting is hidden, but should provide an escape hatch for cases where we get the calculation (especially when it becomes dynamic) wrong. Release justification: needed for new functionality 61499: sql: Require FORCE to modify protected zone config fields r=otan a=ajstorm With the introduction of the multi-region simplification in 21.1, there are a set of fields in the zone configurations which are protected by the system. These fields are transparently modified by the system when certain multi-region abstractions are used (e.g. when a region is added to the database, the constraints field is modified to add the new region). Due the protected status of these field, we want to prevent users from setting them if they're not aware of the impact in doing so (that they could be overwritten by the system, and that they could result in sub-optimal performance). To make this more explicit to users, we block modifications to these fields and if the users wish to modify them anyway, they must provide a FORCE argument along with the modification statement. This impacts both the setting of the field in the zone configuration, as well as any eventual multi-region statement which follows and will result in over-writing the user's zone configuration update. Release justification: Prevent bad user experience with new feature. Release note (sql change): Updates to certain fields in the zone configurations are blocked for multi-region enabled databases. This block can be overridden through the use of the FORCE keyword on the blocked statement. Resolves #60447 61556: sql: read db descriptor from store when constructing mr zone configs r=arulajmani a=arulajmani Previously, when constructing multi-region zone configs for a database in the type schema changer, we were doing so using a leased version of the database descriptor. This meant it could be the case that we were constructing a stale zone configuration, which manifested itself in some CI failures. This patch fixes that issue by always reading the database descriptor from the store when constructing zone configurations in the type schema changer. Previously, this was failing consistently under stress in ~30s. I have this thing running perfectly for the last 10 minutes. Fixes: #61320 Release justification: bug fix to new functionality Release note: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Adam Storm <storm@cockroachlabs.com> Co-authored-by: arulajmani <arulajmani@gmail.com>
Link to text of RFC
Non-blocking transactions are a variant of CockroachDB's standard read-write transaction protocol that permit low-latency, global reads of read-mostly and read-only (excluding maintenance events) data. The transaction protocol and the replication schema that it is paired with differ from standard read-write transactions in two important ways:
The ability to serve reads from follower and/or learner replicas is beneficial both because it can reduce read latency in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic in order to reduce tail latencies. The ability to serve consistent (non-stale) reads from any replica in a Range makes the functionality accessible to a larger class of read-only transactions and accessible for the first time to read-write transactions.
The ability to perform writes on read-heavy data without causing conflicting reads to block is beneficial for providing predictable read latency. Such predictability is doubly important in global deployments, where the cost of read/write contention can delay reads for 100's of ms as they are forced to navigate wide-area network latencies in order to resolve conflicts.
These properties combine to prioritize read latency over write latency for some configurable subset of data, recognizing that there exists a sizable class of data which is heavily skewed towards read traffic.
Non-blocking transactions are provided through extensions to existing concepts in the CockroachDB architecture (i.e. uncertainty intervals, read refreshes, closed timestamps, learner replicas) and compose with CockroachDB's standard transaction protocol intuitively and effectively.
This proposal serves as an alternative to the Consistent Read Replicas proposal. Whereas the Consistent Read Replicas proposal enforces consistency through communication, this proposal enforces consistency through semi-synchronized clocks with bounded uncertainty.