Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: Non-Blocking Transactions #52745

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Aug 13, 2020

Link to text of RFC


Non-blocking transactions are a variant of CockroachDB's standard read-write transaction protocol that permit low-latency, global reads of read-mostly and read-only (excluding maintenance events) data. The transaction protocol and the replication schema that it is paired with differ from standard read-write transactions in two important ways:

  • non-blocking transactions support a replication scheme over Ranges that they operate on which allows all followers in these Ranges to serve consistent (non-stale) follower reads.
  • non-blocking transactions are minimally disruptive to reads over the data that they modify, even in the presence of read/write contention.

The ability to serve reads from follower and/or learner replicas is beneficial both because it can reduce read latency in geo-distributed deployments and because it can serve as a form of load-balancing for concentrated read traffic in order to reduce tail latencies. The ability to serve consistent (non-stale) reads from any replica in a Range makes the functionality accessible to a larger class of read-only transactions and accessible for the first time to read-write transactions.

The ability to perform writes on read-heavy data without causing conflicting reads to block is beneficial for providing predictable read latency. Such predictability is doubly important in global deployments, where the cost of read/write contention can delay reads for 100's of ms as they are forced to navigate wide-area network latencies in order to resolve conflicts.

These properties combine to prioritize read latency over write latency for some configurable subset of data, recognizing that there exists a sizable class of data which is heavily skewed towards read traffic.

Non-blocking transactions are provided through extensions to existing concepts in the CockroachDB architecture (i.e. uncertainty intervals, read refreshes, closed timestamps, learner replicas) and compose with CockroachDB's standard transaction protocol intuitively and effectively.

This proposal serves as an alternative to the Consistent Read Replicas proposal. Whereas the Consistent Read Replicas proposal enforces consistency through communication, this proposal enforces consistency through semi-synchronized clocks with bounded uncertainty.

@nvanbenschoten nvanbenschoten requested a review from a team as a code owner August 13, 2020 00:08
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor

knz commented Aug 13, 2020

One thing I am missing from the guide-level explanation is how this looks like from the perspective of a SQL client: how does one engage a non-blocking transaction?

@nvanbenschoten
Copy link
Member Author

@knz the guide-level explanation includes this paragraph:

For now, it is sufficient to say that "non-blocking transactions" are completely hidden from users. Instead, users interact with "non-blocking Ranges" by choosing which key ranges should consist of non-blocking Range. These Ranges are configured using zone configurations, though we'll likely end up adding a nicer abstraction on top of this (e.g. CREATE REFERENCE TABLE ...).

Do you mind expanding on what you think is missing from that?

One thing that certainly is missing in the RFC right now is a full derivation of how long we expect the Commit Wait stage of one of these non-blocking transactions to be in a global cluster. This is touched upon in "Tuning non_blocking_duration", but it would be helpful to give some first-order approximation of what we expect this delay to be in practice.

@awoods187
Copy link
Contributor

I gave this a quick skim this evening and it seems pretty exciting. I think we should design with the SQL in mind from day one as @knz was inquiring about and I like CREATE REFERENCE TABLE ... syntax. I assume that we will also be able to do something like ALTER TABLE foo MAKE REFERENCE TABLE or something like that to allow users to upgrade from a single-region to multi-region (or vice versa). This would be an expensive operation but one that would likely be necessary.


As defined above, "non-blocking transactions" perform locking in such a way that
contending reads do not need to wait on their locks. In practice, this means
that non-blocking reads **perform their writes at a timestamp in advance of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reads/writes/ or /transactions

Commit Wait rule then it would have been delayed before committing and the stale
read would not be possible.

But relying on Commit Wait here violates a desirable property of our transaction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope I'm misunderstanding something here, but isn't this basically saying that our read latencies will spike to ~30ms (uncertainty interval) for each non-blocking write when it becomes visible?

For read-only txns (or non-blocking read-write txns) I hoped we would not need to apply uncertainty to writes that it knows came from a non-blocking txn, but it doesn't seem to quite work. If the non-blocking writer committed the intents only after making sure that the commit timestamp is no longer synthetic (i.e. its clock is past that timestamp), then we're resolving intents too late and readers contend. If we resolve them early and then wait, the readers have to deal with uncertainty.

See if you can polish this section a bit - I had some trouble following "second read", "first read-only transaction", maybe an explicit itemized example with timestamps would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think/hope I'm just confused here. If it's t=100 now and a transaction future-writes-and-commits at t=1000, only readers with t \in [970,999) have to worry about uncertainty. But if reader's timestamp was < 1000 when it was chosen (non-synthetically), it can't causally depend on the previous writer if that writer would have waited for its local clock to hit 1030 (future write ts + uncertainty) to ack the commit, which means our dear reader couldn't have started after the writer returned and gotten a timestamp less than 1000.

tracker that is leading "present time" by some duration. As it turns out,
configuring this closed timestamp tracker to lead "present time" by
`non_blocking_duration` is enough to implement "non-blocking transactions"
without any other server-side changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still changes here, no? We don't do per-range closed timestamps at the moment, so we need the infra for that. But it'd be nice if those were the only changes. I think the quiescence stuff will need some work as it's tied to the store-wide closed timestamp.

Any standard transaction that writes to a "non-blocking Range" will naturally
get pushed into the future. If it had performed writes in the past, these will
eventually get moved up to the new commit timestamp. If it had performed reads
in the past, these will eventually get refreshed up to the new commit timestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so they'll get refreshed to a synthetic timestamp? And updating a synthetic timestamp with a non-synthetic one with a >= walltime will result in a non-synthetic one, right?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up! When the parallels to Global Optimistic Transactions came up in our last conversation I was intrigued and I'm glad this wasn't just a temporary confusion. I agree that this seems much more reasonable in scope than the other proposals we have discussed (consistent read replicas + global optimistic reads) and at the same time it has what I would call intuitive interactions with our existing transactions.
Are you at all worried about the need for readers to continue to wait out the uncertainty? If you want ~3s local reads but on every write readers are forced into ~30ms (uncertainty) stalls, this is still a ~10x latency spike. Sure better than a 100x one, but still - beyond non-blocking txns, do you foresee us still lowering the consistency levels further still?

@nvanbenschoten
Copy link
Member Author

Thanks for the reads!

I assume that we will also be able to do something like ALTER TABLE foo MAKE REFERENCE TABLE or something like that to allow users to upgrade from a single-region to multi-region (or vice versa). This would be an expensive operation but one that would likely be necessary.

Yes, absolutely. It shouldn't even be that expensive.

It's also worth reiterating that both this proposal and the consistent read replicas proposal are applicable for single-region use cases, as they both can provide a form of load balancing that may be important for read-heavy data.

at the same time it has what I would call intuitive interactions with our existing transactions

I agree! It was eye-opening to me to dig into how these interact with our standard transaction model.

Are you at all worried about the need for readers to continue to wait out the uncertainty? If you want ~3ms local reads but on every write readers are forced into ~30ms (uncertainty) stalls, this is still a ~10x latency spike.

Good question. I'll say yes and no. We should be very conscious that committing to this proposal is committing to making a long-term commitment to investing in better clock synchronization. Up to this point, clock synchronization has mostly been out of the critical path of performance, outside of impacting txn retry susceptibility. This proposal brings it directly into the equation.

I am certainly concerned that reads will still need to wait out a maximum of the clock uncertainty (strawman is 30ms) under contention. Ideally, read/write contention would have no effect on read latency, but without dropping the consistency level, that doesn't seem possible. So obviously, there still is some blocking here in certain cases. I used of the term "non-blocking" here to convey the fact that with this approach, reads will block on the order of clock uncertainty instead of on the order of communication latency. This seems like a highly desirable property for a globally distributed database like Cockroach, because clock uncertainty seems like an easier opponent to fight than the speed of light.

So as we think about the future of CockroachDB and Cockroach Cloud, this is where I'd rather place a bet. We know clouds are starting to offer better time services (ref: Amazon Time Sync Service, Cloud TrueTime, etc.) and we're also aware innovation in this area (ref: Tick Tock Networks, etc.).

Sure better than a 100x one, but still - beyond non- txns, do you foresee us still lowering the consistency levels further still?

I'll first point out, because I made the mistake a few times while thinking through this myself, that consistency levels != isolation levels here. Even the lowest isolation levels don't allow stale reads, to the extent that they make any claim about real-time ordering. And so if we want to maintain strong consistency, even at lower isolation levels, there will be times that we need to block to establish ordering between causally related events. The question is whether we block on clock synchronization or communication.

So then back to the question of consistency levels: I'm not sure. I think there is a good reason to expose different consistency levels, and yet, it gets very confusing to work with. I can't get through https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-levels without confusing myself, and I'm not trying to build an application on top of these guarantees. It also gets challenging to answer that question when we start talking about lowering the consistency level of operations within CRDB itself that may be trying to maintain referential integrity. For instance, how do we let users configure foreign key checks to use reduced consistency levels and what does that do to our optimizer's ability to trust this referential integrity. Making foreign key checks work was actually an important litmus test I used when considering different approaches.

So to the degree that we can, I'd prefer to expose understandable consistency levels to users (strong consistency + exact/bounded staleness) but optimize for and push the boundaries of strong consistency. I think that's what makes Cockroach special.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


docs/RFCS/20200811_non_blocking_txns.md, line 630 at r1 (raw file):

Complexity. But not overly so, especially compared to Consistent Read Replicas.

Only effective if we can reduce the clock uncertainty interval dramatically.

Why? I see that transactions that write to non-blocking ranges need to wait for a duration that is a function of the uncertainty interval. But as long as reads are unaffected by uncertainty (which I think is the case but I'm not certain I've understood this correctly), I think there'd still be demand for this. There are a lot of reference table use cases that could tolerate write times measured in seconds as long as reads were fast.

Reducing the uncertainty interval and speeding up non-blocking txn writes would of course be great, but it gets more difficult/expensive as you approach zero. We could easily knock a few hundred milliseconds off our uncertainty bound (this is safe today in many environments). Getting down to 30ms is probably doable, but if it only affects writes I'm not sure it's necessary, and if it affects reads I'm not sure it's enough.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/nonBlockingTxnsRfc branch from 6e142e9 to 000779b Compare August 14, 2020 21:58
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, and @tbg)


docs/RFCS/20200811_non_blocking_txns.md, line 352 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/reads/writes/ or /transactions

Done.


docs/RFCS/20200811_non_blocking_txns.md, line 501 at r1 (raw file):

I hope I'm misunderstanding something here, but isn't this basically saying that our read latencies will spike to ~30ms (uncertainty interval) for each non-blocking write when it becomes visible?

Yes, it is. This wasn't clear to either you or Ben, so the RFC is definitely not making this clear. I added two transaction latency comparison tables to the guide-level explanation to make this more clear.

See if you can polish this section a bit - I had some trouble following "second read", "first read-only transaction", maybe an explicit itemized example with timestamps would be helpful.

Added an example timeline here.

Hmm, I think/hope I'm just confused here. If it's t=100 now and a transaction future-writes-and-commits at t=1000, only readers with t \in [970,999) have to worry about uncertainty. But if reader's timestamp was < 1000 when it was chosen (non-synthetically), it can't causally depend on the previous writer if that writer would have waited for its local clock to hit 1030 (future write ts + uncertainty) to ack the commit, which means our dear reader couldn't have started after the writer returned and gotten a timestamp less than 1000.

The writer will only wait until its local clock hits 1000 before acking the commit, so a causally dependent reader can have a clock as low as 970.


docs/RFCS/20200811_non_blocking_txns.md, line 556 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

There are still changes here, no? We don't do per-range closed timestamps at the moment, so we need the infra for that. But it'd be nice if those were the only changes. I think the quiescence stuff will need some work as it's tied to the store-wide closed timestamp.

Yep, definitely some engineering work here. On one end of the spectrum, we could give every range their own closed timestamp subsystem (clearly unrealistic). On the other, we could have one for normal ranges and one for non-blocking Ranges. Or maybe something in the middle. @ajwerner has been talking about configuring closed timestamps differently for different ranges/groups of ranges for some time.

We can flesh more of this out if we decide to push the rest of this proposal forward. For now, it seems like a known quantity.


docs/RFCS/20200811_non_blocking_txns.md, line 561 at r1 (raw file):

so they'll get refreshed to a synthetic timestamp?

Yes, added.

And updating a synthetic timestamp with a non-synthetic one with a >= walltime will result in a non-synthetic one, right?

Good point, added.


docs/RFCS/20200811_non_blocking_txns.md, line 630 at r1 (raw file):

which I think is the case but I'm not certain I've understood this correctly

No, this, unfortunately, isn't the case. This tripped up you and Tobi, so it definitely wasn't clear enough. I added two transaction latency comparison tables to the guide-level explanation to make this more clear and an example in the Uncertainty Intervals: To Wait or Not to Wait section to demonstrate why readers have to wait in the cases where they see a value in their uncertainty interval.

As such, the uncertainty bounds necessary for this proposal to make sense are
much lower than what we currently use, but are also well within the realm of
possibility for software-only solutions. Atomic clocks and TrueTime are not
prerequisites!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe point out that Amazon has its Time Sync Service, which should enable us to take advantage of hardware in CC:

Today we’re launching Amazon Time Sync Service, a time synchronization service
delivered over Network Time Protocol (NTP) which uses a fleet of redundant satellite-
connected and atomic clocks in each region to deliver a highly accurate reference clock.

Could also link to the Huygens paper: https://www.usenix.org/system/files/conference/nsdi18/nsdi18-geng.pdf

badly pessimizes write latency. On the surface, this seems to be the expected
tradeoff. The problem is that when reads and writes contend, reads can get
blocked on writes and have to wait for writing transactions to complete. If the
writing transaction is implicit, this results in at least 1 WAN RTT of blocking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"implicit" meaning no BEGIN TRANSACTION?

the standard read-write transaction that performs locking in a manner such that
contending reads by other transactions can avoid waiting on its locks.

The RFC then introduces the term "non-blocking Range", which is a Range in which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the term be a non-blocking keyrange instead? When I hear "range", I immediately wonder if splits/merges can change the boundaries of a non-blocking Range. But I think what this proposal means by a "non-blocking Range" is just a (start,end) keyrange that is independent of splits/merges. Even if "Range" is the accepted KV terminology for this concept, at the very least add some clarifying language here for those not well-versed in KV-speak.

> In addition to tables that are immutable, it is common for schemas to have some
> tables that are mutable but only rarely mutated. An example of this is a user
> profile table in an application for a global user base. This table may be read
> frequently across the world and it expected to be updated infrequently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"and yet it is expected"


Merging a "synthetic" and "real" timestamp, typically done using a `Forward`
operation obeys the following rule: the `synthetic` bit from the larger of the
two timestamps is carrier over to the result. In the case of a tie between a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carrier => carried

##### Timestamp.Forward

Merging a "synthetic" and "real" timestamp, typically done using a `Forward`
operation obeys the following rule: the `synthetic` bit from the larger of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain your reasons as to why this is the right algorithm?

For "real" timestamps, it is cheap to implement this operation by ratcheting a
few integers within the HLC because we know that the timestamp must have come
from another node, which itself must have been ahead of the slowest node by less
that `mox_offset`, so the ratcheting operation will not push our HLC further
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mox_offset => max_offset

between the idea of writing in the future and closing time in the future. It is
interesting to point out that in systems like Spanner and YB which assign
monotonically increasing timestamps to all writes within a replication group and
therefore "instantaneously" close out time, these two operations are one in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> "one and the same"

@andy-kimball
Copy link
Contributor

For cases where RTT << uncertainty interval (e.g. because environment requires 100's ms interval, or because RTT is local within region), do we really need to wait out that entire interval when we encounter a write? Couldn't the follower replica send a message to the leaseholder in order to ensure its current time is >= leaseholder current time (equivalent to reading from the leaseholder)? And in the worst case, the follower's transaction would block for <= uncertainty interval, no matter how many non-stale follower reads it did.

Reason I ask is b/c, if possible, I'd like this proposal to work at least as well as the Consistent Read Replicas proposal. In that proposal, contended reads can block on WAN RTTs, so reducing this proposal's worst case to MIN(RTT, uncertainty interval) should make this proposal competitive in all the important cases.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases where RTT << uncertainty interval (e.g. because environment requires 100's ms interval, or because RTT is local within region), do we really need to wait out that entire interval when we encounter a write? Couldn't the follower replica send a message to the leaseholder in order to ensure its current time is >= leaseholder current time (equivalent to reading from the leaseholder)?

So you're suggesting essentially to start redirecting all the reads to the leaseholder when they're uncertain? That's an interesting idea I think, although it kinda messes with the load-balancing hopes of these proposals.

One thing I'd like to elevate here is the manner in which the synthetic timestamps spread through the system - they're not just localized to the non-blocking ranges. This is illustrated the non-blocking txn's read / standard write cell in the txn interaction table: reads from non-blocking txns can now disrupt "standard writes" on random ranges. Basically we don't just have "writes in the future" to worry about, but also "reads in the future". Perhaps this isn't the biggest concern in the world given how "non-blocking txns" are supposed to be pretty rare...

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)


docs/RFCS/20200811_non_blocking_txns.md, line 158 at r2 (raw file):

If the writing transaction is doing other work, this blocking can be
unbounded.

To make it fair, I'd add an asterisk here explaining that this unbounded wait is a problem for all proposals, that seems to be addressable also for all proposals similarly.


docs/RFCS/20200811_non_blocking_txns.md, line 706 at r2 (raw file):

standard read            | both bump ts cache, no interaction | bump ts cache, may bump write ts, but still below present time | both bump ts cache, no interaction | bump ts cache, no interaction
standard write           | read ignores write if below, read waits on write if above | 2nd write waits on 1st write, bumps write ts above 1st write | read waits on write | non-blocking write waits on standard write
non-blocking txn's read  | both bump ts cache, no interaction | **bump ts cache, bump write ts, standard write becomes non-blocking write, must wait on commit** | both bump ts cache, no interaction | bump ts cache, may bump write ts

This case is interesting, isn't it. I kinda woke up thinking about it too, as I was thinking about how the synthetic timestamps spread. Standard writers become infected by previous reads, and so now they have to do the commit-wait. And this commit wait is RTT+max_offset, not just one or the other (right?), so they can be quite disruptive to traffic. Non-blocking txns are all about how their writes are less disruptive, but now their reads can be quite disruptive... Not taking any sides, but I'd call this out in the comparison section. A saving grace is that, as the previous sections says, a transaction only becomes a "non-blocking" txn when it does some writing, so read-only txns to not cause disruption.

Similarly for the non-blocking txn's write / standard write case below, although that one is less concerning to me.

@andy-kimball
Copy link
Contributor

So you're suggesting essentially to start redirecting all the reads to the leaseholder when they're uncertain? That's an interesting idea I think, although it kinda messes with the load-balancing hopes of these proposals.

I don't think the load-balancing story will be materially impacted. Assuming these tables are read-mostly, with very infrequent writes, there should be minimal leaseholder traffic caused by these timestamp sync'ing calls (these calls are only made when uncertain reads happen, after all). And the messages that do get sent should be tiny; it's just a timestamp sync; no rows are actually read nor returned. I'd expect a tiny overhead from doing this.

@tbg tbg requested review from ajwerner and bdarnell August 18, 2020 12:04
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)


docs/RFCS/20200811_non_blocking_txns.md, line 501 at r1 (raw file):

The writer will only wait until its local clock hits 1000 before acking the commit, so a causally dependent reader can have a clock as low as 970.

Right, but what if we waited out the clock uncertainty at the writer as well?

@tbg tbg self-requested a review August 18, 2020 12:04
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/nonBlockingTxnsRfc branch from 000779b to 3685609 Compare August 18, 2020 14:49
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases where RTT << uncertainty interval (e.g. because environment requires 100's ms interval, or because RTT is local within region), do we really need to wait out that entire interval when we encounter a write? Couldn't the follower replica send a message to the leaseholder in order to ensure its current time is >= leaseholder current time (equivalent to reading from the leaseholder)? And in the worst case, the follower's transaction would block for <= uncertainty interval, no matter how many non-stale follower reads it did.

This is an interesting idea. At first, I thought it would work, but now, I'm not so sure. On its own, it works because the leaseholder's clock is monotonic and so we preserve linearizability. However, it doesn't appear to compose with the current proposal because the leaseholder's clock is not guaranteed to lead all followers' clocks. Here's an example of how this could go wrong:

clocks:
  leaseholder: 50
  follower A (close, cheaper to communicate): 50
  follower B (far away, cheaper to wait): 75
  max_clock_offset: 50

value @ 100

1. read on follower B
-- observes value in uncertainty interval
-- waits 25s
-- returns value and commits

clocks:
  leaseholder: 75
  follower A (close): 75
  follower B (far away): 100
  max_clock_offset: 50

2. dependent read on follower A
-- observes value in uncertainty interval
-- forwards to leaseholder
-- value not in uncertainty interval
-- ignores value
-- linearizability violated!

So even when we consult with the leaseholder, we still need to respect uncertainty intervals. For this kind of this to work, I think we'd need to switch to waiting out uncertainty on commit like Spanner does. So instead of saying that after a commit, all clocks will be within max_clock_offset away from any values written, we'd need to be able to say that after a commit, all clocks will be equal to or above any values written.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @andy-kimball, @bdarnell, @nvanbenschoten, and @tbg)


docs/RFCS/20200811_non_blocking_txns.md, line 501 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

The writer will only wait until its local clock hits 1000 before acking the commit, so a causally dependent reader can have a clock as low as 970.

Right, but what if we waited out the clock uncertainty at the writer as well?

That still wouldn't be enough. Linearizability doesn't just talk about the relationship between writes and reads. It also talks about the relationship between multiple reads. If we were to wait out the clock uncertainty on the write then we would ensure that all causally dependent events observe the write. However, we wouldn't ensure that two concurrent reads respect their real-time ordering with each other. The new example here actually demonstrates exactly this.


docs/RFCS/20200811_non_blocking_txns.md, line 59 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Maybe point out that Amazon has its Time Sync Service, which should enable us to take advantage of hardware in CC:

Today we’re launching Amazon Time Sync Service, a time synchronization service
delivered over Network Time Protocol (NTP) which uses a fleet of redundant satellite-
connected and atomic clocks in each region to deliver a highly accurate reference clock.

Could also link to the Huygens paper: https://www.usenix.org/system/files/conference/nsdi18/nsdi18-geng.pdf

Done.


docs/RFCS/20200811_non_blocking_txns.md, line 155 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

"implicit" meaning no BEGIN TRANSACTION?

Yes, just a single-statement transaction.


docs/RFCS/20200811_non_blocking_txns.md, line 158 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…
If the writing transaction is doing other work, this blocking can be
unbounded.

To make it fair, I'd add an asterisk here explaining that this unbounded wait is a problem for all proposals, that seems to be addressable also for all proposals similarly.

I just removed the sentence.


docs/RFCS/20200811_non_blocking_txns.md, line 301 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

"and yet it is expected"

Done.


docs/RFCS/20200811_non_blocking_txns.md, line 460 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Can you explain your reasons as to why this is the right algorithm?

Done.


docs/RFCS/20200811_non_blocking_txns.md, line 461 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

carrier => carried

Done.


docs/RFCS/20200811_non_blocking_txns.md, line 491 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

mox_offset => max_offset

Done.


docs/RFCS/20200811_non_blocking_txns.md, line 651 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

=> "one and the same"

Done.


docs/RFCS/20200811_non_blocking_txns.md, line 706 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This case is interesting, isn't it. I kinda woke up thinking about it too, as I was thinking about how the synthetic timestamps spread. Standard writers become infected by previous reads, and so now they have to do the commit-wait. And this commit wait is RTT+max_offset, not just one or the other (right?), so they can be quite disruptive to traffic. Non-blocking txns are all about how their writes are less disruptive, but now their reads can be quite disruptive... Not taking any sides, but I'd call this out in the comparison section. A saving grace is that, as the previous sections says, a transaction only becomes a "non-blocking" txn when it does some writing, so read-only txns to not cause disruption.

Similarly for the non-blocking txn's write / standard write case below, although that one is less concerning to me.

Absolutely, everything you said is true. A non-blocking transaction is disruptive to conflicting writes both on non-blocking ranges and off non-blocking ranges. But it should never be disruptive to conflicting reads (by read-only or read-write txns).

I'll add to the discussion section.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/nonBlockingTxnsRfc branch from 3685609 to 0da9c08 Compare August 31, 2020 00:15
@nvanbenschoten nvanbenschoten changed the title [WIP] rfc: Non-Blocking Transactions rfc: Non-Blocking Transactions Aug 31, 2020
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/nonBlockingTxnsRfc branch from 0da9c08 to 0243f65 Compare August 31, 2020 00:19
@nvanbenschoten
Copy link
Member Author

@bdarnell @petermattis I've updated this RFC for clarity and to better contextualize it within the rest of the thinking around "Making Global Easy". I'm pretty happy about how the proposal is turning out. It's the first one that I've liked more after sitting on it for a week, rather than less. I'd love to get your thoughts on it when you get a chance.

I'll also draw attention to the Implementation Touch Points section, which highlights how little work this ends up being. This is a testament to how this approach extends existing concepts in the CockroachDB architecture (i.e. uncertainty intervals, read refreshes, closed timestamps, learner replicas) instead of introducing new concepts. Notably, there are no new RPCs and no new liveness mechanisms. The few extensions that the proposal does make to existing concepts are universally applicable and reduce to trivial no-ops for the standard transaction model. I heard there was concern about this proposal being complex, but I'd argue it's actually the least complex of the ideas floated so far for solving the problem of global, consistent reads.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 29, 2021
Informs cockroachdb#52745.

This commit introduces the concept of a RangeClosedTimestampPolicy,
which represents the policy used by the leaseholder of a range to
establish and publish closed timestamps. The policy dictates how far in
the past (lag) or in the future (lead) MVCC history is closed off at.
Currently, there are two RangeClosedTimestampPolicy:
- LAG_BY_CLUSTER_SETTING
- LEAD_FOR_GLOBAL_READS

After introducing these policies, the commit teaches the RangeCache
about this information. In addition to a range's descriptor and lease,
the cache will now maintain an up-to-date understanding of each range's
closed timestamp policy.

Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so
that the client <-> server RangeInfo protocol will ensure that the kv
client is informed of each Range's closed timestamp policy and kept up
to date when its cached information has gone stale.

Now that the kv client is aware of which ranges have configured their
closed timestamps to serve global reads, it will be able to use this
information in `CanSendToFollower` to target follower replicas for
non-stale reads in read-only and read-write transactions.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 1, 2021
Informs cockroachdb#52745.

This commit introduces the concept of a RangeClosedTimestampPolicy,
which represents the policy used by the leaseholder of a range to
establish and publish closed timestamps. The policy dictates how far in
the past (lag) or in the future (lead) MVCC history is closed off at.
Currently, there are two RangeClosedTimestampPolicy:
- LAG_BY_CLUSTER_SETTING
- LEAD_FOR_GLOBAL_READS

After introducing these policies, the commit teaches the RangeCache
about this information. In addition to a range's descriptor and lease,
the cache will now maintain an up-to-date understanding of each range's
closed timestamp policy.

Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so
that the client <-> server RangeInfo protocol will ensure that the kv
client is informed of each Range's closed timestamp policy and kept up
to date when its cached information has gone stale.

Now that the kv client is aware of which ranges have configured their
closed timestamps to serve global reads, it will be able to use this
information in `CanSendToFollower` to target follower replicas for
non-stale reads in read-only and read-write transactions.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2021
Informs cockroachdb#52745.

This commit introduces the concept of a RangeClosedTimestampPolicy,
which represents the policy used by the leaseholder of a range to
establish and publish closed timestamps. The policy dictates how far in
the past (lag) or in the future (lead) MVCC history is closed off at.
Currently, there are two RangeClosedTimestampPolicy:
- LAG_BY_CLUSTER_SETTING
- LEAD_FOR_GLOBAL_READS

After introducing these policies, the commit teaches the RangeCache
about this information. In addition to a range's descriptor and lease,
the cache will now maintain an up-to-date understanding of each range's
closed timestamp policy.

Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so
that the client <-> server RangeInfo protocol will ensure that the kv
client is informed of each Range's closed timestamp policy and kept up
to date when its cached information has gone stale.

Now that the kv client is aware of which ranges have configured their
closed timestamps to serve global reads, it will be able to use this
information in `CanSendToFollower` to target follower replicas for
non-stale reads in read-only and read-write transactions.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 6, 2021
59505: kv: cache RangeClosedTimestampPolicy in RangeCache, keep up to date r=nvanbenschoten a=nvanbenschoten

Informs #52745.

This commit introduces the concept of a RangeClosedTimestampPolicy, which represents the policy used by the leaseholder of a range to establish and publish closed timestamps. The policy dictates how far in the past (lag) or in the future (lead) MVCC history is closed off. Currently, there are two RangeClosedTimestampPolicy:
- `LAG_BY_CLUSTER_SETTING`
- `LEAD_FOR_GLOBAL_READS`

After introducing these policies, the commit teaches the RangeCache about this information. In addition to a range's descriptor and lease, the cache will now maintain an up-to-date understanding of each range's closed timestamp policy.

Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so that the client <-> server RangeInfo protocol will ensure that the kv client is informed of each Range's closed timestamp policy and kept up to date when its cached information has gone stale.

Now that the kv client is aware of which ranges have configured their closed timestamps to serve global reads, it will be able to use this information in `CanSendToFollower` to target follower replicas for non-stale reads in read-only and read-write transactions.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
andreimatei pushed a commit to andreimatei/cockroach that referenced this pull request Feb 9, 2021
Informs cockroachdb#52745.

This commit introduces the concept of a RangeClosedTimestampPolicy,
which represents the policy used by the leaseholder of a range to
establish and publish closed timestamps. The policy dictates how far in
the past (lag) or in the future (lead) MVCC history is closed off at.
Currently, there are two RangeClosedTimestampPolicy:
- LAG_BY_CLUSTER_SETTING
- LEAD_FOR_GLOBAL_READS

After introducing these policies, the commit teaches the RangeCache
about this information. In addition to a range's descriptor and lease,
the cache will now maintain an up-to-date understanding of each range's
closed timestamp policy.

Finally, the commit adds the policy to ClientRangeInfo and RangeInfo, so
that the client <-> server RangeInfo protocol will ensure that the kv
client is informed of each Range's closed timestamp policy and kept up
to date when its cached information has gone stale.

Now that the kv client is aware of which ranges have configured their
closed timestamps to serve global reads, it will be able to use this
information in `CanSendToFollower` to target follower replicas for
non-stale reads in read-only and read-write transactions.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 11, 2021
57077: roachpb/storage: introduce localUncertaintyLimit, teach MVCC about synthetic timestamps r=nvanbenschoten a=nvanbenschoten

Informs #52745.
Informs #36431.

This PR introduces a new `localUncertaintyLimit` parameter to MVCC. The PR also renames the existing `Txn.MaxTimestamp` to `Txn.GlobalUncertaintyLimit`.

`localUncertaintyLimit` is introduced alongside `GlobalUncertaintyLimit` for two reasons:
1. it allows us to stop modifying `global_uncertainty_limit` (formerly `max_timestamp`) when applying an `observed_timestamp` to a transaction when evaluating on a replica. Instead, we can keep the `localUncertaintyLimit` directly on the stack, which better represents its lifetime.
2. it allows MVCC logic to distinguish between the pre- and post-observed timestamp uncertainty interval, depending on whether a possibly-uncertain value has a standard or synthetic timestamp.

The field is defined as follows:

> LocalUncertaintyLimit is the transaction's GlobalUncertaintyLimit, reduce
> by any observed timestamp that the transaction has acquired from the node
> that holds the lease for the range that the transaction is evaluating on.
>
> The local uncertainty limit can reduce the uncertainty interval applied
> to most values on a range. This can lead to values that would otherwise
> be considered uncertain by the original global uncertainty limit to be
> considered "certainly concurrent", and thus not causally related, with
> the transaction due to observed timestamps.
>
> However, the local uncertainty limit does not apply to all values on a
> range. Specifically, values with "synthetic timestamps" must use the
> transaction's original global uncertainty limit for the purposes of
> uncertainty, because observed timestamps do not apply to values with
> synthetic timestamps.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 25, 2021
Fixes cockroachdb#57687.
Related to cockroachdb#52745.

This commit introduces a "commit-wait" sleep stage after a transaction
commits, which is entered if doing so is deemed
necessary for consistency.

By default, commit-wait is only necessary for transactions that commit
with a future-time timestamp that leads the local HLC clock. This is
because CockroachDB's consistency model depends on all transactions
waiting until their commit timestamp is below their gateway clock. In
doing so, transactions ensure that at the time that they complete, all
other clocks in the system (i.e. on all possible gateways) will be no
more than the max_offset below the transaction's commit timestamp. This
property ensures that all causally dependent transactions will have an
uncertainty interval (see GlobalUncertaintyLimit) that exceeds the
original transaction's commit timestamp, preventing stale reads. Without
the wait, it would be possible for a read-write transaction to write a
future-time value and then for a causally dependent transaction to read
below that future-time value, violating "read your writes".

The property must also hold for read-only transactions, which may have a
commit timestamp in the future due to an uncertainty restart after
observing a future-time value in their uncertainty interval. In such
cases, the property that the transaction must wait for the local HLC
clock to exceed its commit timestamp is not necessary to prevent stale
reads, but it is necessary to ensure monotonic reads. Without the wait,
it would be possible for a read-only transaction coordinated on a
gateway with a fast clock to return a future-time value and then for a
causally dependent read-only transaction coordinated on a gateway with a
slow clock to read below that future-time value, violating "monotonic
reads".

In practice, most transactions do not need to wait at all, because their
commit timestamps were pulled from an HLC clock (i.e. are not synthetic)
and so they will be guaranteed to lead the local HLC's clock, assuming
proper HLC time propagation. Only transactions whose commit timestamps
were pushed into the future will need to wait, like those who wrote to a
global_read range and got bumped by the closed timestamp or those who
conflicted (write-read or write-write) with an existing future-time
value.

However, CockroachDB also supports a stricter model of consistency
through its "linearizable" flag. When in linearizable mode (also known
as "strict serializable" mode), all writing transactions (but not
read-only transactions) must wait an additional max_offset after
committing to ensure that their commit timestamp is below the current
HLC clock time of any other node in the system. In doing so, all
causally dependent transactions are guaranteed to start with higher
timestamps, regardless of the gateway they use. This ensures that all
causally dependent transactions commit with higher timestamps, even if
their read and writes sets do not conflict with the original
transaction's. This obviates the need for uncertainty intervals and
prevents the "causal reverse" anamoly which can be observed by a third,
concurrent transaction.

For more, see https://www.cockroachlabs.com/blog/consistency-model/ and
docs/RFCS/20200811_non_blocking_txns.md.

Release notes: None

Release justification: New functionality.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 27, 2021
Fixes cockroachdb#57687.
Related to cockroachdb#52745.

This commit introduces a "commit-wait" sleep stage after a transaction
commits, which is entered if doing so is deemed
necessary for consistency.

By default, commit-wait is only necessary for transactions that commit
with a future-time timestamp that leads the local HLC clock. This is
because CockroachDB's consistency model depends on all transactions
waiting until their commit timestamp is below their gateway clock. In
doing so, transactions ensure that at the time that they complete, all
other clocks in the system (i.e. on all possible gateways) will be no
more than the max_offset below the transaction's commit timestamp. This
property ensures that all causally dependent transactions will have an
uncertainty interval (see GlobalUncertaintyLimit) that exceeds the
original transaction's commit timestamp, preventing stale reads. Without
the wait, it would be possible for a read-write transaction to write a
future-time value and then for a causally dependent transaction to read
below that future-time value, violating "read your writes".

The property must also hold for read-only transactions, which may have a
commit timestamp in the future due to an uncertainty restart after
observing a future-time value in their uncertainty interval. In such
cases, the property that the transaction must wait for the local HLC
clock to exceed its commit timestamp is not necessary to prevent stale
reads, but it is necessary to ensure monotonic reads. Without the wait,
it would be possible for a read-only transaction coordinated on a
gateway with a fast clock to return a future-time value and then for a
causally dependent read-only transaction coordinated on a gateway with a
slow clock to read below that future-time value, violating "monotonic
reads".

In practice, most transactions do not need to wait at all, because their
commit timestamps were pulled from an HLC clock (i.e. are not synthetic)
and so they will be guaranteed to lead the local HLC's clock, assuming
proper HLC time propagation. Only transactions whose commit timestamps
were pushed into the future will need to wait, like those who wrote to a
global_read range and got bumped by the closed timestamp or those who
conflicted (write-read or write-write) with an existing future-time
value.

However, CockroachDB also supports a stricter model of consistency
through its "linearizable" flag. When in linearizable mode (also known
as "strict serializable" mode), all writing transactions (but not
read-only transactions) must wait an additional max_offset after
committing to ensure that their commit timestamp is below the current
HLC clock time of any other node in the system. In doing so, all
causally dependent transactions are guaranteed to start with higher
timestamps, regardless of the gateway they use. This ensures that all
causally dependent transactions commit with higher timestamps, even if
their read and writes sets do not conflict with the original
transaction's. This obviates the need for uncertainty intervals and
prevents the "causal reverse" anamoly which can be observed by a third,
concurrent transaction.

For more, see https://www.cockroachlabs.com/blog/consistency-model/ and
docs/RFCS/20200811_non_blocking_txns.md.

Release notes: None

Release justification: New functionality.
craig bot pushed a commit that referenced this pull request Feb 27, 2021
61110: kv: implement commit-wait for future time transaction commits r=nvanbenschoten a=nvanbenschoten

Fixes #57687.
Related to #52745.

This PR introduces a "commit-wait" sleep stage after a transaction commits, which is entered if doing so is deemed necessary for consistency.

By default, commit-wait is only necessary for transactions that commit with a future-time timestamp that leads the local HLC clock. This is because CockroachDB's consistency model depends on all transactions waiting until their commit timestamp is below their gateway clock. In doing so, transactions ensure that at the time that they complete, all other clocks in the system (i.e. on all possible gateways) will be no more than the max_offset below the transaction's commit timestamp. This property ensures that all causally dependent transactions will have an uncertainty interval (see GlobalUncertaintyLimit) that exceeds the original transaction's commit timestamp, preventing stale reads. Without the wait, it would be possible for a read-write transaction to write a future-time value and then for a causally dependent transaction to read below that future-time value, violating "read your writes".

The property must also hold for read-only transactions, which may have a commit timestamp in the future due to an uncertainty restart after observing a future-time value in their uncertainty interval. In such cases, the property that the transaction must wait for the local HLC clock to exceed its commit timestamp is not necessary to prevent stale reads, but it is necessary to ensure monotonic reads. Without the wait, it would be possible for a read-only transaction coordinated on a gateway with a fast clock to return a future-time value and then for a causally dependent read-only transaction coordinated on a gateway with a slow clock to read below that future-time value, violating "monotonic reads".

In practice, most transactions do not need to wait at all, because their commit timestamps were pulled from an HLC clock (i.e. are not synthetic) and so they will be guaranteed to lead the local HLC's clock, assuming proper HLC time propagation. Only transactions whose commit timestamps were pushed into the future will need to wait, like those who wrote to a global_read range and got bumped by the closed timestamp or those who conflicted (write-read or write-write) with an existing future-time value.

However, CockroachDB also supports a stricter model of consistency through its "linearizable" flag. When in linearizable mode (also known as "strict serializable" mode), all writing transactions (but not read-only transactions) must wait an additional max_offset after committing to ensure that their commit timestamp is below the current HLC clock time of any other node in the system. In doing so, all causally dependent transactions are guaranteed to start with higher timestamps, regardless of the gateway they use. This ensures that all causally dependent transactions commit with higher timestamps, even if their read and writes sets do not conflict with the original transaction's. This obviates the need for uncertainty intervals and prevents the "causal reverse" anamoly which can be observed by a third, concurrent transaction.

For more, see https://www.cockroachlabs.com/blog/consistency-model/ and [docs/RFCS/20200811_non_blocking_txns.md](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200811_non_blocking_txns.md).

----

The PR also fixes a bug by properly marking read-only txn as aborted on rollback, which was missed in a85115a.

We were assuming that all calls to `commitReadOnlyTxnLocked` were for `EndTxn` requests with the Commit flag set to true, but this is not the case. This was not only confusing, but it was also leading to the `txn.commit` metric being incremented on rollback of a read-only transaction, instead of the `txn.aborts` metric.

Release justification: New functionality.

61170: kvserver: remove `kv.atomic_replication_changes.enabled` setting r=aayushshah15 a=aayushshah15

This setting was added in 19.2 to provide a fallback against atomic
replication changes. They've now been a part of CockroachDB for over 3
releases. They're also a requirement for non-voting replicas.

Release note (backward-incompatible change): Removed the
`kv.atomic_replication_changes.enabled` cluster setting. All replication
changes on a range now use joint-consensus.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 4, 2021
Informs cockroachdb#59680.
Informs cockroachdb#52745.

This commit updates `closedts.TargetForPolicy` to calculate a target closed
timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS
closed timestamp policy. This is needed for non-blocking transactions, which
require ranges to closed time in the future.

TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its
LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset
from the publisher's perspective, the policy defines a goal from the consumer's
perspective - the goal being that present time reads (with a possible
uncertainty interval) can be served from all followers. To accomplish this, we
must work backwards to establish a lead time to publish closed timestamps at.

The calculation looks something like the following:
```
// this should be sufficient for any present-time transaction,
// because its global uncertainty limit should be <= this time.
// For more, see (*Transaction).RequiredFrontier.
closed_ts_at_follower = now + max_offset

// the sender must account for the time it takes to propagate a
// closed timestamp update to its followers.
closed_ts_at_sender = closed_ts_at_follower + propagation_time

// closed timestamps propagate in two ways. Both need to make it to
// followers in time.
propagation_time = max(raft_propagation_time, side_propagation_time)

// raft propagation takes 3 network hops to go from a leader proposing
// a write (with a closed timestamp update) to the write being applied.
// 1. leader sends MsgProp with entry
// 2. followers send MsgPropResp with vote
// 3. leader sends MsgProp with higher commit index
//
// we also add on a small bit of overhead for request evaluation, log
// sync, and state machine apply latency.
raft_propagation_time = max_network_rtt * 1.5 + raft_overhead

// side-transport propagation takes 1 network hop, as there is no voting.
// However, it is delayed by the full side_transport_close_interval in
// the worst-case.
side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval

// put together, we get the following result
closed_ts_at_sender = now + max_offset + max(
	max_network_rtt * 1.5 + raft_overhead,
	max_network_rtt * 0.5 + side_transport_close_interval,
)
```

While writing this, I explored what it would take to use dynamic network latency
measurements in this calculation to complete cockroachdb#59680. The code for that wasn't
too bad, but brought up a number of questions, including how far into the tail
we care about and whether we place upper and lower bounds on this value. To
avoid needing to immediately answer these questions, the commit hardcodes a
maximum network RTT of 150ms, which should be an overestimate for almost any
cluster we expect to run on.

The commit also adds a new `kv.closed_timestamp.lead_for_global_reads_override`
cluster setting, which, if nonzero, overrides the lead time that global_read
ranges use to publish closed timestamps. The cluster setting is hidden, but
should provide an escape hatch for cases where we get the calculation
(especially when it becomes dynamic) wrong.

Release justification: needed for new functionality
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 4, 2021
Informs cockroachdb#59680.
Informs cockroachdb#52745.

This commit updates `closedts.TargetForPolicy` to calculate a target closed
timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS
closed timestamp policy. This is needed for non-blocking transactions, which
require ranges to closed time in the future.

TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its
LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset
from the publisher's perspective, the policy defines a goal from the consumer's
perspective - the goal being that present time reads (with a possible
uncertainty interval) can be served from all followers. To accomplish this, we
must work backwards to establish a lead time to publish closed timestamps at.

The calculation looks something like the following:
```
// this should be sufficient for any present-time transaction,
// because its global uncertainty limit should be <= this time.
// For more, see (*Transaction).RequiredFrontier.
closed_ts_at_follower = now + max_offset

// the sender must account for the time it takes to propagate a
// closed timestamp update to its followers.
closed_ts_at_sender = closed_ts_at_follower + propagation_time

// closed timestamps propagate in two ways. Both need to make it to
// followers in time.
propagation_time = max(raft_propagation_time, side_propagation_time)

// raft propagation takes 3 network hops to go from a leader proposing
// a write (with a closed timestamp update) to the write being applied.
// 1. leader sends MsgProp with entry
// 2. followers send MsgPropResp with vote
// 3. leader sends MsgProp with higher commit index
//
// we also add on a small bit of overhead for request evaluation, log
// sync, and state machine apply latency.
raft_propagation_time = max_network_rtt * 1.5 + raft_overhead

// side-transport propagation takes 1 network hop, as there is no voting.
// However, it is delayed by the full side_transport_close_interval in
// the worst-case.
side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval

// put together, we get the following result
closed_ts_at_sender = now + max_offset + max(
	max_network_rtt * 1.5 + raft_overhead,
	max_network_rtt * 0.5 + side_transport_close_interval,
)
```

While writing this, I explored what it would take to use dynamic network latency
measurements in this calculation to complete cockroachdb#59680. The code for that wasn't
too bad, but brought up a number of questions, including how far into the tail
we care about and whether we place upper and lower bounds on this value. To
avoid needing to immediately answer these questions, the commit hardcodes a
maximum network RTT of 150ms, which should be an overestimate for almost any
cluster we expect to run on.

The commit also adds a new `kv.closed_timestamp.lead_for_global_reads_override`
cluster setting, which, if nonzero, overrides the lead time that global_read
ranges use to publish closed timestamps. The cluster setting is hidden, but
should provide an escape hatch for cases where we get the calculation
(especially when it becomes dynamic) wrong.

Release justification: needed for new functionality
craig bot pushed a commit that referenced this pull request Mar 5, 2021
61207: cli: don't refresh prompt on line continuations r=jordanlewis a=jordanlewis

Previously, the CLI would recalculate its prompt (including any required
network roundtrips to determine transaction status and current database)
on every newline, even if the newline came in the middle of a SQL
statement. This was wasteful, of course, but it's also very confusing
for users when entering a large multi-line statement via paste: the
prompt refreshes defer until enter is pressed for the first time, and
there is a lot of unexplained latency during the statement that doesn't
reflect the actual SQL being executed. It also doesn't match the
reported execution time, leading to mass confusion and sadness.

Now, we don't bother refreshing the prompt during line continuations.

Closes #61095

Release note (cli change): optimize handling of multi-line SQL strings
to avoid unwanted extra server roundtrips.

Release justification: bug fixes and low-risk updates to new functionality

61305:  kvserver: add closedts side-transport consumer  r=andreimatei a=andreimatei

Add the consumer of closed timestamps communicated by the side transport
(i.e. the gRPC server for our new push-based streaming protocol).

This side-transport consumer accumulates closed timestamps communicated
to it by other nodes (the leaseholders of the respective ranges). Its
state is queried whenever a range needs a higher closed timestamp than
what it has locally in the Replica state, at which point the Replica's
state is lazily updated.

Release note: None
Release justification: Needed for GLOBAL tables.

61386: kv: configure leading closed timestamp target for global_read ranges r=nvanbenschoten a=nvanbenschoten

Informs #59680.
Informs #52745.

This commit updates `closedts.TargetForPolicy` to calculate a target closed
timestamp that leads present time for ranges with the LEAD_FOR_GLOBAL_READS
closed timestamp policy. This is needed for non-blocking transactions, which
require ranges to closed time in the future.

TargetForPolicy's LEAD_FOR_GLOBAL_READS calculation is more complex than its
LAG_BY_CLUSTER_SETTING calculation. Instead of the policy defining an offset
from the publisher's perspective, the policy defines a goal from the consumer's
perspective - the goal being that present time reads (with a possible
uncertainty interval) can be served from all followers. To accomplish this, we
must work backwards to establish a lead time to publish closed timestamps at.

The calculation looks something like the following:
```
// this should be sufficient for any present-time transaction,
// because its global uncertainty limit should be <= this time.
// For more, see (*Transaction).RequiredFrontier.
closed_ts_at_follower = now + max_offset

// the sender must account for the time it takes to propagate a
// closed timestamp update to its followers.
closed_ts_at_sender = closed_ts_at_follower + propagation_time

// closed timestamps propagate in two ways. Both need to make it to
// followers in time.
propagation_time = max(raft_propagation_time, side_propagation_time)

// raft propagation takes 3 network hops to go from a leader proposing
// a write (with a closed timestamp update) to the write being applied.
// 1. leader sends MsgProp with entry
// 2. followers send MsgPropResp with vote
// 3. leader sends MsgProp with higher commit index
//
// we also add on a small bit of overhead for request evaluation, log
// sync, and state machine apply latency.
raft_propagation_time = max_network_rtt * 1.5 + raft_overhead

// side-transport propagation takes 1 network hop, as there is no voting.
// However, it is delayed by the full side_transport_close_interval in
// the worst-case.
side_propagation_time = max_network_rtt * 0.5 + side_transport_close_interval

// put together, we get the following result
closed_ts_at_sender = now + max_offset + max(
	max_network_rtt * 1.5 + raft_overhead,
	max_network_rtt * 0.5 + side_transport_close_interval,
)
```

While writing this, I explored what it would take to use dynamic network latency
measurements in this calculation to complete #59680. The code for that wasn't
too bad, but brought up a number of questions, including how far into the tail
we care about and whether we place upper and lower bounds on this value. To
avoid needing to immediately answer these questions, the commit hardcodes a
maximum network RTT of 150ms, which should be an overestimate for almost any
cluster we expect to run on.

The commit also adds a new `kv.closed_timestamp.lead_for_global_reads_override`
cluster setting, which, if nonzero, overrides the lead time that global_read
ranges use to publish closed timestamps. The cluster setting is hidden, but
should provide an escape hatch for cases where we get the calculation
(especially when it becomes dynamic) wrong.

Release justification: needed for new functionality

61499: sql: Require FORCE to modify protected zone config fields r=otan a=ajstorm

With the introduction of the multi-region simplification in 21.1, there
are a set of fields in the zone configurations which are protected by the
system. These fields are transparently modified by the system when
certain multi-region abstractions are used (e.g. when a region is added
to the database, the constraints field is modified to add the new
region). Due the protected status of these field, we want to prevent
users from setting them if they're not aware of the impact in doing so
(that they could be overwritten by the system, and that they could
result in sub-optimal performance). To make this more explicit to users,
we block modifications to these fields and if the users wish to modify
them anyway, they must provide a FORCE argument along with the
modification statement. This impacts both the setting of the field in
the zone configuration, as well as any eventual multi-region statement
which follows and will result in over-writing the user's zone
configuration update.

Release justification: Prevent bad user experience with new feature.

Release note (sql change): Updates to certain fields in the zone
configurations are blocked for multi-region enabled databases. This
block can be overridden through the use of the FORCE keyword on the
blocked statement.

Resolves #60447 

61556: sql: read db descriptor from store when constructing mr zone configs r=arulajmani a=arulajmani

Previously, when constructing multi-region zone configs for a database
in the type schema changer, we were doing so using a leased version of
the database descriptor. This meant it could be the case that we were
constructing a stale zone configuration, which manifested itself in
some CI failures. This patch fixes that issue by always reading the
database descriptor from the store when constructing zone
configurations in the type schema changer.

Previously, this was failing consistently under stress in ~30s. I have
this thing running perfectly for the last 10 minutes.

Fixes: #61320

Release justification: bug fix to new functionality
Release note: None

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Adam Storm <storm@cockroachlabs.com>
Co-authored-by: arulajmani <arulajmani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region
Projects
None yet
Development

Successfully merging this pull request may close these issues.