Skip to content

Add closed timestamps to architecture docs #11762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 30, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions v21.1/architecture/replication-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ After loading the snapshot, the node gets up to date by replaying all actions fr

A single node in the Raft group acts as the leaseholder, which is the only node that can serve reads or propose writes to the Raft group leader (both actions are received as `BatchRequests` from [`DistSender`](distribution-layer.html#distsender)).

When serving reads, leaseholders bypass Raft; for the leaseholder's writes to have been committed in the first place, they must have already achieved consensus, so a second consensus on the same data is unnecessary. This has the benefit of not incurring networking round trips required by Raft and greatly increases the speed of reads (without sacrificing consistency).

CockroachDB attempts to elect a leaseholder who is also the Raft group leader, which can also optimize the speed of writes.

If there is no leaseholder, any node receiving a request will attempt to become the leaseholder for the range. To prevent two nodes from acquiring the lease, the requester includes a copy of the last valid lease it had; if another node became the leaseholder, its request is ignored.

When serving [strongly-consistent (aka "non-stale") reads](transaction-layer.html#reading), leaseholders bypass Raft; for the leaseholder's writes to have been committed in the first place, they must have already achieved consensus, so a second consensus on the same data is unnecessary. This has the benefit of not incurring latency from networking round trips required by Raft and greatly increases the speed of reads (without sacrificing consistency).

#### Co-location with Raft leadership

The range lease is completely separate from Raft leadership, and so without further efforts, Raft leadership and the Range lease might not be held by the same replica. However, we can optimize query performance by making the same node both Raft leader and the leaseholder; it reduces network round trips if the leaseholder receiving the requests can simply propose the Raft commands to itself, rather than communicating them to another node.
Expand Down
27 changes: 25 additions & 2 deletions v21.1/architecture/transaction-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ If transactions fail for other reasons, such as failing to pass a SQL constraint

If the transaction has not been aborted, the transaction layer begins executing read operations. If a read only encounters standard MVCC values, everything is fine. However, if it encounters any write intents, the operation must be resolved as a [transaction conflict](#transaction-conflicts).

CockroachDB provides the following types of reads:

- Strongly-consistent (aka "non-stale") reads: These are the default and most common type of read. These reads go through the [leaseholder](replication-layer.html#leases) and see all writes performed by writers that committed before the reading transaction started. They always return data that is correct and up-to-date.
- Stale reads: These are useful in situations where you can afford to read data that is slightly stale in exchange for faster reads. They can only be used in read-only transactions that use the [`AS OF SYSTEM TIME`](../as-of-system-time.html) clause. They do not need to go through the leaseholder, since they ensure consistency by reading from a local replica at a timestamp that is never higher than the [closed timestamp](#closed-timestamps). For more information about how to use stale reads from SQL, see [Follower Reads](../follower-reads.html).

### Commits (phase 2)

CockroachDB checks the running transaction's record to see if it's been `ABORTED`; if it has, it restarts the transaction.
Expand Down Expand Up @@ -93,8 +98,26 @@ For more detail about the risks that large clock offsets can cause, see [What ha

As part of providing serializability, whenever an operation reads a value, we store the operation's timestamp in a timestamp cache, which shows the high-water mark for values being read.

The timestamp cache is a data structure used to store information about the reads performed by [leaseholders](replication-layer.html#leases). This is used to ensure that once some transaction *t1* reads a row, another transaction *t2* that comes along and tries to write to that row will be ordered after *t1*, thus ensuring a serial order of transactions, aka serializability.

Whenever a write occurs, its timestamp is checked against the timestamp cache. If the timestamp is less than the timestamp cache's latest value, we attempt to push the timestamp for its transaction forward to a later time. Pushing the timestamp might cause the transaction to restart in the second phase of the transaction (see [read refreshing](#read-refreshing)).

### Closed timestamps

Each CockroachDB range tracks a property called its _closed timestamp_, which means that no new writes can ever be introduced below that timestamp. The closed timestamp advances continuously, and lags the current time by some target interval. If the range receives a write at a timestamp less than its closed timestamp, the write is forced to change its timestamp, which might result in a transaction retry error (see [read refreshing](#read-refreshing)).

In other words, a closed timestamp is a promise by the range's [leaseholder](replication-layer.html#leases) to its follower replicas that it will not accept writes below that timestamp. Generally speaking, the leaseholder continuously closes timestamps a few seconds in the past.

The closed timestamps subsystem works by propagating information from leaseholders to followers by piggybacking closed timestamps onto Raft commands such that the replication stream is synchronized with timestamp closing. This means that a follower replica can start serving reads with timestamps at or below the closed timestamp as soon as it has applied all of the Raft commands up to the position in the [Raft log](replication-layer.html#raft-logs) specified by the leaseholder.

Once the follower replica has applied the abovementioned Raft commands, it has all the data necessary to serve reads with timestamps less than or equal to the closed timestamp.

Note that closed timestamps are valid even if the leaseholder changes, since they are preserved across [lease transfers](replication-layer.html#epoch-based-leases-table-data). Once a lease transfer occurs, the new leaseholder will not break the closed timestamp promise made by the old leaseholder.

Closed timestamps provide the guarantees that are used to provide support for low-latency historical (stale) reads, also known as [Follower Reads](../follower-reads.html). Follower reads can be particularly useful in [multi-region deployments](../multiregion-overview.html).

For more information about the implementation of closed timestamps and Follower Reads, see our blog post [An Epic Read on Follower Reads](https://www.cockroachlabs.com/blog/follower-reads-stale-data/).

### client.Txn and TxnCoordSender

As we mentioned in the SQL layer's architectural overview, CockroachDB converts all SQL statements into key-value (KV) operations, which is how data is ultimately stored and accessed.
Expand Down Expand Up @@ -267,7 +290,7 @@ If there is a deadlock between transactions (i.e., they're each blocked by each

### Read refreshing

Whenever a transaction's timestamp has been pushed, additional checks are required before allowing it to commit at the pushed timestamp: any values which the transaction previously read must be checked to verify that no writes have subsequently occurred between the original transaction timestamp and the pushed transaction timestamp. This check prevents serializability violation. The check is done by keeping track of all the reads using a dedicated `RefreshRequest`. If this succeeds, the transaction is allowed to commit (transactions perform this check at commit time if they've been pushed by a different transaction or by the timestamp cache, or they perform the check whenever they encounter a `ReadWithinUncertaintyIntervalError` immediately, before continuing).
Whenever a transaction's timestamp has been pushed, additional checks are required before allowing it to commit at the pushed timestamp: any values which the transaction previously read must be checked to verify that no writes have subsequently occurred between the original transaction timestamp and the pushed transaction timestamp. This check prevents serializability violation. The check is done by keeping track of all the reads using a dedicated `RefreshRequest`. If this succeeds, the transaction is allowed to commit (transactions perform this check at commit time if they've been pushed by a different transaction or by the [timestamp cache](#timestamp-cache), or they perform the check whenever they encounter a [`ReadWithinUncertaintyIntervalError`](../transaction-retry-error-reference.html#readwithinuncertaintyinterval) immediately, before continuing).
If the refreshing is unsuccessful, then the transaction must be retried at the pushed timestamp.

### Transaction pipelining
Expand Down Expand Up @@ -386,7 +409,7 @@ The consistency guarantees offered by non-blocking transactions are enforced thr
Non-blocking transactions are implemented via _non-blocking ranges_. Every non-blocking range has the following properties:

- Any transaction that writes to this range has its write timestamp pushed into the future.
- The range is able to propagate a [closed timestamp](../follower-reads.html#how-follower-reads-work) in the future of present time.
- The range is able to propagate a [closed timestamp](#closed-timestamps) in the future of present time.
- A transaction that writes to this range and commits with a future time commit timestamp needs to wait until the HLC advances past its commit timestamp. This process is known as _"commit-wait"_. Essentially, the HLC waits until it advances past the future timestamp on its own, or it advances due to updates from other timestamps.
- A transaction that reads a future-time write to this range can have its commit timestamp bumped into the future as well, if the write falls in the read's uncertainty window (this is dictated by the [maximum clock offset](#max-clock-offset-enforcement) configured for the cluster). Such transactions (a.k.a. "conflicting readers") may also need to commit-wait.

Expand Down
2 changes: 1 addition & 1 deletion v21.1/follower-reads.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ This is an [{{ site.data.products.enterprise }} feature](enterprise-licensing.ht

## How follower reads work

Each CockroachDB node tracks a property called its "closed timestamp", which means that no new writes can ever be introduced below that timestamp. The closed timestamp advances forward by some target interval behind the current time. If the replica receives a write at a timestamp less than its closed timestamp, it rejects the write.
Each CockroachDB range tracks a property called its [_closed timestamp_](architecture/transaction-layer.html#closed-timestamps), which means that no new writes can ever be introduced below that timestamp. The closed timestamp advances forward by some target interval behind the current time. If the range receives a write at a timestamp less than its closed timestamp, it rejects the write.

With [follower reads enabled](#enable-disable-follower-reads), any replica on a node can serve a read for a key as long as the time at which the operation is performed (i.e., the [`AS OF SYSTEM TIME`](as-of-system-time.html) value) is less or equal to the node's closed timestamp.

Expand Down
6 changes: 3 additions & 3 deletions v21.1/transaction-retry-error-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ The `RETRY_SERIALIZABLE` error occurs in the following cases:

2. When a [high-priority transaction](transactions.html#transaction-priorities) _A_ does a read that runs into a write intent from another lower-priority transaction _B_, and some other transaction _C_ writes to a key that _B_ has already read. Transaction _B_ will get this error when it tries to commit, because _A_ has already read some of the data touched by _B_ and returned results to the client, and _C_ has written data previously read by _B_.

3. When a transaction _A_ is forced to refresh (i.e., change its timestamp) due to hitting the maximum _closed timestamp_ interval (closed timestamps enable [Follower Reads](follower-reads.html#how-follower-reads-work) and [Change Data Capture (CDC)](stream-data-out-of-cockroachdb-using-changefeeds.html)). This can happen when transaction _A_ is a long-running transaction, and there is a write by another transaction to data that _A_ has already read. If this is the cause of the error, the solution is to increase the `kv.closed_timestamp.target_duration` setting to a higher value. Unfortunately, there is no indication from this error code that a too-low closed timestamp setting is the issue. Therefore, you may need to rule out cases 1 and 2 (or experiment with increasing the closed timestamp interval, if that is possible for your application - see the note below).
3. When a transaction _A_ is forced to refresh (i.e., change its timestamp) due to hitting the maximum [_closed timestamp_](architecture/transaction-layer.html#closed-timestamps) interval (closed timestamps enable [Follower Reads](follower-reads.html#how-follower-reads-work) and [Change Data Capture (CDC)](stream-data-out-of-cockroachdb-using-changefeeds.html)). This can happen when transaction _A_ is a long-running transaction, and there is a write by another transaction to data that _A_ has already read. If this is the cause of the error, the solution is to increase the `kv.closed_timestamp.target_duration` setting to a higher value. Unfortunately, there is no indication from this error code that a too-low closed timestamp setting is the issue. Therefore, you may need to rule out cases 1 and 2 (or experiment with increasing the closed timestamp interval, if that is possible for your application - see the note below).

_Action_:

Expand Down Expand Up @@ -175,7 +175,7 @@ This error occurs in the cases described below.

2. When a [high-priority transaction](transactions.html#transaction-priorities) _A_ does a read that runs into a write intent from another lower-priority transaction _B_. Transaction _B_ may get this error when it tries to commit, because _A_ has already read some of the data touched by _B_ and returned results to the client.

3. When a transaction _A_ is forced to refresh (change its timestamp) due to hitting the maximum _closed timestamp_ interval (closed timestamps enable [Follower Reads](follower-reads.html#how-follower-reads-work) and [Change Data Capture (CDC)](stream-data-out-of-cockroachdb-using-changefeeds.html)). This can happen when transaction _A_ is a long-running transaction, and there is a write by another transaction to data that _A_ has already read.
3. When a transaction _A_ is forced to refresh (change its timestamp) due to hitting the maximum [_closed timestamp_](architecture/transaction-layer.html#closed-timestamps) interval (closed timestamps enable [Follower Reads](follower-reads.html#how-follower-reads-work) and [Change Data Capture (CDC)](stream-data-out-of-cockroachdb-using-changefeeds.html)). This can happen when transaction _A_ is a long-running transaction, and there is a write by another transaction to data that _A_ has already read.

_Action_:

Expand Down Expand Up @@ -280,4 +280,4 @@ Retry transaction _A_ as described in [client-side retry handling](transactions.
- [Client-side retry handling](transactions.html#client-side-intervention)
- [Understanding and avoiding transaction contention](performance-best-practices-overview.html#understanding-and-avoiding-transaction-contention)
- [DB Console Transactions Page](ui-transactions-page.html)
- [Architecture - Transaction Layer](architecture/transaction-layer.html)
- [Architecture - Transaction Layer](architecture/transaction-layer.html)
4 changes: 2 additions & 2 deletions v21.2/architecture/replication-layer.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,12 +69,12 @@ After loading the snapshot, the node gets up to date by replaying all actions fr

A single node in the Raft group acts as the leaseholder, which is the only node that can serve reads or propose writes to the Raft group leader (both actions are received as `BatchRequests` from [`DistSender`](distribution-layer.html#distsender)).

When serving reads, leaseholders bypass Raft; for the leaseholder's writes to have been committed in the first place, they must have already achieved consensus, so a second consensus on the same data is unnecessary. This has the benefit of not incurring networking round trips required by Raft and greatly increases the speed of reads (without sacrificing consistency).

CockroachDB attempts to elect a leaseholder who is also the Raft group leader, which can also optimize the speed of writes.

If there is no leaseholder, any node receiving a request will attempt to become the leaseholder for the range. To prevent two nodes from acquiring the lease, the requester includes a copy of the last valid lease it had; if another node became the leaseholder, its request is ignored.

When serving [strongly-consistent (aka "non-stale") reads](transaction-layer.html#reading), leaseholders bypass Raft; for the leaseholder's writes to have been committed in the first place, they must have already achieved consensus, so a second consensus on the same data is unnecessary. This has the benefit of not incurring latency from networking round trips required by Raft and greatly increases the speed of reads (without sacrificing consistency).

#### Co-location with Raft leadership

The range lease is completely separate from Raft leadership, and so without further efforts, Raft leadership and the Range lease might not be held by the same replica. However, we can optimize query performance by making the same node both Raft leader and the leaseholder; it reduces network round trips if the leaseholder receiving the requests can simply propose the Raft commands to itself, rather than communicating them to another node.
Expand Down
Loading