sql: expose MVCC timestamps to SQL #50102

ajwerner · 2020-06-11T15:49:18Z

Is your feature request related to a problem? Please describe.

Cockroach associates each row in its KV store with an HLC timestamp for the purpose of MVCC. Currently this information is not accessible to client applications. Making this data accessible would be valuable to allow client applications to detect when data changed. We know for a fact that this has been a customer request on more than one occasion.

It also loosely relates to #39275 as using these timestamp might make detecting differences between rows easier. It seems like this task would be something of a requirement to implementing #19749.

Describe the solution you'd like

I'd like to see cockroach expose a system column that a sql query can ask for to retrieve the MVCC timestamp of a row. Postgres offers such system columns (see https://www.postgresql.org/docs/current/ddl-system-columns.html) that provide introspection into its concurrency control.

One wrinkle in all of this is that different column families in a row can have different MVCC timestamps. Perhaps the name of the column will need to somehow refer to the name of the column family. Perhaps then we'd need some way to refer to the upper bound of all of the column families.

Another related problem has to do with retrieving this column from indexes. Only in some cases will we know that the write timestamp on an index will be the same as the timestamp on a secondary index. Figuring out when the timestamp off of the secondary index can be used vs. needing to go back and consult the primary index may also be something of a complex problem. I suspect for many cases it's not a big deal -- the need to do an index join back on the primary index when using this feature would likely be acceptable for v0.

Describe alternatives you've considered

Not much. Perhaps some sort of auto-updating column that updates for every write to the row?
Triggers could relieve the need for this in some cases but it's not ideal.

Additional context

The changes required for this should be pretty isolated. The row-fetcher already always retrieves the MVCC timestamps in get and scan requests. It's a matter of telling the row-fetcher to decode it and use it and to plan that retrieval of these columns. This feels like a squarely SQL features project and might even make for a good intern project.

The remaining complication as far as I'm concerned is that today schema change backfills will update rows and their MVCC timestamps. The solution to this will be to construct new indexes and swapping over (#47989). This issue shouldn't necessarily precede the work to expose MVCC columns but it has major bearing on their semantics. I suspect the work should proceed in parallel.

The text was updated successfully, but these errors were encountered:

ajwerner · 2020-06-11T15:53:17Z

Another proposal is to expose the MVCC timestamp as a builtin. It's not clear exactly what the signature of such a builtin would be but it seems conceivable that we could invent one. I have a harder time envisioning how such a builtin would get planned or implemented.

awoods187 · 2020-06-11T23:33:30Z

Slack thread: https://cockroachlabs.slack.com/archives/CHVV403F0/p1591888449339100

rohany · 2020-06-18T02:13:04Z

One wrinkle in all of this is that different column families in a row can have different MVCC timestamps. Perhaps the name of the column will need to somehow refer to the name of the column family. Perhaps then we'd need some way to refer to the upper bound of all of the column families.

We already "collect" information about each kv read when buffering data for a row. It wouldn't be hard to keep a running max mvcc timestamp that is reset every time a row is emitted from the fetcher.

the need to do an index join back on the primary index when using this feature would likely be acceptable for v0.

This seems reasonable.

I might hack on this and see how possible it is. I'll update the thread with results if I get anywhere.

rohany · 2020-06-18T15:14:58Z

Turns out the fetcher level implementation of this is really easy -- it already tracks timestamps and buffers them across families for each row:

--- a/pkg/sql/row/kv_fetcher.go
+++ b/pkg/sql/row/kv_fetcher.go
@@ -17,6 +17,7 @@ import (
        "github.com/cockroachdb/cockroach/pkg/roachpb"
        "github.com/cockroachdb/cockroach/pkg/sql/sqlbase"
        "github.com/cockroachdb/cockroach/pkg/storage/enginepb"
+       "github.com/cockroachdb/cockroach/pkg/util/hlc"
        "github.com/cockroachdb/errors"
 )

@@ -73,14 +74,16 @@ func (f *KVFetcher) NextKV(
                        var key []byte
                        var rawBytes []byte
                        var err error
-                       key, rawBytes, f.batchResponse, err = enginepb.ScanDecodeKeyValueNoTS(f.batchResponse)
+                       var ts hlc.Timestamp
+                       key, ts, rawBytes, f.batchResponse, err = enginepb.ScanDecodeKeyValue(f.batchResponse)
                        if err != nil {
                                return false, kv, false, err
                        }
                        return true, roachpb.KeyValue{
                                Key: key,
                                Value: roachpb.Value{
-                                       RawBytes: rawBytes,
+                                       RawBytes:  rawBytes,
+                                       Timestamp: ts,
                                },
                        }, newSpan, nil
                }

The next step is just planning -- we need to advertise this timestamp column in the catalog, ensure that only the primary index contains it (which forces an index join), and then tell the fetcher in what column slot to place the timestamp. @RaduBerinde do you think you could give me a pointer about where to look to do this in the optimizer/is this something that would be easily done?

ajwerner · 2020-06-19T01:48:16Z

One question is whether decoding the timestamp has some cost here (I anticipate it does) and thus whether we want to only decode it if it we need it?

rohany · 2020-06-19T01:55:59Z

There is going to be some small amount of overhead, but it looks pretty easy to only decode the timestamp when we need it

rohany · 2020-06-22T18:20:18Z

cc @RaduBerinde, did you get a change to look at this? It seems like this is a pretty desired feature from customers, so we might look at changing the priority of it.

RaduBerinde · 2020-06-22T20:26:09Z

For the optimizer, it should be as easy as changing the catalog (sql/opt_catalog.go) to advertise an extra, hidden column and put it in the primary index. We'd want a similar change in the test catalog (sql/opt/testutils/testcat/test_catalog.go) to be able to have related opt tests

rohany · 2020-06-23T00:33:37Z

Hmm, it doesn't seem to be as easy as just changing the optTable. Query planning outside of the optimizer doesn't use the catalog interfaces and operate directly on descriptors, meaning that this extra column has to escape the optTable and live on the ImmutableTableDescriptor (and on the columns in the TableDescriptor inside). There's also a question about what ID to assign to this new column.

RaduBerinde · 2020-06-23T00:53:17Z

It's up to the ConstructScan opt_factory code to create whatever metadata is needed to scan this column. E.g. it could be converted to a special flag in the scanNode, or whatever..

Fixes cockroachdb#50102. This PR adds introspection into the KV layer's concurrency control from the SQL level. In particular, we now expose the MVCC HLC timestamp of a row as a special system column on every table. This system column is exposed as a decimal, and is computed as `wall time * 10^10 + logical time`. To accomplish this, this PR adds planning and execution infrastructure for implicit system columns that can be produced by the execution layer for a particular row. Release note (sql change): Expose the MVCC timestamp of each row as a system column on tables.

Fixes cockroachdb#50102. This PR adds introspection into the KV layer's concurrency control from the SQL level. In particular, we now expose the MVCC HLC timestamp of a row as a special system column on every table. This system column is exposed as a decimal, and is computed as `wall time * 10^10 + logical time`. To accomplish this, this PR adds planning and execution infrastructure for implicit system columns that can be produced by the execution layer for a particular row. Release note (sql change): Expose the MVCC timestamp of each row as a system column on tables. This column is named `crdb_internal_mvcc_timestamp` and is accessible only in a limited set of contexts.

51494: sql: expose mvcc timestamps to SQL r=raduberinde a=rohany Fixes #50102. This PR adds introspection into the KV layer's concurrency control from the SQL level. In particular, we now expose the MVCC HLC timestamp of a row as a special system column on every table. This system column is exposed as a decimal, and is computed as `wall time * 10^10 + logical time`. To accomplish this, this PR adds planning and execution infrastructure for implicit system columns that can be produced by the execution layer for a particular row. Release note (sql change): Expose the MVCC timestamp of each row as a system column on tables. Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>

Before this commit, we'd return a zero-value MVCC timestamp when reading an intent from the intent history. This was problematic because elsewhere in the code we assume that we will always get a non-zero MVCC timestamp when a read returns a value. This is especially bizarre given that a read of the latest intent will return its write timestamp. The semantics here are such that we'll return the timestamp of the MVCC metadata for the row. I think this is the most reasonable thing to do as that timestamp ought to reflect the timestamp we return when we return the current intent and furthermore is the only timestamp we really have around. We could return the transactions current read or write timestamp but those both seem like worse choices. It's also worth noting that in the case where we need to read a non-zero value, we don't really care what that value is and the fact that we are reading this intent itself is somewhat suspect. That being said, I still think we should make this change in addition to any change made to prevent the code referenced in cockroachdb#49266 from needing it. Fixes cockroachdb#49266. Informs cockroachdb#50102. Release note: None

48842: sql: fix portals after exhausting rows r=yuzefovich a=yuzefovich Previously, we would erroneously restart the execution from the very beginning of empty, unclosed portals after they have been fully exhausted when we should be returning no rows or an error in such scenarios. This is now fixed by tracking whether a portal is exhausted or not and intercepting the calls to `execStmt` when the conn executor state machine is in an open state. Note that the current solution has known deviations from Postgres: - when attempting to execute portals of statements that don't return row sets, on the second and consequent attempt PG returns an error while we are silently doing nothing (meaning we do not run the statement at all and return 0 rows) - we incorrectly populate "command tag" field of pgwire messages of some rows-returning statements after the portal suspension (for example, a suspended UPDATE RETURNING in PG will return the total count of "rows affected" while we will return the count since the last suspension). These deviations are deemed acceptable since this commit fixes a much worse problem - re-executing an exhausted portal (which could be a mutation meaning, previously, we could have executed a mutation multiple times). The reasons for why this commit does not address these deviations are: - Postgres has a concept of "portal strategy" (see https://github.com/postgres/postgres/blob/2f9661311b83dc481fc19f6e3bda015392010a40/src/include/utils/portal.h#L89). - Postgres has a concept of "command" type (these are things like SELECTs, UPDATEs, INSERTs, etc, see https://github.com/postgres/postgres/blob/1aac32df89eb19949050f6f27c268122833ad036/src/include/nodes/nodes.h#L672). CRDB doesn't have these concepts, and without them any attempt to simulate Postgres results in a very error-prone and brittle code. Fixes: #48448. Release note (bug fix): Previously, CockroachDB would erroneously restart the execution of empty, unclosed portals after they have been fully exhausted, and this has been fixed. 52098: sql: better distribute distinct processors r=yuzefovich a=yuzefovich **sql: better distribute distinct processors** The distinct processors are planned in two stages - first, a "local" stage is planned on the same nodes as the previous stage, then, a "global" stage is added. Previously, the global stage would be planned on the gateway, and this commit changes that to make it distributed - by planning "global" distinct processors on the same nodes as the "local" ones and connecting them via a hash router hashing the distinct columns. Release note: None **sql: implement ConstructDistinct in the new factory** Addresses: #47473. Release note: None 52358: engine: set the MVCC timestamp on reads due to historical intents r=ajwerner a=ajwerner Before this commit, we'd return a zero-value MVCC timestamp when reading an intent from the intent history. This was problematic because elsewhere in the code we assume that we will always get a non-zero MVCC timestamp when a read returns a value. This is especially bizarre given that a read of the latest intent will return its write timestamp. The semantics here are such that we'll return the timestamp of the MVCC metadata for the row. I think this is the most reasonable thing to do as that timestamp ought to reflect the timestamp we return when we return the current intent and furthermore is the only timestamp we really have around. We could return the transactions current read or write timestamp but those both seem like worse choices. It's also worth noting that in the case where we need to read a non-zero value, we don't really care what that value is and the fact that we are reading this intent itself is somewhat suspect. That being said, I still think we should make this change in addition to any change made to prevent the code referenced in #49266 from needing it. Fixes #49266. Informs #50102. Release note: None 52384: sql: properly reset extraTxnState in COPY r=ajwerner a=ajwerner Apparently we support some sort of COPY protocol that I know nothing about. We allow operations in that protocol in the open state and in the noTxn state in the connExecutor. In the noTxn state we let the `copyMachine` handle its transaction lifecycle all on its own. In that case, we also hand have the `connExecutor` in a fresh state when calling `execCopyIn()`. During the execution of `COPY`, it seems like sometime we can pick up table descriptor leases. In the noTxn case we'd like to drop those leases and generally leave the executor in the fresh state in which it was handed to us. To deal with that, we call `resetExtraTxnState` before returning in the happy noTxn case. Fixes #52065. Release note (bug fix): Fixed a bug when using the COPY protocol which could prevent schema changes for up to 5 minutes. Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

Before this commit, we'd return a zero-value MVCC timestamp when reading an intent from the intent history. This was problematic because elsewhere in the code we assume that we will always get a non-zero MVCC timestamp when a read returns a value. This is especially bizarre given that a read of the latest intent will return its write timestamp. The semantics here are such that we'll return the timestamp of the MVCC metadata for the row. I think this is the most reasonable thing to do as that timestamp ought to reflect the timestamp we return when we return the current intent and furthermore is the only timestamp we really have around. We could return the transactions current read or write timestamp but those both seem like worse choices. It's also worth noting that in the case where we need to read a non-zero value, we don't really care what that value is and the fact that we are reading this intent itself is somewhat suspect. That being said, I still think we should make this change in addition to any change made to prevent the code referenced in cockroachdb#49266 from needing it. Fixes cockroachdb#49266. Informs cockroachdb#50102. Release note: None

andreimatei · 2022-09-20T13:45:09Z

For posterity, the hidden column that was added here is crdb_internal_mvcc_timestamp.

ajwerner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-optimizer SQL logical planning and optimizations. A-sql-execution Relating to SQL execution. labels Jun 11, 2020

ajwerner mentioned this issue Jun 19, 2020

sql: provide mechanisnm to determine whether an UPSERT did an UPDATE or INSERT #50418

Open

rohany mentioned this issue Jun 28, 2020

sql: expose mvcc timestamps to SQL #50519

Closed

rohany mentioned this issue Jul 16, 2020

sql: expose mvcc timestamps to SQL #51494

Merged

craig bot closed this as completed in aed4dfa Jul 21, 2020

ajwerner mentioned this issue Aug 4, 2020

engine: set the MVCC timestamp on reads due to historical intents #52358

Merged

ajwerner mentioned this issue Sep 29, 2020

release-20.1: engine: set the MVCC timestamp on reads due to historical intents #54942

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: expose MVCC timestamps to SQL #50102

sql: expose MVCC timestamps to SQL #50102

ajwerner commented Jun 11, 2020

ajwerner commented Jun 11, 2020

awoods187 commented Jun 11, 2020

rohany commented Jun 18, 2020

rohany commented Jun 18, 2020

ajwerner commented Jun 19, 2020

rohany commented Jun 19, 2020

rohany commented Jun 22, 2020

RaduBerinde commented Jun 22, 2020

rohany commented Jun 23, 2020

RaduBerinde commented Jun 23, 2020

andreimatei commented Sep 20, 2022

sql: expose MVCC timestamps to SQL #50102

sql: expose MVCC timestamps to SQL #50102

Comments

ajwerner commented Jun 11, 2020

ajwerner commented Jun 11, 2020

awoods187 commented Jun 11, 2020

rohany commented Jun 18, 2020

rohany commented Jun 18, 2020

ajwerner commented Jun 19, 2020

rohany commented Jun 19, 2020

rohany commented Jun 22, 2020

RaduBerinde commented Jun 22, 2020

rohany commented Jun 23, 2020

RaduBerinde commented Jun 23, 2020

andreimatei commented Sep 20, 2022