Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: expose MVCC timestamps to SQL #50102

Closed
ajwerner opened this issue Jun 11, 2020 · 11 comments · Fixed by #51494
Closed

sql: expose MVCC timestamps to SQL #50102

ajwerner opened this issue Jun 11, 2020 · 11 comments · Fixed by #51494
Labels
A-sql-execution Relating to SQL execution. A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@ajwerner
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Cockroach associates each row in its KV store with an HLC timestamp for the purpose of MVCC. Currently this information is not accessible to client applications. Making this data accessible would be valuable to allow client applications to detect when data changed. We know for a fact that this has been a customer request on more than one occasion.

It also loosely relates to #39275 as using these timestamp might make detecting differences between rows easier. It seems like this task would be something of a requirement to implementing #19749.

Describe the solution you'd like

I'd like to see cockroach expose a system column that a sql query can ask for to retrieve the MVCC timestamp of a row. Postgres offers such system columns (see https://www.postgresql.org/docs/current/ddl-system-columns.html) that provide introspection into its concurrency control.

One wrinkle in all of this is that different column families in a row can have different MVCC timestamps. Perhaps the name of the column will need to somehow refer to the name of the column family. Perhaps then we'd need some way to refer to the upper bound of all of the column families.

Another related problem has to do with retrieving this column from indexes. Only in some cases will we know that the write timestamp on an index will be the same as the timestamp on a secondary index. Figuring out when the timestamp off of the secondary index can be used vs. needing to go back and consult the primary index may also be something of a complex problem. I suspect for many cases it's not a big deal -- the need to do an index join back on the primary index when using this feature would likely be acceptable for v0.

Describe alternatives you've considered

Not much. Perhaps some sort of auto-updating column that updates for every write to the row?
Triggers could relieve the need for this in some cases but it's not ideal.

Additional context

The changes required for this should be pretty isolated. The row-fetcher already always retrieves the MVCC timestamps in get and scan requests. It's a matter of telling the row-fetcher to decode it and use it and to plan that retrieval of these columns. This feels like a squarely SQL features project and might even make for a good intern project.

The remaining complication as far as I'm concerned is that today schema change backfills will update rows and their MVCC timestamps. The solution to this will be to construct new indexes and swapping over (#47989). This issue shouldn't necessarily precede the work to expose MVCC columns but it has major bearing on their semantics. I suspect the work should proceed in parallel.

@ajwerner ajwerner added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-sql-optimizer SQL logical planning and optimizations. A-sql-execution Relating to SQL execution. labels Jun 11, 2020
@ajwerner
Copy link
Contributor Author

Another proposal is to expose the MVCC timestamp as a builtin. It's not clear exactly what the signature of such a builtin would be but it seems conceivable that we could invent one. I have a harder time envisioning how such a builtin would get planned or implemented.

@awoods187
Copy link
Contributor

@rohany
Copy link
Contributor

rohany commented Jun 18, 2020

One wrinkle in all of this is that different column families in a row can have different MVCC timestamps. Perhaps the name of the column will need to somehow refer to the name of the column family. Perhaps then we'd need some way to refer to the upper bound of all of the column families.

We already "collect" information about each kv read when buffering data for a row. It wouldn't be hard to keep a running max mvcc timestamp that is reset every time a row is emitted from the fetcher.

the need to do an index join back on the primary index when using this feature would likely be acceptable for v0.

This seems reasonable.

I might hack on this and see how possible it is. I'll update the thread with results if I get anywhere.

@rohany
Copy link
Contributor

rohany commented Jun 18, 2020

Turns out the fetcher level implementation of this is really easy -- it already tracks timestamps and buffers them across families for each row:

--- a/pkg/sql/row/kv_fetcher.go
+++ b/pkg/sql/row/kv_fetcher.go
@@ -17,6 +17,7 @@ import (
        "github.com/cockroachdb/cockroach/pkg/roachpb"
        "github.com/cockroachdb/cockroach/pkg/sql/sqlbase"
        "github.com/cockroachdb/cockroach/pkg/storage/enginepb"
+       "github.com/cockroachdb/cockroach/pkg/util/hlc"
        "github.com/cockroachdb/errors"
 )

@@ -73,14 +74,16 @@ func (f *KVFetcher) NextKV(
                        var key []byte
                        var rawBytes []byte
                        var err error
-                       key, rawBytes, f.batchResponse, err = enginepb.ScanDecodeKeyValueNoTS(f.batchResponse)
+                       var ts hlc.Timestamp
+                       key, ts, rawBytes, f.batchResponse, err = enginepb.ScanDecodeKeyValue(f.batchResponse)
                        if err != nil {
                                return false, kv, false, err
                        }
                        return true, roachpb.KeyValue{
                                Key: key,
                                Value: roachpb.Value{
-                                       RawBytes: rawBytes,
+                                       RawBytes:  rawBytes,
+                                       Timestamp: ts,
                                },
                        }, newSpan, nil
                }

The next step is just planning -- we need to advertise this timestamp column in the catalog, ensure that only the primary index contains it (which forces an index join), and then tell the fetcher in what column slot to place the timestamp. @RaduBerinde do you think you could give me a pointer about where to look to do this in the optimizer/is this something that would be easily done?

@ajwerner
Copy link
Contributor Author

One question is whether decoding the timestamp has some cost here (I anticipate it does) and thus whether we want to only decode it if it we need it?

@rohany
Copy link
Contributor

rohany commented Jun 19, 2020

There is going to be some small amount of overhead, but it looks pretty easy to only decode the timestamp when we need it

@rohany
Copy link
Contributor

rohany commented Jun 22, 2020

cc @RaduBerinde, did you get a change to look at this? It seems like this is a pretty desired feature from customers, so we might look at changing the priority of it.

@RaduBerinde
Copy link
Member

For the optimizer, it should be as easy as changing the catalog (sql/opt_catalog.go) to advertise an extra, hidden column and put it in the primary index. We'd want a similar change in the test catalog (sql/opt/testutils/testcat/test_catalog.go) to be able to have related opt tests

@rohany
Copy link
Contributor

rohany commented Jun 23, 2020

Hmm, it doesn't seem to be as easy as just changing the optTable. Query planning outside of the optimizer doesn't use the catalog interfaces and operate directly on descriptors, meaning that this extra column has to escape the optTable and live on the ImmutableTableDescriptor (and on the columns in the TableDescriptor inside). There's also a question about what ID to assign to this new column.

@RaduBerinde
Copy link
Member

It's up to the ConstructScan opt_factory code to create whatever metadata is needed to scan this column. E.g. it could be converted to a special flag in the scanNode, or whatever..

rohany added a commit to rohany/cockroach that referenced this issue Jun 29, 2020
Fixes cockroachdb#50102.

This PR adds introspection into the KV layer's concurrency control from
the SQL level. In particular, we now expose the MVCC HLC timestamp of a
row as a special system column on every table. This system column is
exposed as a decimal, and is computed as `wall time * 10^10 + logical time`.

To accomplish this, this PR adds planning and execution infrastructure
for implicit system columns that can be produced by the execution layer
for a particular row.

Release note (sql change): Expose the MVCC timestamp of each row as a
system column on tables.
rohany added a commit to rohany/cockroach that referenced this issue Jul 16, 2020
Fixes cockroachdb#50102.

This PR adds introspection into the KV layer's concurrency control from
the SQL level. In particular, we now expose the MVCC HLC timestamp of a
row as a special system column on every table. This system column is
exposed as a decimal, and is computed as `wall time * 10^10 + logical time`.

To accomplish this, this PR adds planning and execution infrastructure
for implicit system columns that can be produced by the execution layer
for a particular row.

Release note (sql change): Expose the MVCC timestamp of each row as a
system column on tables. This column is named
`crdb_internal_mvcc_timestamp` and is accessible only in a limited set
of contexts.
rohany added a commit to rohany/cockroach that referenced this issue Jul 16, 2020
Fixes cockroachdb#50102.

This PR adds introspection into the KV layer's concurrency control from
the SQL level. In particular, we now expose the MVCC HLC timestamp of a
row as a special system column on every table. This system column is
exposed as a decimal, and is computed as `wall time * 10^10 + logical time`.

To accomplish this, this PR adds planning and execution infrastructure
for implicit system columns that can be produced by the execution layer
for a particular row.

Release note (sql change): Expose the MVCC timestamp of each row as a
system column on tables. This column is named
`crdb_internal_mvcc_timestamp` and is accessible only in a limited set
of contexts.
rohany added a commit to rohany/cockroach that referenced this issue Jul 16, 2020
Fixes cockroachdb#50102.

This PR adds introspection into the KV layer's concurrency control from
the SQL level. In particular, we now expose the MVCC HLC timestamp of a
row as a special system column on every table. This system column is
exposed as a decimal, and is computed as `wall time * 10^10 + logical time`.

To accomplish this, this PR adds planning and execution infrastructure
for implicit system columns that can be produced by the execution layer
for a particular row.

Release note (sql change): Expose the MVCC timestamp of each row as a
system column on tables. This column is named
`crdb_internal_mvcc_timestamp` and is accessible only in a limited set
of contexts.
rohany added a commit to rohany/cockroach that referenced this issue Jul 17, 2020
Fixes cockroachdb#50102.

This PR adds introspection into the KV layer's concurrency control from
the SQL level. In particular, we now expose the MVCC HLC timestamp of a
row as a special system column on every table. This system column is
exposed as a decimal, and is computed as `wall time * 10^10 + logical time`.

To accomplish this, this PR adds planning and execution infrastructure
for implicit system columns that can be produced by the execution layer
for a particular row.

Release note (sql change): Expose the MVCC timestamp of each row as a
system column on tables. This column is named
`crdb_internal_mvcc_timestamp` and is accessible only in a limited set
of contexts.
rohany added a commit to rohany/cockroach that referenced this issue Jul 20, 2020
Fixes cockroachdb#50102.

This PR adds introspection into the KV layer's concurrency control from
the SQL level. In particular, we now expose the MVCC HLC timestamp of a
row as a special system column on every table. This system column is
exposed as a decimal, and is computed as `wall time * 10^10 + logical time`.

To accomplish this, this PR adds planning and execution infrastructure
for implicit system columns that can be produced by the execution layer
for a particular row.

Release note (sql change): Expose the MVCC timestamp of each row as a
system column on tables. This column is named
`crdb_internal_mvcc_timestamp` and is accessible only in a limited set
of contexts.
craig bot pushed a commit that referenced this issue Jul 21, 2020
51494: sql: expose mvcc timestamps to SQL r=raduberinde a=rohany

Fixes #50102.

This PR adds introspection into the KV layer's concurrency control from
the SQL level. In particular, we now expose the MVCC HLC timestamp of a
row as a special system column on every table. This system column is
exposed as a decimal, and is computed as `wall time * 10^10 + logical time`.

To accomplish this, this PR adds planning and execution infrastructure
for implicit system columns that can be produced by the execution layer
for a particular row.

Release note (sql change): Expose the MVCC timestamp of each row as a
system column on tables.

Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
@craig craig bot closed this as completed in aed4dfa Jul 21, 2020
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 4, 2020
Before this commit, we'd return a zero-value MVCC timestamp when reading an
intent from the intent history. This was problematic because elsewhere in the
code we assume that we will always get a non-zero MVCC timestamp when a read
returns a value. This is especially bizarre given that a read of the latest
intent will return its write timestamp.

The semantics here are such that we'll return the timestamp of the MVCC
metadata for the row. I think this is the most reasonable thing to do as
that timestamp ought to reflect the timestamp we return when we return the
current intent and furthermore is the only timestamp we really have around.
We could return the transactions current read or write timestamp but those
both seem like worse choices.

It's also worth noting that in the case where we need to read a non-zero
value, we don't really care what that value is and the fact that we are
reading this intent itself is somewhat suspect. That being said, I still
think we should make this change in addition to any change made to prevent
the code referenced in cockroachdb#49266 from needing it.

Fixes cockroachdb#49266.
Informs cockroachdb#50102.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 5, 2020
Before this commit, we'd return a zero-value MVCC timestamp when reading an
intent from the intent history. This was problematic because elsewhere in the
code we assume that we will always get a non-zero MVCC timestamp when a read
returns a value. This is especially bizarre given that a read of the latest
intent will return its write timestamp.

The semantics here are such that we'll return the timestamp of the MVCC
metadata for the row. I think this is the most reasonable thing to do as
that timestamp ought to reflect the timestamp we return when we return the
current intent and furthermore is the only timestamp we really have around.
We could return the transactions current read or write timestamp but those
both seem like worse choices.

It's also worth noting that in the case where we need to read a non-zero
value, we don't really care what that value is and the fact that we are
reading this intent itself is somewhat suspect. That being said, I still
think we should make this change in addition to any change made to prevent
the code referenced in cockroachdb#49266 from needing it.

Fixes cockroachdb#49266.
Informs cockroachdb#50102.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 5, 2020
Before this commit, we'd return a zero-value MVCC timestamp when reading an
intent from the intent history. This was problematic because elsewhere in the
code we assume that we will always get a non-zero MVCC timestamp when a read
returns a value. This is especially bizarre given that a read of the latest
intent will return its write timestamp.

The semantics here are such that we'll return the timestamp of the MVCC
metadata for the row. I think this is the most reasonable thing to do as
that timestamp ought to reflect the timestamp we return when we return the
current intent and furthermore is the only timestamp we really have around.
We could return the transactions current read or write timestamp but those
both seem like worse choices.

It's also worth noting that in the case where we need to read a non-zero
value, we don't really care what that value is and the fact that we are
reading this intent itself is somewhat suspect. That being said, I still
think we should make this change in addition to any change made to prevent
the code referenced in cockroachdb#49266 from needing it.

Fixes cockroachdb#49266.
Informs cockroachdb#50102.

Release note: None
craig bot pushed a commit that referenced this issue Aug 5, 2020
48842: sql: fix portals after exhausting rows r=yuzefovich a=yuzefovich

Previously, we would erroneously restart the execution from the very
beginning of empty, unclosed portals after they have been fully
exhausted when we should be returning no rows or an error in such
scenarios. This is now fixed by tracking whether a portal is exhausted
or not and intercepting the calls to `execStmt` when the conn executor
state machine is in an open state.

Note that the current solution has known deviations from Postgres:
- when attempting to execute portals of statements that don't return row
sets, on the second and consequent attempt PG returns an error while we
are silently doing nothing (meaning we do not run the statement at all
and return 0 rows)
- we incorrectly populate "command tag" field of pgwire messages of some
rows-returning statements after the portal suspension (for example,
a suspended UPDATE RETURNING in PG will return the total count of "rows
affected" while we will return the count since the last suspension).

These deviations are deemed acceptable since this commit fixes a much
worse problem - re-executing an exhausted portal (which could be
a mutation meaning, previously, we could have executed a mutation
multiple times).

The reasons for why this commit does not address these deviations are:
- Postgres has a concept of "portal strategy"
(see https://github.com/postgres/postgres/blob/2f9661311b83dc481fc19f6e3bda015392010a40/src/include/utils/portal.h#L89).
- Postgres has a concept of "command" type (these are things like
SELECTs, UPDATEs, INSERTs, etc,
see https://github.com/postgres/postgres/blob/1aac32df89eb19949050f6f27c268122833ad036/src/include/nodes/nodes.h#L672).

CRDB doesn't have these concepts, and without them any attempt to
simulate Postgres results in a very error-prone and brittle code.

Fixes: #48448.

Release note (bug fix): Previously, CockroachDB would erroneously
restart the execution of empty, unclosed portals after they have been
fully exhausted, and this has been fixed.

52098: sql: better distribute distinct processors r=yuzefovich a=yuzefovich

**sql: better distribute distinct processors**

The distinct processors are planned in two stages - first, a "local"
stage is planned on the same nodes as the previous stage, then,
a "global" stage is added. Previously, the global stage would be planned
on the gateway, and this commit changes that to make it distributed - by
planning "global" distinct processors on the same nodes as the "local"
ones and connecting them via a hash router hashing the distinct columns.

Release note: None

**sql: implement ConstructDistinct in the new factory**

Addresses: #47473.

Release note: None

52358: engine: set the MVCC timestamp on reads due to historical intents r=ajwerner a=ajwerner

Before this commit, we'd return a zero-value MVCC timestamp when reading an
intent from the intent history. This was problematic because elsewhere in the
code we assume that we will always get a non-zero MVCC timestamp when a read
returns a value. This is especially bizarre given that a read of the latest
intent will return its write timestamp.

The semantics here are such that we'll return the timestamp of the MVCC
metadata for the row. I think this is the most reasonable thing to do as
that timestamp ought to reflect the timestamp we return when we return the
current intent and furthermore is the only timestamp we really have around.
We could return the transactions current read or write timestamp but those
both seem like worse choices.

It's also worth noting that in the case where we need to read a non-zero
value, we don't really care what that value is and the fact that we are
reading this intent itself is somewhat suspect. That being said, I still
think we should make this change in addition to any change made to prevent
the code referenced in #49266 from needing it.

Fixes #49266.
Informs #50102.

Release note: None

52384: sql: properly reset extraTxnState in COPY r=ajwerner a=ajwerner

Apparently we support some sort of COPY protocol that I know nothing about.
We allow operations in that protocol in the open state and in the noTxn state
in the connExecutor. In the noTxn state we let the `copyMachine` handle its
transaction lifecycle all on its own. In that case, we also hand have the
`connExecutor` in a fresh state when calling `execCopyIn()`. During the
execution of `COPY`, it seems like sometime we can pick up table descriptor
leases. In the noTxn case we'd like to drop those leases and generally leave
the executor in the fresh state in which it was handed to us. To deal with
that, we call `resetExtraTxnState` before returning in the happy noTxn case.

Fixes #52065.

Release note (bug fix): Fixed a bug when using the COPY protocol which could
prevent schema changes for up to 5 minutes.

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 29, 2020
Before this commit, we'd return a zero-value MVCC timestamp when reading an
intent from the intent history. This was problematic because elsewhere in the
code we assume that we will always get a non-zero MVCC timestamp when a read
returns a value. This is especially bizarre given that a read of the latest
intent will return its write timestamp.

The semantics here are such that we'll return the timestamp of the MVCC
metadata for the row. I think this is the most reasonable thing to do as
that timestamp ought to reflect the timestamp we return when we return the
current intent and furthermore is the only timestamp we really have around.
We could return the transactions current read or write timestamp but those
both seem like worse choices.

It's also worth noting that in the case where we need to read a non-zero
value, we don't really care what that value is and the fact that we are
reading this intent itself is somewhat suspect. That being said, I still
think we should make this change in addition to any change made to prevent
the code referenced in cockroachdb#49266 from needing it.

Fixes cockroachdb#49266.
Informs cockroachdb#50102.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 29, 2020
Before this commit, we'd return a zero-value MVCC timestamp when reading an
intent from the intent history. This was problematic because elsewhere in the
code we assume that we will always get a non-zero MVCC timestamp when a read
returns a value. This is especially bizarre given that a read of the latest
intent will return its write timestamp.

The semantics here are such that we'll return the timestamp of the MVCC
metadata for the row. I think this is the most reasonable thing to do as
that timestamp ought to reflect the timestamp we return when we return the
current intent and furthermore is the only timestamp we really have around.
We could return the transactions current read or write timestamp but those
both seem like worse choices.

It's also worth noting that in the case where we need to read a non-zero
value, we don't really care what that value is and the fact that we are
reading this intent itself is somewhat suspect. That being said, I still
think we should make this change in addition to any change made to prevent
the code referenced in cockroachdb#49266 from needing it.

Fixes cockroachdb#49266.
Informs cockroachdb#50102.

Release note: None
@andreimatei
Copy link
Contributor

For posterity, the hidden column that was added here is crdb_internal_mvcc_timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. A-sql-optimizer SQL logical planning and optimizations. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
5 participants