Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual schema docs #8909

Merged
merged 7 commits into from
Mar 24, 2021
Merged

Virtual schema docs #8909

merged 7 commits into from
Mar 24, 2021

Conversation

ericharmeling
Copy link
Contributor

@ericharmeling ericharmeling commented Nov 12, 2020

Fixes #2957.
Fixes #8706.
Fixes #9569.
Fixes #9578.

Update of #5454.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ericharmeling
Copy link
Contributor Author

@glennfawcett FYI

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I go into the details of the contents, there are three high-level concerns here:

  1. I am surprised that pg_catalog does not get documented. We are committed to supporting pg_catalog, and in fact it gives access to more useful data than info_schema / crdb_internal. Moreover, it happens to help more with applications already developed for postgres. What is the plan about that?

  2. there's been recent development, which I hope @piyush-singh can help clarify: that we do not wish to promote crdb_internal as a supported SQL API for use by end-users. This means that documentation, if any (we can discuss if it should be present at all) should disclaim loudly that all this information does not come with any guarantee of stability or preservation across versions. Hence the following questions:

    • is it desirable to have crdb_internal documented here at all?
    • if yes, what is the best way to properly communicate that this is provided with many strings attached?
  3. each of the crdb_internal vtables come with a CPU, memory and latency cost associated, which in some cases can impact cluster stability or performance upon use. This must absolutely be called out if we decide to document this virtual schema. (the cost functions are displayed in the table comments)

Reviewed 10 of 10 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling)

Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing these concerns forward, @knz !

Some quick responses to your points:

  1. I am more than happy to add docs for pg_catalog to this PR.

  2. I think product should make the call on whether or not we want to encourage our end users to use crdb_internal. If we want to discourage/not encourage its use, then we definitely shouldn't document the schemas contents in detail.

  3. Okay! If we do choose to document the schema, I can port all of the information from those table comments (how fortuitous) into the document.

Some additional thoughts:

Throughout our docs, we use crdb_internal tables in examples that require certain information about a cluster that is not exposed by SQL statements or other virtual schema tables. There might have been some new statements introduced over the past few releases that make these crdb_internal tables less useful.

If we want to not document crdb_internal, then I can open an issue to remove (or replace, where possible) references to crdb_internal tables. We can then limit its documentation to brief mentions in the existing Name Resolution page (and possibly the proposed Virtual Schemas overview page).

If we do want to document crdb_internal, then I can modify the documentation introduced in this PR to have no stability distinctions among tables in crdb_internal. I can also add a warning to the top of the page calling out the lack of stability guarantees. And also additional documentation (a column in the table of tables, and possibly a short paragraph) concerning the costs associated with using crdb_internal tables.

I'll wait on the final word from @piyush-singh before moving forward on this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@ericharmeling
Copy link
Contributor Author

@piyush-singh Ping on whether or not we want to document crdb_internal.

@ericharmeling
Copy link
Contributor Author

@knz

  • I added a page on pg_catalog. I'm not sure how much more information we want to provide on that schema.
  • I updated the page on crdb_internal to include a warning re: stability and performance, and less detailed information about the tables in the schema.

@piyush-singh
Ping on the questions in the comment above re: crdb_internal docs.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throughout our docs, we use crdb_internal tables in examples that require certain information about a cluster that is not exposed by SQL statements or other virtual schema tables.

This is a strong signal that something is amiss in our product management. If there was a user need to see some information exposed, the proper "next action" was to ensure we created some SHOW statement or did some research to see if pg_catalog could be used. Adding documentation on how to use crdb_internal was not the right approach.

I would recommend some investigation by PM about this. @piyush-singh what do you think?

There might have been some new statements introduced over the past few releases that make these crdb_internal tables less useful.

Yes that might be the case too. But then if you have not been informed of this, that is perhaps another defect in our process as we should have had github/jira tickets to address these missing features and you would have been informed via release notes.

I added a page on pg_catalog. I'm not sure how much more information we want to provide on that schema.

I'd say we want to document everything we support, and highlight in particular those columns which exist in our schema but which we have not yet populated (i.e. they contain NULL values), because they may be sources of incompatibilities with postgres.

We also want to remind users to talk to us if they find they would like to use pg_catalog but some aspect of it is still missing.

Reviewed 1 of 3 files at r2, 8 of 8 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling)


v20.1/crdb-internal.md, line 15 at r3 (raw file):

{{site.data.alerts.callout_danger}}
We do not recommend using `crdb_internal` tables in production environments for the following reasons:
- The contents of `crdb_internal` schema are unstable, and subject to change in new releases of CockroachDB.

to change without prior notice


v20.1/crdb-internal.md, line 16 at r3 (raw file):

We do not recommend using `crdb_internal` tables in production environments for the following reasons:
- The contents of `crdb_internal` schema are unstable, and subject to change in new releases of CockroachDB.
- There are memory and latency costs associated with each table in `crdb_internal`. Accessing the tables in the schema can impact cluster stability and performance.

This particular point is also generally true of SHOW statements and other features. I would rather see callouts about memory usage / performance for each individual feature that incurs them.

@jseldess
Copy link
Contributor

jseldess commented Mar 1, 2021

@ericharmeling, what's the current state of this PR?

@ericharmeling
Copy link
Contributor Author

@jseldess

what's the current state of this PR?

TL;DR:

It has been on hold, pending further direction/prioritization from product management. I just synced with @vy-ton and now have a clearer direction for this work, as a result of this 21.1 product roadmap item.

More detailed response:

I originally started working on virtual schema docs in 2019, in response to several requests for crdb_internal documentation. That work was not a priority, so I closed my first PR on virtual schema docs (#5454). I then received several more requests for crdb_internal docs, and so I opened a new PR for virtual schema docs (i.e., this PR).

Despite the longstanding requests for crdb_internal docs, whether we want to document crdb_internal at all is still in question. @knz gave some legitimate reasons why we would not want to expose crdb_internal virtual schema in docs. In summary: crdb_internal is an interface to internals of the product, its tables are likely to change across releases, and there are better virtual schemas to which to point users (e.g., pg_catalog). So I am thinking that we do not want to document crdb_internal in detail.

On the topic of virtual schemas, we definitely do want to document pg_catalog. It seems to be the most important virtual schema for end users, as it provides useful information (as noted above by @knz), its tables correspond to the system catalogs in PostgreSQL, and improvements to the table are part of the 21.1 product roadmap (see https://cockroachlabs.atlassian.net/browse/CRDB-1523). This PR includes a page on pg_catalog, but it could use more detail. Note that in 21.1, we are adding tables to the virtual schema that will be empty. Users should know that these empty tables are included for PG compatibility, but remain unimplemented.

So the plan moving forward is:

  1. Remove the crdb_internal page from this PR.
  2. Add more detail to the pg_catalog page about exactly which tables we support, and which ones we do not support.
  3. Add a note to the 21.1 pg_catalog page about the empty tables.

@jseldess
Copy link
Contributor

jseldess commented Mar 1, 2021

Thanks for this update, @ericharmeling!

@ericharmeling
Copy link
Contributor Author

Updates to v21.1/pg-catalog.md are pending the finalization of the new system catalog tables added to pg_catalog in 21.1.

See here for a tracking doc (still working on this doc, so let me know if you want to add anything/change anything before you do): https://docs.google.com/spreadsheets/d/1azfgQJQJriUii1dd8_RwhgHBqhlrwkjao1rISR8YAOc/edit#gid=1321389238

@ericharmeling
Copy link
Contributor Author

There will also be subsequent PRs to update v21.1/information-schema.md. I'm probably going to wait until all work on information_schema has been finalized before starting on that.

@ericharmeling
Copy link
Contributor Author

See cockroachdb/cockroach#61750.

@ericharmeling
Copy link
Contributor Author

@RichardJCai

Adding you to review v21.1/pg-catalog.md.

@ericharmeling
Copy link
Contributor Author

@jordanlewis @vy-ton

Adding both of you for a general review, following our conversation today (3/15/21).

@RichardJCai
Copy link

@RichardJCai Is information_schema what powers SHOW commands?

crdb_internal.create_statements powers the SHOW commands for tables.

-----------------------------|--------------
`pg_aggregate` | `pg_aggregate`
`pg_am` | `pg_am`
`pg_amop` | `pg_amop` (unimplemented)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but piggybacking on Vy's comment, I think we should clarify that unimplemented means that the table exists but the rows are not populated.

Copy link
Contributor

@vy-ton vy-ton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @jordanlewis, and @knz)


v21.1/information-schema.md, line 13 at r5 (raw file):

Previously, vy-ton (Vy Ton) wrote…

@RichardJCai Is information_schema what powers SHOW commands?

Based on Richard's answer, I was confused why this table is nested under information_schema, since I read it as information_schema tables are what backs the SHOW commands`.

I think it's worth putting the SHOW table in its own top-level page (we have individual pages for each SHOW command but no over) or stick it under the top Virtual schemas page

Copy link
Contributor Author

@ericharmeling ericharmeling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR @vy-ton and @RichardJCai !

I think I've addressed all of your comments.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @jordanlewis, @knz, @RichardJCai, and @vy-ton)


_includes/sidebar-data-v20.2.json, line 2422 at r5 (raw file):

Previously, vy-ton (Vy Ton) wrote…

2 suggestions:

  • Put this closer to Name Resolution as it's the closest relative
  • We should get Engineering input but I've been referring to this as System Catalog. Developers might not understand "virtual".

I like both of those suggestions.

  1. Done

  2. I updated the nav title, the title of the overview page ("Virtual Schemas" -> "System Catalogs"), the name of the file (virtual-schemas.md-> system-catalogs.md), and most of the instances of "virtual schema" to "system catalog" (where it made sense).


v20.1/crdb-internal.md, line 15 at r3 (raw file):

Previously, knz (kena) wrote…

to change without prior notice

Done.


v20.2/information-schema.md, line 9 at r5 (raw file):

Previously, vy-ton (Vy Ton) wrote…

can therefore be relied on to remain stable over time

We're not quite here yet with information_schema. I'm looking into this so will have more details soon

I removed this entire paragraph.


v20.2/pg-catalog.md, line 7 at r5 (raw file):

Previously, vy-ton (Vy Ton) wrote…

They correspond to both system catalog and system views, maybe use https://www.postgresql.org/docs/13/catalogs.html as the link

Done.


v20.2/pg-catalog.md, line 13 at r5 (raw file):

Previously, vy-ton (Vy Ton) wrote…

Looking into the None tables here, since I expected to be able to query empty pg_catalog tables

See response below.


v20.2/pg-catalog.md, line 13 at r5 (raw file):

Previously, vy-ton (Vy Ton) wrote…

We should clarify what None here means. We have added tables like pg_amop as empty tables. So they are unimplemented but they do output in SHOW TABLES and you can query the empty table

This is a list of the pg_catalog tables in the official 20.2.6 release. We should only document what is in the latest production release, for production-release docs (i.e., anything with the URL prefix https://www.cockroachlabs.com/docs/stable).

I just built cockroach from the latest commit on the release-20.2 branch, and see that all of the new, unimplemented tables are in pg_catalog. I'll follow up with an update after 20.2.7 is released (which is scheduled for March 29, but still has a few blockers).

The 21.1 beta (scheduled for Monday 3/22) should include all of the new, unimplemented pg_catalog tables. But IMO, because 21.1 is a testing release, the docs can include features that aren't in an "official release" in any case.


v21.1/information-schema.md, line 13 at r5 (raw file):

Is information_schema what powers SHOW commands?

crdb_internal is what powers SHOW commands.

I was confused why this table is nested under information_schema...

Which table?

I think it's worth putting the SHOW table in its own top-level page (we have individual pages for each SHOW command but no over)

SHOW TABLES is a SQL statement, with its own page. Unless you are talking about an actual virtual table? I don't see a table called "SHOW". Or are you talking about crdb_internal.create_statements?

or stick it under the top Virtual schemas page

I'd prefer to keep just the system catalog schemas at the top level of the Virtual Schemas page (now called the "System Catalogs" page)


v21.1/pg-catalog.md, line 13 at r5 (raw file):

Previously, vy-ton (Vy Ton) wrote…

How are you determining the list for the first column? We might be falling into a trap of duplicating the list of Postgres pg_catalog. Instead of just documenting the ones we do implement.

The list for the first column is exactly what is listed on this page (which includes all of the tables in pg_catalog, as well as a handful of system views): https://www.postgresql.org/docs/13/catalogs.html

The list for the second column is what is returned when I run SHOW TABLES FROM pg_catalog from a 3/17 build on the `release-21.1 branch.

The "None" values in the second columns indicate that a table/view in the postgres system catalog (https://www.postgresql.org/docs/13/catalogs.html) is missing from CRDB's pg_catalog schema in that 3/17 21.1 build.

The "(unimplemented)" indicates that a table is empty (I've changed "unimplemented"->"empty"). These are tables that are not in pg_catalog in 20.2.0-20.2.6, but are in pg_catalog in 21.1.beta-1 and 20.2.7. I assume that all of the tables recently added are empty. @RichardJCai Please correct me if I am wrong.


v21.1/pg-catalog.md, line 17 at r5 (raw file):

Previously, RichardJCai (Richard Cai) wrote…

LGTM but piggybacking on Vy's comment, I think we should clarify that unimplemented means that the table exists but the rows are not populated.

I changed "unimplemented" to "empty". "unimplemented" suggests that that will at one point be implemented. "empty" is a little clearer (i.e., they have no contents).

@RichardJCai
Copy link

I changed "unimplemented" to "empty". "unimplemented" suggests that that will at one point be implemented. "empty" is a little clearer (i.e., they have no contents).

SGTM!

@ericharmeling
Copy link
Contributor Author

TFTR, @ianjevans !

@vy-ton

I'm going to go ahead and merge this, so we have the pages live. If we need to make any updates after it's been published, it should be pretty easy to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants