Add docs on indexing of REGIONAL BY ROW tables #10280

rmloveland · 2021-04-07T18:35:10Z

Fixes #9273.
Fixes #9274.
Fixes #9740.
Fixes #9743.
Fixes #9885.
Fixes #10136.

Relates to #9739.

Summary of changes:

Update 'ADD CONSTRAINT' docs to show how to add unique indexes to
REGIONAL BY ROW tables.
- Also update 'ADD CONSTRAINT' with an example showing the differences
  in behavior between explicitly partitioned vs. implicitly
  partitioned indexes.
Update 'CREATE INDEX' docs to note the existence of automatically
partitioned indexes, and mention that most users should probably use
REGIONAL BY ROW tables and add indexes to them normally, instead of
explicit partitioning.
Update 'Indexes' doc to mention:
- Index partitioning in general, which was not mentioned on this page.
- Automatic index partitioning that comes for free on REGIONAL BY ROW
  tables, which most users should use instead.
Update the 'Multi-Region Overview' page with a new section called
'Additional Features', where we list how indexes work on RBR tables as
one of the new features (this will grow to add mention of other
CockroachDB features that are designed to work with multi-region as we
go)
Update the 'Unique Constraint' page to talk about indexes on RBR
tables, as well as linking to the usage example on the 'ADD
CONSTRAINT' page (described above).

cockroach-teamcity · 2021-04-07T18:35:16Z

This change is

rmloveland · 2021-04-07T19:16:23Z

Hi Becca!

Please ignore the busted build, it's because this PR depends on a link introduced in #10235 - it will turn green once that one is in

I have a few questions I would appreciate your thoughts on in addition to anything else you want to provide feedback on:

Is there anything that is obviously incorrect or misleading?
Does the general emphasis I have added of "hey most people probably want to use this, NOT explicit partitioning" make sense from your POV?
Anything else users should know about this feature that I've just completely missed?

Thank you!

rytaft

Thank you, @rmloveland!

Is there anything that is obviously incorrect or misleading?

I added some comments inline in a few places

Does the general emphasis I have added of "hey most people probably want to use this, NOT explicit partitioning" make sense from your POV?

Yes, but as I mentioned in some inline comments below, I think the message should be "use REGIONAL BY ROW tables, not explicit partitioning".

Anything else users should know about this feature that I've just completely missed?

I think it might be worth changing the name of this feature to "UNIQUE indexes in REGIONAL BY ROW tables". Although implicitly partitioned unique indexes technically exist, I believe that similar to UNIQUE WITHOUT INDEX, we've decided not to expose the feature to users in 21.1 (@otan, can you confirm?). You can enable it with SET experimental_enable_implicit_column_partitioning = true. Implicitly partitioned unique indexes are used to support UNIQUE indexes in REGIONAL BY ROW tables, but that's an implementation detail that users don't need to know about.

Instead, you should probably say that all indexes, including UNIQUE indexes, are partitioned on crdb_region in REGIONAL BY ROW tables. These indexes can either include the partitioning key (crdb_region) as the first column in the index definition, or exclude it. If it is included, a UNIQUE index will enforce uniqueness on the set of columns, just like it would in a non-partitioned table. But if crdb_region is excluded from the index definition, that serves as a signal that the database should enforce uniqueness on just the columns in the index definition. The index alone cannot enforce uniqueness on columns that are not a prefix of the index columns, so any time rows are inserted or updated in a REGIONAL BY ROW table that has an implicitly partitioned UNIQUE index, the optimizer must add uniqueness checks. Maybe the docs could show the EXPLAIN output of an INSERT or UPDATE statement to show what these checks look like.

I'm not sure if you want to go into detail about this, but the optimizer will also try to remove those checks whenever it is safe to do so. The checks do add latency overhead since they need to visit every region to be sure that the new value doesn't already exist in any other region. There are some specific scenarios where checks can be removed that I can describe if you'd like.

Despite the overhead, the advantage of these checks is:

They enforce uniqueness constraints that the user has specified, maintaining consistency in the database
They allow the optimizer to take advantage of the guaranteed uniqueness of these columns to perform "locality optimized search" when searching for a row with a fixed value for the unique columns. By using this optimization, the execution engine can avoid visiting remote nodes if the row is found locally. (See inline comments for an example.)

~

Ok sorry that was a lot! Also happy to schedule some time to chat about this if it would be helpful.

Reviewed 5 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rmloveland)

v21.1/add-constraint.md, line 235 at r1 (raw file):

to ensure that you get the best performance possible on uniqueness checks in multi-region deployments.

This seems a bit misleading since it's not really about performance; it's about correctness. If you want a partitioned UNIQUE index where the partitioning column is not part of the UNIQUE constraint, this is currently the only way to do it. The optimizer adds uniqueness checks to ensure correctness, and will try to choose the most performant plan possible for those checks. But the fact that there are checks at all means that by definition it will be less performant than an explicitly partitioned or non-partitioned UNIQUE index.

v21.1/add-constraint.md, line 278 at r1 (raw file):

uniqueness checks happen in the region where the user is located, which will lead to reduced query latency.

This isn't correct. The uniqueness checks, which are added when a row is inserted or the email column of an existing row is updated, will need to query every region. The performance benefit happens for queries that read users and select a single email address (e.g., SELECT * FROM users WHERE email = 'anemailaddress@gmail.com'). If 'anemailaddress@gmail.com' is found in the local region, there is no need to search remote regions.

v21.1/add-constraint.md, line 296 at r1 (raw file):

                |            |                |              |                         |                  |
                |            |                |              |                         |                  |
                |            |                |              |                         |                  |

why all the empty spaces?

v21.1/create-index.md, line 50 at r1 (raw file):

`ASC` or `DESC`| Sort the column in ascending (`ASC`) or descending (`DESC`) order in the index. How columns are sorted affects query results, particularly when using `LIMIT`.<br><br>__Default:__ `ASC`
`STORING ...`| Store (but do not sort) each column whose name you include.<br><br>For information on when to use `STORING`, see  [Store Columns](#store-columns).  Note that columns that are part of a table's [`PRIMARY KEY`](primary-key.html) cannot be specified as `STORING` columns in secondary indexes on the table.<br><br>`COVERING` and `INCLUDE` are aliases for `STORING` and work identically.
`opt_partition_by` | An [enterprise-only](enterprise-licensing.html) option that lets you [define index partitions at the row level](partitioning.html). As of CockroachDB v21.1 and later, most users should use [implicitly partitioned unique indexes](multiregion-overview.html#implicitly-partitioned-unique-indexes) instead.

This partition by clause isn't for only UNIQUE indexes. I would either take this comment out or just say users should use REGIONAL BY ROW tables.

v21.1/indexes.md, line 22 at r1 (raw file):

 To index [spatial data](spatial-data.html), CockroachDB uses *spatial indexes*. For more information about spatial indexes, see [Spatial Indexes](spatial-indexes.html).

To support [multi-region deployments](multiregion-overview.html), indexes can be [partitioned](partitioning.html) based on region.  However, as of CockroachDB v21.1 and later, most users should use [implicitly partitioned unique indexes](multiregion-overview.html#implicitly-partitioned-unique-indexes) instead.

Ditto -- say they should use REGIONAL BY ROW tables

v21.1/multiregion-overview.md, line 173 at r1 (raw file):

When you add a `UNIQUE` constraint to a column in a multi-region database,

This only applies for REGIONAL BY ROW tables

to ensure that you get the best performance possible on uniqueness checks in multi-region deployments.

See my comments above about performance v. correctness

v21.1/unique.md, line 24 at r1 (raw file):

- You can define the `UNIQUE` constraint when [creating a table](#syntax), or you can add it to existing tables through [`ADD CONSTRAINT`](add-constraint.html#add-the-unique-constraint).

- <span class="version-tag">New in v21.1:</span> CockroachDB has support for _implicitly partitioned unique indexes_ in [multi-region databases](multiregion-overview.html).  When you add a `UNIQUE` constraint to a column in a multi-region database, CockroachDB automatically takes care of [partitioning](partitioning.html) the index for you to ensure that you get the best performance possible on uniqueness checks in multi-region deployments.

ditto comments above

otan · 2021-04-09T03:43:19Z

Although implicitly partitioned unique indexes technically exist, I believe that similar to UNIQUE WITHOUT INDEX, we've decided not to expose the feature to users in 21.1 (@otan, can you confirm?).

Yeah I think it's better not to publicise it.

rmloveland

Thanks for all of this information Becca! Sorry for the slow response, I think I opened too many PRs at once last week.

I have changed things around a bit to focus more on the "don't do explicit partitioning, instead use indexes on REGIONAL BY ROW tables" angle, as you said to

Also followed your guidance to move the focus away from UNIQUE indexes to all indexes on REGIONAL BY ROW tables, and noted they are automagically partitioned on crdb_region (with a link to some more details on that column)

w.r.t. info re: including/excluding crdb_region info, I am thinking of adding a note about the exclude case to the "Add a unique index to a REGIONAL BY ROW table" example in add-constraint.md, since we already discuss when the region column is implicitly included, like "normal" tables - but the use case for excluding it is not as clear to me. Do you mind explaining a use case for doing that? (Sorry if it's implied by what you wrote, I am not sure I understood that section)

Re: your other comments below, I hope they are mostly addressed by the latest edits, but PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)

v21.1/add-constraint.md, line 235 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

to ensure that you get the best performance possible on uniqueness checks in multi-region deployments.
This seems a bit misleading since it's not really about performance; it's about correctness. If you want a partitioned UNIQUE index where the partitioning column is not part of the UNIQUE constraint, this is currently the only way to do it. The optimizer adds uniqueness checks to ensure correctness, and will try to choose the most performant plan possible for those checks. But the fact that there are checks at all means that by definition it will be less performant than an explicitly partitioned or non-partitioned UNIQUE index.

Thanks for calling this out. I updated this language here and elsewhere where it was used. Not sure if it's 100% correct though, please let me know

v21.1/add-constraint.md, line 278 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

uniqueness checks happen in the region where the user is located, which will lead to reduced query latency.
This isn't correct. The uniqueness checks, which are added when a row is inserted or the email column of an existing row is updated, will need to query every region. The performance benefit happens for queries that read users and select a single email address (e.g., SELECT * FROM users WHERE email = 'anemailaddress@gmail.com'). If 'anemailaddress@gmail.com' is found in the local region, there is no need to search remote regions.

Thanks - I have updated this section with this info.

Is this ability to keep the query local for things that are known to be unique also known as "locality optimized search" ? I took a shot at adding that to the doc. PTAL.

v21.1/add-constraint.md, line 296 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

why all the empty spaces?

Brain malfunction - removed :-)

v21.1/create-index.md, line 50 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This partition by clause isn't for only UNIQUE indexes. I would either take this comment out or just say users should use REGIONAL BY ROW tables.

I see - per your comment above I guess it's any and all indexes, right?

Here is what I updated it with, it's a bit more than your suggestion, please let me know what you think. Also happy to keep it to "use REGIONAL BY ROW" if that's better than this.

"As of CockroachDB v21.1 and later, most users should use REGIONAL BY ROW tables. Indexes against regional by row tables are automatically partitioned, so explicit index partitioning is not required."

v21.1/indexes.md, line 22 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Ditto -- say they should use REGIONAL BY ROW tables

Updated! I ended up using an "include file" since this info is repeated in like 4 places. The text should now be updated to be more general, recommending REGIONAL BY ROW and just creating "normal" indexes against such tables, and recommending against explicit partitioning for most users

v21.1/multiregion-overview.md, line 173 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

When you add a `UNIQUE` constraint to a column in a multi-region database,
This only applies for REGIONAL BY ROW tables
to ensure that you get the best performance possible on uniqueness checks in multi-region deployments.
See my comments above about performance v. correctness

Updated to rename this section to 'Indexes on REGIONAL BY ROW tables'

Revised this whole opening paragraph as discussed above, saying most users should use REGIONAL BY ROW tables and add indexes normally, no explicit partitioning required

Re: performance, we link to the UNIQUE constraint example which is the only place now we discuss performance, and it's updated based on your earlier comment

v21.1/unique.md, line 24 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

ditto comments above

Updated as mentioned in previous comment (these are all included from one source file now, since it's used in 4 places - if you have additional comments you won't have to leave them 4 times now, sorry about that!)

rytaft

Looks much better!

Here's an example for what I was talking about with the implicit v. explicit partitioning. To start, run ./cockroach demo --geo-partitioned-replicas. Next, create a multi-region DB and an employees table as follows:

CREATE DATABASE multi_region_test_db PRIMARY REGION "europe-west1" REGIONS "us-west1", "us-east1";
USE multi_region_test_db;
CREATE TABLE employee (
  id INT PRIMARY KEY,
  email STRING UNIQUE,
  desk_id INT,
  UNIQUE (crdb_region, desk_id)
) LOCALITY REGIONAL BY ROW;

In this example, the schema guarantees that both id and email are globally unique, while desk_id is only unique per region. There are three indexes in the table, all UNIQUE and all partitioned by crdb_region. The indexes on id and email are implicitly partitioned, while the index on crdb_region, desk_id is explicitly partitioned. UNIQUE indexes can only directly enforce uniqueness on all columns in the index, including partitioning columns, so each of these indexes only enforce uniqueness for id, email, and desk_id per region. Therefore, the database needs to do additional work to enforce global uniqueness for id and email. This "additional work" is in the form of "uniqueness checks" that the optimizer adds as part of mutation queries. You can see the checks if you run EXPLAIN on any mutation query affecting the table:

demo@127.0.0.1:26257/multi_region_test_db> EXPLAIN INSERT INTO employee VALUES (1, 'joe@example.com', 1);
                                         info
--------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • root
  │
  ├── • insert
  │   │ into: employee(id, email, desk_id, crdb_region)
  │   │
  │   └── • buffer
  │       │ label: buffer 1
  │       │
  │       └── • values
  │             size: 5 columns, 1 row
  │
  ├── • constraint-check
  │   │
  │   └── • error if rows
  │       │
  │       └── • lookup join (semi)
  │           │ table: employee@primary
  │           │ equality: (lookup_join_const_col_@15, column1) = (crdb_region,id)
  │           │ equality cols are key
  │           │ pred: column10 != crdb_region
  │           │
  │           └── • cross join
  │               │ estimated row count: 3
  │               │
  │               ├── • values
  │               │     size: 1 column, 3 rows
  │               │
  │               └── • scan buffer
  │                     label: buffer 1
  │
  └── • constraint-check
      │
      └── • error if rows
          │
          └── • lookup join (semi)
              │ table: employee@employee_email_key
              │ equality: (lookup_join_const_col_@25, column2) = (crdb_region,email)
              │ equality cols are key
              │ pred: (column1 != id) OR (column10 != crdb_region)
              │
              └── • cross join
                  │ estimated row count: 3
                  │
                  ├── • values
                  │     size: 1 column, 3 rows
                  │
                  └── • scan buffer
                        label: buffer 1
(51 rows)

Time: 44ms total (execution 43ms / network 1ms)

For this insert query, the optimizer has added two "constraint-check" post queries to check the uniqueness of id and email. There is no check needed for crdb_region, desk_id, since that constraint is automatically enforced by the index.

Here is an update example where we update the email column. We only need a uniqueness check for email since we're not updating any other columns:

demo@127.0.0.1:26257/multi_region_test_db> EXPLAIN UPDATE employee SET email = 'joe1@exaple.com' WHERE id = 1;
                                                  info
--------------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • root
  │
  ├── • update
  │   │ table: employee
  │   │ set: email
  │   │
  │   └── • buffer
  │       │ label: buffer 1
  │       │
  │       └── • render
  │           │ estimated row count: 1
  │           │
  │           └── • union all
  │               │ estimated row count: 1
  │               │ limit: 1
  │               │
  │               ├── • scan
  │               │     estimated row count: 1 (100% of the table; stats collected 1 minute ago)
  │               │     table: employee@primary
  │               │     spans: [/'us-east1'/1 - /'us-east1'/1]
  │               │
  │               └── • scan
  │                     estimated row count: 1 (100% of the table; stats collected 1 minute ago)
  │                     table: employee@primary
  │                     spans: [/'europe-west1'/1 - /'europe-west1'/1] [/'us-west1'/1 - /'us-west1'/1]
  │
  └── • constraint-check
      │
      └── • error if rows
          │
          └── • lookup join (semi)
              │ table: employee@employee_email_key
              │ equality: (lookup_join_const_col_@18, email_new) = (crdb_region,email)
              │ equality cols are key
              │ pred: (id != id) OR (crdb_region != crdb_region)
              │
              └── • cross join
                  │ estimated row count: 3
                  │
                  ├── • values
                  │     size: 1 column, 3 rows
                  │
                  └── • scan buffer
                        label: buffer 1
(47 rows)

Time: 4ms total (execution 4ms / network 0ms)

If we only update desk_id, no uniqueness checks are needed at all:

demo@127.0.0.1:26257/multi_region_test_db> EXPLAIN UPDATE employee SET desk_id = 2 WHERE id = 1;
                                              info
------------------------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • update
  │ table: employee
  │ set: desk_id
  │ auto commit
  │
  └── • render
      │ estimated row count: 1
      │
      └── • union all
          │ estimated row count: 1
          │ limit: 1
          │
          ├── • scan
          │     estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
          │     table: employee@primary
          │     spans: [/'us-east1'/1 - /'us-east1'/1]
          │
          └── • scan
                estimated row count: 1 (100% of the table; stats collected 2 minutes ago)
                table: employee@primary
                spans: [/'europe-west1'/1 - /'europe-west1'/1] [/'us-west1'/1 - /'us-west1'/1]
(24 rows)

Time: 1ms total (execution 1ms / network 0ms)

Hopefully this example makes clear that whether or not to explicitly include crdb_region in the index definition depends on the context: if you only need to enforce uniqueness at the region-level, then including crdb_region in the UNIQUE index definition will enforce these semantics and allow you to get better performance on INSERTs, UPDATEs, and UPSERTs, since there won't be any added latency from uniqueness checks. However, if you need to enforce global uniqueness, you should not include crdb_region in the UNIQUE (or PRIMARY KEY) index definition, and the database will automatically ensure that the constraint is enforced.

Let me know if this helps or if you'd like me to clarify anything. Thanks!

Reviewed 7 of 7 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rmloveland)

_includes/v21.1/sql/unique-indexes-regional-by-row.md, line 1 at r2 (raw file):

<span class="version-tag">New in v21.1:</span> In [multi-region deployments](multiregion-overview.html), most users should use [`REGIONAL BY ROW` tables](multiregion-overview.html#regional-by-row-tables) instead of explicit index [partitioning](partitioning.html). When you add an index to a `REGIONAL BY ROW` table, it is automatically partitioned on the [`crdb_region` column](set-locality.html#crdb_region).  Explicit index partitioning is not required.

This sounds good to me, but it's no longer specific to unique indexes, so maybe consider changing the name of this page to just indexes-regional-by-row.md?

v21.1/add-constraint.md, line 278 at r1 (raw file):

Is this ability to keep the query local for things that are known to be unique also known as "locality optimized search" ?

Yep, exactly! The new text looks good.

v21.1/add-constraint.md, line 239 at r2 (raw file):

This example assumes you have a simulated multi-region database running on your local machine following the steps described in [Low Latency Reads and Writes in a Multi-Region Cluster](demo-low-latency-multi-region-deployment.html).

To show how the automatic partitioning of indexes on `REGIONAL BY ROW` tables works, we will:

nit: you've included "unique" in the title of this section, so you might want to make this a bit more verbose and say that this shows how a unique index is partitioned, but it's also similar to how all indexes are partitioned.

v21.1/add-constraint.md, line 278 at r2 (raw file):

~~~

Next, issue the [`SHOW PARTITIONS`](show-partitions.html) statement. The output below (which is edited for length) will verify that the unique index was automatically [partitioned](partitioning.html) for you. It shows that the `user_email_unique` index is now partitioned by the database regions `europe-west1`, `us-east1`, and `us-west1`.  This ensures that the uniqueness constraint is enforced properly across regions when rows are inserted, or the `email` column of an existing row is updated.

nit: it's not the partitioning that ensures the constraint is enforced properly. It's actually sort of the opposite: Without the partitioning, the unique index would be able to enforce the constraint without any additional help from the optimizer. But with an implicitly partitioned unique index, the database needs to do additional work to enforce the constraint. As a result, when a new unique index is added to a REGIONAL BY ROW table, two things happen automatically:

We run a one-time-only validation query to ensure that the existing data in the table satisfies the unique constraint.
Thereafter, the optimizer will automatically add a "uniqueness check" when necessary to any INSERT, UPDATE, or UPSERT statement affecting the columns in the unique constraint.

v21.1/add-constraint.md, line 280 at r2 (raw file):

Next, issue the [`SHOW PARTITIONS`](show-partitions.html) statement. The output below (which is edited for length) will verify that the unique index was automatically [partitioned](partitioning.html) for you. It shows that the `user_email_unique` index is now partitioned by the database regions `europe-west1`, `us-east1`, and `us-west1`.  This ensures that the uniqueness constraint is enforced properly across regions when rows are inserted, or the `email` column of an existing row is updated.

There is also a performance benefit for queries that select a single email address (e.g., `SELECT * FROM users WHERE email = 'anemailaddress@gmail.com'`). If `'anemailaddress@gmail.com'` is found in the local region, there is no need to search remote regions.  This feature, whereby the SQL engine will avoid sending requests to nodes in other regions when it can read a unique column that is stored locally, is known as _locality optimized search_.

super nit: "when it can read a unique column that is stored locally" isn't quite right, since the column spans all regions. Perhaps "when it can read a value from a unique column that is stored locally" would be more accurate.

v21.1/create-index.md, line 50 at r1 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

I see - per your comment above I guess it's any and all indexes, right?

Here is what I updated it with, it's a bit more than your suggestion, please let me know what you think. Also happy to keep it to "use REGIONAL BY ROW" if that's better than this.

"As of CockroachDB v21.1 and later, most users should use REGIONAL BY ROW tables. Indexes against regional by row tables are automatically partitioned, so explicit index partitioning is not required."

LGTM

rmloveland

Thanks for the review Becca! Also for the chat re: this stuff, and the long example in your comment. Super helpful!

In addition to addressing your specific comments below, I also added a new example on explicit vs. implicitly partitioned indexes which is an edited version of what you provided here. PTAL

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)

_includes/v21.1/sql/unique-indexes-regional-by-row.md, line 1 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

This sounds good to me, but it's no longer specific to unique indexes, so maybe consider changing the name of this page to just indexes-regional-by-row.md?

Updated!

v21.1/add-constraint.md, line 278 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Is this ability to keep the query local for things that are known to be unique also known as "locality optimized search" ?

Yep, exactly! The new text looks good.

Yay thanks!

v21.1/add-constraint.md, line 239 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: you've included "unique" in the title of this section, so you might want to make this a bit more verbose and say that this shows how a unique index is partitioned, but it's also similar to how all indexes are partitioned.

Makes sense. I've added some words to that effect

v21.1/add-constraint.md, line 278 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: it's not the partitioning that ensures the constraint is enforced properly. It's actually sort of the opposite: Without the partitioning, the unique index would be able to enforce the constraint without any additional help from the optimizer. But with an implicitly partitioned unique index, the database needs to do additional work to enforce the constraint. As a result, when a new unique index is added to a REGIONAL BY ROW table, two things happen automatically:

We run a one-time-only validation query to ensure that the existing data in the table satisfies the unique constraint.

Thereafter, the optimizer will automatically add a "uniqueness check" when necessary to any INSERT, UPDATE, or UPSERT statement affecting the columns in the unique constraint.

Thanks - I reworked this section to hopefully capture this info. PTAL

v21.1/add-constraint.md, line 280 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

super nit: "when it can read a unique column that is stored locally" isn't quite right, since the column spans all regions. Perhaps "when it can read a value from a unique column that is stored locally" would be more accurate.

Super nit appreciated! Always good to use more correct language. Updated!

v21.1/create-index.md, line 50 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

LGTM

Sweet thanks!

rytaft

Cool! I think the example looks great. Just one remaining nit below.

Reviewed 5 of 5 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rmloveland)

v21.1/add-constraint.md, line 302 at r3 (raw file):

### Using implicit vs. explicit index partitioning in `REGIONAL BY ROW` tables

In `REGIONAL BY ROW` tables, all indexes are partitioned on the region column (usually called [`crdb_region`](set-locality.html#crdb_region), but it can also be [computed](computed-columns.html)).

nit: the name of the column and whether or not it is computed are orthogonal. The standard vanilla approach results in a non-computed hidden column named crdb_region, but we allow for lots of flexibility. Here are some examples for different ways to create the partitioning column (sorry if you already knew all of this):

-- This creates a hidden `crdb_region` column that is not computed, but has a default
-- value of the gateway region. If you run SHOW CREATE TABLE, you'll see that the column
-- looks like this:
--   crdb_region public.crdb_internal_region NOT VISIBLE NOT NULL DEFAULT 
--     default_to_database_primary_region(gateway_region())::public.crdb_internal_region,
CREATE TABLE t (
  id INT PRIMARY KEY
) LOCALITY REGIONAL BY ROW;

-- This creates a non-computed column with the same default value as the automatically-
-- created `crdb_region` column, but with a different name (note the "AS" after LOCALITY
-- REGIONAL BY ROW).
CREATE TABLE t (
  id INT PRIMARY KEY,
  cr crdb_internal_region NOT NULL DEFAULT 
     default_to_database_primary_region(gateway_region())::public.crdb_internal_region
) LOCALITY REGIONAL BY ROW AS "cr";

-- This creates a computed column with the name cr.
CREATE TABLE t (
  id INT PRIMARY KEY,
  cr crdb_internal_region AS (
      CASE
        WHEN id <= 10 THEN 'us-east-1'
        ELSE 'ap-southeast-2'
      END
    ) STORED
) LOCALITY REGIONAL BY ROW AS "cr";

-- This creates a computed column with the the default name of `crdb_region`,
-- so there's no need to identify it with "AS".
CREATE TABLE t (
  id INT PRIMARY KEY,
  crdb_region crdb_internal_region AS (
      CASE
        WHEN id <= 10 THEN 'us-east-1'
        ELSE 'ap-southeast-2'
      END
    ) STORED
) LOCALITY REGIONAL BY ROW;

It's also possible to specify in the schema that the column is NOT VISIBLE, which means it won't show up when you run SELECT * FROM t. This is an option whether or not the column is computed or has a different name. (This new NOT VISIBLE syntax can also be used in regular, non-multi-region tables.)

Anyway, you definitely don't need to give all this information here, but I think it would be good to make sure it's documented somewhere. Also, as I mentioned when we chatted yesterday, we should explain that using a computed region column is one way to eliminate uniqueness checks and therefore reduce the latency of mutation operations. Another way to eliminate these checks is to use type UUID with default gen_random_uuid() for one of the columns in the unique constraint (but if the user still wants checks in that case, there is a cluster setting to enable them called sql.optimizer.uniqueness_checks_for_gen_random_uuid.enabled). Another way to eliminate uniqueness checks is to just use explicit partitioning (as your example now describes) if you don't actually need to guarantee global uniqueness.

None of this stuff needs to be included in the current PR (other than fixing your implication that computed columns have a different name), but I just wanted to write it down so you can use it for a future PR.

rmloveland

Thanks for the great review Becca! I think I have fixed the last error - PTAL

(Also filed some follow-up issues based on your comments below)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rytaft)

v21.1/add-constraint.md, line 302 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: the name of the column and whether or not it is computed are orthogonal. The standard vanilla approach results in a non-computed hidden column named crdb_region, but we allow for lots of flexibility. Here are some examples for different ways to create the partitioning column (sorry if you already knew all of this):
-- This creates a hidden `crdb_region` column that is not computed, but has a default
-- value of the gateway region. If you run SHOW CREATE TABLE, you'll see that the column
-- looks like this:
--   crdb_region public.crdb_internal_region NOT VISIBLE NOT NULL DEFAULT 
--     default_to_database_primary_region(gateway_region())::public.crdb_internal_region,
CREATE TABLE t (
  id INT PRIMARY KEY
) LOCALITY REGIONAL BY ROW;

-- This creates a non-computed column with the same default value as the automatically-
-- created `crdb_region` column, but with a different name (note the "AS" after LOCALITY
-- REGIONAL BY ROW).
CREATE TABLE t (
  id INT PRIMARY KEY,
  cr crdb_internal_region NOT NULL DEFAULT 
     default_to_database_primary_region(gateway_region())::public.crdb_internal_region
) LOCALITY REGIONAL BY ROW AS "cr";

-- This creates a computed column with the name cr.
CREATE TABLE t (
  id INT PRIMARY KEY,
  cr crdb_internal_region AS (
      CASE
        WHEN id <= 10 THEN 'us-east-1'
        ELSE 'ap-southeast-2'
      END
    ) STORED
) LOCALITY REGIONAL BY ROW AS "cr";

-- This creates a computed column with the the default name of `crdb_region`,
-- so there's no need to identify it with "AS".
CREATE TABLE t (
  id INT PRIMARY KEY,
  crdb_region crdb_internal_region AS (
      CASE
        WHEN id <= 10 THEN 'us-east-1'
        ELSE 'ap-southeast-2'
      END
    ) STORED
) LOCALITY REGIONAL BY ROW;
It's also possible to specify in the schema that the column is NOT VISIBLE, which means it won't show up when you run SELECT * FROM t. This is an option whether or not the column is computed or has a different name. (This new NOT VISIBLE syntax can also be used in regular, non-multi-region tables.)

Anyway, you definitely don't need to give all this information here, but I think it would be good to make sure it's documented somewhere. Also, as I mentioned when we chatted yesterday, we should explain that using a computed region column is one way to eliminate uniqueness checks and therefore reduce the latency of mutation operations. Another way to eliminate these checks is to use type UUID with default gen_random_uuid() for one of the columns in the unique constraint (but if the user still wants checks in that case, there is a cluster setting to enable them called sql.optimizer.uniqueness_checks_for_gen_random_uuid.enabled). Another way to eliminate uniqueness checks is to just use explicit partitioning (as your example now describes) if you don't actually need to guarantee global uniqueness.

None of this stuff needs to be included in the current PR (other than fixing your implication that computed columns have a different name), but I just wanted to write it down so you can use it for a future PR.

Thanks for this info Becca. I have removed the incorrect reference to computed columns vs. naming.

I have also filed the following issues (I definitely did not know all this lol):

Update "Computed Columns" page with examples of new multi-region columns and syntax #10401 (update computed columns doc)
Document how to avoid the cost of optimizer uniqueness checks in multi-region settings #10402 (how to avoid uniqueness checks)
Document the non-computed ways to give the partitioned region column a different name #10403 (how to have a different name etc. for region column when it's not computed)

rytaft

thanks for all your work on this!!

Reviewed 1 of 1 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rmloveland)

v21.1/add-constraint.md, line 302 at r3 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Thanks for this info Becca. I have removed the incorrect reference to computed columns vs. naming.

I have also filed the following issues (I definitely did not know all this lol):

Update "Computed Columns" page with examples of new multi-region columns and syntax #10401 (update computed columns doc)

Document how to avoid the cost of optimizer uniqueness checks in multi-region settings #10402 (how to avoid uniqueness checks)

Document the non-computed ways to give the partitioned region column a different name #10403 (how to have a different name etc. for region column when it's not computed)

Awesome -- thanks for filing these issues! And the new text here looks good.

I assume NOT VISIBLE is already documented (or has an issue open to be documented) elsewhere?

rmloveland

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)

v21.1/add-constraint.md, line 302 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Awesome -- thanks for filing these issues! And the new text here looks good.

I assume NOT VISIBLE is already documented (or has an issue open to be documented) elsewhere?

Great, thanks!

Re: NOT VISIBLE, I did not find any open docs issues, nor did grepping "not visible" suffice. So I opened #10407 as well

lnhsingh

Just a few edits, but !

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @rmloveland and @rytaft)

v21.1/add-constraint.md, line 241 at r4 (raw file):

To show how the automatic partitioning of indexes on `REGIONAL BY ROW` tables works, we will:

1. [Add a column](add-column.html) to the `users` table in the [MovR dataset](movr.html).

nit: make this ordered list all start with 1.

v21.1/add-constraint.md, line 286 at r4 (raw file):

~~~
  database_name | table_name | partition_name | column_names |       index_name        | partition_value  |

Is the output edited for width length too (i.e, are these all the columns)? It's a little weird that the last column has the | closing it. Normally when I edit a response for length, I include ... to clearly show where it's been truncated, but that is just a suggestion!

v21.1/add-constraint.md, line 295 at r4 (raw file):

To ensure that the uniqueness constraint is enforced properly across regions when rows are inserted, or the `email` column of an existing row is updated, the database needs to do the following additional work when indexes are partitioned as shown above:

1. Run a one-time-only validation query to ensure that the existing data in the table satisfies the unique constraint.

nit: same comment re: ordered lists

v21.1/add-constraint.md, line 309 at r4 (raw file):

- If `crdb_region` is excluded from the index definition, that serves as a signal that CockroachDB should enforce uniqueness on only the columns in the index definition.

In the latter case, the index alone cannot enforce uniqueness on columns that are not a prefix of the index columns, so any time rows are inserted or updated in a `REGIONAL BY ROW` table that has an implicitly partitioned `UNIQUE` index, the [optimizer](cost-based-optimizer.html) must add uniqueness checks.

Not sure if it's necessary, but might be nice to cross-link to INSERT / UPDATE docs

v21.1/add-constraint.md, line 313 at r4 (raw file):

Whether or not to explicitly include `crdb_region` in the index definition depends on the context:

- If you only need to enforce uniqueness at the region level, then including `crdb_region` in the `UNIQUE` index definition will enforce these semantics and allow you to get better performance on `INSERT`s, `UPDATE`s, and `UPSERT`s, since there won't be any added latency from uniqueness checks.

Same comment here about cross-linking

_includes/v21.1/sql/indexes-regional-by-row.md, line 1 at r4 (raw file):

<span class="version-tag">New in v21.1:</span> In [multi-region deployments](multiregion-overview.html), most users should use [`REGIONAL BY ROW` tables](multiregion-overview.html#regional-by-row-tables) instead of explicit index [partitioning](partitioning.html). When you add an index to a `REGIONAL BY ROW` table, it is automatically partitioned on the [`crdb_region` column](set-locality.html#crdb_region).  Explicit index partitioning is not required.

nit: double spaces (applies to other docs too) 😬

rytaft

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @rmloveland)

v21.1/add-constraint.md, line 302 at r3 (raw file):

Previously, rmloveland (Rich Loveland) wrote…

Great, thanks!

Re: NOT VISIBLE, I did not find any open docs issues, nor did grepping "not visible" suffice. So I opened #10407 as well

Thanks!

rmloveland

Thanks Becca!

Thanks for the review Lauren! I think I updated all the things

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @lnhsingh and @rytaft)

v21.1/add-constraint.md, line 302 at r3 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Thanks!

👍

v21.1/add-constraint.md, line 241 at r4 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

nit: make this ordered list all start with 1.

Done!

v21.1/add-constraint.md, line 286 at r4 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

Is the output edited for width length too (i.e, are these all the columns)? It's a little weird that the last column has the | closing it. Normally when I edit a response for length, I include ... to clearly show where it's been truncated, but that is just a suggestion!

Good call - added some dots n stuff

v21.1/add-constraint.md, line 295 at r4 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

nit: same comment re: ordered lists

Done. I will eventually get with the program on this. (It is aesthetically displeasing tho)

v21.1/add-constraint.md, line 309 at r4 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

Not sure if it's necessary, but might be nice to cross-link to INSERT / UPDATE docs

Added links!

v21.1/add-constraint.md, line 313 at r4 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

Same comment here about cross-linking

Moar links

_includes/v21.1/sql/indexes-regional-by-row.md, line 1 at r4 (raw file):

Previously, lnhsingh (Lauren Hirata Singh) wrote…

nit: double spaces (applies to other docs too) 😬

Fixed, here and in the 'Add Constraint' page.

Fixes #9273. Fixes #9274. Fixes #9740. Fixes #9743. Fixes #9885. Fixes #10136. Relates to #9739. Summary of changes: - Update 'ADD CONSTRAINT' docs to show how to add unique indexes to REGIONAL BY ROW tables. - Also update 'ADD CONSTRAINT' with an example showing the differences in behavior between explicitly partitioned vs. implicitly partitioned indexes. - Update 'CREATE INDEX' docs to note the existence of automatically partitioned indexes, and mention that most users should probably use REGIONAL BY ROW tables and add indexes to them normally, instead of explicit partitioning. - Update 'Indexes' doc to mention: - Index partitioning in general, which was not mentioned on this page. - Automatic index partitioning that comes for free on REGIONAL BY ROW tables, which most users should use instead. - Update the 'Multi-Region Overview' page with a new section called 'Additional Features', where we list how indexes work on RBR tables as one of the new features (this will grow to add mention of other CockroachDB features that are designed to work with multi-region as we go) - Update the 'Unique Constraint' page to talk about indexes on RBR tables, as well as linking to the usage example on the 'ADD CONSTRAINT' page (described above).

netlify · 2021-04-26T20:00:58Z

Netlify Preview

Built with commit 8b0b4c7

https://deploy-preview-10280--cockroachdb-docs.netlify.app

rmloveland requested a review from rytaft April 7, 2021 19:15

rmloveland mentioned this pull request Apr 8, 2021

Update indexes docs to reflect updated multi-region capabilities #9273

Closed

1 task

rytaft requested changes Apr 9, 2021

View reviewed changes

rmloveland force-pushed the 20210407-implicitly-partitioned-unique-indexes branch from ecc826e to e3515b4 Compare April 15, 2021 20:20

rmloveland commented Apr 15, 2021

View reviewed changes

rytaft reviewed Apr 19, 2021

View reviewed changes

rmloveland mentioned this pull request Apr 21, 2021

Update Cost Based Optimizer docs section "Preferring the nearest index" to describe new reality #10391

Closed

rmloveland force-pushed the 20210407-implicitly-partitioned-unique-indexes branch from e3515b4 to 3519c59 Compare April 21, 2021 22:34

rmloveland commented Apr 21, 2021

View reviewed changes

rmloveland changed the title ~~Add docs on implicitly partitioned unique indexes~~ Add docs on indexing of REGIONAL BY ROW tables Apr 21, 2021

rytaft reviewed Apr 22, 2021

View reviewed changes

rmloveland force-pushed the 20210407-implicitly-partitioned-unique-indexes branch from 3519c59 to be83d5f Compare April 22, 2021 18:19

rmloveland commented Apr 22, 2021

View reviewed changes

rytaft approved these changes Apr 22, 2021

View reviewed changes

rmloveland commented Apr 22, 2021

View reviewed changes

rmloveland requested a review from lnhsingh April 22, 2021 21:00

lnhsingh reviewed Apr 22, 2021

View reviewed changes

rytaft approved these changes Apr 22, 2021

View reviewed changes

rmloveland force-pushed the 20210407-implicitly-partitioned-unique-indexes branch from be83d5f to 6f0e6c9 Compare April 26, 2021 15:38

rmloveland commented Apr 26, 2021

View reviewed changes

rmloveland force-pushed the 20210407-implicitly-partitioned-unique-indexes branch from 6f0e6c9 to 8b0b4c7 Compare April 26, 2021 19:57

rmloveland merged commit 0919b29 into master Apr 26, 2021

rmloveland deleted the 20210407-implicitly-partitioned-unique-indexes branch April 26, 2021 20:02

This was referenced Apr 26, 2021

sql: handle CONSTRAINT...UNIQUE with PARTITION BY ALL / REGIONAL BY ROW #9739

Closed

sql: correctly handle CREATE INDEX on REGIONAL BY ROW tables #9624

Closed

rmloveland mentioned this pull request May 3, 2021

ccl: allow partitioned inverted indexes #10066

Closed

rmloveland mentioned this pull request May 11, 2021

Inverted index on RBR table is auto-partitioned #10536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs on indexing of REGIONAL BY ROW tables #10280

Add docs on indexing of REGIONAL BY ROW tables #10280

rmloveland commented Apr 7, 2021 •

edited

Loading

cockroach-teamcity commented Apr 7, 2021

rmloveland commented Apr 7, 2021

rytaft left a comment

otan commented Apr 9, 2021

rmloveland left a comment

rytaft left a comment

rmloveland left a comment

rytaft left a comment

rmloveland left a comment

rytaft left a comment

rmloveland left a comment

lnhsingh left a comment

rytaft left a comment

rmloveland left a comment

netlify bot commented Apr 26, 2021

Add docs on indexing of REGIONAL BY ROW tables #10280

Add docs on indexing of REGIONAL BY ROW tables #10280

Conversation

rmloveland commented Apr 7, 2021 • edited Loading

cockroach-teamcity commented Apr 7, 2021

rmloveland commented Apr 7, 2021

rytaft left a comment

Choose a reason for hiding this comment

otan commented Apr 9, 2021

rmloveland left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rmloveland left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rmloveland left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rmloveland left a comment

Choose a reason for hiding this comment

lnhsingh left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rmloveland left a comment

Choose a reason for hiding this comment

netlify bot commented Apr 26, 2021

rmloveland commented Apr 7, 2021 •

edited

Loading