Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for schema-based-sharding via a GUC #6866

Merged
merged 31 commits into from
May 26, 2023

Conversation

onurctirtir
Copy link
Member

@onurctirtir onurctirtir commented Apr 18, 2023

DESCRIPTION: Adds citus.enable_schema_based_sharding GUC that allows sharding the database based on schemas when enabled.

  • Refactor the logic that automatically creates Citus managed tables

  • Refactor CreateSingleShardTable() to allow specifying colocation id instead

  • Add support for schema-based-sharding via a GUC

What this PR is about:

Add citus.enable_schema_based_sharding GUC to enable schema-based
sharding. Each schema created while this GUC is ON will be considered
as a tenant schema. Later on, regardless of whether the GUC is ON or
OFF, any table created in a tenant schema will be converted to a
single shard distributed table (without a shard key). All the tenant
tables that belong to a particular schema will be co-located with each
other and will have a shard count of 1.

We introduce a new metadata table --pg_dist_tenant_schema-- to do the
bookkeeping for tenant schemas:

psql> \d pg_dist_tenant_schema
          Table "pg_catalog.pg_dist_tenant_schema"
┌───────────────┬─────────┬───────────┬──────────┬─────────┐
│    Column     │  Type   │ Collation │ Nullable │ Default │
├───────────────┼─────────┼───────────┼──────────┼─────────┤
│ schemaid      │ oid     │           │ not null │         │
│ colocationid  │ integer │           │ not null │         │
└───────────────┴─────────┴───────────┴──────────┴─────────┘
Indexes:
    "pg_dist_tenant_schema_pkey" PRIMARY KEY, btree (schemaid)
    "pg_dist_tenant_schema_unique_colocationid_index" UNIQUE, btree (colocationid)

psql> table pg_dist_tenant_schema;
┌───────────┬───────────────┐
│ schemaid  │ colocationid  │
├───────────┼───────────────┤
│     4196391 │
│     4196290 │
└───────────┴───────────────┘
(2 rows)

Colocation id column of pg_dist_tenant_schema can never be NULL even
for the tenant schemas that don't have a tenant table yet. This is
because, we assign colocation ids to tenant schemas as soon as they
are created. That way, we can keep associating tenant schemas with
particular colocation groups even if all the tenant tables of a tenant
schema are dropped and recreated later on.

When a tenant schema is dropped, we delete the corresponding row from
pg_dist_tenant_schema. In that case, we delete the corresponding
colocation group from pg_dist_colocation as well.

Future work for 12.0 release:

We're building schema-based sharding on top of the infrastructure that
adds support for creating distributed tables without a shard key (#6867).
However, not all the operations that can be done on distributed tables
without a shard key necessarily make sense (in the same way) in the
context of schema-based sharding. For example, we need to think about
what happens if user attempts altering schema of a tenant table. We
will tackle such scenarios in a future PR.

We will also add a new UDF --citus.schema_tenant_set() or such-- to
allow users to use an existing schema as a tenant schema, and another
one --citus.schema_tenant_unset() or such-- to stop using a schema as
a tenant schema in future PRs.

@onurctirtir onurctirtir force-pushed the schema-based-sharding-via-guc branch 6 times, most recently from e9365fa to a286958 Compare April 19, 2023 10:45
@onurctirtir onurctirtir marked this pull request as ready for review April 19, 2023 10:47
@onurctirtir onurctirtir force-pushed the schema-based-sharding-via-guc branch 4 times, most recently from dab92d7 to 31efdde Compare April 19, 2023 14:46
@onurctirtir onurctirtir force-pushed the null-shard-key-devel-protected branch 3 times, most recently from 7f02706 to dd346f6 Compare April 19, 2023 23:36
@onurctirtir onurctirtir force-pushed the schema-based-sharding-via-guc branch 2 times, most recently from a18b16b to fc3cccd Compare April 19, 2023 23:59
@onurctirtir onurctirtir force-pushed the null-shard-key-devel-protected branch from 400f58d to 65d2b72 Compare April 24, 2023 10:20
@onurctirtir onurctirtir force-pushed the schema-based-sharding-via-guc branch from fc3cccd to 9494c6a Compare April 24, 2023 10:23
@onurctirtir onurctirtir force-pushed the schema-based-sharding-via-guc branch 2 times, most recently from 3c7050e to 0c26df3 Compare April 26, 2023 09:27
@onurctirtir onurctirtir force-pushed the schema-based-sharding-via-guc branch 7 times, most recently from d26084a to de2ff1f Compare April 26, 2023 15:12
Copy link
Member

@marcocitus marcocitus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good overall, after addressing open comments this seems ready to merge.

* Avoid unnecessarily adding the table into metadata if we will
* distribute it as a tenant table later.
*/
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still kind of wondering whether we can call ConvertNewTableIfNecessary from here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe we can do but would might mean a somewhat big change in the context of this PR --need to understand why we didn't do so for ConvertToManagedTable before.

src/backend/distributed/commands/utility_hook.c Outdated Show resolved Hide resolved
"by using colocate_with option."),
errhint("Consider using \"ALTER TABLE %s SET SCHEMA %s;\" "
"to move the table to the same schema with the "
"tenant table.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that possible?

Copy link
Member Author

@onurctirtir onurctirtir May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible today. But a time ago, I was thinking that we would definitely have such a logic for ALTER TABLE SET SCHEMA, in 12.0. Let me improve the hint and then we can adjust it if we decide making ALTER TABLE SET SCHEMA clever in this release.

@citusdata citusdata deleted a comment from codecov bot May 23, 2023
@codecov
Copy link

codecov bot commented May 23, 2023

Codecov Report

Merging #6866 (8eefa99) into main (2c7beee) will decrease coverage by 0.01%.
The diff coverage is 91.58%.

@@            Coverage Diff             @@
##             main    #6866      +/-   ##
==========================================
- Coverage   93.26%   93.26%   -0.01%     
==========================================
  Files         269      271       +2     
  Lines       57558    57813     +255     
==========================================
+ Hits        53684    53919     +235     
- Misses       3874     3894      +20     

Copy link
Member

@marcocitus marcocitus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to do something about this syntax, since it currently results in a local table existing in a tenant schema:

create schema t2 create table test (x int, y int);

@aykut-bozkurt
Copy link
Contributor

aykut-bozkurt commented May 24, 2023

nit about weird syntax:

SET citus.enable_schema_based_sharding TO off;
CREATE SCHEMA "CiTuS.TeeN";

This fails like below:

NOTICE:  issuing SELECT pg_catalog.citus_internal_add_tenant_schema("CiTuS.TeeN"::regnamespace, 1)
DETAIL:  on server aykutbozkurt@localhost:9701 connectionId: 1
ERROR:  column "CiTuS.TeeN" does not exist

We can use quote_literal_cstr(quote_identifier(schemaName)) at RemoteSchemaIdExpression to fix.

$$);

DROP OWNED BY test_other_super_user;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use a CREATE FOREIGN TABLE Test, also we may need to update the error/hint

create foreign table t6.foreign_test (x int, y int) server local_server options (schema_name 'public', table_name 'test');
ERROR:  foreign tables cannot be distributed
HINT:  Can add foreign table "foreign_test" to metadata by running: SELECT citus_add_local_table_to_metadata($$t6.foreign_test$$);

CREATE SCHEMA tenant_10;
CREATE TABLE tenant_10.tbl_1(a int, b text);
CREATE TABLE tenant_10.tbl_2(a int, b text);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding a CREATE TABLE tenant_10.tbl_3 AS SELECT s, s FROM generate_series(1,100) s; test

This currently results in a local table being created in the schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we're throwing an error for that

@aykut-bozkurt
Copy link
Contributor

aykut-bozkurt commented May 24, 2023

I had 2 questions. First one looks like works and is resolved. The second one is a problem. We can error and defer it to the second phase or see the solution I suggest.

  1. What we do when we alter a tenant schema? Looks like it just works we do not hardcode any schema's name into any table. Verified with below.
ALTER SCHEMA tenant1 RENAME TO <NEW_SCHEMA_NAME>;
  1. What happens if we alter a tenant table's schema? Looks like queries still work and table colocation and other metadata are fine. But feels like bad to have the tables in different schemas with the same colocation entry.
ALTER TABLE tenant1.test1 SET SCHEMA <NEW_SCHEMA_NAME>;
  • we may delete metadata if we change the tenant schema to a regular schema. (undistribute or leave it single shard?)
  • we may update metadata and move the single shard table if we change the tenant schema to another tenant schema. (This can be useful if user chooses a tenant schema by mistake)

@onurctirtir
Copy link
Member Author

I had 2 questions. First one looks like works and is resolved. The second one is a problem. We can error and defer it to the second phase or see the solution I suggest.

  1. What we do when we alter a tenant schema? Looks like it just works we do not hardcode any schema's name into any table. Verified with below.
ALTER SCHEMA tenant1 RENAME TO <NEW_SCHEMA_NAME>;
  1. What happens if we alter a tenant table's schema? Looks like queries still work and table colocation and other metadata are fine. But feels like bad to have the tables in different schemas with the same colocation entry.
ALTER TABLE tenant1.test1 SET SCHEMA <NEW_SCHEMA_NAME>;
  • we may delete metadata if we change the tenant schema to a regular schema. (undistribute or leave it single shard?)
  • we may update metadata and move the single shard table if we change the tenant schema to another tenant schema. (This can be useful if user chooses a tenant schema by mistake)

Scenarios like 2) will be tackled in a separate PR.

emelsimsek pushed a commit that referenced this pull request May 24, 2023
We mark objects as distributed objects in Citus metadata only if we need
to propagate given the command that creates it to worker nodes. For this
reason, we were not doing this for the objects that are created while
pg_dist_node is empty.

One implication of doing so is that we defer the schema propagation to
the time when user creates the first distributed table in the schema.
However, this doesn't help for schema-based sharding (#6866) because we
want to sync pg_dist_tenant_schema to the worker nodes even for empty
schemas too.

* Support test dependencies for isolation tests without a schedule

* Comment out a test due to a known issue (#6901)

* Also, reduce the verbosity for some log messages and make some
   tests compatible with run_test.py.
@onurctirtir
Copy link
Member Author

@marcocitus, I looked into special variants of CREATE TABLE / CREATE SCHEMA commands and made some changes accordingly.

  • CREATE TABLE is now allowed to use IF EXISTS syntax when creating a tenant table.
  • CREATE TABLE commands are allowed to use LIKE syntax when creating a tenant table based on any other table.

However, we do not allow the following;

  • Creating tenant tables by using CREATE TABLE AS / SELECT INTO commands.
  • Creating tenant tables that inherit from other tables, and similarly, we do not allow creating regular tables that inherit from a tenant table.
    • @aykut-bozkurt: For citus_set_tenant_schema(), need to throw an error for the tables involved in an inheritance relationship.
  • Creating tenant tables by using CREATE TABLE .. OF TYPE .. syntax.
    • @aykut-bozkurt: For citus_set_tenant_schema(), need to throw an error for the tables that explicitly depends on a type, i.e., created by using that syntax earlier.
  • Creating the tenant table and the tenant schema at the same time, by using CREATE SCHEMA .. CREATE TABLE .. syntax. However, if time permits, we want to add support for this by using the infra that is being implemented for citus_set_tenant_schema(). (cc: @aykut-bozkurt)

@onurctirtir onurctirtir merged commit 246b054 into main May 26, 2023
@onurctirtir onurctirtir deleted the schema-based-sharding-via-guc branch May 26, 2023 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants