Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution key as varchar causing problems #6543

Closed
hghammoud opened this issue Dec 5, 2022 · 8 comments · Fixed by #6550
Closed

Distribution key as varchar causing problems #6543

hghammoud opened this issue Dec 5, 2022 · 8 comments · Fixed by #6550

Comments

@hghammoud
Copy link

hghammoud commented Dec 5, 2022

I am currently facing an issue with tenant_id being a varchar instead of integer. I simplified the example 👍

This works

CREATE TABLE event (
  tenant_id int,
  id bigint,
  primary key (tenant_id, id)
);

CREATE TABLE page (
  tenant_id int,
  id int,
  primary key (tenant_id, id)
);

SELECT create_distributed_table('event', 'tenant_id');
SELECT create_distributed_table('page', 'tenant_id', colocate_with => 'event');
alter table page add constraint fk21 foreign key (tenant_id, id) references event;

this does NOT work

CREATE TABLE event (
  tenant_id varchar,
  id bigint,
  primary key (tenant_id, id)
);

CREATE TABLE page (
  tenant_id varchar,
  id int,
  primary key (tenant_id, id)
);

SELECT create_distributed_table('event', 'tenant_id');
SELECT create_distributed_table('page', 'tenant_id', colocate_with => 'event');
alter table page add constraint fk21 foreign key (tenant_id, id) references event;

I am getting the following error

ERROR:  complex joins are only supported when all distributed tables are co-located and joined on their distribution columns
CONTEXT:  SQL statement "SELECT fk."tenant_id", fk."id" FROM ONLY "public"."page" fk LEFT OUTER JOIN ONLY "public"."event" pk ON ( pk."tenant_id"::pg_catalog.text OPERATOR(pg_catalog.=) fk."tenant_id"::pg_catalog.text AND pk."id"::pg_catalog.text OPERATOR(pg_catalog.=) fk."id"::pg_catalog.text) WHERE pk."tenant_id" IS NULL AND (fk."tenant_id" IS NOT NULL AND fk."id" IS NOT NULL)"
while executing command on docker-citus_worker-1:5432

I am using the docker-compose found on the root of the github (citus 11.1)

What am I missing?

thank you.

@JelteF JelteF transferred this issue from citusdata/citus_docs Dec 5, 2022
@TsinghuaLucky912
Copy link
Contributor

TsinghuaLucky912 commented Dec 6, 2022

Dear Hackers, I have investigated this issue, and now I will tell you the results of my investigation for your reference:

  1. Environment: Ubuntu22+pg15.0+citus main
  2. I think the root cause of the error is: there is no pg_operator tuple for varchar type in postgresql (if the two tables here use text type, there is no such problem, because pg_operator has the content of text=text and its oid =98)
  3. Because of the reason in 1, the generate_operator_clause function converts the varchar type to the text type, which leads to the following error. If you execute this select command (manually replace the text type with the varchar type, the query can be successful)
  4. If you have any new ideas, please let me know, and I will see if I can solve this problem. Thank you!
src/backend/utils/adt/ruleutils.c

void
generate_operator_clause(StringInfo buf,
						 const char *leftop, Oid leftoptype,
						 Oid opoid,
						 const char *rightop, Oid rightoptype)
{
    // here
}

调查结果2

@marcocitus
Copy link
Member

marcocitus commented Dec 6, 2022

I feel like there are 2 separate issues here:

  1. Co-located LEFT JOIN is erroring out for varchar distribution columns, edit: +when casting the column to text
  2. We should not be doing the LEFT JOIN to begin with because we should skip validation when creating the foreign key on shell tables

@TsinghuaLucky912
Copy link
Contributor

Hello, I don't quite understand your meaning. At present, such SQL statements (left join) do not report errors

image

@marcocitus
Copy link
Member

I think the issue is specifically with the cast from varchar to text in the join clause.

SELECT fk."tenant_id", fk."id" FROM ONLY "public"."page" fk LEFT OUTER JOIN ONLY "public"."event" pk ON ( pk."tenant_id"::pg_catalog.text OPERATOR(pg_catalog.=) fk."tenant_id"::pg_catalog.text AND pk."id"::pg_catalog.text OPERATOR(pg_catalog.=) fk."id"::pg_catalog.text) WHERE pk."tenant_id" IS NULL AND (fk."tenant_id" IS NOT NULL AND fk."id" IS NOT NULL)
ERROR:  complex joins are only supported when all distributed tables are co-located and joined on their distribution columns

@TsinghuaLucky912
Copy link
Contributor

TsinghuaLucky912 commented Dec 7, 2022

I think the issue is specifically with the cast from varchar to text in the join clause.

SELECT fk."tenant_id", fk."id" FROM ONLY "public"."page" fk LEFT OUTER JOIN ONLY "public"."event" pk ON ( pk."tenant_id"::pg_catalog.text OPERATOR(pg_catalog.=) fk."tenant_id"::pg_catalog.text AND pk."id"::pg_catalog.text OPERATOR(pg_catalog.=) fk."id"::pg_catalog.text) WHERE pk."tenant_id" IS NULL AND (fk."tenant_id" IS NOT NULL AND fk."id" IS NOT NULL)
ERROR:  complex joins are only supported when all distributed tables are co-located and joined on their distribution columns

Yes, if you still keep the varchar type, there is no problem. But this doesn't seem to be caused by citus, and maybe I haven't found it (if you know, you can tell me)

@TsinghuaLucky912
Copy link
Contributor

@marcocitus My feeling is that the problem is not citus, but PostgreSQL's missing handling of varchar types

1670408501677_4CA3B77F-D6BC-478c-A608-7E00DAF061B0

lQLPJwGCYg6DfCzNA-3NB32wsBZA4SMCYtUDjKXKhMDqAA_1917_1005

lQLPJwZIvpRCK6zNA-3NB3mw0CV86y37L_sDjKXIdEBVAA_1913_1005

@JelteF
Copy link
Contributor

JelteF commented Dec 7, 2022

The commit that introduced this regression is: eadc88a

JelteF added a commit that referenced this issue Dec 7, 2022
In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The new way of skipping validation of foreign keys skipped the
validation during execution. The reason that this caused this regression
was because some foreign key validation queries already fail during
planning. In those cases it never gets to the execution step where it
would later be skipped.

Fixes #6543
JelteF added a commit that referenced this issue Dec 7, 2022
In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The new way of skipping validation of foreign keys skipped the
validation during execution. The reason that this caused this regression
was because some foreign key validation queries already fail during
planning. In those cases it never gets to the execution step where it
would later be skipped.

Fixes #6543
JelteF added a commit that referenced this issue Dec 7, 2022
In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The new way of skipping validation of foreign keys skipped the
validation during execution. The reason that this caused this regression
was because some foreign key validation queries already fail during
planning. In those cases it never gets to the execution step where it
would later be skipped.

Fixes #6543
JelteF added a commit that referenced this issue Dec 7, 2022
In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The new way of skipping validation of foreign keys skipped the
validation during execution. The reason that this caused this regression
was because some foreign key validation queries already fail during
planning. In those cases it never gets to the execution step where it
would later be skipped.

Fixes #6543
@JelteF
Copy link
Contributor

JelteF commented Dec 7, 2022

I have a fix ready in #6550

JelteF added a commit that referenced this issue Dec 16, 2022
In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The new way of skipping validation of foreign keys skipped the
validation during execution. The reason that this caused this regression
was because some foreign key validation queries already fail during
planning. In those cases it never gets to the execution step where it
would later be skipped.

Fixes #6543
JelteF added a commit that referenced this issue Jan 20, 2023
In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The new way of skipping validation of foreign keys skipped the
validation during execution. The reason that this caused this regression
was because some foreign key validation queries already fail during
planning. In those cases it never gets to the execution step where it
would later be skipped.

Fixes #6543
JelteF added a commit that referenced this issue Jan 24, 2023
In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The new way of skipping validation of foreign keys skipped the
validation during execution. The reason that this caused this regression
was because some foreign key validation queries already fail during
planning. In those cases it never gets to the execution step where it
would later be skipped.

Fixes #6543
JelteF added a commit that referenced this issue Jan 24, 2023
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6543 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
JelteF added a commit that referenced this issue Jan 24, 2023
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6543 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
JelteF added a commit that referenced this issue Jan 24, 2023
DESCRIPTION: Fix regression in allowed foreign keys on distributed
tables

In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The way of skipping validation of foreign keys that was introduced in
eadc88a was skipping validation during execution. The reason that
this caused this regression was because some foreign key validation
queries already fail during planning. In those cases it never gets to
the execution step where it would later be skipped.

Fixes #6543
JelteF added a commit that referenced this issue Jan 24, 2023
DESCRIPTION: Fix regression in allowed foreign keys on distributed
tables

In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The way of skipping validation of foreign keys that was introduced in
eadc88a was skipping validation during execution. The reason that
this caused this regression was because some foreign key validation
queries already fail during planning. In those cases it never gets to
the execution step where it would later be skipped.

Fixes #6543
JelteF added a commit that referenced this issue Jan 24, 2023
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6543 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
JelteF added a commit that referenced this issue Jan 24, 2023
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6543 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
JelteF added a commit that referenced this issue Jan 24, 2023
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6543 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
JelteF added a commit that referenced this issue Jan 25, 2023
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6543 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
JelteF added a commit that referenced this issue Jan 27, 2023
In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6543 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
JelteF added a commit that referenced this issue Jan 27, 2023
DESCRIPTION: Fix foreign key validation skip at the end of shard move

In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6550 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.
JelteF added a commit that referenced this issue Feb 10, 2023
DESCRIPTION: Fix regression in allowed foreign keys on distributed
tables

In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The way of skipping validation of foreign keys that was introduced in
eadc88a was skipping validation during execution. The reason that
this caused this regression was because some foreign key validation
queries already fail during planning. In those cases it never gets to
the execution step where it would later be skipped.

Fixes #6543

(cherry picked from commit 7a7880a)
JelteF added a commit that referenced this issue Feb 10, 2023
DESCRIPTION: Fix foreign key validation skip at the end of shard move

In eadc88a we started completely skipping foreign key constraint
validation at the end of a non blocking shard move, instead of only for
foreign keys to reference tables. However, it turns out that this didn't
work at all because of a hard to notice bug: By resetting the
SkipConstraintValidation flag at the end of our utility hook, we
actually make the SET command that sets it a no-op.

This fixes that bug by removing the code that resets it. This is fine
because #6543 removed the only place where we set the flag in C code. So
the resetting of the flag has no purpose anymore. This PR also adds a
regression test, because it turned out we didn't have any otherwise we
would have caught that the feature was completely broken.

It also moves the constraint validation skipping to the utility hook.
The reason is that #6550 showed us that this is the better place to skip
it, because it will also skip the planning phase and not just the
execution.

(cherry picked from commit 81dcddd)
rajeshkt78 pushed a commit that referenced this issue Feb 17, 2023
DESCRIPTION: Fix regression in allowed foreign keys on distributed
tables

In commit eadc88a we changed how we skip foreign key validation. The
goal was to skip it in more cases. However, one change had the
unintended regression of introducing failures when trying to create
certain foreign keys. This reverts that part of the change.

The way of skipping validation of foreign keys that was introduced in
eadc88a was skipping validation during execution. The reason that
this caused this regression was because some foreign key validation
queries already fail during planning. In those cases it never gets to
the execution step where it would later be skipped.

Fixes #6543
emelsimsek added a commit that referenced this issue Feb 17, 2023
…6714)

When auto_explain module is loaded and configured, EXPLAIN will be
implicitly run for all the supported commands. Postgres does not support
`EXPLAIN` for `ALTER` command. However, auto_explain will try to
`EXPLAIN` other supported commands internally triggered by `ALTER`.

 For instance, 
`ALTER TABLE target_table ADD CONSTRAINT fkey_167 FOREIGN KEY (col_1)
REFERENCES ref_table(key) ... `

command may trigger a SELECT command in the following form for foreign
key validation purpose:
`SELECT fk.col_1 FROM ONLY target_table fk LEFT OUTER JOIN ONLY
ref_table pk ON ( pk.key OPERATOR(pg_catalog.=) fk.col_1) WHERE pk.key
IS NULL AND (fk.col_1 IS NOT NULL) `

For Citus tables, the Citus utility hook should ensure that constraint
validation is skipped for shell tables but they are done for shard
tables. The reason behind this design choice can be summed up as:

- An ALTER TABLE command via coordinator node is run in a distributed
transaction.
- Citus does not support nested distributed transactions. 
- A SELECT query on a distributed table (aka shell table) is also run in
a distributed transaction.
- Therefore, Citus does not support running a SELECT query on a shell
table while an ALTER TABLE command is running.

With
eadc88a
a bug is introduced breaking the skip constraint validation behaviour of
Citus. With this change, we see that validation queries on distributed
tables are triggered within `ALTER` command adding constraints with
validation check. This regression did not cause an issue for regular use
cases since the citus executor hook blocks those queries heuristically
when there is an ALTER TABLE command in progress.

The issue is surfaced as a crash (#6424 Workers, when configured to use
auto_explain, crash during distributed transactions.) when auto_explain
is enabled. This is due to auto_explain trying to execute the SELECT
queries in a nested distributed transaction.

Now since the regression with constraint validation is fixed in
#6543, we should be able to
remove the workaround.
emelsimsek added a commit that referenced this issue Mar 6, 2023
…6714)

When auto_explain module is loaded and configured, EXPLAIN will be
implicitly run for all the supported commands. Postgres does not support
`EXPLAIN` for `ALTER` command. However, auto_explain will try to
`EXPLAIN` other supported commands internally triggered by `ALTER`.

 For instance, 
`ALTER TABLE target_table ADD CONSTRAINT fkey_167 FOREIGN KEY (col_1)
REFERENCES ref_table(key) ... `

command may trigger a SELECT command in the following form for foreign
key validation purpose:
`SELECT fk.col_1 FROM ONLY target_table fk LEFT OUTER JOIN ONLY
ref_table pk ON ( pk.key OPERATOR(pg_catalog.=) fk.col_1) WHERE pk.key
IS NULL AND (fk.col_1 IS NOT NULL) `

For Citus tables, the Citus utility hook should ensure that constraint
validation is skipped for shell tables but they are done for shard
tables. The reason behind this design choice can be summed up as:

- An ALTER TABLE command via coordinator node is run in a distributed
transaction.
- Citus does not support nested distributed transactions. 
- A SELECT query on a distributed table (aka shell table) is also run in
a distributed transaction.
- Therefore, Citus does not support running a SELECT query on a shell
table while an ALTER TABLE command is running.

With
eadc88a
a bug is introduced breaking the skip constraint validation behaviour of
Citus. With this change, we see that validation queries on distributed
tables are triggered within `ALTER` command adding constraints with
validation check. This regression did not cause an issue for regular use
cases since the citus executor hook blocks those queries heuristically
when there is an ALTER TABLE command in progress.

The issue is surfaced as a crash (#6424 Workers, when configured to use
auto_explain, crash during distributed transactions.) when auto_explain
is enabled. This is due to auto_explain trying to execute the SELECT
queries in a nested distributed transaction.

Now since the regression with constraint validation is fixed in
#6543, we should be able to
remove the workaround.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants