Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Ensure table_diff works consistently with multiple join keys #3281

Merged
merged 2 commits into from
Oct 23, 2024

Conversation

erindru
Copy link
Collaborator

@erindru erindru commented Oct 23, 2024

Prior to this change, sqlmesh table_diff didnt really work outside of duckdb when multiple join keys were specified. In addition, it was was only being tested on duckdb.

This PR:

  • Introduces a RowDiffMixin that contains the logic for turning arbitrary columns into strings and building an expression to combine multiple columns into a single column
  • Utilises this to build a single join key and switches table_diff to FULL JOIN on the single key instead of trying to build an expression that takes all the key fields into account while also handling null keys (which didn't work on engines like Postgres)
  • Introduces integration tests that run the table_diff across every engine adapter implementing RowDiffMixin to help prevent regressions in future. The tests are for single-column grain, multi-column grain and arbitrary "on" condition with where clause

The following will need to be addressed in follow-up PR's:

  • Clickhouse support. This wasn't trivial to tack onto this PR due to how Clickhouse handles nulls

@erindru erindru force-pushed the erin/table-diff-count-distinct branch 2 times, most recently from 28b78ac to df70a93 Compare October 23, 2024 04:49
name.set("catalog", value=self.default_catalog)
name.set("catalog", exp.to_identifier(self.default_catalog))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tobymao should we use exp.parse_identifier(self.default_catalog) here instead of calling to_identifier? What if there are quotes/unsafe characters in the name?

(I think another place where we do this is in the bigquery adapter, that'd need to be fixed too probably?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change because when running on BigQuery with our tobiko-1 dataset, it was outputting values like tobiko-1.`schema`.`table` (i.e not quoting the tobiko-1 which produced an error).

However, I never know when to use to_identifier or parse_identifier. I guess parse_identifier is more appropriate here because this value can come from the user connection config, i'll update it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main difference between the two:

>>> from sqlglot import exp
>>> exp.to_identifier('"foo"')
Identifier(this="foo", quoted=True)
>>> exp.to_identifier('"foo"').sql()
'"""foo"""'
>>> exp.parse_identifier('"foo"').sql()
'"foo"'

Basically, to_identifier accepts a string and it instantiates an Identifier object. If it has "unsafe" chracters, then it quotes it. This happens using a regex match. On the other hand, parse_identifier parses the string as its name suggests.

If you came across an error after using the latter then there may be additional context around that choice, I'm not 100% sure, just observed based on hunch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no the error came from not doing anything at all and just shoving in the str that self.default_catalog returned without doing anything to turn it into an Identifier so it could get quoted correctly.

I think parse_identifier is correct here, thanks for explaining!

sqlmesh/core/table_diff.py Show resolved Hide resolved
@erindru erindru force-pushed the erin/table-diff-count-distinct branch from 3d601d4 to 87ade49 Compare October 23, 2024 19:47
@erindru erindru merged commit 993edfa into main Oct 23, 2024
22 checks passed
@erindru erindru deleted the erin/table-diff-count-distinct branch October 23, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants