-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Ensure table_diff works consistently with multiple join keys #3281
Conversation
28b78ac
to
df70a93
Compare
sqlmesh/core/engine_adapter/base.py
Outdated
name.set("catalog", value=self.default_catalog) | ||
name.set("catalog", exp.to_identifier(self.default_catalog)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tobymao should we use exp.parse_identifier(self.default_catalog)
here instead of calling to_identifier
? What if there are quotes/unsafe characters in the name?
(I think another place where we do this is in the bigquery adapter, that'd need to be fixed too probably?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change because when running on BigQuery with our tobiko-1
dataset, it was outputting values like tobiko-1.`schema`.`table`
(i.e not quoting the tobiko-1
which produced an error).
However, I never know when to use to_identifier
or parse_identifier
. I guess parse_identifier
is more appropriate here because this value can come from the user connection config, i'll update it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main difference between the two:
>>> from sqlglot import exp
>>> exp.to_identifier('"foo"')
Identifier(this="foo", quoted=True)
>>> exp.to_identifier('"foo"').sql()
'"""foo"""'
>>> exp.parse_identifier('"foo"').sql()
'"foo"'
Basically, to_identifier
accepts a string and it instantiates an Identifier
object. If it has "unsafe" chracters, then it quotes it. This happens using a regex match. On the other hand, parse_identifier
parses the string as its name suggests.
If you came across an error after using the latter then there may be additional context around that choice, I'm not 100% sure, just observed based on hunch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no the error came from not doing anything at all and just shoving in the str that self.default_catalog
returned without doing anything to turn it into an Identifier
so it could get quoted correctly.
I think parse_identifier
is correct here, thanks for explaining!
3d601d4
to
87ade49
Compare
Prior to this change,
sqlmesh table_diff
didnt really work outside of duckdb when multiple join keys were specified. In addition, it was was only being tested on duckdb.This PR:
RowDiffMixin
that contains the logic for turning arbitrary columns into strings and building an expression to combine multiple columns into a single columnFULL JOIN
on the single key instead of trying to build an expression that takes all the key fields into account while also handling null keys (which didn't work on engines like Postgres)RowDiffMixin
to help prevent regressions in future. The tests are for single-column grain, multi-column grain and arbitrary "on" condition with where clauseThe following will need to be addressed in follow-up PR's: