Fix: Ensure table_diff works consistently with multiple join keys #3281

erindru · 2024-10-23T03:49:52Z

Prior to this change, sqlmesh table_diff didnt really work outside of duckdb when multiple join keys were specified. In addition, it was was only being tested on duckdb.

This PR:

Introduces a RowDiffMixin that contains the logic for turning arbitrary columns into strings and building an expression to combine multiple columns into a single column
Utilises this to build a single join key and switches table_diff to FULL JOIN on the single key instead of trying to build an expression that takes all the key fields into account while also handling null keys (which didn't work on engines like Postgres)
Introduces integration tests that run the table_diff across every engine adapter implementing RowDiffMixin to help prevent regressions in future. The tests are for single-column grain, multi-column grain and arbitrary "on" condition with where clause

The following will need to be addressed in follow-up PR's:

Clickhouse support. This wasn't trivial to tack onto this PR due to how Clickhouse handles nulls

sqlmesh/core/engine_adapter/duckdb.py

georgesittas · 2024-10-23T09:31:54Z

sqlmesh/core/engine_adapter/base.py

-            name.set("catalog", value=self.default_catalog)
+            name.set("catalog", exp.to_identifier(self.default_catalog))


@tobymao should we use exp.parse_identifier(self.default_catalog) here instead of calling to_identifier? What if there are quotes/unsafe characters in the name?

(I think another place where we do this is in the bigquery adapter, that'd need to be fixed too probably?)

I made this change because when running on BigQuery with our tobiko-1 dataset, it was outputting values like tobiko-1.`schema`.`table` (i.e not quoting the tobiko-1 which produced an error).

However, I never know when to use to_identifier or parse_identifier. I guess parse_identifier is more appropriate here because this value can come from the user connection config, i'll update it

This is the main difference between the two:

>>> from sqlglot import exp >>> exp.to_identifier('"foo"') Identifier(this="foo", quoted=True) >>> exp.to_identifier('"foo"').sql() '"""foo"""' >>> exp.parse_identifier('"foo"').sql() '"foo"'

Basically, to_identifier accepts a string and it instantiates an Identifier object. If it has "unsafe" chracters, then it quotes it. This happens using a regex match. On the other hand, parse_identifier parses the string as its name suggests.

If you came across an error after using the latter then there may be additional context around that choice, I'm not 100% sure, just observed based on hunch.

Oh, no the error came from not doing anything at all and just shoving in the str that self.default_catalog returned without doing anything to turn it into an Identifier so it could get quoted correctly.

I think parse_identifier is correct here, thanks for explaining!

sqlmesh/core/table_diff.py

tobymao reviewed Oct 23, 2024

View reviewed changes

sqlmesh/core/engine_adapter/duckdb.py Outdated Show resolved Hide resolved

tobymao approved these changes Oct 23, 2024

View reviewed changes

erindru force-pushed the erin/table-diff-count-distinct branch 2 times, most recently from 28b78ac to df70a93 Compare October 23, 2024 04:49

georgesittas reviewed Oct 23, 2024

View reviewed changes

erindru added 2 commits October 23, 2024 19:43

Fix: Ensure table_diff works consistently with multiple join keys

289c953

PR feedback

87ade49

erindru force-pushed the erin/table-diff-count-distinct branch from 3d601d4 to 87ade49 Compare October 23, 2024 19:47

erindru merged commit 993edfa into main Oct 23, 2024
22 checks passed

erindru deleted the erin/table-diff-count-distinct branch October 23, 2024 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Ensure table_diff works consistently with multiple join keys #3281

Fix: Ensure table_diff works consistently with multiple join keys #3281

erindru commented Oct 23, 2024 •

edited

Loading

georgesittas Oct 23, 2024

tobymao Oct 23, 2024

erindru Oct 23, 2024

georgesittas Oct 23, 2024

erindru Oct 23, 2024

		name.set("catalog", value=self.default_catalog)
		name.set("catalog", exp.to_identifier(self.default_catalog))

Fix: Ensure table_diff works consistently with multiple join keys #3281

Fix: Ensure table_diff works consistently with multiple join keys #3281

Conversation

erindru commented Oct 23, 2024 • edited Loading

georgesittas Oct 23, 2024

Choose a reason for hiding this comment

tobymao Oct 23, 2024

Choose a reason for hiding this comment

erindru Oct 23, 2024

Choose a reason for hiding this comment

georgesittas Oct 23, 2024

Choose a reason for hiding this comment

erindru Oct 23, 2024

Choose a reason for hiding this comment

erindru commented Oct 23, 2024 •

edited

Loading