Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dryrun): Add --validate-union-schemas option (DENG-7746) #6916

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sean-rose
Copy link
Contributor

@sean-rose sean-rose commented Jan 30, 2025

Description

This PR adds a --validate-union-schemas option to the bqetl dryrun command, which attempts to dryrun subqueries that are unioned together and check whether there are any discrepancies between their schemas which might indicate the union result is incorrect.

Testing this manually it seems to work pretty well for the majority of ETLs/views, but there are some known cases where it doesn't work properly:

  • The column names and types from the first query in a union get automatically applied to subsequent queries in the union, so some of those subsequent union queries omit column names or types, which produces false positives in these validation checks (e.g. NULL without an explicit cast will default to being treated as an integer instead of the type imposed by the union). IMO it'd be good to update all such queries to always use explicit column names and types.
  • This doesn't generate valid dryrun queries for unions in correlated subqueries (i.e. where the subquery being unioned is selecting from an outer scope that isn't a separate CTE). This might be solvable with some clever query rewriting, but I didn't want to make this already complex PR even more complex for this relatively minor edge case.
  • SQLGlot mangles type-annotated array literals (issue), resulting in some invalid dryrun queries (though that has been fixed and should be included in the next release).

Related Tickets & Documents

  • DENG-7746: Some ETLs and views may be silently unioning data incorrectly

Reviewer, please follow this checklist

┆Issue is synchronized with this Jira Task

@sean-rose sean-rose force-pushed the validate-union-schemas branch from 67c90e9 to 9c11c35 Compare January 30, 2025 01:14
@sean-rose sean-rose force-pushed the validate-union-schemas branch from 9c11c35 to 43930ec Compare January 30, 2025 01:15
err=True,
)
success = False
return success, sqlfile
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mimicked the behavior of the existing --validate-schemas option above, where if the option is set the normal dryrun validation doesn't get run. I'm not sure if that was the intended behavior for the --validate-schemas option, but I figured I wouldn't change the status quo for that in this PR.

This also means the --validate-schemas and --validate-union-schemas options are mutually exclusive (with the --validate-schemas option taking precedence).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check would fit better in bqetl query validate since it's validation that doesn't (AFAICT) need to dry run the whole query. The validate command can optionally invoke dry run. The validate_schema() method does seem to do a dry run of the whole query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed bqetl query and its subcommands only worked on ETLs, and with this we also want to check views, but looking at the implementation of bqetl query validate it seems like it would also process views by default, so I guess that's workable. @scholtzan do you have an opinion on this?

(on a semi-related note, I don't love that bqetl query validate defaults to automatically running bqetl format, which can mess with your working copy files when running locally unless you remember to specify --skip-format-sql)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, slight preference to move it to bqetl query validate.

"DATE": "2019-01-01",
"DATETIME": "2019-01-01 00:00:00",
"TIMESTAMP": "2019-01-01 00:00:00",
TMP_DATASET = "bigquery-etl-integration-test.tmp"
Copy link
Contributor Author

@sean-rose sean-rose Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sean-rose sean-rose requested review from scholtzan and BenWu January 30, 2025 16:58
@sean-rose
Copy link
Contributor Author

sean-rose commented Jan 30, 2025

I've described the results of an initial test run of this code in DENG-7746 (tldr: it found three incorrect unions).

err=True,
)
success = False
return success, sqlfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check would fit better in bqetl query validate since it's validation that doesn't (AFAICT) need to dry run the whole query. The validate command can optionally invoke dry run. The validate_schema() method does seem to do a dry run of the whole query

Comment on lines +300 to +302
raise SchemaAssertError(
f"Field #{field_number}{parent_field_note} has different names ({field['name']} vs {other_field['name']})."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be changed to something like

Suggested change
raise SchemaAssertError(
f"Field #{field_number}{parent_field_note} has different names ({field['name']} vs {other_field['name']})."
)
errors.append(
f"Field #{field_number}{parent_field_note} has different names ({field['name']} vs {other_field['name']})."
)
continue

with an exception being raised at the end so the error contains all the conflicting columns. That would result in dupes like a != b b != a, but it would be convenient to see everything in one go.

e.g. In #6878 there were two of these errors in the struct

Co-authored-by: Ben Wu <12437227+BenWu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants