-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(dryrun): Add --validate-union-schemas
option (DENG-7746)
#6916
base: main
Are you sure you want to change the base?
Conversation
67c90e9
to
9c11c35
Compare
9c11c35
to
43930ec
Compare
err=True, | ||
) | ||
success = False | ||
return success, sqlfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mimicked the behavior of the existing --validate-schemas
option above, where if the option is set the normal dryrun validation doesn't get run. I'm not sure if that was the intended behavior for the --validate-schemas
option, but I figured I wouldn't change the status quo for that in this PR.
This also means the --validate-schemas
and --validate-union-schemas
options are mutually exclusive (with the --validate-schemas
option taking precedence).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check would fit better in bqetl query validate
since it's validation that doesn't (AFAICT) need to dry run the whole query. The validate command can optionally invoke dry run. The validate_schema()
method does seem to do a dry run of the whole query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had assumed bqetl query
and its subcommands only worked on ETLs, and with this we also want to check views, but looking at the implementation of bqetl query validate
it seems like it would also process views by default, so I guess that's workable. @scholtzan do you have an opinion on this?
(on a semi-related note, I don't love that bqetl query validate
defaults to automatically running bqetl format
, which can mess with your working copy files when running locally unless you remember to specify --skip-format-sql
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, slight preference to move it to bqetl query validate
.
"DATE": "2019-01-01", | ||
"DATETIME": "2019-01-01 00:00:00", | ||
"TIMESTAMP": "2019-01-01 00:00:00", | ||
TMP_DATASET = "bigquery-etl-integration-test.tmp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've described the results of an initial test run of this code in DENG-7746 (tldr: it found three incorrect unions). |
err=True, | ||
) | ||
success = False | ||
return success, sqlfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check would fit better in bqetl query validate
since it's validation that doesn't (AFAICT) need to dry run the whole query. The validate command can optionally invoke dry run. The validate_schema()
method does seem to do a dry run of the whole query
raise SchemaAssertError( | ||
f"Field #{field_number}{parent_field_note} has different names ({field['name']} vs {other_field['name']})." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be changed to something like
raise SchemaAssertError( | |
f"Field #{field_number}{parent_field_note} has different names ({field['name']} vs {other_field['name']})." | |
) | |
errors.append( | |
f"Field #{field_number}{parent_field_note} has different names ({field['name']} vs {other_field['name']})." | |
) | |
continue |
with an exception being raised at the end so the error contains all the conflicting columns. That would result in dupes like a != b
b != a
, but it would be convenient to see everything in one go.
e.g. In #6878 there were two of these errors in the struct
Co-authored-by: Ben Wu <12437227+BenWu@users.noreply.github.com>
Description
This PR adds a
--validate-union-schemas
option to thebqetl dryrun
command, which attempts to dryrun subqueries that are unioned together and check whether there are any discrepancies between their schemas which might indicate the union result is incorrect.Testing this manually it seems to work pretty well for the majority of ETLs/views, but there are some known cases where it doesn't work properly:
NULL
without an explicit cast will default to being treated as an integer instead of the type imposed by the union). IMO it'd be good to update all such queries to always use explicit column names and types.Related Tickets & Documents
Reviewer, please follow this checklist
┆Issue is synchronized with this Jira Task