Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lineage, BigQuery - ingestion processes exported audit logs with corrupted SQL statements #4691

Closed
vgaidass opened this issue Apr 19, 2022 · 4 comments
Assignees
Labels
bug Bug report

Comments

@vgaidass
Copy link
Contributor

vgaidass commented Apr 19, 2022

Describe the bug
CLI version: 0.8.33

During the ingestion, the lineage parsing mechanism is failing with the following outputs:

* Error: Sub-process exception: An Identifier is expected, got Operation
* Error: Sub-process exception: An Identifier is expected, got Function
* Error: Sub-process exception: An Identifier or IdentifierList is expected, got Parenthesis

While it is quite hard to understand which exact audit logs are breaking the SQL parsing operation from sqllineage even during datahub --debug (something to improve as well?), a presence of "failed" SQL queries has been confirmed in BigQuery audit logs.

Such audit logs have $.jobChange.job.jobStatus.jobState = "DONE" AND $.jobChange.job.jobStatus.errorResults present.

To Reproduce

  1. Have some broken SQL statements executed and saved to audit logs.
  2. Perform BigQuery metadata+lineage ingestion with use_exported_bigquery_audit_metadata: true

Theoretically, $.jobChange.job.jobConfig.queryConfig.query of such queries would still be taken as the current version of DataHub extracts all audit logs with jobState = "DONE"

Expected behavior
Audit logs with corrupted and/or broken SQL queries must not be processed by DataHub ingestion from BigQuery with use_exported_bigquery_audit_metadata: true

Screenshots
image

Additional context
A. One of the examples of audit logs with corrupted SQL statement:

 {"@type":"type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata","jobChange":{"after":"DONE","job":{"jobConfig":{"queryConfig":{"createDisposition":"CREATE_IF_NEEDED","priority":"QUERY_INTERACTIVE","query":"8513 * 4","writeDisposition":"WRITE_EMPTY"},"type":"QUERY"},"jobName":"projects/GCP_PROJECT_NAME/jobs/bquxjob_25490c81_17fcfd28e24","jobStats":{"createTime":"2022-03-28T09:20:35.582Z","endTime":"2022-03-28T09:20:35.588Z","queryStats":{},"startTime":"2022-03-28T09:20:35.588Z"},"jobStatus":{"errorResult":{"code":3,"message":"Syntax error: Expected end of input but got integer literal \"8513\" at [1:1]"},"errors":[{"code":3,"message":"Syntax error: Expected end of input but got integer literal \"8513\" at [1:1]"}],"jobState":"DONE"}}}}

B. Possible fix:

  • Update lines 211-216 of bigquery_audit_metadata_query_template() in datahub/ingestion/source/sql/bigquery.py by adding AND JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.jobChange.job.jobStatus.errorResult") IS NULL
  • Note: extracting lineage from GCP logging seems to address the presence of failed SQL jobs
@vgaidass vgaidass added the bug Bug report label Apr 19, 2022
@vgaidass
Copy link
Contributor Author

@anshbansal @rslanka Raising awareness of this issue :)

@vgaidass
Copy link
Contributor Author

Extra: maybe in the end it would benefit exposing the entire QueryEvent or the query that is being processed when the parsing fails instead of "Error: Sub-process exception". It seems like that there might be another issue happening which is more related to having a function and/or CTE as a query inside an audit log entry.

@anshbansal
Copy link
Collaborator

@vgaidass Given that you have a very good understanding of this issue would you like to try your hand at sending a PR for this bugfix? This doc https://datahubproject.io/docs/metadata-ingestion/developing/ contains details on how to make changes on your local machine. And I am here (as well as DataHub slack) to help out.

@vgaidass
Copy link
Contributor Author

@anshbansal I'll see what I can do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report
Projects
None yet
Development

No branches or pull requests

2 participants