You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the ingestion, the lineage parsing mechanism is failing with the following outputs:
* Error: Sub-process exception: An Identifier is expected, got Operation
* Error: Sub-process exception: An Identifier is expected, got Function
* Error: Sub-process exception: An Identifier or IdentifierList is expected, got Parenthesis
While it is quite hard to understand which exact audit logs are breaking the SQL parsing operation from sqllineage even during datahub --debug (something to improve as well?), a presence of "failed" SQL queries has been confirmed in BigQuery audit logs.
Such audit logs have $.jobChange.job.jobStatus.jobState = "DONE" AND $.jobChange.job.jobStatus.errorResults present.
To Reproduce
Have some broken SQL statements executed and saved to audit logs.
Perform BigQuery metadata+lineage ingestion with use_exported_bigquery_audit_metadata: true
Theoretically, $.jobChange.job.jobConfig.queryConfig.query of such queries would still be taken as the current version of DataHub extracts all audit logs with jobState = "DONE"
Expected behavior
Audit logs with corrupted and/or broken SQL queries must not be processed by DataHub ingestion from BigQuery with use_exported_bigquery_audit_metadata: true
Screenshots
Additional context
A. One of the examples of audit logs with corrupted SQL statement:
{"@type":"type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata","jobChange":{"after":"DONE","job":{"jobConfig":{"queryConfig":{"createDisposition":"CREATE_IF_NEEDED","priority":"QUERY_INTERACTIVE","query":"8513 * 4","writeDisposition":"WRITE_EMPTY"},"type":"QUERY"},"jobName":"projects/GCP_PROJECT_NAME/jobs/bquxjob_25490c81_17fcfd28e24","jobStats":{"createTime":"2022-03-28T09:20:35.582Z","endTime":"2022-03-28T09:20:35.588Z","queryStats":{},"startTime":"2022-03-28T09:20:35.588Z"},"jobStatus":{"errorResult":{"code":3,"message":"Syntax error: Expected end of input but got integer literal \"8513\" at [1:1]"},"errors":[{"code":3,"message":"Syntax error: Expected end of input but got integer literal \"8513\" at [1:1]"}],"jobState":"DONE"}}}}
B. Possible fix:
Update lines 211-216 of bigquery_audit_metadata_query_template() in datahub/ingestion/source/sql/bigquery.py by adding AND JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.jobChange.job.jobStatus.errorResult") IS NULL
Note: extracting lineage from GCP logging seems to address the presence of failed SQL jobs
The text was updated successfully, but these errors were encountered:
Extra: maybe in the end it would benefit exposing the entire QueryEvent or the query that is being processed when the parsing fails instead of "Error: Sub-process exception". It seems like that there might be another issue happening which is more related to having a function and/or CTE as a query inside an audit log entry.
@vgaidass Given that you have a very good understanding of this issue would you like to try your hand at sending a PR for this bugfix? This doc https://datahubproject.io/docs/metadata-ingestion/developing/ contains details on how to make changes on your local machine. And I am here (as well as DataHub slack) to help out.
Describe the bug
CLI version: 0.8.33
During the ingestion, the lineage parsing mechanism is failing with the following outputs:
While it is quite hard to understand which exact audit logs are breaking the SQL parsing operation from sqllineage even during
datahub --debug
(something to improve as well?), a presence of "failed" SQL queries has been confirmed in BigQuery audit logs.Such audit logs have
$.jobChange.job.jobStatus.jobState = "DONE"
AND$.jobChange.job.jobStatus.errorResults
present.To Reproduce
use_exported_bigquery_audit_metadata: true
Theoretically,
$.jobChange.job.jobConfig.queryConfig.query
of such queries would still be taken as the current version of DataHub extracts all audit logs withjobState = "DONE"
Expected behavior
Audit logs with corrupted and/or broken SQL queries must not be processed by DataHub ingestion from BigQuery with
use_exported_bigquery_audit_metadata: true
Screenshots
![image](https://user-images.githubusercontent.com/14344344/163988834-0798312f-0360-4c9b-91fa-016a13f0ce26.png)
Additional context
A. One of the examples of audit logs with corrupted SQL statement:
B. Possible fix:
bigquery_audit_metadata_query_template()
indatahub/ingestion/source/sql/bigquery.py
by addingAND JSON_EXTRACT(protopayload_auditlog.metadataJson, "$.jobChange.job.jobStatus.errorResult") IS NULL
The text was updated successfully, but these errors were encountered: