Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tableau): ability to force extraction of table/column level linage from SQL queries #9838

Merged
merged 52 commits into from
Mar 21, 2024

Conversation

alexs-101
Copy link
Contributor

@alexs-101 alexs-101 commented Feb 14, 2024

Related issues

Everything was fixed in backward compatible way, should not break anything

Closes #9841
Closes #9842
Closes #9843
Closes #9844
Closes #9845
Closes #9846
Closes #9847
Closes #9880
Closes #9881

Features

force_extraction_of_lineage_from_custom_sql_queries: true | false
disable_schema_awarenes_during_parsing_of_sql_queries: true | false

Details

force_extraction_of_lineage_from_custom_sql_queries: true | false
If enabled, the tableau source plugin will not rely on Tableau Custom SQL metadata, instead, it will always parse out table and column level linage from SQL queries (does not matter if it's supported and unsupported by Tableau). The feature was implemented, because in our case the quality of Custom SQL Tableau metadata (even for supported SQL queries) was worse than what sqlglot based parser does. It allowed us to finally ingest our huge Tableau projects and be happy enough with quality of the resulting linage.

disable_schema_awarenes_during_parsing_of_sql_queries: true | false
If enabled, the SQL parser will not lookup pre ingested schemas of tables from DataHub during parsing, it's helpful if the Tableau projects are old, some of projects are abandoned and Custom SQL queries are out of sync with real table schemas, in this case SQL parser will be failing silently on attempt to resolve columns referenced by SQL queries. The feature will allow to ingest such a Tableau swamp and have at least table level linage and mostly column level linage (if columns are prefixed with aliases and * is not used).

How to use

source:
    type: tableau
    config:
        ...
        force_extraction_of_lineage_from_custom_sql_queries: true
        disable_schema_awarenes_during_parsing_of_sql_queries: true
        ingest_tables_external: true
        ...

Also, the PR contains some bug fixes for issues that were found by us during ingestion of metadata from Tableau

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 14, 2024
@hsheth2
Copy link
Collaborator

hsheth2 commented Feb 14, 2024

@alexs-101 thanks for the PR! Before diving into the code, I wanted to make sure I understood the motivation behind each feature. Btw - it makes it a lot easier to review if things are split up into multiple PRs

Things that make sense to me

  • force_extraction_of_lineage_from_custom_sql_queries - agree that tableau's lineage is pretty bad, and we should prefer our own to theirs. Likely will want to flip the default on this in the future
  • The salesforce dialect hack - seems reasonable

Things I'd like to discuss

  • The TSQLv2 dialect - ideally we contribute this back to sqlglot, or if they're not willing to accept it, we can merge it into the acryl-sqlglot fork
  • disable_schema_awarenes_during_parsing_of_sql_queries and the schema_aware flag - I didn't realize that having custom SQL queries digress from the underlying table schemas was a common thing.
    • My goal in building the initial parser was that it would be permissive enough that it could handle missing columns gracefully. Do you have some specific examples where a certain query + schema info would fail, but just the query with no schema info generates results that are more correct? If so, maybe we can fix those instances, and avoid needing disabling schema awareness across the board
    • Also, you can also disable schema-awareness by not passing a graph instance to the SchemaResolver - that feels like a cleaner approach
  • Adding in_tables_schemas - it seems like this can be fully derived from the rest of the SqlParsingResult? The logic might be better suited to live in a helper method, and not part of the core SqlParsingResult

@hsheth2
Copy link
Collaborator

hsheth2 commented Feb 14, 2024

One other thing - it seems like the t-sql changes only make sense when QUOTED_IDENTIFIER is OFF (as per https://stackoverflow.com/questions/62917351/what-is-the-syntax-of-a-string-literal-in-t-sql) - so I'm not sure it makes sense to change the default. It might make sense to make this configurable in some way

@alexs-101
Copy link
Contributor Author

@hsheth2 big thanx for quick feedback! I will answer (or fix) everything tomorrow morning with a fresh mind.

@alexs-101
Copy link
Contributor Author

alexs-101 commented Mar 5, 2024

tbh the fact that we have ingest_tables_external at all feels pretty strange to me - I guess it makes sense for CSVs and things, but for snowflake/other warehouse tables, we really should just rely on those other sources. But I guess it's alright for now, especially as it's disabled by default

it's a very helpful feature for those who has ability to ingest Tableau only and don't have access to other metadata sources. cold we keep it?

especially as it's disabled by default

yep

@alexs-101
Copy link
Contributor Author

@hsheth2 finished, could you start tests again, please?

@alexs-101
Copy link
Contributor Author

@hsheth2

Seems likt this test failed

test_create_list_get_ingestion_execution_request

E         - urn:li:dataHubExecutionRequest:219530d5-775d-4a38-9c93-d393cc8239a4
E         + urn:li:dataHubExecutionRequest:af738e1d-4870-4519-af1a-fc88c1d57d6b

I checked, it can not be due to my changes

@alexs-101
Copy link
Contributor Author

alexs-101 commented Mar 7, 2024

@hsheth2 i pulled in latest changes from master, could you please start tests again?

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Mar 18, 2024
@alexs-101
Copy link
Contributor Author

alexs-101 commented Mar 19, 2024

@hsheth2 I checked failed tests and all of them look like failed NOT due to changes brought by the PR

FAILED tests/integration/s3/test_s3.py::test_data_lake_s3_ingest[multiple_files.json] - Failed: Metadata files differ (use `pytest --update-golden-files` to update):

Urn changed, urn:li:dataset:(urn:li:dataPlatform:s3,test-platform-instance.my-test-bucket/folder_a/folder_aa/folder_aaa/NPS.7.1.package_data_NPS.6.1_ARCN_Lakes_ChemistryData_v1_csv.csv,DEV):
<schemaMetadata> changed:
	Item aspect['fields'][20]['type']['type']['com.linkedin.schema.DateType'] added to dictionary.
	Item aspect['fields'][20]['type']['type']['com.linkedin.schema.StringType'] removed from dictionary.
	Value of aspect['fields'][20]['nativeDataType'] changed from "string" to "date".
FAILED tests/integration/s3/test_s3.py::test_data_lake_local_ingest[multiple_files.json] - Failed: Metadata files differ (use `pytest --update-golden-files` to update):

Urn changed, urn:li:dataset:(urn:li:dataPlatform:file,test-platform-instance.tests/integration/s3/test_data/local_system/folder_a/folder_aa/folder_aaa/NPS.7.1.package_data_NPS.6.1_ARCN_Lakes_ChemistryData_v1_csv.csv,DEV):
<schemaMetadata> changed:
	Item aspect['fields'][20]['type']['type']['com.linkedin.schema.DateType'] added to dictionary.
	Item aspect['fields'][20]['type']['type']['com.linkedin.schema.StringType'] removed from dictionary.
	Value of aspect['fields'][20]['nativeDataType'] changed from "string" to "date".

I will fix it, if you say - fix it. :)

@egemenberk
Copy link
Contributor

egemenberk commented Mar 20, 2024

@hsheth2 I checked failed tests and all of them look like failed NOT due to changes brought by the PR

........

I will fix it, if you say - fix it. :)

+1 They also fail in my PR #9874 as well

@hsheth2
Copy link
Collaborator

hsheth2 commented Mar 21, 2024

@alexs-101 @egemenberk I agree those failures look unrelated, and I believe have been fixed on the master branch. I just synced this branch with master, and hopefully that unblocks CI

@treff7es treff7es merged commit e6e5c09 into datahub-project:master Mar 21, 2024
53 checks passed
@usmanovbf
Copy link

found bug #10357 when the force_extraction_of_lineage_from_custom_sql_queries property is not leveraged and the error is thrown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment