feat(ingestion/glue): delta schemas #10299

sgomezvillamor · 2024-04-16T17:24:24Z

Hive sync up in Spark is wrongly reported as col (array<string>). Experience shows that the issue happens only when using Scala and for tables (not views), which is actually a very common case. Given that the correct schema is serialized in the spark.sql.sources.schema.part.{i} table parameters, this PR parses and processes the schema from those properties so such a valuable metadata is not lost.

This is solving the following feature request: https://feature-requests.datahubproject.io/p/glue-crawler-support-for-correct-delta-schema-stored-in-properties

This bug in Spark has been reported many times and never solved. Some references here:

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

sgomezvillamor · 2024-04-17T07:29:48Z

...ests/integration/delta_lake/golden_files/local/golden_mces_tables_with_nested_datatypes.json

                                "nullable": true,
                                "type": {
                                    "type": {
-                                        "com.linkedin.pegasus2avro.schema.NullType": {}
+                                        "com.linkedin.pegasus2avro.schema.NumberType": {}


Note for reviewers:
This is for the new mapping I added here

datahub/metadata-ingestion/src/datahub/utilities/hive_schema_to_avro.py

Line 31 in 21b3c47

"long": "long",

sgomezvillamor · 2024-04-17T07:31:19Z

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

+    num_dataset_schema_invalid: int = 0
+    num_dataset_buggy_delta_schema: int = 0


Note for reviewers:
I added these two mainly for the testing. I'm ok to rename or remove even

This is nice. Let's please rename as

num_dataset_schema_invalid -> num_dataset_invalid_delta_schema
num_dataset_buggy_delta_schema -> num_dataset_valid_delta_schema

addressed in d634b10

sgomezvillamor · 2024-04-22T09:43:22Z

...ests/integration/delta_lake/golden_files/local/golden_mces_tables_with_nested_datatypes.json

@@ -1796,397 +1796,6 @@
        "lastRunId": "no-run-id-provided"
    }
 },
-{


Note for the reviewers:

Removed aspects are not lost; they were duplicated and updating the golden file resulted on removing the duplicated aspects.

sgomezvillamor · 2024-04-26T07:06:45Z

@hsheth2 or @treff7es , as usual contributors to ingestion codebase, could you have a look at this?

mayurinehate

Can we add config extract_delta_schema_from_parameters to enable this behavior. can be set to default True. If disabled, none of this delta customization would be used.

Can you confirm if this issue is with particular spark/delta version or all of them ?

Also, if I understand correctly, this workaround can entirely be removed if the issue is fixed in new spark/delta version - for example - if the delta.io PR that you have linked is merged, right ? If so, please add a comment in codebase to mention this.

mayurinehate · 2024-05-09T06:07:20Z

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

+    num_dataset_schema_invalid: int = 0
+    num_dataset_buggy_delta_schema: int = 0


This is nice. Let's please rename as

num_dataset_schema_invalid -> num_dataset_invalid_delta_schema
num_dataset_buggy_delta_schema -> num_dataset_valid_delta_schema

mayurinehate · 2024-05-09T06:09:13Z

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

@@ -1148,9 +1152,35 @@ def get_s3_tags() -> Optional[GlobalTagsClass]:
            return new_tags

        def get_schema_metadata() -> Optional[SchemaMetadata]:
-            if not table.get("StorageDescriptor"):
+            def is_delta_schema(columns: Optional[List[Mapping[str, Any]]]) -> bool:


Defining functions within other functions is discouraged as per coding style. Can you please move this function outside ?

Can this be refractored to accept both tableParameters and tableStorageDescriptor and then return boolean ? this function can subsume this check as well -> (provider == "delta") and (num_parts > 0)

Defining functions within other functions is discouraged as per coding style. Can you please move this function outside ?

While I do agree, I just followed the existing pattern in the code. Note get_owner, get_dataset_properties, get_s3_tags are all defined within _extract_record method.

So, are you suggesting to move is_delta_schema at _extract_record level too or at GlueSource level?

As just moved the is_delta_schema at _extra_record level, to keep it aligned with all others existing methods in the codebase.

mayurinehate · 2024-05-09T06:21:02Z

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

                return None

+        def _get_glue_schema_metadata() -> Optional[SchemaMetadata]:
+            assert table.get("StorageDescriptor")


can remove this assert as this is already checked earlier.

sgomezvillamor · 2024-05-09T07:42:09Z

Thanks for the review @mayurinehate . I will address comments soon.

Can we add config extract_delta_schema_from_parameters to enable this behavior. can be set to default True. If disabled, none of this delta customization would be used.

Sure, good point!

Can you confirm if this issue is with particular spark/delta version or all of them ?

Not a particular version but an long-time overall issue. As you can see in the two links listed in the PR description, community is recurrently complaining about this issue.

Also, if I understand correctly, this workaround can entirely be removed if the issue is fixed in new spark/delta version - for example - if the delta.io PR that you have linked is merged, right ? If so, please add a comment in codebase to mention this.

True. As soon as the hive integration with Spark is correctly providing the schema as expected in the StorageProperties, this could be removed. If that will happen in the PR I linked or some other, hard to predict 😄
I will provide more context in the comment here

datahub/metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

Line 1163 in dbceda1

# https://github.com/delta-io/delta/pull/2310

sgomezvillamor · 2024-05-10T09:05:47Z

Added the extract_delta_schema_from_parameters config option with False default value in d634b10

sgomezvillamor · 2024-05-10T13:29:17Z

@mayurinehate Could you please review again? Thanks

mayurinehate

Minor edits suggested. Everything else looks good to me.

mayurinehate · 2024-05-16T07:29:42Z

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

+    extract_delta_schema_from_parameters: Optional[bool] = Field(
+        default=False,
+        description="If enabled, delta schemas can be alternatively fetched from table parameters "
+        "(https://github.com/delta-io/delta/pull/2310)",


Let's please move this PR's link as code comment on this config - rather than in description of config,

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

sgomezvillamor · 2024-05-16T14:28:03Z

@mayurinehate Thanks for the feedback and approval.
I have added your latests suggestions and solved conflicts.

Build is all green but a nifi test. Surprisingly the test complains about not matching golden fine in 3.10 while it works fine in 3.8. Do you think is that related to my updates? or just some random error?

mayurinehate · 2024-05-17T12:19:45Z

CI Failures are unrelated to the changes in the PR.

Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

feat(ingestion/glue): delta schemas

cd7f751

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community datahub-community-champion PRs authored by DataHub Community Champions labels Apr 16, 2024

vercel bot deployed to Preview April 16, 2024 17:40 View deployment

lint fix

ac351e9

vercel bot deployed to Preview April 16, 2024 17:58 View deployment

update delta golden files

21b3c47

vercel bot deployed to Preview April 17, 2024 07:28 View deployment

sgomezvillamor commented Apr 17, 2024

View reviewed changes

Merge branch 'master' into feat-ingestion-glue-delta-schema

7fe6f0f

vercel bot deployed to Preview April 17, 2024 08:47 View deployment

sgomezvillamor marked this pull request as ready for review April 18, 2024 15:22

Merge branch 'master' into feat-ingestion-glue-delta-schema

1562daa

vercel bot deployed to Preview April 19, 2024 07:55 View deployment

Merge branch 'master' into feat-ingestion-glue-delta-schema

cecb5aa

vercel bot deployed to Preview April 19, 2024 15:22 View deployment

sgomezvillamor commented Apr 22, 2024

View reviewed changes

Merge branch 'master' into feat-ingestion-glue-delta-schema

5f53955

vercel bot deployed to Preview April 22, 2024 10:00 View deployment

Merge branch 'master' into feat-ingestion-glue-delta-schema

bd8a97b

vercel bot deployed to Preview April 23, 2024 17:39 View deployment

Merge branch 'master' into feat-ingestion-glue-delta-schema

7774c34

vercel bot deployed to Preview April 24, 2024 07:12 View deployment

Merge branch 'master' into feat-ingestion-glue-delta-schema

1a66f4d

vercel bot deployed to Preview April 25, 2024 14:29 View deployment

Merge branch 'master' into feat-ingestion-glue-delta-schema

dbceda1

vercel bot deployed to Preview April 26, 2024 06:47 View deployment

mayurinehate self-requested a review May 8, 2024 06:03

mayurinehate reviewed May 9, 2024

View reviewed changes

refactor: address PR review comments

d634b10

Merge branch 'master' into feat-ingestion-glue-delta-schema

8fc4f15

vercel bot deployed to Preview May 10, 2024 09:33 View deployment

Merge branch 'master' into feat-ingestion-glue-delta-schema

ee9f466

vercel bot deployed to Preview May 13, 2024 08:06 View deployment

Merge branch 'master' into feat-ingestion-glue-delta-schema

f602238

vercel bot had a problem deploying to Preview May 15, 2024 08:14 Failure

mayurinehate approved these changes May 16, 2024

View reviewed changes

sgomezvillamor and others added 3 commits May 16, 2024 13:52

Update metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

310102e

Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

Update metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

da5c3b7

Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

Merge branch 'master' into feat-ingestion-glue-delta-schema

6bec426

vercel bot had a problem deploying to Preview May 16, 2024 12:01 Failure

fix lint

4bbadbd

vercel bot deployed to Preview May 16, 2024 12:49 View deployment

mayurinehate requested a review from treff7es May 17, 2024 06:04

Merge branch 'master' into feat-ingestion-glue-delta-schema

b8dfc25

vercel bot deployed to Preview May 17, 2024 08:04 View deployment

treff7es approved these changes May 17, 2024

View reviewed changes

treff7es merged commit 0059960 into datahub-project:master May 17, 2024
57 of 58 checks passed

sgomezvillamor deleted the feat-ingestion-glue-delta-schema branch May 22, 2024 02:31

sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Jun 25, 2024

feat(ingestion/glue): delta schemas (datahub-project#10299)

e9a8841

Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion/glue): delta schemas #10299

feat(ingestion/glue): delta schemas #10299

sgomezvillamor commented Apr 16, 2024 •

edited

Loading

sgomezvillamor Apr 17, 2024

sgomezvillamor Apr 17, 2024

mayurinehate May 9, 2024

sgomezvillamor May 10, 2024

sgomezvillamor Apr 22, 2024

sgomezvillamor commented Apr 26, 2024

mayurinehate left a comment

mayurinehate May 9, 2024

mayurinehate May 9, 2024

mayurinehate May 9, 2024

sgomezvillamor May 9, 2024

sgomezvillamor May 10, 2024

mayurinehate May 9, 2024

sgomezvillamor commented May 9, 2024

sgomezvillamor commented May 10, 2024

sgomezvillamor commented May 10, 2024

mayurinehate left a comment

mayurinehate May 16, 2024

sgomezvillamor commented May 16, 2024

mayurinehate commented May 17, 2024

		num_dataset_schema_invalid: int = 0
		num_dataset_buggy_delta_schema: int = 0

feat(ingestion/glue): delta schemas #10299

feat(ingestion/glue): delta schemas #10299

Conversation

sgomezvillamor commented Apr 16, 2024 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgomezvillamor commented Apr 26, 2024

mayurinehate left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgomezvillamor commented May 9, 2024

sgomezvillamor commented May 10, 2024

sgomezvillamor commented May 10, 2024

mayurinehate left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgomezvillamor commented May 16, 2024

mayurinehate commented May 17, 2024

sgomezvillamor commented Apr 16, 2024 •

edited

Loading