Fix platform instance support on Druid ingestion #12716

Rasnar · 2025-02-24T08:44:28Z

If the platform instance was specified in a Druid ingestion, it was always set twice, because it was overrided in the get_identifier, and then added again down the line (see #11639).
This may break existing users that already rely on the broken platform instance implementation, but they could still manually set it to the old pattern in their dhub file.

Checklist

[ x] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
[ x] Links to related issues (if applicable)
Druid dataset URN is not generated correctly #11639
Druid ingestion using platform_instance creates a duplicate in the dataset URN #12546
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

codecov · 2025-02-24T08:53:51Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines	Coverage Δ
...ngestion/src/datahub/ingestion/source/sql/druid.py	`100.00% <100.00%> (ø)`

... and 10 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f14c42d...16fdb6a. Read the comment docs.

sgomezvillamor · 2025-02-24T12:11:18Z

metadata-ingestion/src/datahub/ingestion/source/sql/druid.py

-            if self.platform_instance
-            else f"{table}"
-        )
+        return f"{table}"


This is restoring the original implementation, from 2021 https://github.com/datahub-project/datahub/pull/2284/files
which was updated on 2022 https://github.com/datahub-project/datahub/pull/3996/files#diff-d2b27025056d73e6bf1508f17c065c63dd4249caacf26de6199aaeaf446eec60

Have we been producing "duplicated" platform instance since then? 🤔

Yes, pretty sure that is the case, or something along the way changed.
But you can see that the default SQLAlchemy has no mention of a platform_instance:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py#L576

If you check any other SQL ingestion with the platform_instance support, e.g. mssql, you can see that also there there is no need to specify the platform_instance at this level:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/mssql/source.py#L731

The Druid ingestion is probably not used a lot, because you will other big inconsistencies on the ingestion. For example, the Druid table is always hardcoded to druid/v2/sql because of the sql alchemy implementation that by default will use the URI to compute the table name. This is probably something that should be fixed at some point.

Integration tests for druid source were missed; I just added them here #12717
And indeed, the golden file shows the platform instance appears twice in the generated URNs for datasets.

Let's merge PR with integration tests first and then yours. Thanks for raising the issue and the fix.

I will wait on #12717 to get merged, and rebase this PR with your changes.
Thanks a lot on the support!

PR #12717 merged

If the platform instance was specified in a Druid ingestion, it was always set twice, because it was overrided in the get_identifier, and then added again down the line. This may break existing users that already rely on the broken platform instance implementation, but they could still manually set it to the old pattern in their dhub file.

sgomezvillamor

👍 Thanks for the contrib!

Rasnar · 2025-02-26T07:33:50Z

One of the builds seems to have failed on a non-related test:

=========================== short test summary info ============================
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_ingest - Exception: Timeout reached while waiting on service!
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_mongosourceconnect_ingest - Exception: Timeout reached while waiting on service!
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_s3sink_ingest - Exception: Timeout reached while waiting on service!
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_ingest_stateful - Exception: Timeout reached while waiting on service!
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_bigquery_sink_ingest - Exception: Timeout reached while waiting on service!
==== 21 passed, 2008 deselected, 20 warnings, 5 errors in 712.74s (0:11:52) ====

Not sure if it's possible to restart this single job instead of the whole pipeline.

sgomezvillamor · 2025-02-26T07:35:59Z

No worries. Those kafka errors in testIntegrationBatch1 are being addressed in a separated PR.
I'm merging as well as closing the related issue. Thanks!

Co-authored-by: rasnar <11248833+Rasnar@users.noreply.github.com> Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>

github-actions bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 24, 2025

Rasnar mentioned this pull request Feb 24, 2025

Druid dataset URN is not generated correctly #11639

Closed

Rasnar force-pushed the master branch from 1bb42de to 0c52401 Compare February 24, 2025 08:49

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 24, 2025

Rasnar force-pushed the master branch 2 times, most recently from 11307c2 to 201db8e Compare February 24, 2025 08:56

vercel bot deployed to Preview February 24, 2025 09:14 View deployment

sgomezvillamor reviewed Feb 24, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Feb 24, 2025

Rasnar force-pushed the master branch from 201db8e to 597f30c Compare February 24, 2025 14:28

vercel bot deployed to Preview February 24, 2025 14:44 View deployment

Rasnar force-pushed the master branch from 597f30c to 2b4904d Compare February 24, 2025 15:36

vercel bot deployed to Preview February 24, 2025 15:52 View deployment

Rasnar force-pushed the master branch from 2b4904d to abab617 Compare February 25, 2025 19:54

vercel bot deployed to Preview February 25, 2025 20:11 View deployment

Update golden test files

7cf4534

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Feb 25, 2025

vercel bot deployed to Preview February 25, 2025 21:49 View deployment

Merge branch 'master' into master

c35c318

sgomezvillamor approved these changes Feb 26, 2025

View reviewed changes

datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Feb 26, 2025

vercel bot deployed to Preview February 26, 2025 06:42 View deployment

Fix unit test with missing Druid config

16fdb6a

vercel bot deployed to Preview February 26, 2025 07:09 View deployment

sgomezvillamor merged commit 5f5e395 into datahub-project:master Feb 26, 2025
186 of 189 checks passed

PeteMango mentioned this pull request Feb 26, 2025

feat(ingestion/superset): superset dataset lineage for metadata ingestion #12679

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix platform instance support on Druid ingestion #12716

Fix platform instance support on Druid ingestion #12716

Rasnar commented Feb 24, 2025

codecov bot commented Feb 24, 2025 •

edited

Loading

sgomezvillamor Feb 24, 2025

Rasnar Feb 24, 2025

sgomezvillamor Feb 24, 2025

Rasnar Feb 24, 2025

sgomezvillamor Feb 25, 2025

sgomezvillamor left a comment

Rasnar commented Feb 26, 2025

sgomezvillamor commented Feb 26, 2025

Fix platform instance support on Druid ingestion #12716

Fix platform instance support on Druid ingestion #12716

Conversation

Rasnar commented Feb 24, 2025

Checklist

codecov bot commented Feb 24, 2025 • edited Loading

Codecov Report

sgomezvillamor Feb 24, 2025

Choose a reason for hiding this comment

Rasnar Feb 24, 2025

Choose a reason for hiding this comment

sgomezvillamor Feb 24, 2025

Choose a reason for hiding this comment

Rasnar Feb 24, 2025

Choose a reason for hiding this comment

sgomezvillamor Feb 25, 2025

Choose a reason for hiding this comment

sgomezvillamor left a comment

Choose a reason for hiding this comment

Rasnar commented Feb 26, 2025

sgomezvillamor commented Feb 26, 2025

codecov bot commented Feb 24, 2025 •

edited

Loading