Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix platform instance support on Druid ingestion #12716

Merged
merged 4 commits into from
Feb 26, 2025

Conversation

Rasnar
Copy link
Contributor

@Rasnar Rasnar commented Feb 24, 2025

If the platform instance was specified in a Druid ingestion, it was always set twice, because it was overrided in the get_identifier, and then added again down the line (see #11639).
This may break existing users that already rely on the broken platform instance implementation, but they could still manually set it to the old pattern in their dhub file.

Checklist

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 24, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Feb 24, 2025
Copy link

codecov bot commented Feb 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Files with missing lines Coverage Δ
...ngestion/src/datahub/ingestion/source/sql/druid.py 100.00% <100.00%> (ø)

... and 10 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f14c42d...16fdb6a. Read the comment docs.

if self.platform_instance
else f"{table}"
)
return f"{table}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is restoring the original implementation, from 2021 https://github.com/datahub-project/datahub/pull/2284/files
which was updated on 2022 https://github.com/datahub-project/datahub/pull/3996/files#diff-d2b27025056d73e6bf1508f17c065c63dd4249caacf26de6199aaeaf446eec60

Have we been producing "duplicated" platform instance since then? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pretty sure that is the case, or something along the way changed.
But you can see that the default SQLAlchemy has no mention of a platform_instance:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py#L576

If you check any other SQL ingestion with the platform_instance support, e.g. mssql, you can see that also there there is no need to specify the platform_instance at this level:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/mssql/source.py#L731

The Druid ingestion is probably not used a lot, because you will other big inconsistencies on the ingestion. For example, the Druid table is always hardcoded to druid/v2/sql because of the sql alchemy implementation that by default will use the URI to compute the table name. This is probably something that should be fixed at some point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration tests for druid source were missed; I just added them here #12717
And indeed, the golden file shows the platform instance appears twice in the generated URNs for datasets.

Let's merge PR with integration tests first and then yours. Thanks for raising the issue and the fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait on #12717 to get merged, and rebase this PR with your changes.
Thanks a lot on the support!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #12717 merged

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Feb 24, 2025
If the platform instance was specified in a Druid ingestion, it was always
set twice, because it was overrided in the get_identifier, and then added
again down the line.
This may break existing users that already rely on the broken
platform instance implementation, but they could still manually set
it to the old pattern in their dhub file.
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Feb 25, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for the contrib!

@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Feb 26, 2025
@Rasnar
Copy link
Contributor Author

Rasnar commented Feb 26, 2025

One of the builds seems to have failed on a non-related test:

=========================== short test summary info ============================
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_ingest - Exception: Timeout reached while waiting on service!
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_mongosourceconnect_ingest - Exception: Timeout reached while waiting on service!
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_s3sink_ingest - Exception: Timeout reached while waiting on service!
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_ingest_stateful - Exception: Timeout reached while waiting on service!
ERROR tests/integration/kafka-connect/test_kafka_connect.py::test_kafka_connect_bigquery_sink_ingest - Exception: Timeout reached while waiting on service!
==== 21 passed, 2008 deselected, 20 warnings, 5 errors in 712.74s (0:11:52) ====

Not sure if it's possible to restart this single job instead of the whole pipeline.

@sgomezvillamor
Copy link
Contributor

No worries. Those kafka errors in testIntegrationBatch1 are being addressed in a separated PR.
I'm merging as well as closing the related issue. Thanks!

@sgomezvillamor sgomezvillamor merged commit 5f5e395 into datahub-project:master Feb 26, 2025
186 of 189 checks passed
shirshanka pushed a commit to shirshanka/datahub that referenced this pull request Mar 3, 2025
Co-authored-by: rasnar <11248833+Rasnar@users.noreply.github.com>
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants