Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/looker): browse path followups #10217

Merged
merged 13 commits into from
Apr 12, 2024

Conversation

mayurinehate
Copy link
Collaborator

@mayurinehate mayurinehate commented Apr 5, 2024

Follow up on #10147

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@mayurinehate mayurinehate marked this pull request as ready for review April 10, 2024 13:42
due to multithreading, occassionally browse path v2 was not generated
correctly for some folders.
@@ -325,7 +339,7 @@ def auto_browse_path_v2(
yield MetadataChangeProposalWrapper(
entityUrn=urn,
aspect=BrowsePathsV2Class(
path=_prepend_platform_instance([], platform, platform_instance)
path=prepend_platform_instance([], platform, platform_instance)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when using a platform instance with manually generated browse path v2, is the source responsible for adding the platform instance or does this helper do it?

I want to make sure we don't have cases where it's prepended twice or more for folders deep in the hierarchy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to keep it as responsibility of source to have platform instance prepended in case its manually emitting browse path v2 but I see a corner case now, where the platform instance may get prepended twice - from within helper - to children of the container that emits such browse path v2.

I can

  1. add a check in prepend_platform_instance not to prepend if browse path already has a platform instance entry at its start
  2. or leave the responsibility of prepending platform instance only to source helper - across all sources - i.e. no source should ever prepend platform instance in browse path v2.

I'm inclined to do 2nd for looker but I also think, it wouldn't harm to have a safety net (as mentioned in point 1). WDYT ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s do the 2nd one

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

cls.get_platform_id = lambda: id or platform_name.lower().replace(" ", "-")
cls.get_platform_id = lambda *args: id or platform_name.lower().replace(
" ", "-"
)
Copy link
Collaborator Author

@mayurinehate mayurinehate Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

browse path source helper relies on platform attribute of source or platform attribute of source's config to generate platform instance however, the platform is set to lookml in lookml source and is used in stale metadata removal job id generation. Earlier, this caused platform instance to be emitted with lookml platform.

This PR changes browse path source helper to use get_platform_id as preferred method to get platform name. get_platform_id added as part of @platform_name wrapper (used for grouping sources together in documentation) can be used as a standard mechanism for identifying platform of a source. It accepts separate platform name and id and can be flexibly used to separate display name from platform's id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not working for s3 where platform is picked from config.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted this logic and added logic to prefer "platform_name" from config if its present.

paths: Dict[str, List[BrowsePathEntryClass]] = {}

emitted_urns: Set[str] = set()
containers_used_as_parent: Set[str] = set()
for urn, batch in _batch_workunits_by_urn(stream):
container_path: Optional[List[BrowsePathEntryClass]] = None
legacy_path: Optional[List[BrowsePathEntryClass]] = None
has_browse_path_v2 = False
browse_path_v2: Optional[List[BrowsePathEntryClass]] = None
Copy link
Collaborator Author

@mayurinehate mayurinehate Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of has_browse_path_v2 replaced by browse_path_v2 is not None

@hsheth2 hsheth2 merged commit 8b79461 into datahub-project:master Apr 12, 2024
57 checks passed
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata release-0.13.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants