Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/sigma): Sigma connector integration #10037

Merged

Conversation

shubhamjagtap639
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 13, 2024
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality:

  • seems like none of the workspaces have any entities nested within them - this suggests that the container paths are not set up correctly
  • can we generate more details about sigma datasets? e.g. schema, properties, documentation, ownership?
  • datasets are part of workspaces/folders right? they should have a container aspect
  • what happens to folders within a workspace?
  • can we get metadata from "badges" in sigma - those would probably be pretty valuable
  • for charts/dashboards, can we generate externalUrl info so we get the "View in Sigma" button in the UI

for workspace_dict in response.json():
workspaces.append(Workspace.parse_obj(workspace_dict))
except Exception as e:
self._log_http_error(message=f"Unable to fetch workspaces. Exception: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this happens, it should appear as a "failure" in the source report

)

def _get_sigma_dataset_identifier(self, dataset: SigmaDataset) -> str:
return f"{dataset.datasetId}".lower()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we making these lowercased?

@shubhamjagtap639
Copy link
Contributor Author

shubhamjagtap639 commented Mar 20, 2024

Functionality:

  • seems like none of the workspaces have any entities nested within them - this suggests that the container paths are not set up correctly
  • can we generate more details about sigma datasets? e.g. schema, properties, documentation, ownership?
  • datasets are part of workspaces/folders right? they should have a container aspect
  • what happens to folders within a workspace?
  • can we get metadata from "badges" in sigma - those would probably be pretty valuable
  • for charts/dashboards, can we generate externalUrl info so we get the "View in Sigma" button in the UI
  • Get Dataset API don't have schema metadata.
  • Should I map folder with container entity or use browsepath aspect for this?
  • Yes, we can get badges metadata but where to map it?

@hsheth2
Copy link
Collaborator

hsheth2 commented Mar 20, 2024

Should I map folder with container entity or use browsepath aspect for this?

you should emit a browsePathsV2 aspect, where the first parts are dataPlatformInstance (if set) and workspace container urns, and the remaining parts are strings with the folder names

Yes, we can get badges metadata but where to map it?

They should become tags in datahub, prefixed with "sigma:"

name: str
description: str
createdBy: str
createdAt: datetime
updatedAt: datetime
url: str
path: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to keep this as a List[str] - that way if someone has a / in their folder name, we still handle it correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting path attribute details from API in string format. If we later convert it to list of string with any one folder containing / in name, still that folder name will get split.

and "*" in self.config.chart_sources_platform_mapping
):
data_source_platform_details = self.config.chart_sources_platform_mapping[
"*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the * is basically a "fallback" platform detail if nothing else matches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user wants to provide platform details to all sources used by all charts which got ingested, * as key can be used.

inputs[
self._gen_sigma_dataset_urn(source_id)
] = in_table_urn
sql_parser_in_tables.remove(in_table_urn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll probably need to come back to this to generate CLL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To generate CLL, we don't have schema metadata for datasets. We do have columns details used by elements/charts, but can't map that columns with there tables.

self.users: Dict = {}
self.workspaces: Dict[str, Workspace] = {}
self.users: Dict[str, str] = {}
self.datasets: Dict[str, str] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we saving these? I would've expected these to be saved in sigma.py, not in sigma_api.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed self.datasets but others can't be moved or removed. We even can't use functools.lru_cache as it gives problem with function fetching response from GET API.

```

#### Example - For all ingested charts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up - the order of these examples should be reversed, and they should be rephrased as below

Suggested change
#### Example - For all ingested charts
#### Example - All workbooks use the same connection

@@ -22,7 +22,6 @@ def __init__(self, config: SigmaSourceConfig) -> None:
self.config = config
self.workspaces: Dict[str, Workspace] = {}
self.users: Dict[str, str] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment was about users and workspaces too - those should probably not be getting saved here?

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Apr 16, 2024
@hsheth2 hsheth2 merged commit 90c1249 into datahub-project:master Apr 16, 2024
60 of 61 checks passed
sleeperdeep pushed a commit to sleeperdeep/datahub that referenced this pull request Jun 25, 2024
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community devops PR or Issue related to DataHub backend & deployment ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green. product PR or Issue related to the DataHub UI/UX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants