Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/sigma): Sigma connector integration #10037

Merged
Merged
4 changes: 4 additions & 0 deletions datahub-web-react/src/app/ingest/source/builder/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ import dynamodbLogo from '../../../../images/dynamodblogo.png';
import fivetranLogo from '../../../../images/fivetranlogo.png';
import csvLogo from '../../../../images/csv-logo.png';
import qlikLogo from '../../../../images/qliklogo.png';
import sigmaLogo from '../../../../images/sigmalogo.png';

export const ATHENA = 'athena';
export const ATHENA_URN = `urn:li:dataPlatform:${ATHENA}`;
Expand Down Expand Up @@ -119,6 +120,8 @@ export const CSV = 'csv-enricher';
export const CSV_URN = `urn:li:dataPlatform:${CSV}`;
export const QLIK_SENSE = 'qlik-sense';
export const QLIK_SENSE_URN = `urn:li:dataPlatform:${QLIK_SENSE}`;
export const SIGMA = 'sigma';
export const SIGMA_URN = `urn:li:dataPlatform:${SIGMA}`;

export const PLATFORM_URN_TO_LOGO = {
[ATHENA_URN]: athenaLogo,
Expand Down Expand Up @@ -157,6 +160,7 @@ export const PLATFORM_URN_TO_LOGO = {
[FIVETRAN_URN]: fivetranLogo,
[CSV_URN]: csvLogo,
[QLIK_SENSE_URN]: qlikLogo,
[SIGMA_URN]: sigmaLogo,
};

export const SOURCE_TO_PLATFORM_URN = {
Expand Down
7 changes: 7 additions & 0 deletions datahub-web-react/src/app/ingest/source/builder/sources.json
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,13 @@
"docsUrl": "https://datahubproject.io/docs/generated/ingestion/sources/qlik-sense/",
"recipe": "source:\n type: qlik-sense\n config:\n # Coordinates\n tenant_hostname: https://xyz12xz.us.qlikcloud.com\n # Coordinates\n api_key: QLIK_API_KEY\n\n # Optional - filter for certain space names instead of ingesting everything.\n # space_pattern:\n\n # allow:\n # - space_name\n ingest_owner: true"
},
{
"urn": "urn:li:dataPlatform:sigma",
"name": "sigma",
"displayName": "Sigma",
"docsUrl": "https://datahubproject.io/docs/generated/ingestion/sources/sigma/",
"recipe": "source:\n type: sigma\n config:\n # Coordinates\n api_url: https://aws-api.sigmacomputing.com/v2\n # Coordinates\n client_id: CLIENT_ID\n client_secret: CLIENT_SECRET\n\n # Optional - filter for certain workspace names instead of ingesting everything.\n # workspace_pattern:\n\n # allow:\n # - workspace_name\n ingest_owner: true"
},
{
"urn": "urn:li:dataPlatform:cockroachdb",
"name": "cockroachdb",
Expand Down
Binary file added datahub-web-react/src/images/sigmalogo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 74 additions & 0 deletions metadata-ingestion/docs/sources/sigma/sigma_pre.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
## Integration Details

This source extracts the following:

- Workspaces and workbooks within that workspaces as Container.
- Sigma Datasets as Datahub Datasets.
- Pages as Datahub dashboards and elements present inside pages as charts.

## Configuration Notes

1. Refer [doc](https://help.sigmacomputing.com/docs/generate-api-client-credentials) to generate an API client credentials.
2. Provide the generated Client ID and Secret in Recipe.

## Concept mapping

| Sigma | Datahub | Notes |
|------------------------|---------------------------------------------------------------|----------------------------------|
| `Workspace` | [Container](../../metamodel/entities/container.md) | SubType `"Sigma Workspace"` |
| `Workbook` | [Container](../../metamodel/entities/container.md) | SubType `"Sigma Workbook"` |
| `Page` | [Dashboard](../../metamodel/entities/dashboard.md) | |
| `Element` | [Chart](../../metamodel/entities/chart.md) | |
| `Dataset` | [Dataset](../../metamodel/entities/dataset.md) | SubType `"Sigma Dataset"` |
| `User` | [User (a.k.a CorpUser)](../../metamodel/entities/corpuser.md) | Optionally Extracted |

## Advanced Configurations

### Chart source platform mapping
If you want to provide platform details(platform name, platform instance and env) for chart's all external upstream data sources, then you can use `chart_sources_platform_mapping` as below:

#### Example - For just one specific chart's external upstream data sources
```yml
chart_sources_platform_mapping:
'workspace_name/workbook_name/chart_name_1':
data_source_platform: snowflake
platform_instance: new_instance
env: PROD

'workspace_name/folder_name/workbook_name/chart_name_2':
data_source_platform: postgres
platform_instance: cloud_instance
env: DEV
```

#### Example - For all charts within one specific workbook
```yml
chart_sources_platform_mapping:
'workspace_name/workbook_name_1':
data_source_platform: snowflake
platform_instance: new_instance
env: PROD

'workspace_name/folder_name/workbook_name_2':
data_source_platform: snowflake
platform_instance: new_instance
env: PROD
```

#### Example - For all workbooks charts within one specific workspace
```yml
chart_sources_platform_mapping:
'workspace_name':
data_source_platform: snowflake
platform_instance: new_instance
env: PROD
```

#### Example - All workbooks use the same connection
```yml
chart_sources_platform_mapping:
'*':
data_source_platform: snowflake
platform_instance: new_instance
env: PROD
```
25 changes: 25 additions & 0 deletions metadata-ingestion/docs/sources/sigma/sigma_recipe.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
source:
type: sigma
config:
# Coordinates
api_url: "https://aws-api.sigmacomputing.com/v2"
# Credentials
client_id: "CLIENTID"
client_secret: "CLIENT_SECRET"

# Optional - filter for certain workspace names instead of ingesting everything.
# workspace_pattern:
# allow:
# - workspace_name

ingest_owner: true

# Optional - mapping of sigma workspace/workbook/chart folder path to all chart's data sources platform details present inside that folder path.
# chart_sources_platform_mapping:
# folder_path:
# data_source_platform: postgres
# platform_instance: cloud_instance
# env: DEV

sink:
# sink configs
3 changes: 3 additions & 0 deletions metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,7 @@
"databricks": databricks | sql_common | sqllineage_lib,
"fivetran": snowflake_common | bigquery_common,
"qlik-sense": sqlglot_lib | {"requests", "websocket-client"},
"sigma": {"requests"},
}

# This is mainly used to exclude plugins from the Docker image.
Expand Down Expand Up @@ -553,6 +554,7 @@
"fivetran",
"kafka-connect",
"qlik-sense",
"sigma",
]
if plugin
for dependency in plugins[plugin]
Expand Down Expand Up @@ -660,6 +662,7 @@
"sql-queries = datahub.ingestion.source.sql_queries:SqlQueriesSource",
"fivetran = datahub.ingestion.source.fivetran.fivetran:FivetranSource",
"qlik-sense = datahub.ingestion.source.qlik_sense.qlik_sense:QlikSenseSource",
"sigma = datahub.ingestion.source.sigma.sigma:SigmaSource",
],
"datahub.ingestion.transformer.plugins": [
"pattern_cleanup_ownership = datahub.ingestion.transformer.pattern_cleanup_ownership:PatternCleanUpOwnership",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ class DatasetSubTypes(str, Enum):
POWERBI_DATASET_TABLE = "PowerBI Dataset Table"
QLIK_DATASET = "Qlik Dataset"
BIGQUERY_TABLE_SNAPSHOT = "Bigquery Table Snapshot"
SIGMA_DATASET = "Sigma Dataset"

# TODO: Create separate entity...
NOTEBOOK = "Notebook"
Expand Down Expand Up @@ -45,6 +46,8 @@ class BIContainerSubTypes(str, Enum):
POWERBI_DATASET = "PowerBI Dataset"
QLIK_SPACE = "Qlik Space"
QLIK_APP = "Qlik App"
SIGMA_WORKSPACE = "Sigma Workspace"
SIGMA_WORKBOOK = "Sigma Workbook"


class JobContainerSubTypes(str, Enum):
Expand Down
Empty file.
86 changes: 86 additions & 0 deletions metadata-ingestion/src/datahub/ingestion/source/sigma/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import logging
from dataclasses import dataclass
from typing import Dict, Optional

import pydantic

from datahub.configuration.common import AllowDenyPattern
from datahub.configuration.source_common import (
EnvConfigMixin,
PlatformInstanceConfigMixin,
)
from datahub.ingestion.source.state.stale_entity_removal_handler import (
StaleEntityRemovalSourceReport,
)
from datahub.ingestion.source.state.stateful_ingestion_base import (
StatefulIngestionConfigBase,
)

logger = logging.getLogger(__name__)


class Constant:
"""
keys used in sigma plugin
"""

# Rest API response key constants
ENTRIES = "entries"
FIRSTNAME = "firstName"
LASTNAME = "lastName"
EDGES = "edges"
DEPENDENCIES = "dependencies"
SOURCE = "source"
WORKSPACEID = "workspaceId"
PATH = "path"
NAME = "name"
URL = "url"
ELEMENTID = "elementId"
ID = "id"
PARENTID = "parentId"
TYPE = "type"
DATASET = "dataset"
WORKBOOK = "workbook"
BADGE = "badge"
NEXTPAGE = "nextPage"

# Source Config constants
DEFAULT_API_URL = "https://aws-api.sigmacomputing.com/v2"


@dataclass
class SigmaSourceReport(StaleEntityRemovalSourceReport):
number_of_workspaces: int = 0

def report_number_of_workspaces(self, number_of_workspaces: int) -> None:
self.number_of_workspaces = number_of_workspaces


class PlatformDetail(PlatformInstanceConfigMixin, EnvConfigMixin):
data_source_platform: str = pydantic.Field(
description="A chart's data sources platform name.",
)


class SigmaSourceConfig(
StatefulIngestionConfigBase, PlatformInstanceConfigMixin, EnvConfigMixin
):
api_url: str = pydantic.Field(
default=Constant.DEFAULT_API_URL, description="Sigma API hosted URL."
)
client_id: str = pydantic.Field(description="Sigma Client ID")
client_secret: str = pydantic.Field(description="Sigma Client Secret")
# Sigma workspace identifier
workspace_pattern: AllowDenyPattern = pydantic.Field(
default=AllowDenyPattern.allow_all(),
description="Regex patterns to filter Sigma workspaces in ingestion."
"Mention 'User Folder' if entities of 'My documents' need to ingest.",
)
ingest_owner: Optional[bool] = pydantic.Field(
default=True,
description="Ingest Owner from source. This will override Owner info entered from UI",
)
chart_sources_platform_mapping: Dict[str, PlatformDetail] = pydantic.Field(
default={},
description="A mapping of the sigma workspace/workbook/chart folder path to all chart's data sources platform details present inside that folder path.",
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
from datetime import datetime
from typing import Dict, List, Optional

from pydantic import BaseModel, root_validator

from datahub.emitter.mcp_builder import ContainerKey


class WorkspaceKey(ContainerKey):
workspaceId: str


class WorkbookKey(ContainerKey):
workbookId: str


class Workspace(BaseModel):
workspaceId: str
name: str
createdBy: str
createdAt: datetime
updatedAt: datetime


class SigmaDataset(BaseModel):
datasetId: str
workspaceId: str
name: str
description: str
createdBy: str
createdAt: datetime
updatedAt: datetime
url: str
path: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to keep this as a List[str] - that way if someone has a / in their folder name, we still handle it correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting path attribute details from API in string format. If we later convert it to list of string with any one folder containing / in name, still that folder name will get split.

badge: Optional[str] = None

@root_validator(pre=True)
def update_values(cls, values: Dict) -> Dict:
# As element lineage api provide this id as source dataset id
values["datasetId"] = values["url"].split("/")[-1]
return values


class Element(BaseModel):
elementId: str
type: str
name: str
url: str
vizualizationType: Optional[str] = None
query: Optional[str] = None
columns: List[str] = []
upstream_sources: Dict[str, str] = {}


class Page(BaseModel):
pageId: str
name: str
elements: List[Element] = []


class Workbook(BaseModel):
workbookId: str
workspaceId: str
name: str
createdBy: str
updatedBy: str
createdAt: datetime
updatedAt: datetime
url: str
path: str
latestVersion: int
pages: List[Page] = []
badge: Optional[str] = None
Loading
Loading