Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] - Clean and centralise YAML docs #13769

Merged
merged 6 commits into from
Oct 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,11 @@ Again, this information will be added on top of the inferred schema from the dat
}
```

{% /codeBlock %}

{% /codePreview %}


### Global Manifest

You can also manage a **single** manifest file to centralize the ingestion process for any container. In that case,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#### Source Configuration - Source Config

{% codeInfo srNumber=100 %}

The `sourceConfig` is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/dashboardServiceMetadataPipeline.json):

- **dbServiceNames**: Database Service Names for ingesting lineage if the source supports it.
- **dashboardFilterPattern**, **chartFilterPattern**, **dataModelFilterPattern**: Note that all of them support regex as include or exclude. E.g., "My dashboard, My dash.*, .*Dashboard".
- **projectFilterPattern**: Filter the dashboards, charts and data sources by projects. Note that all of them support regex as include or exclude. E.g., "My project, My proj.*, .*Project".
- **includeOwners**: Set the 'Include Owners' toggle to control whether to include owners to the ingested entity if the owner email matches with a user stored in the OM server as part of metadata ingestion. If the ingested entity already exists and has an owner, the owner will not be overwritten.
- **includeTags**: Set the 'Include Tags' toggle to control whether to include tags in metadata ingestion.
- **includeDataModels**: Set the 'Include Data Models' toggle to control whether to include tags as part of metadata ingestion.
- **markDeletedDashboards**: Set the 'Mark Deleted Dashboards' toggle to flag dashboards as soft-deleted if they are not present anymore in the source system.

{% /codeInfo %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
```yaml {% srNumber=100 %}
sourceConfig:
config:
type: DashboardMetadata
overrideOwner: True
# dbServiceNames:
# - service1
# - service2
# dashboardFilterPattern:
# includes:
# - dashboard1
# - dashboard2
# excludes:
# - dashboard3
# - dashboard4
# chartFilterPattern:
# includes:
# - chart1
# - chart2
# excludes:
# - chart3
# - chart4
# projectFilterPattern:
# includes:
# - project1
# - project2
# excludes:
# - project3
# - project4
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
## Data Profiler

The Data Profiler workflow will be using the `orm-profiler` processor.

After running a Metadata Ingestion workflow, we can run Data Profiler workflow.
While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server.


### 1. Define the YAML Config

This is a sample config for the profiler:

{% codePreview %}

{% codeInfoContainer %}

{% codeInfo srNumber=13 %}
#### Source Configuration - Source Config

You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).

**generateSampleData**: Option to turn on/off generating sample data.

{% /codeInfo %}

{% codeInfo srNumber=14 %}

**profileSample**: Percentage of data or no. of rows we want to execute the profiler and tests on.

{% /codeInfo %}

{% codeInfo srNumber=15 %}

**threadCount**: Number of threads to use during metric computations.

{% /codeInfo %}

{% codeInfo srNumber=16 %}

**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.

{% /codeInfo %}

{% codeInfo srNumber=17 %}

**confidence**: Set the Confidence value for which you want the column to be marked

{% /codeInfo %}


{% codeInfo srNumber=18 %}

**timeoutSeconds**: Profiler Timeout in Seconds

{% /codeInfo %}

{% codeInfo srNumber=19 %}

**databaseFilterPattern**: Regex to only fetch databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=20 %}

**schemaFilterPattern**: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=21 %}

**tableFilterPattern**: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=22 %}

#### Processor Configuration

Choose the `orm-profiler`. Its config can also be updated to define tests from the YAML itself instead of the UI:

**tableConfig**: `tableConfig` allows you to set up some configuration at the table level.
{% /codeInfo %}


{% codeInfo srNumber=23 %}

#### Sink Configuration

To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.
{% /codeInfo %}


{% partial file="/v1.2/connectors/yaml/workflow-config-def.md" /%}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}


```yaml
source:
type: {% $connector %}
serviceName: local_athena
sourceConfig:
config:
type: Profiler
```

```yaml {% srNumber=13 %}
generateSampleData: true
```
```yaml {% srNumber=14 %}
# profileSample: 85
```
```yaml {% srNumber=15 %}
# threadCount: 5
```
```yaml {% srNumber=16 %}
processPiiSensitive: false
```
```yaml {% srNumber=17 %}
# confidence: 80
```
```yaml {% srNumber=18 %}
# timeoutSeconds: 43200
```
```yaml {% srNumber=19 %}
# databaseFilterPattern:
# includes:
# - database1
# - database2
# excludes:
# - database3
# - database4
```
```yaml {% srNumber=20 %}
# schemaFilterPattern:
# includes:
# - schema1
# - schema2
# excludes:
# - schema3
# - schema4
```
```yaml {% srNumber=21 %}
# tableFilterPattern:
# includes:
# - table1
# - table2
# excludes:
# - table3
# - table4
```

```yaml {% srNumber=22 %}
processor:
type: orm-profiler
config: {} # Remove braces if adding properties
# tableConfig:
# - fullyQualifiedName: <table fqn>
# profileSample: <number between 0 and 99> # default

# profileSample: <number between 0 and 99> # default will be 100 if omitted
# profileQuery: <query to use for sampling data for the profiler>
# columnConfig:
# excludeColumns:
# - <column name>
# includeColumns:
# - columnName: <column name>
# - metrics:
# - MEAN
# - MEDIAN
# - ...
# partitionConfig:
# enablePartitioning: <set to true to use partitioning>
# partitionColumnName: <partition column name>
# partitionIntervalType: <TIME-UNIT, INTEGER-RANGE, INGESTION-TIME, COLUMN-VALUE>
# Pick one of the variation shown below
# ----'TIME-UNIT' or 'INGESTION-TIME'-------
# partitionInterval: <partition interval>
# partitionIntervalUnit: <YEAR, MONTH, DAY, HOUR>
# ------------'INTEGER-RANGE'---------------
# partitionIntegerRangeStart: <integer>
# partitionIntegerRangeEnd: <integer>
# -----------'COLUMN-VALUE'----------------
# partitionValues:
# - <value>
# - <value>

```

```yaml {% srNumber=23 %}
sink:
type: metadata-rest
config: {}
```

{% partial file="/v1.2/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}

- You can learn more about how to configure and run the Profiler Workflow to extract Profiler data and execute the Data Quality from [here](/connectors/ingestion/workflows/profiler)

### 2. Run with the CLI

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

```bash
metadata profile -c <path-to-yaml>
```

Note now instead of running `ingest`, we are using the `profile` command to select the Profiler workflow.

{% tilesContainer %}

{% tile
title="Data Profiler"
description="Find more information about the Data Profiler here"
link="/connectors/ingestion/workflows/profiler"
/ %}

{% /tilesContainer %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#### Source Configuration - Source Config

{% codeInfo srNumber=100 %}

The `sourceConfig` is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceMetadataPipeline.json):

**markDeletedTables**: To flag tables as soft-deleted if they are not present anymore in the source system.

**includeTables**: true or false, to ingest table data. Default is true.

**includeViews**: true or false, to ingest views definitions.

**databaseFilterPattern**, **schemaFilterPattern**, **tableFilterPattern**: Note that the filter supports regex as include or exclude. You can find examples [here](/connectors/ingestion/workflows/metadata/filter-patterns/database)

{% /codeInfo %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
```yaml {% srNumber=100 %}
sourceConfig:
config:
type: DatabaseMetadata
markDeletedTables: true
includeTables: true
includeViews: true
# includeTags: true
# databaseFilterPattern:
# includes:
# - database1
# - database2
# excludes:
# - database3
# - database4
# schemaFilterPattern:
# includes:
# - schema1
# - schema2
# excludes:
# - schema3
# - schema4
# tableFilterPattern:
# includes:
# - users
# - type_test
# excludes:
# - table3
# - table4
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
### 2. Run with the CLI

First, we will need to save the YAML file. Afterward, and with all requirements installed, we can run:

```bash
metadata ingest -c <path-to-yaml>
```

Note that from connector to connector, this recipe will always be the same. By updating the YAML configuration,
you will be able to extract metadata from different sources.
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#### Sink Configuration

{% codeInfo srNumber=200 %}

To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.

{% /codeInfo %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
```yaml {% srNumber=200 %}
sink:
type: metadata-rest
config: {}
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#### Source Configuration - Source Config

{% codeInfo srNumber=100 %}

The sourceConfig is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/messagingServiceMetadataPipeline.json):

**generateSampleData:** Option to turn on/off generating sample data during metadata extraction.

**topicFilterPattern:** Note that the `topicFilterPattern` supports regex as include or exclude.

{% /codeInfo %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
```yaml {% srNumber=100 %}
sourceConfig:
config:
type: MessagingMetadata
topicFilterPattern:
excludes:
- _confluent.*
# includes:
# - topic1
# generateSampleData: true
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#### Source Configuration - Source Config

{% codeInfo srNumber=100 %}

The sourceConfig is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/messagingServiceMetadataPipeline.json):

**markDeletedMlModels**: Set the Mark Deleted Ml Models toggle to flag ml models as soft-deleted if they are not present anymore in the source system.

{% /codeInfo %}
Loading
Loading