Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR: Added external workflow missing docs for usage/lineage #15223

Merged
merged 4 commits into from
Feb 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
## Data Quality

### Adding Data Quality Test Cases from yaml config

When creating a JSON config for a test workflow the source configuration is very simple.
pmbrull marked this conversation as resolved.
Show resolved Hide resolved
```yaml
source:
type: TestSuite
serviceName: <your_service_name>
sourceConfig:
config:
type: TestSuite
entityFullyQualifiedName: <entityFqn>
```
The only sections you need to modify here are the `serviceName` (this name needs to be unique) and `entityFullyQualifiedName` (the entity for which we'll be executing tests against) keys.

Once you have defined your source configuration you'll need to define te processor configuration.

```yaml
processor:
type: "orm-test-runner"
config:
forceUpdate: <false|true>
testCases:
- name: <testCaseName>
testDefinitionName: columnValueLengthsToBeBetween
columnName: <columnName>
parameterValues:
- name: minLength
value: 10
- name: maxLength
value: 25
- name: <testCaseName>
testDefinitionName: tableRowCountToEqual
parameterValues:
- name: value
value: 10
```

The processor type should be set to ` "orm-test-runner"`. For accepted test definition names and parameter value names refer to the [tests page](/connectors/ingestion/workflows/data-quality/tests).

{% note %}

Note that while you can define tests directly in this YAML configuration, running the
workflow will execute ALL THE TESTS present in the table, regardless of what you are defining in the YAML.

This makes it easy for any user to contribute tests via the UI, while maintaining the test execution external.

{% /note %}

You can keep your YAML config as simple as follows if the table already has tests.

```yaml
processor:
type: "orm-test-runner"
config: {}
```

### Key reference:

- `forceUpdate`: if the test case exists (base on the test case name) for the entity, implements the strategy to follow when running the test (i.e. whether or not to update parameters)
- `testCases`: list of test cases to add to the entity referenced. Note that we will execute all the tests present in the Table.
- `name`: test case name
- `testDefinitionName`: test definition
- `columnName`: only applies to column test. The name of the column to run the test against
- `parameterValues`: parameter values of the test


The `sink` and `workflowConfig` will have the same settings as the ingestion and profiler workflow.

### Full `yaml` config example

```yaml
source:
type: TestSuite
serviceName: MyAwesomeTestSuite
sourceConfig:
config:
type: TestSuite
entityFullyQualifiedName: MySQL.default.openmetadata_db.tag_usage

processor:
type: "orm-test-runner"
config:
forceUpdate: false
testCases:
- name: column_value_length_tagFQN
testDefinitionName: columnValueLengthsToBeBetween
columnName: tagFQN
parameterValues:
- name: minLength
value: 10
- name: maxLength
value: 25
- name: table_row_count_test
testDefinitionName: tableRowCountToEqual
parameterValues:
- name: value
value: 10

sink:
type: metadata-rest
config: {}
workflowConfig:
openMetadataServerConfig:
hostPort: <OpenMetadata host and port>
authProvider: <OpenMetadata auth provider>
```

### How to Run Tests

To run the tests from the CLI execute the following command
```
metadata test -c /path/to/my/config.yaml
```
167 changes: 167 additions & 0 deletions openmetadata-docs/content/partials/v1.3/connectors/yaml/lineage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
## Lineage

After running a Metadata Ingestion workflow, we can run Lineage workflow.
While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server.


### 1. Define the YAML Config

This is a sample config for BigQuery Lineage:

{% codePreview %}

{% codeInfoContainer %}

{% codeInfo srNumber=40 %}
#### Source Configuration - Source Config

You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceQueryLineagePipeline.json).

{% /codeInfo %}

{% codeInfo srNumber=41 %}

**queryLogDuration**: Configuration to tune how far we want to look back in query logs to process lineage data in days.

{% /codeInfo %}

{% codeInfo srNumber=42 %}

**parsingTimeoutLimit**: Configuration to set the timeout for parsing the query in seconds.
{% /codeInfo %}

{% codeInfo srNumber=43 %}

**filterCondition**: Condition to filter the query history.

{% /codeInfo %}

{% codeInfo srNumber=44 %}

**resultLimit**: Configuration to set the limit for query logs.

{% /codeInfo %}

{% codeInfo srNumber=45 %}

**queryLogFilePath**: Configuration to set the file path for query logs.

{% /codeInfo %}

{% codeInfo srNumber=46 %}

**databaseFilterPattern**: Regex to only fetch databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=47 %}

**schemaFilterPattern**: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=48 %}

**tableFilterPattern**: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}


{% codeInfo srNumber=49 %}

#### Sink Configuration

To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.
{% /codeInfo %}


{% codeInfo srNumber=50 %}

#### Workflow Configuration

The main property here is the `openMetadataServerConfig`, where you can define the host and security provider of your OpenMetadata installation.

For a simple, local installation using our docker containers, this looks like:

{% /codeInfo %}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}


```yaml {% srNumber=40 %}
source:
type: {% $connector %}-lineage
serviceName: <serviceName (same as metadata ingestion service name)>
sourceConfig:
config:
type: DatabaseLineage
```

```yaml {% srNumber=41 %}
# Number of days to look back
queryLogDuration: 1
```
```yaml {% srNumber=42 %}
parsingTimeoutLimit: 300
```
```yaml {% srNumber=43 %}
# filterCondition: query_text not ilike '--- metabase query %'
```
```yaml {% srNumber=44 %}
resultLimit: 1000
```
```yaml {% srNumber=45 %}
# If instead of getting the query logs from the database we want to pass a file with the queries
# queryLogFilePath: /tmp/query_log/file_path
```
```yaml {% srNumber=46 %}
# databaseFilterPattern:
# includes:
# - database1
# - database2
# excludes:
# - database3
# - database4
```
```yaml {% srNumber=47 %}
# schemaFilterPattern:
# includes:
# - schema1
# - schema2
# excludes:
# - schema3
# - schema4
```
```yaml {% srNumber=48 %}
# tableFilterPattern:
# includes:
# - table1
# - table2
# excludes:
# - table3
# - table4
```

```yaml {% srNumber=49 %}
sink:
type: metadata-rest
config: {}
```

{% partial file="/v1.2/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}

- You can learn more about how to configure and run the Lineage Workflow to extract Lineage data from [here](/connectors/ingestion/workflows/lineage)

### 2. Run with the CLI

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

```bash
metadata ingest -c <path-to-yaml>
```
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ Configure and schedule Athena metadata and profiler workflows from the OpenMetad
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage](#query-usage)
- [Data Profiler](#data-profiler)
- [Lineage](#lineage)
- [Data Profiler](#data-profiler)
- [Data Quality](#data-quality)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.3/connectors/external-ingestion-deployment.md" /%}
Expand Down Expand Up @@ -359,11 +360,11 @@ source:

{% partial file="/v1.3/connectors/yaml/query-usage.md" variables={connector: "athena"} /%}

{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "athena"} /%}
{% partial file="/v1.3/connectors/yaml/lineage.md" variables={connector: "athena"} /%}

## Lineage
{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "athena"} /%}

You can learn more about how to ingest lineage [here](/connectors/ingestion/workflows/lineage).
{% partial file="/v1.3/connectors/yaml/data-quality.md" /%}

## dbt Integration

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Configure and schedule AzureSQL metadata and profiler workflows from the OpenMet
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Data Profiler](#data-profiler)
- [Data Quality](#data-quality)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.3/connectors/external-ingestion-deployment.md" /%}
Expand Down Expand Up @@ -192,6 +193,8 @@ source:

{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "azuresql"} /%}

{% partial file="/v1.3/connectors/yaml/data-quality.md" /%}

## dbt Integration

{% tilesContainer %}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,9 @@ Configure and schedule BigQuery metadata and profiler workflows from the OpenMet
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage](#query-usage)
- [Data Profiler](#data-profiler)
- [Lineage](#lineage)
- [Data Profiler](#data-profiler)
- [Data Quality](#data-quality)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.3/connectors/external-ingestion-deployment.md" /%}
Expand Down Expand Up @@ -255,11 +256,11 @@ source:

{% partial file="/v1.3/connectors/yaml/query-usage.md" variables={connector: "bigquery"} /%}

{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "bigquery"} /%}
{% partial file="/v1.3/connectors/yaml/lineage.md" variables={connector: "bigquery"} /%}

## Lineage
{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "bigquery"} /%}

You can learn more about how to ingest lineage [here](/connectors/ingestion/workflows/lineage).
{% partial file="/v1.3/connectors/yaml/data-quality.md" /%}

## dbt Integration

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ Configure and schedule Clickhouse metadata and profiler workflows from the OpenM
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage](#query-usage)
- [Data Profiler](#data-profiler)
- [Lineage](#lineage)
- [Data Profiler](#data-profiler)
- [Data Quality](#data-quality)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.3/connectors/external-ingestion-deployment.md" /%}
Expand Down Expand Up @@ -255,11 +256,11 @@ source:

{% partial file="/v1.3/connectors/yaml/query-usage.md" variables={connector: "clickhouse"} /%}

{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "clickhouse"} /%}
{% partial file="/v1.3/connectors/yaml/lineage.md" variables={connector: "clickhouse"} /%}

## Lineage
{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "clickhouse"} /%}

You can learn more about how to ingest lineage [here](/connectors/ingestion/workflows/lineage).
{% partial file="/v1.3/connectors/yaml/data-quality.md" /%}

## dbt Integration

Expand Down
Loading
Loading