Skip to content

Commit

Permalink
MINOR: Added external workflow missing docs for usage/lineage (#15223)
Browse files Browse the repository at this point in the history
* Added ext ing workflow docs

* Added dq run ext docs

* Nit

* Nit

---------

Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
  • Loading branch information
OnkarVO7 and pmbrull authored Feb 16, 2024
1 parent 44fd621 commit 50dd742
Show file tree
Hide file tree
Showing 31 changed files with 394 additions and 190 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
## Data Quality

### Adding Data Quality Test Cases from yaml config

When creating a JSON config for a test workflow the source configuration is very simple.
```yaml
source:
type: TestSuite
serviceName: <your_service_name>
sourceConfig:
config:
type: TestSuite
entityFullyQualifiedName: <entityFqn>
```
The only sections you need to modify here are the `serviceName` (this name needs to be unique) and `entityFullyQualifiedName` (the entity for which we'll be executing tests against) keys.

Once you have defined your source configuration you'll need to define te processor configuration.

```yaml
processor:
type: "orm-test-runner"
config:
forceUpdate: <false|true>
testCases:
- name: <testCaseName>
testDefinitionName: columnValueLengthsToBeBetween
columnName: <columnName>
parameterValues:
- name: minLength
value: 10
- name: maxLength
value: 25
- name: <testCaseName>
testDefinitionName: tableRowCountToEqual
parameterValues:
- name: value
value: 10
```

The processor type should be set to ` "orm-test-runner"`. For accepted test definition names and parameter value names refer to the [tests page](/connectors/ingestion/workflows/data-quality/tests).

{% note %}

Note that while you can define tests directly in this YAML configuration, running the
workflow will execute ALL THE TESTS present in the table, regardless of what you are defining in the YAML.

This makes it easy for any user to contribute tests via the UI, while maintaining the test execution external.

{% /note %}

You can keep your YAML config as simple as follows if the table already has tests.

```yaml
processor:
type: "orm-test-runner"
config: {}
```

### Key reference:

- `forceUpdate`: if the test case exists (base on the test case name) for the entity, implements the strategy to follow when running the test (i.e. whether or not to update parameters)
- `testCases`: list of test cases to add to the entity referenced. Note that we will execute all the tests present in the Table.
- `name`: test case name
- `testDefinitionName`: test definition
- `columnName`: only applies to column test. The name of the column to run the test against
- `parameterValues`: parameter values of the test


The `sink` and `workflowConfig` will have the same settings as the ingestion and profiler workflow.

### Full `yaml` config example

```yaml
source:
type: TestSuite
serviceName: MyAwesomeTestSuite
sourceConfig:
config:
type: TestSuite
entityFullyQualifiedName: MySQL.default.openmetadata_db.tag_usage
processor:
type: "orm-test-runner"
config:
forceUpdate: false
testCases:
- name: column_value_length_tagFQN
testDefinitionName: columnValueLengthsToBeBetween
columnName: tagFQN
parameterValues:
- name: minLength
value: 10
- name: maxLength
value: 25
- name: table_row_count_test
testDefinitionName: tableRowCountToEqual
parameterValues:
- name: value
value: 10
sink:
type: metadata-rest
config: {}
workflowConfig:
openMetadataServerConfig:
hostPort: <OpenMetadata host and port>
authProvider: <OpenMetadata auth provider>
```

### How to Run Tests

To run the tests from the CLI execute the following command
```
metadata test -c /path/to/my/config.yaml
```
167 changes: 167 additions & 0 deletions openmetadata-docs/content/partials/v1.3/connectors/yaml/lineage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
## Lineage

After running a Metadata Ingestion workflow, we can run Lineage workflow.
While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server.


### 1. Define the YAML Config

This is a sample config for BigQuery Lineage:

{% codePreview %}

{% codeInfoContainer %}

{% codeInfo srNumber=40 %}
#### Source Configuration - Source Config

You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceQueryLineagePipeline.json).

{% /codeInfo %}

{% codeInfo srNumber=41 %}

**queryLogDuration**: Configuration to tune how far we want to look back in query logs to process lineage data in days.

{% /codeInfo %}

{% codeInfo srNumber=42 %}

**parsingTimeoutLimit**: Configuration to set the timeout for parsing the query in seconds.
{% /codeInfo %}

{% codeInfo srNumber=43 %}

**filterCondition**: Condition to filter the query history.

{% /codeInfo %}

{% codeInfo srNumber=44 %}

**resultLimit**: Configuration to set the limit for query logs.

{% /codeInfo %}

{% codeInfo srNumber=45 %}

**queryLogFilePath**: Configuration to set the file path for query logs.

{% /codeInfo %}

{% codeInfo srNumber=46 %}

**databaseFilterPattern**: Regex to only fetch databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=47 %}

**schemaFilterPattern**: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=48 %}

**tableFilterPattern**: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}


{% codeInfo srNumber=49 %}

#### Sink Configuration

To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.
{% /codeInfo %}


{% codeInfo srNumber=50 %}

#### Workflow Configuration

The main property here is the `openMetadataServerConfig`, where you can define the host and security provider of your OpenMetadata installation.

For a simple, local installation using our docker containers, this looks like:

{% /codeInfo %}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}


```yaml {% srNumber=40 %}
source:
type: {% $connector %}-lineage
serviceName: <serviceName (same as metadata ingestion service name)>
sourceConfig:
config:
type: DatabaseLineage
```
```yaml {% srNumber=41 %}
# Number of days to look back
queryLogDuration: 1
```
```yaml {% srNumber=42 %}
parsingTimeoutLimit: 300
```
```yaml {% srNumber=43 %}
# filterCondition: query_text not ilike '--- metabase query %'
```
```yaml {% srNumber=44 %}
resultLimit: 1000
```
```yaml {% srNumber=45 %}
# If instead of getting the query logs from the database we want to pass a file with the queries
# queryLogFilePath: /tmp/query_log/file_path
```
```yaml {% srNumber=46 %}
# databaseFilterPattern:
# includes:
# - database1
# - database2
# excludes:
# - database3
# - database4
```
```yaml {% srNumber=47 %}
# schemaFilterPattern:
# includes:
# - schema1
# - schema2
# excludes:
# - schema3
# - schema4
```
```yaml {% srNumber=48 %}
# tableFilterPattern:
# includes:
# - table1
# - table2
# excludes:
# - table3
# - table4
```

```yaml {% srNumber=49 %}
sink:
type: metadata-rest
config: {}
```
{% partial file="/v1.2/connectors/yaml/workflow-config.md" /%}
{% /codeBlock %}
{% /codePreview %}
- You can learn more about how to configure and run the Lineage Workflow to extract Lineage data from [here](/connectors/ingestion/workflows/lineage)
### 2. Run with the CLI
After saving the YAML config, we will run the command the same way we did for the metadata ingestion:
```bash
metadata ingest -c <path-to-yaml>
```
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ Configure and schedule Athena metadata and profiler workflows from the OpenMetad
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage](#query-usage)
- [Data Profiler](#data-profiler)
- [Lineage](#lineage)
- [Data Profiler](#data-profiler)
- [Data Quality](#data-quality)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.3/connectors/external-ingestion-deployment.md" /%}
Expand Down Expand Up @@ -359,11 +360,11 @@ source:

{% partial file="/v1.3/connectors/yaml/query-usage.md" variables={connector: "athena"} /%}

{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "athena"} /%}
{% partial file="/v1.3/connectors/yaml/lineage.md" variables={connector: "athena"} /%}

## Lineage
{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "athena"} /%}

You can learn more about how to ingest lineage [here](/connectors/ingestion/workflows/lineage).
{% partial file="/v1.3/connectors/yaml/data-quality.md" /%}

## dbt Integration

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Configure and schedule AzureSQL metadata and profiler workflows from the OpenMet
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Data Profiler](#data-profiler)
- [Data Quality](#data-quality)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.3/connectors/external-ingestion-deployment.md" /%}
Expand Down Expand Up @@ -192,6 +193,8 @@ source:

{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "azuresql"} /%}

{% partial file="/v1.3/connectors/yaml/data-quality.md" /%}

## dbt Integration

{% tilesContainer %}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,9 @@ Configure and schedule BigQuery metadata and profiler workflows from the OpenMet
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage](#query-usage)
- [Data Profiler](#data-profiler)
- [Lineage](#lineage)
- [Data Profiler](#data-profiler)
- [Data Quality](#data-quality)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.3/connectors/external-ingestion-deployment.md" /%}
Expand Down Expand Up @@ -255,11 +256,11 @@ source:

{% partial file="/v1.3/connectors/yaml/query-usage.md" variables={connector: "bigquery"} /%}

{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "bigquery"} /%}
{% partial file="/v1.3/connectors/yaml/lineage.md" variables={connector: "bigquery"} /%}

## Lineage
{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "bigquery"} /%}

You can learn more about how to ingest lineage [here](/connectors/ingestion/workflows/lineage).
{% partial file="/v1.3/connectors/yaml/data-quality.md" /%}

## dbt Integration

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ Configure and schedule Clickhouse metadata and profiler workflows from the OpenM
- [Requirements](#requirements)
- [Metadata Ingestion](#metadata-ingestion)
- [Query Usage](#query-usage)
- [Data Profiler](#data-profiler)
- [Lineage](#lineage)
- [Data Profiler](#data-profiler)
- [Data Quality](#data-quality)
- [dbt Integration](#dbt-integration)

{% partial file="/v1.3/connectors/external-ingestion-deployment.md" /%}
Expand Down Expand Up @@ -255,11 +256,11 @@ source:

{% partial file="/v1.3/connectors/yaml/query-usage.md" variables={connector: "clickhouse"} /%}

{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "clickhouse"} /%}
{% partial file="/v1.3/connectors/yaml/lineage.md" variables={connector: "clickhouse"} /%}

## Lineage
{% partial file="/v1.3/connectors/yaml/data-profiler.md" variables={connector: "clickhouse"} /%}

You can learn more about how to ingest lineage [here](/connectors/ingestion/workflows/lineage).
{% partial file="/v1.3/connectors/yaml/data-quality.md" /%}

## dbt Integration

Expand Down
Loading

0 comments on commit 50dd742

Please sign in to comment.