Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR - Clean up configs & add auto classification docs #18907

Merged
merged 3 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1745,7 +1745,7 @@ WHERE JSON_EXTRACT(json, '$.pipelineType') = 'metadata';

-- classification and sampling configs from the profiler pipelines
UPDATE ingestion_pipeline_entity
SET json = JSON_REMOVE(json, '$.sourceConfig.config.processPiiSensitive', '$.sourceConfig.config.confidence', '$.sourceConfig.config.generateSampleData')
SET json = JSON_REMOVE(json, '$.sourceConfig.config.processPiiSensitive', '$.sourceConfig.config.confidence', '$.sourceConfig.config.generateSampleData', '$.sourceConfig.config.sampleDataCount')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also not needed for the profiler. It was only used for sample data

WHERE JSON_EXTRACT(json, '$.pipelineType') = 'profiler';

-- Rename 'jobId' to 'jobIds', set 'jobId' as type array in 'jobIds' , add 'projectIds' for dbt cloud
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1732,7 +1732,7 @@ WHERE json #>> '{pipelineType}' = 'metadata';

-- classification and sampling configs from the profiler pipelines
UPDATE ingestion_pipeline_entity
SET json = json::jsonb #- '{sourceConfig,config,processPiiSensitive}' #- '{sourceConfig,config,confidence}' #- '{sourceConfig,config,generateSampleData}'
SET json = json::jsonb #- '{sourceConfig,config,processPiiSensitive}' #- '{sourceConfig,config,confidence}' #- '{sourceConfig,config,generateSampleData}' #- '{sourceConfig,config,sampleDataCount}'
WHERE json #>> '{pipelineType}' = 'profiler';

-- set value of 'jobId' as an array into 'jobIds' for dbt cloud
Expand Down
4 changes: 2 additions & 2 deletions ingestion/src/metadata/clients/aws_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,15 +144,15 @@ def create_session(self) -> Session:
def get_client(self, service_name: str) -> Any:
# initialize the client depending on the AWSCredentials passed
if self.config is not None:
logger.info(f"Getting AWS client for service [{service_name}]")
logger.debug(f"Getting AWS client for service [{service_name}]")
session = self.create_session()
if self.config.endPointURL is not None:
return session.client(
service_name=service_name, endpoint_url=str(self.config.endPointURL)
)
return session.client(service_name=service_name)

logger.info(f"Getting AWS default client for service [{service_name}]")
logger.debug(f"Getting AWS default client for service [{service_name}]")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're now passing the client to the source externally, this log was too chatty

# initialized with the credentials loaded from running machine
return boto3.client(service_name=service_name)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,6 @@ def create_profiler_interface(
profile_sample_type=self.source_config.profileSampleType,
sampling_method_type=self.source_config.samplingMethodType,
),
default_sample_data_count=self.source_config.sampleDataCount,
)

profiler_interface: ProfilerInterface = profiler_class.create(
Expand Down
6 changes: 1 addition & 5 deletions ingestion/src/metadata/sampler/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,11 +88,7 @@ def _run(self, record: ProfilerSourceAndEntity) -> Either[SamplerResponse]:
schema_entity=schema_entity,
database_entity=database_entity,
table_config=get_config_for_table(entity, self.profiler_config),
default_sample_config=SampleConfig(
profile_sample=self.source_config.profileSample,
profile_sample_type=self.source_config.profileSampleType,
sampling_method_type=self.source_config.samplingMethodType,
),
default_sample_config=SampleConfig(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keeping simpler pipeline for the auto classification. If there's anything configured for the table we'll pick it from there directly

default_sample_data_count=self.source_config.sampleDataCount,
)
sample_data = SampleData(
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
## Auto Classification

The Auto Classification workflow will be using the `orm-profiler` processor.

After running a Metadata Ingestion workflow, we can run the Auto Classification workflow.
While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server.


### 1. Define the YAML Config

This is a sample config for the Auto Classification Workflow:

{% codePreview %}

{% codeInfoContainer %}

#### Source Configuration - Source Config

You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceAutoClassificationPipeline.json).

{% codeInfo srNumber=14 %}

**storeSampleData**: Option to turn on/off storing sample data. If enabled, we will ingest sample data for each table.

{% /codeInfo %}

{% codeInfo srNumber=15 %}

**enableAutoClassification**: Optional configuration to automatically tag columns that might contain sensitive information.

{% /codeInfo %}

{% codeInfo srNumber=18 %}

**confidence**: Set the Confidence value for which you want the column to be tagged as PII. Confidence value ranges from 0 to 100. A higher number will yield less false positives but more false negatives. A lower number will yield more false positives but less false negatives.

{% /codeInfo %}

{% codeInfo srNumber=19 %}

**databaseFilterPattern**: Regex to only fetch databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=20 %}

**schemaFilterPattern**: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=21 %}

**tableFilterPattern**: Regex to only fetch tables or databases that matches the pattern.

{% /codeInfo %}

{% codeInfo srNumber=22 %}

#### Processor Configuration

Choose the `orm-profiler`. Its config can also be updated to define tests from the YAML itself instead of the UI:

**tableConfig**: `tableConfig` allows you to set up some configuration at the table level.
{% /codeInfo %}


{% codeInfo srNumber=23 %}

#### Sink Configuration

To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.
{% /codeInfo %}


{% partial file="/v1.5/connectors/yaml/workflow-config-def.md" /%}

{% /codeInfoContainer %}

{% codeBlock fileName="filename.yaml" %}


```yaml {% isCodeBlock=true %}
source:
type: {% $connector %}
serviceName: {% $connector %}
sourceConfig:
config:
type: AutoClassification
```
```yaml {% srNumber=14 %}
# storeSampleData: true
```
```yaml {% srNumber=15 %}
# enableAutoClassification: true
```
```yaml {% srNumber=18 %}
# confidence: 80
```
```yaml {% srNumber=19 %}
# databaseFilterPattern:
# includes:
# - database1
# - database2
# excludes:
# - database3
# - database4
```
```yaml {% srNumber=20 %}
# schemaFilterPattern:
# includes:
# - schema1
# - schema2
# excludes:
# - schema3
# - schema4
```
```yaml {% srNumber=21 %}
# tableFilterPattern:
# includes:
# - table1
# - table2
# excludes:
# - table3
# - table4
```

```yaml {% srNumber=22 %}
processor:
type: orm-profiler
config: {}
```

```yaml {% srNumber=23 %}
sink:
type: metadata-rest
config: {}
```

{% partial file="/v1.5/connectors/yaml/workflow-config.md" /%}

{% /codeBlock %}

{% /codePreview %}


### 2. Run with the CLI

After saving the YAML config, we will run the command the same way we did for the metadata ingestion:

```bash
metadata classify -c <path-to-yaml>
```

Note now instead of running `ingest`, we are using the `classify` command to select the Auto Classification workflow.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The Data Profiler workflow will be using the `orm-profiler` processor.

After running a Metadata Ingestion workflow, we can run Data Profiler workflow.
After running a Metadata Ingestion workflow, we can run the Data Profiler workflow.
While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server.


Expand All @@ -14,15 +14,10 @@ This is a sample config for the profiler:

{% codeInfoContainer %}

{% codeInfo srNumber=13 %}
#### Source Configuration - Source Config

You can find all the definitions and types for the `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).

**generateSampleData**: Option to turn on/off generating sample data.

{% /codeInfo %}

{% codeInfo srNumber=14 %}

**profileSample**: Percentage of data or no. of rows we want to execute the profiler and tests on.
Expand All @@ -35,19 +30,6 @@ You can find all the definitions and types for the `sourceConfig` [here](https:

{% /codeInfo %}

{% codeInfo srNumber=16 %}

**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.

{% /codeInfo %}

{% codeInfo srNumber=17 %}

**confidence**: Set the Confidence value for which you want the column to be marked

{% /codeInfo %}


{% codeInfo srNumber=18 %}

**timeoutSeconds**: Profiler Timeout in Seconds
Expand Down Expand Up @@ -100,27 +82,17 @@ To send the metadata to OpenMetadata, it needs to be specified as `type: metadat
```yaml {% isCodeBlock=true %}
source:
type: {% $connector %}
serviceName: local_athena
serviceName: {% $connector %}
sourceConfig:
config:
type: Profiler
```

```yaml {% srNumber=13 %}
generateSampleData: true
```
```yaml {% srNumber=14 %}
# profileSample: 85
```
```yaml {% srNumber=15 %}
# threadCount: 5
```
```yaml {% srNumber=16 %}
processPiiSensitive: false
```
```yaml {% srNumber=17 %}
# confidence: 80
```
```yaml {% srNumber=18 %}
# timeoutSeconds: 43200
```
Expand Down Expand Up @@ -158,8 +130,6 @@ processor:
config: {} # Remove braces if adding properties
# tableConfig:
# - fullyQualifiedName: <table fqn>
# profileSample: <number between 0 and 99> # default

# profileSample: <number between 0 and 99> # default will be 100 if omitted
# profileQuery: <query to use for sampling data for the profiler>
# columnConfig:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ For a simple, local installation using our docker containers, this looks like:
```yaml {% srNumber=40 %}
source:
type: {% $connector %}-lineage
serviceName: <serviceName (same as metadata ingestion service name)>
serviceName: {% $connector %}
sourceConfig:
config:
type: DatabaseLineage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Note that the location is a directory that will be cleaned at the end of the ing
```yaml {% isCodeBlock=true %}
source:
type: {% $connector %}-usage
serviceName: <service name>
serviceName: {% $connector %}
sourceConfig:
config:
type: DatabaseUsage
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,33 +84,6 @@ during the migration after bumping this value, you can increase them further.

After the migration is finished, you can revert this changes.

# New Versioning System for Ingestion Docker Image

We are excited to announce a recent change in our version tagging system for our Ingestion Docker images. This update aims to improve consistency and clarity in our versioning, aligning our Docker image tags with our Python PyPi package versions.

### Ingestion Docker Image Tags

To maintain consistency, our Docker images will now follow the same 4-digit versioning system as of Python Package versions. For example, a Docker image version might look like `1.0.0.0`.

Additionally, we will continue to provide a 3-digit version tag (e.g., `1.0.0`) that will always point to the latest corresponding 4-digit image tag. This ensures ease of use for those who prefer a simpler version tag while still having access to the most recent updates.

### Benefits

**Consistency**: Both Python applications and Docker images will have the same versioning format, making it easier to track and manage versions.
**Clarity**: The 4-digit system provides a clear and detailed versioning structure, helping users understand the nature and scope of changes.
**Non-Breaking Change**: This update is designed to be non-disruptive. Existing Ingestions and dependencies will remain unaffected.

#### Example

Here’s an example of how the new versioning works:

**Python Application Version**: `1.5.0.0`
**Docker Image Tags**:
- `1.5.0.0` (specific version)
- `1.5.0` (latest version in the 1.5.0.x series)

We believe this update will bring greater consistency and clarity to our versioning system. As always, we value your feedback and welcome any questions or comments you may have.

# Backward Incompatible Changes

## 1.6.0
Expand Down Expand Up @@ -145,6 +118,13 @@ removing these properties as well.
- If you still want to use the Auto PII Classification and sampling features, you can create the new workflow
from the UI.

### Collate - Metadata Actions for ML Tagging - Deprecation Notice

Since we are introducing the `Auto Classification` workflow, **we are going to remove in 1.7 the `ML Tagging` action**
from the Metadata Actions. That feature will be covered already by the `Auto Classification` workflow, which even brings
more flexibility allow the on-the-fly usage of the sample data for classification purposes without having to store
it in the database.

### Service Spec for the Ingestion Framework

This impacts users who maintain their own connectors for the ingestion framework that are **NOT** part of the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,8 @@ source:

{% partial file="/v1.5/connectors/yaml/data-profiler.md" variables={connector: "athena"} /%}

{% partial file="/v1.5/connectors/yaml/auto-classification.md" variables={connector: "athena"} /%}

{% partial file="/v1.5/connectors/yaml/data-quality.md" /%}

## dbt Integration
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,8 @@ source:

{% partial file="/v1.5/connectors/yaml/data-profiler.md" variables={connector: "azuresql"} /%}

{% partial file="/v1.5/connectors/yaml/auto-classification.md" variables={connector: "azuresql"} /%}

{% partial file="/v1.5/connectors/yaml/data-quality.md" /%}

## dbt Integration
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,8 @@ source:

{% partial file="/v1.5/connectors/yaml/data-profiler.md" variables={connector: "bigquery"} /%}

{% partial file="/v1.5/connectors/yaml/auto-classification.md" variables={connector: "bigquery"} /%}

{% partial file="/v1.5/connectors/yaml/data-quality.md" /%}

## dbt Integration
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,8 @@ source:

{% partial file="/v1.5/connectors/yaml/data-profiler.md" variables={connector: "clickhouse"} /%}

{% partial file="/v1.5/connectors/yaml/auto-classification.md" variables={connector: "clickhouse"} /%}

{% partial file="/v1.5/connectors/yaml/data-quality.md" /%}

## dbt Integration
Expand Down
Loading
Loading