Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Auto Classification Doc Addition #19148

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions openmetadata-docs/content/v1.6.x/collate-menu.md
Original file line number Diff line number Diff line change
Expand Up @@ -686,8 +686,6 @@ site_menu:
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/external-workflow
- category: How-to Guides / Data Quality and Observability / Data Profiler / Auto PII Tagging
url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
- category: How-to Guides / Data Quality and Observability / Data Observability
url: /how-to-guides/data-quality-observability/observability
- category: How-to Guides / Data Quality and Observability / Data Observability / Observability Alerts
Expand Down Expand Up @@ -773,6 +771,12 @@ site_menu:
url: /how-to-guides/data-governance/classification/request-tags
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata
url: /how-to-guides/data-governance/classification/auto
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Workflow
url: /how-to-guides/data-governance/classification/auto/workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / External Workflow
url: /how-to-guides/data-governance/classification/auto/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto/auto-pii-tagging
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Auto PII Tagging
slug: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
slug: /how-to-guides/data-governance/classification/auto/auto-pii-tagging
---

# Auto PII Tagging
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
---
title: External Auto Classification Workflow
slug: /how-to-guides/data-governance/classification/auto/external-workflow
---

# Auto Classification Workflow Configuration

The Auto Classification Workflow enables automatic tagging of sensitive information within databases. Below are the configuration parameters available in the **Service Classification Pipeline JSON**.

## Pipeline Configuration Parameters

| **Parameter** | **Description** | **Type** | **Default Value** |
|-------------------------------|---------------------------------------------------------------------------------|-----------|-------------------------|
| `type` | Specifies the pipeline type. | String | `AutoClassification` |
| `classificationFilterPattern`| Regex to compute metrics for tables matching specific tags, tiers, or glossary patterns. | Object | N/A |
| `schemaFilterPattern` | Regex to fetch schemas matching the specified pattern. | Object | N/A |
| `tableFilterPattern` | Regex to exclude tables matching the specified pattern. | Object | N/A |
| `databaseFilterPattern` | Regex to fetch databases matching the specified pattern. | Object | N/A |
| `includeViews` | Option to include or exclude views during metadata ingestion. | Boolean | `true` |
| `useFqnForFiltering` | Determines whether filtering is applied to the Fully Qualified Name (FQN) instead of raw names. | Boolean | `false` |
| `storeSampleData` | Option to enable or disable storing sample data for each table. | Boolean | `true` |
| `enableAutoClassification` | Enables automatic tagging of columns that might contain sensitive information. | Boolean | `false` |
| `confidence` | Confidence level for tagging columns as sensitive. Value ranges from 0 to 100. | Number | `80` |
| `sampleDataCount` | Number of sample rows to ingest when Store Sample Data is enabled. | Integer | `50` |

## Key Parameters Explained

### `enableAutoClassification`
- Set this to `true` to enable automatic detection of sensitive columns (e.g., PII).
- Applies pattern recognition and tagging based on predefined criteria.

### `confidence`
- Confidence level for tagging sensitive columns:
- A higher confidence value (e.g., `90`) reduces false positives but may miss some sensitive data.
- A lower confidence value (e.g., `70`) identifies more sensitive columns but may result in false positives.

### `storeSampleData`
- Controls whether sample rows are stored during ingestion.
- If enabled, the specified number of rows (`sampleDataCount`) will be fetched for each table.

### `useFqnForFiltering`
- When set to `true`, filtering patterns will be applied to the Fully Qualified Name of a table (e.g., `service_name.db_name.schema_name.table_name`).
- When set to `false`, filtering applies only to raw table names.

## Sample Auto Classification Workflow yaml

```yaml
source:
type: bigquery
serviceName: local_bigquery
serviceConnection:
config:
type: BigQuery
credentials:
gcpConfig:
type: service_account
projectId: my-project-id-1234
privateKeyId: privateKeyID
privateKey: "-----BEGIN PRIVATE KEY-----\nmySuperSecurePrivateKey==\n-----END PRIVATE KEY-----\n"
clientEmail: client@email.secure
clientId: "1234567890"
authUri: https://accounts.google.com/o/oauth2/auth
tokenUri: https://oauth2.googleapis.com/token
authProviderX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
clientX509CertUrl: https://www.googleapis.com/oauth2/v1/certs
sourceConfig:
config:
type: AutoClassification
storeSampleData: true
enableAutoClassification: true
databaseFilterPattern:
includes:
- hello-world-1234
schemaFilterPattern:
includes:
- super_schema
tableFilterPattern:
includes:
- abc

processor:
type: "orm-profiler"
config:
tableConfig:
- fullyQualifiedName: local_bigquery.hello-world-1234.super_schema.abc
profileSample: 85
partitionConfig:
partitionQueryDuration: 180
columnConfig:
excludeColumns:
- a
- b

sink:
type: metadata-rest
config: {}
workflowConfig:
# loggerLevel: INFO # DEBUG, INFO, WARN or ERROR
openMetadataServerConfig:
hostPort: http://localhost:8585/api
authProvider: openmetadata
securityConfig:
jwtToken: "eyJraWQiOiJHYjM4OWEtOWY3Ni1nZGpzLWE5MmotMDI0MmJrOTQzNTYiLCJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImlzQm90IjpmYWxzZSwiaXNzIjoib3Blbi1tZXRhZGF0YS5vcmciLCJpYXQiOjE2NjM5Mzg0NjIsImVtYWlsIjoiYWRtaW5Ab3Blbm1ldGFkYXRhLm9yZyJ9.tS8um_5DKu7HgzGBzS1VTA5uUjKWOCU0B_j08WXBiEC0mr0zNREkqVfwFDD-d24HlNEbrqioLsBuFRiwIWKc1m_ZlVQbG7P36RUxhuv2vbSp80FKyNM-Tj93FDzq91jsyNmsQhyNv_fNr3TXfzzSPjHt8Go0FMMP66weoKMgW2PbXlhVKwEuXUHyakLLzewm9UMeQaEiRzhiTMU3UkLXcKbYEJJvfNFcLwSl9W8JCO_l0Yj3ud-qt_nQYEZwqW6u5nfdQllN133iikV4fM5QZsMCnm8Rq1mvLR0y9bmJiD7fwM1tmJ791TUWqmKaTnP49U493VanKpUAfzIiOiIbhg"
```

## Workflow Execution

### To Execute the Auto Classification Workflow:

1. **Create a Pipeline**
- Configure the Auto Classification JSON as demonstrated in the provided configuration example.

2. **Run the Ingestion Pipeline**
- Use OpenMetadata or an external scheduler like Argo to trigger the pipeline execution.

3. **Validate Results**
- Verify the metadata and tags applied to sensitive columns in the OpenMetadata UI.

### Expected Outcomes

- **Automatic Tagging:**
Columns containing sensitive information (e.g., names, emails, SSNs) are automatically tagged based on predefined confidence levels.

- **Enhanced Visibility:**
Gain improved visibility and classification of sensitive data within your databases.

- **Sample Data Integration:**
Store sample data to provide better insights during profiling and testing workflows.

Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ alt="Column Data provides information"
caption="Column Data provides information"
/%}

You can read more about [Auto PII Tagging](/how-to-guides/data-quality-observability/profiler/auto-pii-tagging) here.
You can read more about [Auto PII Tagging](/how-to-guides/data-governance/classification/auto/auto-pii-tagging) here.

## Tag Mapping

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: Adding Auto Classification Workflow through UI
slug: /how-to-guides/data-governance/classification/auto/workflow
---

# Adding Auto Classification Ingestion through the UI

Follow these steps to configure Auto Classification ingestion via the OpenMetadata UI:

## 1. Navigate to the Database Service
- Go to **Settings > Services > Databases** in the OpenMetadata UI.
- Select the database for which you want to configure Auto Classification ingestion.

{% image
src="/images/v1.6/how-to-guides/governance/ac-1.png"
alt="Settings"
caption="Settings"
/%}

{% image
src="/images/v1.6/how-to-guides/governance/ac-1.1.png"
alt="Services"
caption="Services"
/%}

{% image
src="/images/v1.6/how-to-guides/governance/ac-2.png"
alt="Databases"
caption="Databases"
/%}

## 2. Access the Ingestion Tab
- In the selected database, navigate to the **Ingestion** tab.
- Click on the option to **Add Auto Classification Ingestion**, as shown in the example image.

{% image
src="/images/v1.6/how-to-guides/governance/ac-3.png"
alt="Access the Ingestion Tab"
caption="Access the Ingestion Tab"
/%}

## 3. Configure Auto Classification Details
- Fill in the details for your Auto Classification ingestion workflow.
- Each field's purpose is explained directly in the UI, allowing you to customize the configuration based on your requirements.

{% image
src="/images/v1.6/how-to-guides/governance/ac-4.png"
alt="Configure Auto Classification Details"
caption="Configure Auto Classification Details"
/%}

## 4. Set the Schedule
- Specify the time interval at which the Auto Classification ingestion should run.

{% image
src="/images/v1.6/how-to-guides/governance/ac-5.png"
alt="Set the Schedule"
caption="Set the Schedule"
/%}

## 5. Add the Ingestion Workflow
- Once all details are configured, click **Add Auto Classification Ingestion** to save and activate the workflow.

{% image
src="/images/v1.6/how-to-guides/governance/ac-6.png"
alt="Add the Ingestion Workflow"
caption="Add the Ingestion Workflow"
/%}

By following these steps, you can set up an Auto Classification ingestion workflow to automatically identify and tag sensitive data in your databases.
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,4 @@ Watch the video to understand OpenMetadata’s native Data Profiler and Data Qua
href="/how-to-guides/data-quality-observability/profiler/external-workflow"%}
Run a single workflow profiler for the entire source externally.
{%/inlineCallout%}
{%inlineCallout
icon="MdOutlinePersonPin"
bold="Auto PII Tagging"
href="/how-to-guides/data-quality-observability/profiler/auto-pii-tagging"%}
Auto tag data as PII Sensitive/NonSensitive at the column level.
{%/inlineCallout%}
{%/inlineCalloutContainer%}
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ alt="Column Data provides information"
caption="Column Data provides information"
/%}

You can read more about [Auto PII Tagging](/how-to-guides/data-quality-observability/profiler/auto-pii-tagging) here.
You can read more about [Auto PII Tagging](/how-to-guides/data-governance/classification/auto/auto-pii-tagging) here.

{%inlineCallout
color="violet-70"
Expand Down
8 changes: 6 additions & 2 deletions openmetadata-docs/content/v1.6.x/menu.md
Original file line number Diff line number Diff line change
Expand Up @@ -849,8 +849,6 @@ site_menu:
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/external-workflow
- category: How-to Guides / Data Quality and Observability / Data Profiler / Auto PII Tagging
url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
- category: How-to Guides / Data Quality and Observability / Data Observability
url: /how-to-guides/data-quality-observability/observability
- category: How-to Guides / Data Quality and Observability / Data Observability / Observability Alerts
Expand Down Expand Up @@ -922,6 +920,12 @@ site_menu:
url: /how-to-guides/data-governance/classification/request-tags
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata
url: /how-to-guides/data-governance/classification/auto
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Workflow
url: /how-to-guides/data-governance/classification/auto/workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / External Workflow
url: /how-to-guides/data-governance/classification/auto/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto/auto-pii-taggings
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
Expand Down
8 changes: 6 additions & 2 deletions openmetadata-docs/content/v1.7.x-SNAPSHOT/collate-menu.md
Original file line number Diff line number Diff line change
Expand Up @@ -689,8 +689,6 @@ site_menu:
- category: How-to Guides / Data Quality and Observability / Data Profiler / Sample Data
url: /how-to-guides/data-quality-observability/profiler/external-sample-data
- category: How-to Guides / Data Quality and Observability / Data Profiler / External Workflow
url: /how-to-guides/data-quality-observability/profiler/external-workflow
- category: How-to Guides / Data Quality and Observability / Data Profiler / Auto PII Tagging
url: /how-to-guides/data-quality-observability/profiler/auto-pii-tagging
- category: How-to Guides / Data Quality and Observability / Data Observability
url: /how-to-guides/data-quality-observability/observability
Expand Down Expand Up @@ -777,6 +775,12 @@ site_menu:
url: /how-to-guides/data-governance/classification/request-tags
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata
url: /how-to-guides/data-governance/classification/auto
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Workflow
url: /how-to-guides/data-governance/classification/auto/workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / External Workflow
url: /how-to-guides/data-governance/classification/auto/external-workflow
- category: How-to Guides / Data Governance / Classification / Auto-Classification in OpenMetadata / Auto PII Tagging
url: /how-to-guides/data-governance/classification/auto/auto-pii-tagging
- category: How-to Guides / Data Governance / Classification / What are Tiers
url: /how-to-guides/data-governance/classification/tiers
- category: How-to Guides / Data Governance / Classification / Best Practices for Classification
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
title: Auto PII Tagging
slug: /how-to-guides/data-governance/classification/auto/auto-pii-tagging
---

# Auto PII Tagging

Auto PII tagging for Sensitive/NonSensitive at the column level is performed based on the two approaches described below.

{% note %}
PII Tagging is only available during `Profiler Ingestion`.
{% /note %}


## Tagging logic

1. **Column Name Scanner**: We validate the column names of the table against a set of regex rules that help us identify
common English patterns to identify email addresses, SSN, bank accounts, etc.
2. **Entity Recognition**: If the sample data ingestion is enabled, we'll validate the sample rows against an Entity
Recognition engine that will bring up any sensitive information from a list of [supported entities](https://microsoft.github.io/presidio/supported_entities/).
In that case, the `confidence` parameter lets you tune the minimum score required to tag a column as `PII.Sensitive`.

Note that if a column is already tagged as `PII`, we will ignore its execution.

## Troubleshooting

### SSL: CERTIFICATE_VERIFY_FAILED

If you see an error similar to:

```
Unexpected error while processing sample data for auto pii tagging - HTTPSConnectionPool(host='raw.githubusercontent.com', port=443):
Max retries exceeded with url: /explosion/spacy-models/master/compatibility.json
(Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to
get local issuer certificate (_ssl.c:1129)')))
```

This is a scenario that we identified on some corporate Windows laptops. The bottom-line here is that the profiler
is trying to download the Entity Recognition model but having certificate issues when trying the request.

A solution here is to manually download the model on the ingestion container / Airflow host by running:

```
pip --trusted-host github.com --trusted-host objects.githubusercontent.com install https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0.tar.gz
```

If using Docker, you might want to customize the `openmetadata-ingestion` image to have this command run there by default.
Loading
Loading