open-metadata · pmbrull · Oct 30, 2023 · Oct 30, 2023 · Oct 30, 2023 · Oct 30, 2023
diff --git a/openmetadata-docs/content/partials/v1.2/connectors/storage/manifest.md b/openmetadata-docs/content/partials/v1.2/connectors/storage/manifest.md
@@ -111,6 +111,11 @@ Again, this information will be added on top of the inferred schema from the dat
 }
 ```
 
+{% /codeBlock %}
+
+{% /codePreview %}
+
+
 ### Global Manifest
 
 You can also manage a **single** manifest file to centralize the ingestion process for any container. In that case,

diff --git a/...adata-docs/content/partials/v1.2/connectors/yaml/dashboard/source-config-def.md b/...adata-docs/content/partials/v1.2/connectors/yaml/dashboard/source-config-def.md
@@ -0,0 +1,15 @@
+#### Source Configuration - Source Config
+
+{% codeInfo srNumber=100 %}
+
+The `sourceConfig` is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/dashboardServiceMetadataPipeline.json):
+
+- **dbServiceNames**: Database Service Names for ingesting lineage if the source supports it.
+- **dashboardFilterPattern**, **chartFilterPattern**, **dataModelFilterPattern**: Note that all of them support regex as include or exclude. E.g., "My dashboard, My dash.*, .*Dashboard".
+- **projectFilterPattern**: Filter the dashboards, charts and data sources by projects. Note that all of them support regex as include or exclude. E.g., "My project, My proj.*, .*Project".
+- **includeOwners**: Set the 'Include Owners' toggle to control whether to include owners to the ingested entity if the owner email matches with a user stored in the OM server as part of metadata ingestion. If the ingested entity already exists and has an owner, the owner will not be overwritten.
+- **includeTags**: Set the 'Include Tags' toggle to control whether to include tags in metadata ingestion.
+- **includeDataModels**: Set the 'Include Data Models' toggle to control whether to include tags as part of metadata ingestion.
+- **markDeletedDashboards**: Set the 'Mark Deleted Dashboards' toggle to flag dashboards as soft-deleted if they are not present anymore in the source system.
+
+{% /codeInfo %}
diff --git a/openmetadata-docs/content/partials/v1.2/connectors/yaml/dashboard/source-config.md b/openmetadata-docs/content/partials/v1.2/connectors/yaml/dashboard/source-config.md
@@ -0,0 +1,30 @@
+```yaml {% srNumber=100 %}
+  sourceConfig:
+    config:
+      type: DashboardMetadata
+      overrideOwner: True
+      # dbServiceNames:
+      #   - service1
+      #   - service2
+      # dashboardFilterPattern:
+      #   includes:
+      #     - dashboard1
+      #     - dashboard2
+      #   excludes:
+      #     - dashboard3
+      #     - dashboard4
+      # chartFilterPattern:
+      #   includes:
+      #     - chart1
+      #     - chart2
+      #   excludes:
+      #     - chart3
+      #     - chart4
+      # projectFilterPattern:
+      #   includes:
+      #     - project1
+      #     - project2
+      #   excludes:
+      #     - project3
+      #     - project4
+```
diff --git a/openmetadata-docs/content/partials/v1.2/connectors/yaml/data-profiler.md b/openmetadata-docs/content/partials/v1.2/connectors/yaml/data-profiler.md
@@ -0,0 +1,224 @@
+## Data Profiler
+
+The Data Profiler workflow will be using the `orm-profiler` processor.
+
+After running a Metadata Ingestion workflow, we can run Data Profiler workflow.
+While the `serviceName` will be the same to that was used in Metadata Ingestion, so the ingestion bot can get the `serviceConnection` details from the server.
+
+
+### 1. Define the YAML Config
+
+This is a sample config for the profiler:
+
+{% codePreview %}
+
+{% codeInfoContainer %}
+
+{% codeInfo srNumber=13 %}
+#### Source Configuration - Source Config
+
+You can find all the definitions and types for the  `sourceConfig` [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceProfilerPipeline.json).
+
+**generateSampleData**: Option to turn on/off generating sample data.
+
+{% /codeInfo %}
+
+{% codeInfo srNumber=14 %}
+
+**profileSample**: Percentage of data or no. of rows we want to execute the profiler and tests on.
+
+{% /codeInfo %}
+
+{% codeInfo srNumber=15 %}
+
+**threadCount**: Number of threads to use during metric computations.
+
+{% /codeInfo %}
+
+{% codeInfo srNumber=16 %}
+
+**processPiiSensitive**: Optional configuration to automatically tag columns that might contain sensitive information.
+
+{% /codeInfo %}
+
+{% codeInfo srNumber=17 %}
+
+**confidence**: Set the Confidence value for which you want the column to be marked
+
+{% /codeInfo %}
+
+
+{% codeInfo srNumber=18 %}
+
+**timeoutSeconds**: Profiler Timeout in Seconds
+
+{% /codeInfo %}
+
+{% codeInfo srNumber=19 %}
+
+**databaseFilterPattern**: Regex to only fetch databases that matches the pattern.
+
+{% /codeInfo %}
+
+{% codeInfo srNumber=20 %}
+
+**schemaFilterPattern**: Regex to only fetch tables or databases that matches the pattern.
+
+{% /codeInfo %}
+
+{% codeInfo srNumber=21 %}
+
+**tableFilterPattern**: Regex to only fetch tables or databases that matches the pattern.
+
+{% /codeInfo %}
+
+{% codeInfo srNumber=22 %}
+
+#### Processor Configuration
+
+Choose the `orm-profiler`. Its config can also be updated to define tests from the YAML itself instead of the UI:
+
+**tableConfig**: `tableConfig` allows you to set up some configuration at the table level.
+{% /codeInfo %}
+
+
+{% codeInfo srNumber=23 %}
+
+#### Sink Configuration
+
+To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.
+{% /codeInfo %}
+
+
+{% partial file="/v1.2/connectors/yaml/workflow-config-def.md" /%}
+
+{% /codeInfoContainer %}
+
+{% codeBlock fileName="filename.yaml" %}
+
+
+```yaml
+source:
+  type: {% $connector %}
+  serviceName: local_athena
+  sourceConfig:
+    config:
+      type: Profiler
+```
+
+```yaml {% srNumber=13 %}
+      generateSampleData: true
+```
+```yaml {% srNumber=14 %}
+      # profileSample: 85
+```
+```yaml {% srNumber=15 %}
+      # threadCount: 5
+```
+```yaml {% srNumber=16 %}
+      processPiiSensitive: false
+```
+```yaml {% srNumber=17 %}
+      # confidence: 80
+```
+```yaml {% srNumber=18 %}
+      # timeoutSeconds: 43200
+```
+```yaml {% srNumber=19 %}
+      # databaseFilterPattern:
+      #   includes:
+      #     - database1
+      #     - database2
+      #   excludes:
+      #     - database3
+      #     - database4
+```
+```yaml {% srNumber=20 %}
+      # schemaFilterPattern:
+      #   includes:
+      #     - schema1
+      #     - schema2
+      #   excludes:
+      #     - schema3
+      #     - schema4
+```
+```yaml {% srNumber=21 %}
+      # tableFilterPattern:
+      #   includes:
+      #     - table1
+      #     - table2
+      #   excludes:
+      #     - table3
+      #     - table4
+```
+
+```yaml {% srNumber=22 %}
+processor:
+  type: orm-profiler
+  config: {}  # Remove braces if adding properties
+    # tableConfig:
+    #   - fullyQualifiedName: <table fqn>
+    #     profileSample: <number between 0 and 99> # default 
+
+    #     profileSample: <number between 0 and 99> # default will be 100 if omitted
+    #     profileQuery: <query to use for sampling data for the profiler>
+    #     columnConfig:
+    #       excludeColumns:
+    #         - <column name>
+    #       includeColumns:
+    #         - columnName: <column name>
+    #         - metrics:
+    #           - MEAN
+    #           - MEDIAN
+    #           - ...
+    #     partitionConfig:
+    #       enablePartitioning: <set to true to use partitioning>
+    #       partitionColumnName: <partition column name>
+    #       partitionIntervalType: <TIME-UNIT, INTEGER-RANGE, INGESTION-TIME, COLUMN-VALUE>
+    #       Pick one of the variation shown below
+    #       ----'TIME-UNIT' or 'INGESTION-TIME'-------
+    #       partitionInterval: <partition interval>
+    #       partitionIntervalUnit: <YEAR, MONTH, DAY, HOUR>
+    #       ------------'INTEGER-RANGE'---------------
+    #       partitionIntegerRangeStart: <integer>
+    #       partitionIntegerRangeEnd: <integer>
+    #       -----------'COLUMN-VALUE'----------------
+    #       partitionValues:
+    #         - <value>
+    #         - <value>
+
+```
+
+```yaml {% srNumber=23 %}
+sink:
+  type: metadata-rest
+  config: {}
+```
+
+{% partial file="/v1.2/connectors/yaml/workflow-config.md" /%}
+
+{% /codeBlock %}
+
+{% /codePreview %}
+
+- You can learn more about how to configure and run the Profiler Workflow to extract Profiler data and execute the Data Quality from [here](/connectors/ingestion/workflows/profiler)
+
+### 2. Run with the CLI
+
+After saving the YAML config, we will run the command the same way we did for the metadata ingestion:
+
+```bash
+metadata profile -c <path-to-yaml>
+```
+
+Note now instead of running `ingest`, we are using the `profile` command to select the Profiler workflow.
+
+{% tilesContainer %}
+
+{% tile
+title="Data Profiler"
+description="Find more information about the Data Profiler here"
+link="/connectors/ingestion/workflows/profiler"
+/ %}
+
+{% /tilesContainer %}
diff --git a/...tadata-docs/content/partials/v1.2/connectors/yaml/database/source-config-def.md b/...tadata-docs/content/partials/v1.2/connectors/yaml/database/source-config-def.md
@@ -0,0 +1,15 @@
+#### Source Configuration - Source Config
+
+{% codeInfo srNumber=100 %}
+
+The `sourceConfig` is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/databaseServiceMetadataPipeline.json):
+
+**markDeletedTables**: To flag tables as soft-deleted if they are not present anymore in the source system.
+
+**includeTables**: true or false, to ingest table data. Default is true.
+
+**includeViews**: true or false, to ingest views definitions.
+
+**databaseFilterPattern**, **schemaFilterPattern**, **tableFilterPattern**: Note that the filter supports regex as include or exclude. You can find examples [here](/connectors/ingestion/workflows/metadata/filter-patterns/database)
+
+{% /codeInfo %}
diff --git a/openmetadata-docs/content/partials/v1.2/connectors/yaml/database/source-config.md b/openmetadata-docs/content/partials/v1.2/connectors/yaml/database/source-config.md
@@ -0,0 +1,30 @@
+```yaml {% srNumber=100 %}
+  sourceConfig:
+    config:
+      type: DatabaseMetadata
+      markDeletedTables: true
+      includeTables: true
+      includeViews: true
+      # includeTags: true
+      # databaseFilterPattern:
+      #   includes:
+      #     - database1
+      #     - database2
+      #   excludes:
+      #     - database3
+      #     - database4
+      # schemaFilterPattern:
+      #   includes:
+      #     - schema1
+      #     - schema2
+      #   excludes:
+      #     - schema3
+      #     - schema4
+      # tableFilterPattern:
+      #   includes:
+      #     - users
+      #     - type_test
+      #   excludes:
+      #     - table3
+      #     - table4
+```
diff --git a/openmetadata-docs/content/partials/v1.2/connectors/yaml/ingestion-cli.md b/openmetadata-docs/content/partials/v1.2/connectors/yaml/ingestion-cli.md
@@ -0,0 +1,10 @@
+### 2. Run with the CLI
+
+First, we will need to save the YAML file. Afterward, and with all requirements installed, we can run:
+
+```bash
+metadata ingest -c <path-to-yaml>
+```
+
+Note that from connector to connector, this recipe will always be the same. By updating the YAML configuration,
+you will be able to extract metadata from different sources.
diff --git a/openmetadata-docs/content/partials/v1.2/connectors/yaml/ingestion-sink-def.md b/openmetadata-docs/content/partials/v1.2/connectors/yaml/ingestion-sink-def.md
@@ -0,0 +1,7 @@
+#### Sink Configuration
+
+{% codeInfo srNumber=200 %}
+
+To send the metadata to OpenMetadata, it needs to be specified as `type: metadata-rest`.
+
+{% /codeInfo %}
diff --git a/openmetadata-docs/content/partials/v1.2/connectors/yaml/ingestion-sink.md b/openmetadata-docs/content/partials/v1.2/connectors/yaml/ingestion-sink.md
@@ -0,0 +1,5 @@
+```yaml {% srNumber=200 %}
+sink:
+  type: metadata-rest
+  config: {}
+```
diff --git a/...adata-docs/content/partials/v1.2/connectors/yaml/messaging/source-config-def.md b/...adata-docs/content/partials/v1.2/connectors/yaml/messaging/source-config-def.md
@@ -0,0 +1,11 @@
+#### Source Configuration - Source Config
+
+{% codeInfo srNumber=100 %}
+
+The sourceConfig is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/messagingServiceMetadataPipeline.json):
+
+**generateSampleData:** Option to turn on/off generating sample data during metadata extraction.
+
+**topicFilterPattern:** Note that the `topicFilterPattern` supports regex as include or exclude.
+
+{% /codeInfo %}
diff --git a/openmetadata-docs/content/partials/v1.2/connectors/yaml/messaging/source-config.md b/openmetadata-docs/content/partials/v1.2/connectors/yaml/messaging/source-config.md
@@ -0,0 +1,11 @@
+```yaml {% srNumber=100 %}
+  sourceConfig:
+    config:
+      type: MessagingMetadata
+      topicFilterPattern:
+        excludes:
+          - _confluent.*
+      # includes:
+      #   - topic1
+      # generateSampleData: true
+```
diff --git a/...tadata-docs/content/partials/v1.2/connectors/yaml/ml-model/source-config-def.md b/...tadata-docs/content/partials/v1.2/connectors/yaml/ml-model/source-config-def.md
@@ -0,0 +1,9 @@
+#### Source Configuration - Source Config
+
+{% codeInfo srNumber=100 %}
+
+The sourceConfig is defined [here](https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/metadataIngestion/messagingServiceMetadataPipeline.json):
+
+**markDeletedMlModels**: Set the Mark Deleted Ml Models toggle to flag ml models as soft-deleted if they are not present anymore in the source system.
+
+{% /codeInfo %}