diff --git a/docs-website/sidebars.js b/docs-website/sidebars.js index 1f2208bc17978..30d58d9948e8e 100644 --- a/docs-website/sidebars.js +++ b/docs-website/sidebars.js @@ -79,6 +79,11 @@ module.exports = { id: "docs/managed-datahub/observe/volume-assertions", className: "saasOnly", }, + { + label: "Open Assertion Specification", + type: "doc", + id: "docs/assertions/open-assertion-specification", + }, ], }, { diff --git a/docs/observability/open-assertions-spec.md b/docs/assertions/open-assertions-spec.md similarity index 75% rename from docs/observability/open-assertions-spec.md rename to docs/assertions/open-assertions-spec.md index 8f2e1d3ee0856..3418fed440efb 100644 --- a/docs/observability/open-assertions-spec.md +++ b/docs/assertions/open-assertions-spec.md @@ -11,7 +11,7 @@ assertion engine without service disruption for end consumers of the results of Currently, the DataHub Open Assertions Specification supports the following integrations: -- Snowflake via [Snowflake DMFs](https://docs.snowflake.com/en/user-guide/data-quality-intro) +- [Snowflake DMF Assertions](snowflake/snowflake_dmfs.md) And is looking for contributions to build out support for the following integrations: @@ -92,7 +92,7 @@ assertions: ``` This assertion checks that the `purchase_events` table in the `test_db.public` schema has between 1000 and 10000 records. -Using the `operator` field, you can specify the type of comparison to be made, and the `min` and `max` fields to specify the range of values to compare against. +Using the `condition` field, you can specify the type of comparison to be made, and the `min` and `max` fields to specify the range of values to compare against. Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted. Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table. The only metric currently supported is `row_count`. @@ -130,9 +130,9 @@ assertions: ``` -#### Supported Operators +#### Supported Conditions -The full set of supported volume assertion operators include: +The full set of supported volume assertion conditions include: - `equal_to` - `not_equal_to` @@ -177,7 +177,7 @@ assertions: ``` This assertion checks that all values for the `amount` column in the `purchase_events` table in the `test_db.public` schema have values between 0 and 10. -Using the `field` field, you can specify the column to be asserted on, and using the `operator` field, you can specify the type of comparison to be made, +Using the `field` field, you can specify the column to be asserted on, and using the `condition` field, you can specify the type of comparison to be made, and the `min` and `max` fields to specify the range of values to compare against. Using the `schedule` field you can specify when the assertion should be run, either on a fixed schedule or when new data is added to the table. Using the `filters` field, you can optionally specify a SQL WHERE clause to filter the records being counted. @@ -232,9 +232,9 @@ assertions: type: on_table_change ``` -#### Field Values Assertion: Supported Operators +#### Field Values Assertion: Supported Conditions -The full set of supported field value operators include: +The full set of supported field value conditions include: - `in` - `not_in` @@ -314,7 +314,7 @@ This assertion ensures that the `name` column in the `purchase_events` table in #### Field Metric Assertion: Supported Metrics -The full set of supported field metric operators include: +The full set of supported field metrics include: - `null_count` - `null_percentage` @@ -332,9 +332,9 @@ The full set of supported field metric operators include: - `zero_count` - `zero_percentage` -### Field Metric Assertion: Supported Operators +### Field Metric Assertion: Supported Conditions -The full set of supported field metric operators include: +The full set of supported field metric conditions include: - `equal_to` - `not_equal_to` @@ -397,9 +397,9 @@ assertions: This assertion checks that the number of rows in the `purchase_events` exactly matches the number of rows in an upstream `purchase_events_raw` table by subtracting the row count of the raw table from the row count of the processed table. -#### Supported Operators +#### Supported Conditions -The full set of supported custom SQL assertion operators include: +The full set of supported custom SQL assertion conditions include: - `equal_to` - `not_equal_to` @@ -484,112 +484,3 @@ The following high-level data types are currently supported by the Schema Assert - union - bytes - enum - - - -## Snowflake - -The DataHub Open Assertion Compiler allows you to define your Data Quality assertions in a simple YAML format, and then compile them to be executed by Snowflake Data Metric Functions. -Once compiled, you'll be able to register the compiled DMFs in your Snowflake environment, and extract their results them as part of your normal ingestion process for DataHub. -Results of Snowflake DMF assertions will be reported as normal Assertion Results, viewable on a historical timeline in the context -of the table with which they are associated. - -### Prerequisites - -- You must have a Snowflake Enterprise account, where the DMFs feature is enabled. -- You must have the necessary permissions to create and run DMFs in your Snowflake environment. -- You must have the necessary permissions to query the DMF results in your Snowflake environment. - -According to the latest Snowflake docs, here are the permissions the service account performing the -DMF registration and ingestion must have: - -| Privilege | Object | Notes | -|--------------------------------|------------------|------------------------------------------------------------------------------------------------------------| -| EXECUTE DATA METRIC FUNCTION | Account | This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF. | -| USAGE | Database, schema | These objects are the database and schema that contain the referenced table in the query. | - - -To learn more about Snowflake DMFs, see the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/data-quality-intro). - -### Supported Assertion Types - -The following assertion types are currently supported by the DataHub Snowflake DMF Assertion Compiler: - -- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md) -- [Volume](/docs/managed-datahub/observe/volume-assertions.md) -- [Column](/docs/managed-datahub/observe/column-assertions.md) -- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md) - -Note that Schema Assertions are not currently supported. - -### Creating Snowflake DMF Assertions - -The process for declaring and running assertions backend by Snowflake DMFs consists of a few steps, which will be outlined -in the following sections. - - -#### Step 1. Define your Data Quality assertions using Assertion YAML files - -See the section **Declaring Assertions in YAML** below for examples of how to define assertions in YAML. - - -#### Step 2. Register your assertions with DataHub - -Use the DataHub CLI to register your assertions with DataHub, so they become visible in the DataHub UI: - -```bash -datahub assertions upsert -f examples/library/assertions_configuration.yml -``` - - -#### Step 3. Compile the assertions into Snowflake DMFs using the DataHub CLI - -Next, we'll use the `assertions compile` command to generate the SQL code for the Snowflake DMFs, -which can then be registered in Snowflake. - -```bash -datahub assertions compile -f examples/library/assertions_configuration.yml -p snowflake -``` - -This will generate . - - -#### Step 4. Register the compiled DMFs in your Snowflake environment - - - - -#### Step 5. Run ingestion to report the results back into DataHub - -Once you've registered the DMFs, they will be automatically executed, either when the target table is updated or on a fixed -schedule. - -To report the results of the generated Data Quality assertions back into DataHub, you'll need to run the DataHub ingestion process with a special configuration -flag: `include_assertion_results: true`: - -```yaml -# Your DataHub Snowflake Recipe -source: - type: snowflake - config: - # ... - include_assertion_results: True - # ... -``` - -This will query the DMF results store in Snowflake, convert them into DataHub Assertion Results, and report the results back into DataHub during your ingestion process -either via CLI or the UI. - -`datahub ingest -c snowflake.yml` - -## dbt test - -Seeking contributions! - -## Great Expectations - -Seeking contributions! - -## Soda SQL - -Seeking contributions! \ No newline at end of file diff --git a/docs/assertions/snowflake/snowflake_dmfs.md b/docs/assertions/snowflake/snowflake_dmfs.md new file mode 100644 index 0000000000000..bf62537728073 --- /dev/null +++ b/docs/assertions/snowflake/snowflake_dmfs.md @@ -0,0 +1,167 @@ +## Snowflake DMF Assertions [BETA] + +The DataHub Open Assertion Compiler allows you to define your Data Quality assertions in a simple YAML format, and then compile them to be executed by Snowflake Data Metric Functions. +Once compiled, you'll be able to register the compiled DMFs in your Snowflake environment, and extract their results them as part of your normal ingestion process for DataHub. +Results of Snowflake DMF assertions will be reported as normal Assertion Results, viewable on a historical timeline in the context +of the table with which they are associated. + +### Prerequisites + +- You must have a Snowflake Enterprise account, where the DMFs feature is enabled. +- You must have the necessary permissions to provision DMFs in your Snowflake environment (see below) +- You must have the necessary permissions to query the DMF results in your Snowflake environment (see below) + +#### Provisioning DMFs + +According to the latest Snowflake docs, here are the permissions the service account performing the +DMF registration and ingestion must have: + +| Privilege | Object | Notes | +|------------------------------|------------------|----------------------------------------------------------------------------------------------------------------------------| +| EXECUTE DATA METRIC FUNCTION | Account | This privilege enables you to control which roles have access to server-agnostic compute resources to call the system DMF. | +| USAGE | Database, schema | These objects are the database and schema that contain the referenced table in the query. | +| USAGE | Database, schema | Database and schema where snowflake DMFs will be created. This is configured in `compile` command described below. | +| USAGE | DMF | This privilege enables you to use the registered DMF | +| OWNERSHIP | Table | This privilege enables you to associate a DMF with a referenced table. | +| CREATE FUNCTION | Schema | This privilege enables creating new DMF in schema. | + + +#### Querying DMF Results + +In addition, the service account that will be executing DataHub Ingestion, and querying the DMF results, must have been granted the following system application roles: + +| Role | Notes | +|--------------------------------|-----------------------------| +| DATA_QUALITY_MONITORING_VIEWER | Query the DMF results table | + +To learn more about Snowflake DMFs and the privileges required to provision and query them, see the [Snowflake documentation](https://docs.snowflake.com/en/user-guide/data-quality-intro). + +### Supported Assertion Types + +The following assertion types are currently supported by the DataHub Snowflake DMF Assertion Compiler: + +- [Freshness](/docs/managed-datahub/observe/freshness-assertions.md) +- [Volume](/docs/managed-datahub/observe/volume-assertions.md) +- [Column](/docs/managed-datahub/observe/column-assertions.md) +- [Custom SQL](/docs/managed-datahub/observe/custom-sql-assertions.md) + +Note that Schema Assertions are not currently supported. + +### Creating Snowflake DMF Assertions + +The process for declaring and running assertions backend by Snowflake DMFs consists of a few steps, which will be outlined +in the following sections. + + +#### Step 1. Define your Data Quality assertions using Assertion YAML files + +See the section **Declaring Assertions in YAML** below for examples of how to define assertions in YAML. + + +#### Step 2. Register your assertions with DataHub + +Use the DataHub CLI to register your assertions with DataHub, so they become visible in the DataHub UI: + +```bash +datahub assertions upsert -f examples/library/assertions_configuration.yml +``` + + +#### Step 3. Compile the assertions into Snowflake DMFs using the DataHub CLI + +Next, we'll use the `assertions compile` command to generate the SQL code for the Snowflake DMFs, +which can then be registered in Snowflake. + +```bash +datahub assertions compile -f examples/library/assertions_configuration.yml -p snowflake -x DMF_SCHEMA=. +``` + +Two files will be generated as output of running this command: + +- `dmf_definitions.sql`: This file contains the SQL code for the DMFs that will be registered in Snowflake. +- `dmf_associations.sql`: This file contains the SQL code for associating the DMFs with the target tables in Snowflake. + +By default in a folder called `target`. You can use config option `-o ` in `compile` command to write these compile artifacts in another folder. + +Each of these artifacts will be important for the next steps in the process. + +_dmf_definitions.sql_ + +This file stores the SQL code for the DMFs that will be registered in Snowflake, generated +from your YAML assertion definitions during the compile step. + +```sql +# Example dmf_definitions.sql + +-- Start of Assertion 5c32eef47bd763fece7d21c7cbf6c659 + + CREATE or REPLACE DATA METRIC FUNCTION + test_db.datahub_dmfs.datahub__5c32eef47bd763fece7d21c7cbf6c659 (ARGT TABLE(col_date DATE)) + RETURNS NUMBER + COMMENT = 'Created via DataHub for assertion urn:li:assertion:5c32eef47bd763fece7d21c7cbf6c659 of type volume' + AS + $$ + select case when metric <= 1000 then 1 else 0 end from (select count(*) as metric from TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES ) + $$; + +-- End of Assertion 5c32eef47bd763fece7d21c7cbf6c659 +.... +``` + +_dmf_associations.sql_ + +This file stores the SQL code for associating with the target table, +along with scheduling the generated DMFs to run on at particular times. + +```sql +# Example dmf_associations.sql + +-- Start of Assertion 5c32eef47bd763fece7d21c7cbf6c659 + + ALTER TABLE TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES'; + ALTER TABLE TEST_DB.PUBLIC.TEST_ASSERTIONS_ALL_TIMES ADD DATA METRIC FUNCTION test_db.datahub_dmfs.datahub__5c32eef47bd763fece7d21c7cbf6c659 ON (col_date); + +-- End of Assertion 5c32eef47bd763fece7d21c7cbf6c659 +.... +``` + + +#### Step 4. Register the compiled DMFs in your Snowflake environment + +Next, you'll need to run the generated SQL from the files output in Step 3 in Snowflake. + +You can achieve this either by running the SQL files directly in the Snowflake UI, or by using the SnowSQL CLI tool: + +```bash +snowsql -f dmf_definitions.sql +snowsql -f dmf_associations.sql +``` + + +#### Step 5. Run ingestion to report the results back into DataHub + +Once you've registered the DMFs, they will be automatically executed, either when the target table is updated or on a fixed +schedule. + +To report the results of the generated Data Quality assertions back into DataHub, you'll need to run the DataHub ingestion process with a special configuration +flag: `include_assertion_results: true`: + +```yaml +# Your DataHub Snowflake Recipe +source: + type: snowflake + config: + # ... + include_assertion_results: True + # ... +``` + +During ingestion we will query for the latest DMF results stored in Snowflake, convert them into DataHub Assertion Results, and report the results back into DataHub during your ingestion process +either via CLI or the UI visible as normal assertions. + +`datahub ingest -c snowflake.yml` + + +### FAQ + +Coming soon! \ No newline at end of file