Skip to content

Latest commit

 

History

History
342 lines (228 loc) · 11.9 KB

data_checks.md

File metadata and controls

342 lines (228 loc) · 11.9 KB

bqetl Data Checks

Instructions on how to add data checks can be found in the Adding data checks section below.

Background

To create more confidence and trust in our data is crucial to provide some form of data checks. These checks should uncover problems as soon as possible, ideally as part of the data process creating the data. This includes checking that the data produced follows certain assumptions determined by the dataset owner. These assumptions need to be easy to define, but at the same time flexible enough to encode more complex business logic. For example, checks for null columns, for range/size properties, duplicates, table grain etc.

bqetl Data Checks to the Rescue

bqetl data checks aim to provide this ability by providing a simple interface for specifying our "assumptions" about the data the query should produce and checking them against the actual result.

This easy interface is achieved by providing a number of jinja templates providing "out-of-the-box" logic for performing a number of common checks without having to rewrite the logic. For example, checking if any nulls are present in a specific column. These templates can be found here and are available as jinja macros inside the checks.sql files. This allows to "configure" the logic by passing some details relevant to our specific dataset. Check templates will get rendered as raw SQL expressions. Take a look at the examples below for practical examples.

It is also possible to write checks using raw SQL by using assertions. This is, for example, useful when writing checks for custom business logic.

Two categories of checks

Each check needs to be categorised with a marker, currently following markers are available:

  • #fail indicates that the ETL pipeline should stop if this check fails (circuit-breaker pattern) and a notification is sent out. This marker should be used for checks that indicate a serious data issue.

  • #warn indicates that the ETL pipeline should continue even if this check fails. These type of checks can be used to indicate potential issues that might require more manual investigation.

Checks can be marked by including one of the markers on the line preceeding the check definition, see Example checks.sql section for an example.

Adding Data Checks

Create checks.sql

Inside the query directory, which usually contains query.sql or query.py, metadata.yaml and schema.yaml, create a new file called checks.sql (unless already exists).

Please make sure each check you add contains a marker (see: the Two categories of checks section above).

Once checks have been added, we need to regenerate the DAG responsible for scheduling the query.

Update checks.sql

If checks.sql already exists for the query, you can always add additional checks to the file by appending it to the list of already defined checks.

When adding additional checks there should be no need to have to regenerate the DAG responsible for scheduling the query as all checks are executed using a single Airflow task.

Removing checks.sql

All checks can be removed by deleting the checks.sql file and regenerating the DAG responsible for scheduling the query.

Alternatively, specific checks can be removed by deleting them from the checks.sql file.

Example checks.sql

Checks can either be written as raw SQL, or by referencing existing Jinja macros defined in tests/checks which may take different parameters used to generate the SQL check expression.

Example of what a checks.sql may look like:

-- raw SQL checks
#fail
ASSERT (
  SELECT
    COUNTIF(ISNULL(country)) / COUNT(*)
    FROM telemetry.table_v1
    WHERE submission_date = @submission_date
  ) > 0.2
) AS "More than 20% of clients have country set to NULL";

-- macro checks
#fail
{{ not_null(["submission_date", "os"], "submission_date = @submission_date") }}

#warn
{{ min_row_count(1, "submission_date = @submission_date") }}

#fail
{{ is_unique(["submission_date", "os", "country"], "submission_date = @submission_date")}}

#warn
{{ in_range(["non_ssl_loads", "ssl_loads", "reporting_ratio"], 0, none, "submission_date = @submission_date") }}

Data Checks Available with Examples

accepted_values (source)

Usage:

Arguments:

column: str - name of the column to check
values: List[str] - list of accepted values
where: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, "submission_date = @submission_date" so that the check is only executed against a specific partition.

Example:

#warn
{{ accepted_values("column_1", ["value_1", "value_2"],"submission_date = @submission_date") }}

in_range (source)

Usage:

Arguments:

columns: List[str] - A list of columns which we want to check the values of.
min: Optional[int] - Minimum value we should observe in the specified columns.
max: Optional[int] - Maximum value we should observe in the specified columns.
where: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, "submission_date = @submission_date" so that the check is only executed against a specific partition.

Example:

#warn
{{ in_range(["non_ssl_loads", "ssl_loads", "reporting_ratio"], 0, none, "submission_date = @submission_date") }}

is_unique (source)

Usage:

Arguments:

columns: List[str] - A list of columns which should produce a unique record.
where: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, "submission_date = @submission_date" so that the check is only executed against a specific partition.

Example:

#warn
{{ is_unique(["submission_date", "os", "country"], "submission_date = @submission_date")}}

min_row_count(source)

Usage:

Arguments:

threshold: Optional[int] - What is the minimum number of rows we expect (default: 1)
where: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, "submission_date = @submission_date" so that the check is only executed against a specific partition.

Example:

#fail
{{ min_row_count(1, "submission_date = @submission_date") }}

not_null (source)

Usage:

Arguments:

columns: List[str] - A list of columns which should not contain a null value.
where: Optional[str] - A condition that will be injected into the `WHERE` clause of the check. For example, "submission_date = @submission_date" so that the check is only executed against a specific partition.

Example:

#fail
{{ not_null(["submission_date", "os"], "submission_date = @submission_date") }}

Please keep in mind the below checks can be combined and specified in the same checks.sql file. For example:

#fail
{{ not_null(["submission_date", "os"], "submission_date = @submission_date") }}
 #fail
 {{ min_row_count(1, "submission_date = @submission_date") }}
 #fail
 {{ is_unique(["submission_date", "os", "country"], "submission_date = @submission_date")}}
 #warn
 {{ in_range(["non_ssl_loads", "ssl_loads", "reporting_ratio"], 0, none, "submission_date = @submission_date") }}

row_count_within_past_partitions_avg(source)

Compares the row count of the current partition to the average of number_of_days past partitions and checks if the row count is within the average +- threshold_percentage %

Usage:

Arguments:

number_of_days: int - Number of days we are comparing the row count to
threshold_percentage: int - How many percent above or below the average row count is ok.
partition_field: Optional[str] - What column is the partition_field (default = "submission_date")

Example:

#fail
{{ row_count_within_past_partitions_avg(7, 5, "submission_date") }}

value_length(source)

Checks that the column has values of specific character length.

Usage:

Arguments:

column: str - Column which will be checked against the `expected_length`.
expected_length: int - Describes the expected character length of the value inside the specified columns.
where: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.

Example:

#warn
{{ value_length(column="country", expected_length=2, where="submission_date = @submission_date") }}

matches_pattern(source)

Checks that the column values adhere to a pattern based on a regex expression.

Usage:

Arguments:

column: str - Column which values will be checked against the regex.
pattern: str - Regex pattern specifying the expected shape / pattern of the values inside the column.
where: Optional[str]: Any additional filtering rules that should be applied when retrieving the data to run the check against.
threshold_fail_percentage: Optional[int] - Percentage of how many rows can fail the check before causing it to fail.
message: Optional[str]: Custom error message.

Example:

#warn
{{ matches_pattern(column="country", pattern="^[A-Z]{2}$", where="submission_date = @submission_date", threshold_fail_percentage=10, message="Oops") }}

Running checks locally / Commands

To list all available commands in the bqetl data checks CLI:

$ ./bqetl check

Usage: bqetl check [OPTIONS] COMMAND [ARGS]...

  Commands for managing and running bqetl data checks.

  ––––––––––––––––––––––––––––––––––––––––––––––

  IN ACTIVE DEVELOPMENT

  The current progress can be found under:

          https://mozilla-hub.atlassian.net/browse/DENG-919

  ––––––––––––––––––––––––––––––––––––––––––––––

Options:
  --help  Show this message and exit.

Commands:
  render  Renders data check query using parameters provided (OPTIONAL).
  run     Runs data checks defined for the dataset (checks.sql).

To see see how to use a specific command use:

$ ./bqetl check [command] --help

render

Usage

$ ./bqetl check render [OPTIONS] DATASET [ARGS]

Renders data check query using parameters provided (OPTIONAL). The result
is what would be used to run a check to ensure that the specified dataset
adheres to the assumptions defined in the corresponding checks.sql file

Options:
  --project-id, --project_id TEXT
                                  GCP project ID
  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.
  --help                          Show this message and exit.

Example

./bqetl check render --project_id=moz-fx-data-marketing-prod ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01

run

Usage

$ ./bqetl check run [OPTIONS] DATASET

Runs data checks defined for the dataset (checks.sql).

Checks can be validated using the `--dry_run` flag without executing them:

Options:
  --project-id, --project_id TEXT
                                  GCP project ID
  --sql_dir, --sql-dir DIRECTORY  Path to directory which contains queries.
  --dry_run, --dry-run            To dry run the query to make sure it is
                                  valid
  --marker TEXT                   Marker to filter checks.
  --help                          Show this message and exit.

Examples

# to run checks for a specific dataset
$ ./bqetl check run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail --marker=warn

# to only dry_run the checks
$ ./bqetl check run --dry_run ga_derived.downloads_with_attribution_v2 --parameter=download_date:DATE:2023-05-01 --marker=fail