Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move tests from scheduled queries / business queries to DBT #83

Merged

Conversation

amishas157
Copy link
Contributor

@amishas157 amishas157 commented Sep 3, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with the jira ticket associated with the PR.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated the docs and README with the added features, breaking changes, new instructions on how to use the repository.

Release planning

  • I've decided if this PR requires a new major/minor/patch version accordingly to
    semver, and I've changed the name of the BRANCH to release/* , feature/* or patch/* .

What

This PR:

  • changes severity of existing tests to error instead of warning to make it compatible with tests from scheduled queries
  • Adds few more tests for column checks based on list of scheduled queries
  • Adds tests from scheduled queries as generic tests

Why

To centralize the data quality tests

Related PR: https://github.com/stellar/stellar-dbt/pull/227

Known limitations

None

@amishas157 amishas157 force-pushed the patch/hubble-520/move-scheduled-queries-to-dbt-tests branch from eed3dec to c62a947 Compare September 11, 2024 09:13
@amishas157 amishas157 changed the title Move tests from scheduled queries Move tests from scheduled queries / business queries to DBT Sep 11, 2024
@amishas157 amishas157 marked this pull request as ready for review September 11, 2024 17:52
@amishas157 amishas157 requested a review from a team as a code owner September 11, 2024 17:52
meta:
description: "Monitors the freshness of your table over time, as the expected time between data updates."
- incremental_unique_combination_of_columns:
combination_of_columns:
- batch_run_date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure batch_run_date should be included. My assumption was that the history_assets table was unique on asset_code, asset_issuer, and asset_type otherwise we would have "duplicate assets" based where each asset would have multiple batch_run_dates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6bd6abf

@@ -14,7 +14,6 @@ sources:
- incremental_unique_combination_of_columns:
combination_of_columns:
- account_id
- sequence_number
Copy link
Contributor

@chowbao chowbao Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sequence_number should be included. The src/staging accounts table should record every instance of an account on every sequence_number if the account had a change for that given ledger sequence iirc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I will keep it then, It was not present in the scheduled query, so I treated that as source of truth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6bd6abf

@@ -0,0 +1,17 @@
{{ config(
severity="error"
, tags=["singular_test"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this tag runs every 30 mins in airflow. This is a higher frequency compared to what the cloud function and scheduled query tests used to run at. Which is good.

Just mentioning this in case we get noisy alerts where we might want to adjust the query and/or the frequency the tests are run (possibly with a separate dbt tag).

select sequence,
closed_at,
total_byte_size_of_bucket_list / 1000000000 as bl_db_gb
from {{ source('crypto_stellar', 'history_ledgers') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a ref('stg_history_ledgers') instead of a source. Otherwise these tests would be hardcoded to just prod right?

Copy link
Contributor Author

@amishas157 amishas157 Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wont test still use crypto_stellar.history_ledgers instead of test_crypto_stellar.history_ledgers ?

Example:

with
raw_table as (
select *
from {{ source('crypto_stellar', 'history_ledgers') }}
)
, history_ledgers as (

Copy link
Contributor

@chowbao chowbao Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will use the test project/dataset because the source for staging tables will be overwritten by the dbt_project.yml in the private dbt repo
https://github.com/stellar/stellar-dbt/blob/master/dbt_project.yml#L63-L68

+project: "{% if target.name == 'prod' %} crypto-stellar {% else %} {{ target.project }} {% endif %}"

I don't think generic tests has such an override defined. So technically you can add a generic test source override to dbt_project.yml. But my preference would be to just change the generic test source to ref because I feel like it is cleaner because you only need to define a single override for staging table instead of two overrides (staging + generic tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Yes, agree in that case we should just use ref

Copy link
Contributor Author

@amishas157 amishas157 Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 74dad5a

Comment on lines 15 to 16
FROM {{ source('crypto_stellar', 'history_operations') }} op
LEFT OUTER JOIN {{ ref('enriched_history_operations') }} eho
Copy link
Contributor

@chowbao chowbao Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here

I think this should be a ref('stg_history_operations') instead of a source. Otherwise these tests would be hardcoded to just prod right?

Edit: also in this case there would be a miss match between data if run in test because there is a ref('enriched_history_operations')

Copy link
Contributor Author

@amishas157 amishas157 Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 74dad5a

@@ -11,7 +11,7 @@ with
, batch_id
, closed_at
, max(sequence) as max_sequence
from {{ source('crypto_stellar', 'history_ledgers') }}
from {{ ref('stg_history_ledgers') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

@amishas157 amishas157 merged commit 0425e26 into master Sep 12, 2024
3 checks passed
@sydneynotthecity sydneynotthecity deleted the patch/hubble-520/move-scheduled-queries-to-dbt-tests branch November 14, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants