Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-616] Implement Great Expectations to run on parquet data #139

Merged
merged 26 commits into from
Sep 13, 2024
Merged

Conversation

rxu17
Copy link
Contributor

@rxu17 rxu17 commented Sep 6, 2024

Purpose:

This draft PR adds the Great Expectations (GX) Parquet Glue jobs to the Recover ETL workflow. When JSON to Parquet workflow finishes, this will run the GX job per data type.

Currently this just supports running GX with expectations for the fitbitdailydata and healthkitv2workouts datasets. All other data types will have their jobs error out.

Changes:

Highlights big changes

New Code:

  • run_great_expectations_on_parquet.py : This is the script to be run by the GX on Parquet jobs. There are some workarounds in the script notably in the add_data_docs_sites and add_validation_results_to_store functions to allow us to add validation results to the validation store and also have the data docs (our GX report) render them since we use a EphemeralDataContext context object without having to create checkpoints, a GX config file (which would likely have us have to confirm to a specific GX repo structure), etc. If we prefer to switch to using a more persistent data content object like FileDataContext, that could be explored further in this ticket.
  • data values validation suite: this will be where we manually add our expectations to in the future. I find this way the easiest way because we have to look through the expectations gallery in order to find the expectation we want to add and in case we want to be able to validate our outputted set of expectations against this list.

Changes to old Code:

Tests:

  • Integration testing in AWS (currently running)
  • Unit tests
  • Tests that script can produce reports from sample validation suite in the shareable artifacts bucket

Viewable reports at (with AWS VPN turned on):

Sample screenshots of a report:
image

image

EDIT: Also added tests to our CI/CD as previously they weren't automatically running all of them each time

Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants