Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate output data and add tests #66

Merged

Conversation

mspierenburg
Copy link
Contributor

@mspierenburg mspierenburg commented May 16, 2024

Description

Also validate output data

Development notes

  • Enable validation of output datasets
  • Added tests

Checklist

  • Read the contributing guidelines
  • Open this PR as a 'Draft Pull Request' if it is work-in-progress
  • Update the documentation to reflect the code changes
  • Add a description of this change and add your name to the list of supporting contributions in the CHANGELOG.md file. Please respect Keep a Changelog guidelines.
  • Add tests to cover your changes

Notice

  • I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

  • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.

  • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.

  • I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

Copy link

codecov bot commented May 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (56d9cd1) to head (81b67fe).
Report is 9 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main       #66      +/-   ##
===========================================
+ Coverage   94.05%   100.00%   +5.94%     
===========================================
  Files           5         5              
  Lines         101       112      +11     
===========================================
+ Hits           95       112      +17     
+ Misses          6         0       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mspierenburg mspierenburg mentioned this pull request May 16, 2024
6 tasks

@hook_impl
def after_node_run(self, node: Node, catalog: DataCatalog, outputs: Dict[str, Any]):
return self._validate_datasets(node, catalog, outputs)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two remarks :

  • I am not sure we should "return" anything
  • I think we should keep track of all previously validated dataset in a private attribute (something like self._already_validated={} to avoid validating multiple time the same dataset. Any intermediary dataset would be validated on save and load and I don't think this should be the default behaviour for performance. We could eventually make it configurable later, but let's keep it simple for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Indeed we should not return anything. Only when we also add the converting feature
  • I added your suggestion. Although I think it would be nice if we make it configurable for example in the catalog.yaml
    example:
"dataset":
    type: pandas.CSVDataset
    filepath: data/01_raw/data. csv
    metadata:
        pandera:
            schema: ${pa.yaml:_data_schema}
            before_only: True # validates the dataset only before the node run 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. would it be an Idea to use the hooks after_dataset_loaded for input validation and before_dataset_saved for output validations.

metadata={"pandera": {"schema": test_schema}},
)
filepath=csv_file,
metadata={"pandera": {"schema": test_schema, "convert": convert}},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the convert key in this PR, this is moved to another one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was indeed from the other PR. Removed.

_run_hook(
csv_file="tests/data/iris.csv",
schema_file="tests/data/iris_schema.yml",
convert=False,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to "convert" here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


def test_hook():
_run_hook(
csv_file="tests/data/iris.csv",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should likely create pytest.fixture for these files we read several time, but let's let it as is for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. If I have time I will convert it to a fixture.

_run_hook(
csv_file="tests/data/iris.csv",
schema_file="tests/data/iris_schema_fail.yml",
convert=False,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem no convert needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@Galileo-Galilei
Copy link
Owner

Thank you very much for the PR. The only required change is to avoid multiple validation of the same dataset, all others are nitpicks you can ignore.

@@ -1,85 +1,84 @@
_example_iris_data_schema:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intention to remove the name of the datasets from schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is intentional. In the current test the schema is loaded with the pandera function from_yaml which does not expect a name. With the name the yaml can't be loaded correctly.

@Galileo-Galilei
Copy link
Owner

Thank you for the update. There is only the double validation check to perform and it's good to go!

@Galileo-Galilei Galileo-Galilei merged commit dc12425 into Galileo-Galilei:main May 27, 2024
9 checks passed
@Galileo-Galilei
Copy link
Owner

Thank you for everything, it's merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants