validate output data and add tests #66

mspierenburg · 2024-05-16T11:33:19Z

Description

Also validate output data

Development notes

Enable validation of output datasets
Added tests

Checklist

Read the contributing guidelines
Open this PR as a 'Draft Pull Request' if it is work-in-progress
Update the documentation to reflect the code changes
Add a description of this change and add your name to the list of supporting contributions in the CHANGELOG.md file. Please respect Keep a Changelog guidelines.
Add tests to cover your changes

Notice

I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.

codecov · 2024-05-16T11:35:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (56d9cd1) to head (81b67fe).
Report is 9 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main       #66      +/-   ##
===========================================
+ Coverage   94.05%   100.00%   +5.94%     
===========================================
  Files           5         5              
  Lines         101       112      +11     
===========================================
+ Hits           95       112      +17     
+ Misses          6         0       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Galileo-Galilei · 2024-05-22T06:59:09Z

kedro_pandera/framework/hooks/pandera_hook.py

+
+    @hook_impl
+    def after_node_run(self, node: Node, catalog: DataCatalog, outputs: Dict[str, Any]):
+        return self._validate_datasets(node, catalog, outputs)


Two remarks :

I am not sure we should "return" anything

I think we should keep track of all previously validated dataset in a private attribute (something like self._already_validated={} to avoid validating multiple time the same dataset. Any intermediary dataset would be validated on save and load and I don't think this should be the default behaviour for performance. We could eventually make it configurable later, but let's keep it simple for this PR.

Indeed we should not return anything. Only when we also add the converting feature

I added your suggestion. Although I think it would be nice if we make it configurable for example in the catalog.yaml
example:

"dataset": type: pandas.CSVDataset filepath: data/01_raw/data. csv metadata: pandera: schema: ${pa.yaml:_data_schema} before_only: True # validates the dataset only before the node run

Btw. would it be an Idea to use the hooks after_dataset_loaded for input validation and before_dataset_saved for output validations.

Galileo-Galilei · 2024-05-22T07:00:14Z

tests/framework/hooks/test_hook.py

-                metadata={"pandera": {"schema": test_schema}},
-            )
+                filepath=csv_file,
+                metadata={"pandera": {"schema": test_schema, "convert": convert}},


We don't need the convert key in this PR, this is moved to another one

That was indeed from the other PR. Removed.

Galileo-Galilei · 2024-05-22T07:01:10Z

tests/framework/hooks/test_hook.py

+    _run_hook(
+        csv_file="tests/data/iris.csv",
+        schema_file="tests/data/iris_schema.yml",
+        convert=False,


No need to "convert" here.

Galileo-Galilei · 2024-05-22T07:02:18Z

tests/framework/hooks/test_hook.py

+
+def test_hook():
+    _run_hook(
+        csv_file="tests/data/iris.csv",


We should likely create pytest.fixture for these files we read several time, but let's let it as is for now

Indeed. If I have time I will convert it to a fixture.

Galileo-Galilei · 2024-05-22T07:02:44Z

tests/framework/hooks/test_hook.py

+        _run_hook(
+            csv_file="tests/data/iris.csv",
+            schema_file="tests/data/iris_schema_fail.yml",
+            convert=False,


Idem no convert needed

Galileo-Galilei · 2024-05-22T07:03:45Z

Thank you very much for the PR. The only required change is to avoid multiple validation of the same dataset, all others are nitpicks you can ignore.

noklam · 2024-05-22T08:59:49Z

tests/data/iris_schema.yml

@@ -1,85 +1,84 @@
-_example_iris_data_schema:


Is this intention to remove the name of the datasets from schema?

Yes this is intentional. In the current test the schema is loaded with the pandera function from_yaml which does not expect a name. With the name the yaml can't be loaded correctly.

Galileo-Galilei · 2024-05-23T18:33:14Z

Thank you for the update. There is only the double validation check to perform and it's good to go!

Galileo-Galilei · 2024-05-27T20:11:14Z

Thank you for everything, it's merged!

validate also output data and add tests

f27d841

mspierenburg mentioned this pull request May 16, 2024

Feature/allow converting #64

Open

6 tasks

Galileo-Galilei reviewed May 22, 2024

View reviewed changes

noklam reviewed May 22, 2024

View reviewed changes

remove convert parameters in test

585998c

add check if dataset is already validated

81b67fe

Galileo-Galilei merged commit dc12425 into Galileo-Galilei:main May 27, 2024
9 checks passed

Galileo-Galilei mentioned this pull request May 27, 2024

Add data validation to terminal outputs #20

Closed

michal-mmm mentioned this pull request May 28, 2024

🐛 Fix AttributeError in datasets with missing metadata parameter (#67) #68

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate output data and add tests #66

validate output data and add tests #66

mspierenburg commented May 16, 2024 •

edited

Loading

codecov bot commented May 16, 2024 •

edited

Loading

Galileo-Galilei May 22, 2024

mspierenburg May 24, 2024

mspierenburg May 24, 2024

Galileo-Galilei May 22, 2024

mspierenburg May 23, 2024

Galileo-Galilei May 22, 2024

mspierenburg May 23, 2024

Galileo-Galilei May 22, 2024

mspierenburg May 23, 2024

Galileo-Galilei May 22, 2024

mspierenburg May 23, 2024

Galileo-Galilei commented May 22, 2024

noklam May 22, 2024

mspierenburg May 23, 2024

Galileo-Galilei commented May 23, 2024

Galileo-Galilei commented May 27, 2024

validate output data and add tests #66

validate output data and add tests #66

Conversation

mspierenburg commented May 16, 2024 • edited Loading

Description

Development notes

Checklist

Notice

codecov bot commented May 16, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Galileo-Galilei commented May 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Galileo-Galilei commented May 23, 2024

Galileo-Galilei commented May 27, 2024

mspierenburg commented May 16, 2024 •

edited

Loading

codecov bot commented May 16, 2024 •

edited

Loading