-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validate output data and add tests #66
validate output data and add tests #66
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #66 +/- ##
===========================================
+ Coverage 94.05% 100.00% +5.94%
===========================================
Files 5 5
Lines 101 112 +11
===========================================
+ Hits 95 112 +17
+ Misses 6 0 -6 ☔ View full report in Codecov by Sentry. |
|
||
@hook_impl | ||
def after_node_run(self, node: Node, catalog: DataCatalog, outputs: Dict[str, Any]): | ||
return self._validate_datasets(node, catalog, outputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two remarks :
- I am not sure we should "return" anything
- I think we should keep track of all previously validated dataset in a private attribute (something like
self._already_validated={}
to avoid validating multiple time the same dataset. Any intermediary dataset would be validated on save and load and I don't think this should be the default behaviour for performance. We could eventually make it configurable later, but let's keep it simple for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Indeed we should not return anything. Only when we also add the converting feature
- I added your suggestion. Although I think it would be nice if we make it configurable for example in the catalog.yaml
example:
"dataset":
type: pandas.CSVDataset
filepath: data/01_raw/data. csv
metadata:
pandera:
schema: ${pa.yaml:_data_schema}
before_only: True # validates the dataset only before the node run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. would it be an Idea to use the hooks after_dataset_loaded
for input validation and before_dataset_saved
for output validations.
tests/framework/hooks/test_hook.py
Outdated
metadata={"pandera": {"schema": test_schema}}, | ||
) | ||
filepath=csv_file, | ||
metadata={"pandera": {"schema": test_schema, "convert": convert}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the convert key in this PR, this is moved to another one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was indeed from the other PR. Removed.
tests/framework/hooks/test_hook.py
Outdated
_run_hook( | ||
csv_file="tests/data/iris.csv", | ||
schema_file="tests/data/iris_schema.yml", | ||
convert=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to "convert" here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
|
||
def test_hook(): | ||
_run_hook( | ||
csv_file="tests/data/iris.csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should likely create pytest.fixture
for these files we read several time, but let's let it as is for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. If I have time I will convert it to a fixture.
tests/framework/hooks/test_hook.py
Outdated
_run_hook( | ||
csv_file="tests/data/iris.csv", | ||
schema_file="tests/data/iris_schema_fail.yml", | ||
convert=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem no convert
needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
Thank you very much for the PR. The only required change is to avoid multiple validation of the same dataset, all others are nitpicks you can ignore. |
@@ -1,85 +1,84 @@ | |||
_example_iris_data_schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intention to remove the name of the datasets from schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is intentional. In the current test the schema is loaded with the pandera function from_yaml
which does not expect a name. With the name the yaml can't be loaded correctly.
Thank you for the update. There is only the double validation check to perform and it's good to go! |
Thank you for everything, it's merged! |
Description
Also validate output data
Development notes
Checklist
CHANGELOG.md
file. Please respect Keep a Changelog guidelines.Notice
I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":
I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorised to submit this contribution on behalf of the original creator(s) or their licensees.
I certify that the use of this contribution as authorised by the Apache 2.0 license does not violate the intellectual property rights of anyone else.