Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Checks on input data #149

Closed
wants to merge 6 commits into from
Closed

More Checks on input data #149

wants to merge 6 commits into from

Conversation

Eric-Sommer
Copy link
Collaborator

@Eric-Sommer Eric-Sommer commented Feb 24, 2020

What problem do you want to solve?

When trying to run 'self-made' households through gettsim, a couple of errors occured due to incorrect variable specifications and wrong data structure. I therefore added a couple of checks to be done on the input data.

Todo

  • Check whether there is exactly one head per household.
  • check for strictly positive values for wohnfl, mietstufe, miete.
  • check for nan values
  • think about other possible checks.

@codecov
Copy link

codecov bot commented Feb 24, 2020

Codecov Report

Merging #149 into master will decrease coverage by 0.72%.
The diff coverage is 65.38%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #149      +/-   ##
========================================
- Coverage   98.73%    98%   -0.73%     
========================================
  Files          33     33              
  Lines        1183   1202      +19     
========================================
+ Hits         1168   1178      +10     
- Misses         15     24       +9
Impacted Files Coverage Δ
gettsim/tests/test_calculation_tax_transfer.py 100% <100%> (ø) ⬆️
gettsim/tax_transfer.py 98.94% <100%> (-0.02%) ⬇️
gettsim/checks.py 60.86% <60.86%> (-39.14%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 66d6406...731b695. Read the comment docs.

@Eric-Sommer Eric-Sommer changed the title WIP: More Checks on input data More Checks on input data Feb 25, 2020
@hmgaudecker
Copy link
Collaborator

@MaxBlesch: How much of this is still relevant and were you already including some of this work somewhere else?

@tobiasraabe
Copy link
Member

tobiasraabe commented Jul 1, 2020

I think we wanted to infer dtypes of internal variables from type hints. This is the most beautiful and fruitful solution.

I do not know about the changes to the test data. Are they useful?

@MaxBlesch
Copy link
Collaborator

MaxBlesch commented Jul 1, 2020

The change of the test data was just the introduction of a bool variable indicating gender. I think this not necessary!

@MaxBlesch
Copy link
Collaborator

But I think we can not completely abstain from variable specific tests. I.e. testing if mietstufe is in 1 to 7, 1 to 6. But this may also depend on years and I think is still far down the road!

@hmgaudecker
Copy link
Collaborator

The change of the test data was just the introduction of a bool variable indicating gender. I think this not necessary!

Boolean is not future proof and not even present-proof. Not kidding, should be category or convertible into a category. m / f until 2019 (?), m / f / d since then.

@hmgaudecker
Copy link
Collaborator

But I think we can not completely abstain from variable specific tests. I.e. testing if mietstufe is in 1 to 7, 1 to 6. But this may also depend on years and I think is still far down the road!

mietstufe is a clear case where it should depend on the params. I.e., set(data['mietstufe_hh'].unique()).issubset(params['wohngeld']['mietstufe'].keys()). I did not bother to check exact syntax or GETTSIM names, but you get the idea.

@hmgaudecker
Copy link
Collaborator

I think we wanted to infer dtypes of internal variables from type hints. This is the most beautiful and fruitful solution.

Can we have type hints for categorical-style variables? Thinking of things like m / f / d here.

@tobiasraabe
Copy link
Member

tobiasraabe commented Jul 1, 2020

I think we can easily check for the general dtype like float, int, categorical, bool using the type hints from this API: pandas-dev/pandas#26766.

I think we can also create custom type hints and provide more information like all possible values of a categorical. Here is an example pandas-dev/pandas#14468 (comment).

I am not sure what needs to be done so that it plays nicely with mypy, but this is not our real use-case anyway. Mainly, we want to check the dtype of input variables. And we can get the information by examing the signature of functions. During the test session, we can also check whether our internal function provide the expected dtype.

@hmgaudecker
Copy link
Collaborator

Hmm, I do not fully see through the latter case, I am afraid. One issue would be that the possible types of a categorical would be dynamic -- e.g., they would need to correspond to the params object in a particular year. But this will be for someone else to worry about and it should not be a problem to have a two-step process (first check whether something is a categorical; then check possible values).

@MaxBlesch
Copy link
Collaborator

I think we can close this PR. With #230 type hints will come in place and from this infrastructure, it will be easy to implement all other checks. For cases like mietstufe, I could imagine a function select_param_by_data which then raises a custom error if the data does not fit to our params. So my suggestion is to close this PR and create two issues: One for moving away from bools and one for creating select_param_by_data?! @hmgaudecker @ChristianZimpelmann

@MaxBlesch MaxBlesch closed this Oct 30, 2020
@MaxBlesch MaxBlesch deleted the data_warning branch October 30, 2020 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants