-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tolerance to row count match #73
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #73 +/- ##
==========================================
- Coverage 99.15% 99.03% -0.12%
==========================================
Files 15 16 +1
Lines 3184 3206 +22
==========================================
+ Hits 3157 3175 +18
- Misses 27 31 +4 ☔ View full report in Codecov by Sentry. |
pointblank/validate.py
Outdated
@@ -4551,6 +4551,7 @@ def col_schema_match( | |||
def row_count_match( | |||
self, | |||
count: int | FrameT | Any, | |||
tol: float = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're adding a tol=
arg (which is totally reasonable) could you add a description of it in the docstring? Also, it might be good to have the option to input as a tuple/list for setting lower/upper bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed the updating the docstring - my mistake! As for lower/upper bounds, I can use the suggestion you had: if 0-1 represent tolerance as a percentage of the actual count, if tuple then treat as raw limits and if int above 1 treat as literal limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea, but I think we should work a bit more on getting the implementation down. Left a few comments, we can discuss a bit more in the Conversation view.
One more idea that could be interesting would be to accept/process values similar to the |
New pr has the below:
|
Thanks! Will review this soon :) |
@@ -7,7 +7,7 @@ test-update: | |||
pytest --snapshot-update | |||
|
|||
test-coverage: | |||
pytest --cov=pointblank --cov-report=term | |||
pytest --cov=pointblank --cov-report=term-missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about this! But it's a great improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I accidentally left that in the pr 😂
I feel like we're very close here to enhancing this method. One last thing: I think the rules for relative/absolute limits should apply to both the scalar case and the 2-item tuple case. The basic case (scalar) would be used most often and just means that the range is equal on both sides (so using If this is okay, I'd recommend (in the code) upgrading any scalar value to a tuple ( |
tbl_type: str = "local" | ||
|
||
def __post_init__(self): | ||
|
||
from pointblank.validate import get_row_count | ||
|
||
if not self.inverse: | ||
res = get_row_count(data=self.data_tbl) == self.count | ||
row_count :int = get_row_count(data=self.data_tbl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it common/better to declare types for variables? I'm honestly not sure what the norm is on this one. I only ask because if it is a preferred approach, we should probably be doing this everywhere (which is what I'm leaning toward).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"best practice" says to type hint everything but it's not necessary. Implementing type checking on a codebase of this size is a giant task, worth it in the long run but not trivial. I'd be happy to do so if you're on board though.
I will say because of how dynamic this code base is, it would probably be even hard than usual, but not at all impossible.
The most bang-for-buck would be to type check function signatures, it's 10% of the work for 90% gain. In my experience, so take it with a grain of salt, most things are caught at the signature level and not the individual variable level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for weighing in on this. I think we ought to add types to variable declarations! We can both chip away at this, and I think it'll be worth it.
So I dedicated a type to the bounds which should solve the validation piece for people who use type checkers. I abstracted the validation and derivation of the left and right bounds into a separate function. I think this should cover all cases and the approach will allow extension in the future and multiple uses of the new type. |
@@ -122,3 +122,8 @@ testpaths = [ | |||
|
|||
[tool.black] | |||
line-length = 100 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this coverage exclusion. Hope it's alright!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I added a tolerance to the row count match validation. It is useful for source data that is updating constantly or very large, ie. big time streaming data. I want the ability to allow for some tolerance.