Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tolerance to row count match #73

Merged
merged 8 commits into from
Mar 3, 2025

Conversation

tylerriccio33
Copy link
Contributor

I added a tolerance to the row count match validation. It is useful for source data that is updating constantly or very large, ie. big time streaming data. I want the ability to allow for some tolerance.

@codecov-commenter
Copy link

codecov-commenter commented Mar 1, 2025

Codecov Report

Attention: Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.

Project coverage is 99.03%. Comparing base (cf006a8) to head (53fc018).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
pointblank/_typing.py 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
- Coverage   99.15%   99.03%   -0.12%     
==========================================
  Files          15       16       +1     
  Lines        3184     3206      +22     
==========================================
+ Hits         3157     3175      +18     
- Misses         27       31       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rich-iannone rich-iannone self-requested a review March 1, 2025 22:22
@@ -4551,6 +4551,7 @@ def col_schema_match(
def row_count_match(
self,
count: int | FrameT | Any,
tol: float = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're adding a tol= arg (which is totally reasonable) could you add a description of it in the docstring? Also, it might be good to have the option to input as a tuple/list for setting lower/upper bounds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed the updating the docstring - my mistake! As for lower/upper bounds, I can use the suggestion you had: if 0-1 represent tolerance as a percentage of the actual count, if tuple then treat as raw limits and if int above 1 treat as literal limit.

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, but I think we should work a bit more on getting the implementation down. Left a few comments, we can discuss a bit more in the Conversation view.

@rich-iannone
Copy link
Member

One more idea that could be interesting would be to accept/process values similar to the Threshold class: any value between between 0-1 is relative, 1 and above values would be absolute rows.

@tylerriccio33
Copy link
Contributor Author

New pr has the below:

  • removed numpy as the rng and went back to good ole' random. I'm opening a separate pr on using hypothesis for parametric testing instead, details there.
  • added ability to pass literal limits as a tuple, a literal limit as int and a relative limit as in the original pr. This takes a bit of validation.
  • Added new examples to demonstrate plus a test of those examples.

@rich-iannone
Copy link
Member

Thanks! Will review this soon :)

@@ -7,7 +7,7 @@ test-update:
pytest --snapshot-update

test-coverage:
pytest --cov=pointblank --cov-report=term
pytest --cov=pointblank --cov-report=term-missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about this! But it's a great improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accidentally left that in the pr 😂

@rich-iannone
Copy link
Member

I feel like we're very close here to enhancing this method. One last thing: I think the rules for relative/absolute limits should apply to both the scalar case and the 2-item tuple case.

The basic case (scalar) would be used most often and just means that the range is equal on both sides (so using tol=0.1 could expand to, for the internal logic, tol=(0.1, 0.1)). And I can see using a mix of absolute and relative bounds for some cases (e.g., there should be at least x rows, but perhaps up to 10% were added; so: tol=(0, 0.1)).

If this is okay, I'd recommend (in the code) upgrading any scalar value to a tuple (5 -> (5, 5)) and always handling the lower/upper bounds. Sorry to keep adding work, but this will be so useful once completed!

tbl_type: str = "local"

def __post_init__(self):

from pointblank.validate import get_row_count

if not self.inverse:
res = get_row_count(data=self.data_tbl) == self.count
row_count :int = get_row_count(data=self.data_tbl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common/better to declare types for variables? I'm honestly not sure what the norm is on this one. I only ask because if it is a preferred approach, we should probably be doing this everywhere (which is what I'm leaning toward).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"best practice" says to type hint everything but it's not necessary. Implementing type checking on a codebase of this size is a giant task, worth it in the long run but not trivial. I'd be happy to do so if you're on board though.

I will say because of how dynamic this code base is, it would probably be even hard than usual, but not at all impossible.

The most bang-for-buck would be to type check function signatures, it's 10% of the work for 90% gain. In my experience, so take it with a grain of salt, most things are caught at the signature level and not the individual variable level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for weighing in on this. I think we ought to add types to variable declarations! We can both chip away at this, and I think it'll be worth it.

@tylerriccio33
Copy link
Contributor Author

So I dedicated a type to the bounds which should solve the validation piece for people who use type checkers. I abstracted the validation and derivation of the left and right bounds into a separate function. I think this should cover all cases and the approach will allow extension in the future and multiple uses of the new type.

@rich-iannone rich-iannone self-requested a review March 3, 2025 04:43
@@ -122,3 +122,8 @@ testpaths = [

[tool.black]
line-length = 100

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this coverage exclusion. Hope it's alright!

Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rich-iannone rich-iannone merged commit 0bc5566 into posit-dev:main Mar 3, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants