Add tolerance to row count match #73

tylerriccio33 · 2025-03-01T15:09:47Z

I added a tolerance to the row count match validation. It is useful for source data that is updating constantly or very large, ie. big time streaming data. I want the ability to allow for some tolerance.

codecov-commenter · 2025-03-01T15:11:30Z

Codecov Report

Attention: Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.

Project coverage is 99.03%. Comparing base (cf006a8) to head (53fc018).
Report is 11 commits behind head on main.

Files with missing lines	Patch %	Lines
pointblank/_typing.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #73      +/-   ##
==========================================
- Coverage   99.15%   99.03%   -0.12%     
==========================================
  Files          15       16       +1     
  Lines        3184     3206      +22     
==========================================
+ Hits         3157     3175      +18     
- Misses         27       31       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rich-iannone · 2025-03-01T22:27:35Z

pointblank/validate.py

@@ -4551,6 +4551,7 @@ def col_schema_match(
    def row_count_match(
        self,
        count: int | FrameT | Any,
+        tol: float = 0,


If we're adding a tol= arg (which is totally reasonable) could you add a description of it in the docstring? Also, it might be good to have the option to input as a tuple/list for setting lower/upper bounds.

I totally missed the updating the docstring - my mistake! As for lower/upper bounds, I can use the suggestion you had: if 0-1 represent tolerance as a percentage of the actual count, if tuple then treat as raw limits and if int above 1 treat as literal limit.

tests/test_validate.py

rich-iannone

This is a great idea, but I think we should work a bit more on getting the implementation down. Left a few comments, we can discuss a bit more in the Conversation view.

rich-iannone · 2025-03-01T22:33:07Z

One more idea that could be interesting would be to accept/process values similar to the Threshold class: any value between between 0-1 is relative, 1 and above values would be absolute rows.

tylerriccio33 · 2025-03-02T14:48:41Z

New pr has the below:

removed numpy as the rng and went back to good ole' random. I'm opening a separate pr on using hypothesis for parametric testing instead, details there.
added ability to pass literal limits as a tuple, a literal limit as int and a relative limit as in the original pr. This takes a bit of validation.
Added new examples to demonstrate plus a test of those examples.

rich-iannone · 2025-03-02T19:31:07Z

Thanks! Will review this soon :)

rich-iannone · 2025-03-02T20:00:05Z

Makefile

@@ -7,7 +7,7 @@ test-update:
 	pytest --snapshot-update

 test-coverage:
-	pytest --cov=pointblank --cov-report=term
+	pytest --cov=pointblank --cov-report=term-missing


Didn't know about this! But it's a great improvement.

I accidentally left that in the pr 😂

rich-iannone · 2025-03-02T20:14:55Z

I feel like we're very close here to enhancing this method. One last thing: I think the rules for relative/absolute limits should apply to both the scalar case and the 2-item tuple case.

The basic case (scalar) would be used most often and just means that the range is equal on both sides (so using tol=0.1 could expand to, for the internal logic, tol=(0.1, 0.1)). And I can see using a mix of absolute and relative bounds for some cases (e.g., there should be at least x rows, but perhaps up to 10% were added; so: tol=(0, 0.1)).

If this is okay, I'd recommend (in the code) upgrading any scalar value to a tuple (5 -> (5, 5)) and always handling the lower/upper bounds. Sorry to keep adding work, but this will be so useful once completed!

rich-iannone · 2025-03-02T20:18:18Z

pointblank/_interrogation.py

    tbl_type: str = "local"

    def __post_init__(self):

        from pointblank.validate import get_row_count

-        if not self.inverse:
-            res = get_row_count(data=self.data_tbl) == self.count
+        row_count :int = get_row_count(data=self.data_tbl)


Is it common/better to declare types for variables? I'm honestly not sure what the norm is on this one. I only ask because if it is a preferred approach, we should probably be doing this everywhere (which is what I'm leaning toward).

"best practice" says to type hint everything but it's not necessary. Implementing type checking on a codebase of this size is a giant task, worth it in the long run but not trivial. I'd be happy to do so if you're on board though.

I will say because of how dynamic this code base is, it would probably be even hard than usual, but not at all impossible.

The most bang-for-buck would be to type check function signatures, it's 10% of the work for 90% gain. In my experience, so take it with a grain of salt, most things are caught at the signature level and not the individual variable level.

Thanks for weighing in on this. I think we ought to add types to variable declarations! We can both chip away at this, and I think it'll be worth it.

tylerriccio33 · 2025-03-03T02:33:15Z

So I dedicated a type to the bounds which should solve the validation piece for people who use type checkers. I abstracted the validation and derivation of the left and right bounds into a separate function. I think this should cover all cases and the approach will allow extension in the future and multiple uses of the new type.

rich-iannone · 2025-03-03T04:46:10Z

pyproject.toml

@@ -122,3 +122,8 @@ testpaths = [

 [tool.black]
 line-length = 100
+


I added this coverage exclusion. Hope it's alright!

rich-iannone

LGTM!

Implement row count tolerance

e1332c8

rich-iannone self-requested a review March 1, 2025 22:22

rich-iannone reviewed Mar 1, 2025

View reviewed changes

tests/test_validate.py Outdated Show resolved Hide resolved

rich-iannone requested changes Mar 1, 2025

View reviewed changes

tylerriccio33 added 4 commits March 1, 2025 18:42

Remove numpy from row tol testing

1cf08dd

Add alternative row count tol inputs

d620478

Expand allowable row count tol inputs

a7affae

Add documentation for new tolerance endpoint

74ac39c

rich-iannone reviewed Mar 2, 2025

View reviewed changes

tylerriccio33 closed this Mar 2, 2025

tylerriccio33 reopened this Mar 2, 2025

tylerriccio33 added 2 commits March 2, 2025 18:51

Document tolerance as distinct type

b69d26d

Consolidate bounds checking in dedicated function

c8e5727

Exclude type-checking import lines from coverage

53fc018

rich-iannone self-requested a review March 3, 2025 04:43

rich-iannone reviewed Mar 3, 2025

View reviewed changes

rich-iannone approved these changes Mar 3, 2025

View reviewed changes

rich-iannone merged commit 0bc5566 into posit-dev:main Mar 3, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tolerance to row count match #73

Add tolerance to row count match #73

tylerriccio33 commented Mar 1, 2025

codecov-commenter commented Mar 1, 2025 •

edited

Loading

rich-iannone Mar 1, 2025

tylerriccio33 Mar 1, 2025

rich-iannone left a comment

rich-iannone commented Mar 1, 2025

tylerriccio33 commented Mar 2, 2025

rich-iannone commented Mar 2, 2025

rich-iannone Mar 2, 2025

tylerriccio33 Mar 2, 2025

rich-iannone commented Mar 2, 2025

rich-iannone Mar 2, 2025

tylerriccio33 Mar 2, 2025

rich-iannone Mar 3, 2025

tylerriccio33 commented Mar 3, 2025

rich-iannone Mar 3, 2025

rich-iannone left a comment

		@@ -122,3 +122,8 @@ testpaths = [

		[tool.black]
		line-length = 100

Add tolerance to row count match #73

Add tolerance to row count match #73

Conversation

tylerriccio33 commented Mar 1, 2025

codecov-commenter commented Mar 1, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rich-iannone left a comment

Choose a reason for hiding this comment

rich-iannone commented Mar 1, 2025

tylerriccio33 commented Mar 2, 2025

rich-iannone commented Mar 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rich-iannone commented Mar 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylerriccio33 commented Mar 3, 2025

Choose a reason for hiding this comment

rich-iannone left a comment

Choose a reason for hiding this comment

codecov-commenter commented Mar 1, 2025 •

edited

Loading