Extract predicate filtering from data model #1079

NickCrews · 2022-08-11T21:12:03Z

Part of the quest to remove the implementation details
of predicates out of DataModel and into the things that
actually care about them.

This slightly changes the behavior in the test because we don't
do any filtering either way, so we use ALL predicates from the
variable definitions

Inspired by #1065 (comment)

@fgregg this is ready for review

NickCrews · 2022-08-11T21:20:12Z

dedupe/labeler.py

@@ -225,6 +225,20 @@ def _sample_indices(self, sample_size: int) -> Iterable[RecordIDPair]:
        return sample_ids


+def _filter_canopy_predicates(
+    predicates: Iterable[Predicate], canopies: bool


@fgregg this preserves the logic that was here before, but I wanted to check that this actually is the logic that we want. I was wondering if this was actually intending to filter out canopies? Because that's not what it currently does, if you pass canopies=True, then it only includes canopies, it filters out non-canopies.

eg perhaps this is intended to be:

def _filter_canopy_predicates( predicates: Iterable[Predicate], canopies: bool ) -> set[Predicate]: if canopies: return set(predicates) result = set() for predicate in predicates: if hasattr(predicate, "index"): is_canopy = hasattr(predicate, "canopy") if is_canopy: result.add(predicate) else: result.add(predicate) return result

NickCrews · 2022-08-11T21:20:36Z

@benchmark

github-actions · 2022-08-11T21:38:31Z

All benchmarks (diff):

before	after	ratio	benchmark
528M	529M	1.00	canonical.Canonical.peakmem_run
18.0±0.4s	16.9±0.01s	0.94	canonical.Canonical.time_run
0.904	0.944	1.04	canonical.Canonical.track_precision
0.911	0.92	1.01	canonical.Canonical.track_recall
228M	228M	1.00	canonical_gazetteer.Gazetteer.peakmem_run(None)
15.8±0.05s	15.5±0.2s	0.98	canonical_gazetteer.Gazetteer.time_run(None)
0.982	0.973	0.99	canonical_gazetteer.Gazetteer.track_precision(None)
0.982	0.982	1.00	canonical_gazetteer.Gazetteer.track_recall(None)
228M	228M	1.00	canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
228M	228M	1.00	canonical_matching.Matching.peakmem_run({'threshold': 0.5})
14.1±0.03s	14.1±0.01s	1.00	canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
14.2±0.07s	14.2±0.04s	1.00	canonical_matching.Matching.time_run({'threshold': 0.5})
0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911	0.911	1.00	canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.911	0.911	1.00	canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

NickCrews · 2022-08-11T22:13:25Z

I have no idea why the tests are failing, since they weren't failing just a few commits ago and I don't see what could have changed. Looking at the logs I think all the dependencies stayed the same, but maybe I missed something. but dependencies shouldn't be related to this? Reinstalling main locally and running I also get these errors but I don't know why...

NickCrews · 2022-08-11T22:32:51Z

I think editable install is failing due to pypa/setuptools#3497?

fgregg · 2022-08-12T00:37:19Z

i'm not sure that datamodel isn't the right owner of predicates, since it knows about the fields?

fgregg · 2022-08-12T00:38:06Z

but i suppose the narrower question is whether we want datamodel to care about whether we want certain types of predicates.

fgregg · 2022-08-17T16:47:04Z

can you rebase this so we can get tests going, @NickCrews ?

NickCrews · 2022-08-17T18:48:13Z

i'm not sure that datamodel isn't the right owner of predicates, since it knows about the fields?
but i suppose the narrower question is whether we want datamodel to care about whether we want certain types of predicates.

I think I see what you're saying, and I think I agree. Yes, this PR is just slightly improving the situation.

Part of the quest to remove the implementation details of predicates out of DataModel and into the things that actually care about them. This slightly changes the behavior in the test because we don't do any filtering either way, so we use ALL predicates from the variable definitions

coveralls · 2022-08-17T18:56:34Z

Coverage decreased (-0.3%) to 64.108% when pulling 7e26aaf on NickCrews:extract-predicate-filtering-from-datamodel into bf028e9 on dedupeio:main.

NickCrews · 2022-08-30T17:35:31Z

Can we get this merged since it's an improvement, and then come back later to fully extract predicates?

fgregg · 2022-08-30T18:16:58Z

thanks for the ping! yes, let's do it.

NickCrews commented Aug 11, 2022

View reviewed changes

NickCrews force-pushed the extract-predicate-filtering-from-datamodel branch from 37c0e36 to 74874ac Compare August 11, 2022 21:23

NickCrews added 2 commits August 17, 2022 10:48

Simplify canopy predicate filtering logic

8474469

NickCrews force-pushed the extract-predicate-filtering-from-datamodel branch from 74874ac to 7e26aaf Compare August 17, 2022 18:48

fgregg merged commit c595052 into dedupeio:main Aug 30, 2022

NickCrews mentioned this pull request Sep 1, 2022

Split up Datamodel into predicates, rename to Featurizer #1088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract predicate filtering from data model #1079

Extract predicate filtering from data model #1079

NickCrews commented Aug 11, 2022 •

edited

Loading

NickCrews Aug 11, 2022

NickCrews commented Aug 11, 2022

github-actions bot commented Aug 11, 2022

NickCrews commented Aug 11, 2022 •

edited

Loading

NickCrews commented Aug 11, 2022

fgregg commented Aug 12, 2022

fgregg commented Aug 12, 2022

fgregg commented Aug 17, 2022

NickCrews commented Aug 17, 2022

coveralls commented Aug 17, 2022

NickCrews commented Aug 30, 2022

fgregg commented Aug 30, 2022

Extract predicate filtering from data model #1079

Extract predicate filtering from data model #1079

Conversation

NickCrews commented Aug 11, 2022 • edited Loading

NickCrews Aug 11, 2022

Choose a reason for hiding this comment

NickCrews commented Aug 11, 2022

github-actions bot commented Aug 11, 2022

All benchmarks (diff):

NickCrews commented Aug 11, 2022 • edited Loading

NickCrews commented Aug 11, 2022

fgregg commented Aug 12, 2022

fgregg commented Aug 12, 2022

fgregg commented Aug 17, 2022

NickCrews commented Aug 17, 2022

coveralls commented Aug 17, 2022

NickCrews commented Aug 30, 2022

fgregg commented Aug 30, 2022

NickCrews commented Aug 11, 2022 •

edited

Loading

NickCrews commented Aug 11, 2022 •

edited

Loading