Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract predicate filtering from data model #1079

Merged

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Aug 11, 2022

Part of the quest to remove the implementation details
of predicates out of DataModel and into the things that
actually care about them.

This slightly changes the behavior in the test because we don't
do any filtering either way, so we use ALL predicates from the
variable definitions

Inspired by #1065 (comment)

@fgregg this is ready for review

@@ -225,6 +225,20 @@ def _sample_indices(self, sample_size: int) -> Iterable[RecordIDPair]:
return sample_ids


def _filter_canopy_predicates(
predicates: Iterable[Predicate], canopies: bool
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgregg this preserves the logic that was here before, but I wanted to check that this actually is the logic that we want. I was wondering if this was actually intending to filter out canopies? Because that's not what it currently does, if you pass canopies=True, then it only includes canopies, it filters out non-canopies.

eg perhaps this is intended to be:

def _filter_canopy_predicates(
    predicates: Iterable[Predicate], canopies: bool
) -> set[Predicate]:
    if canopies:
        return set(predicates)
    result = set()
    for predicate in predicates:
        if hasattr(predicate, "index"):
            is_canopy = hasattr(predicate, "canopy")
            if is_canopy:
                result.add(predicate)
        else:
            result.add(predicate)
    return result

@NickCrews
Copy link
Contributor Author

@benchmark

@NickCrews NickCrews force-pushed the extract-predicate-filtering-from-datamodel branch from 37c0e36 to 74874ac Compare August 11, 2022 21:23
@github-actions
Copy link

All benchmarks (diff):

before after ratio benchmark
528M 529M 1.00 canonical.Canonical.peakmem_run
18.0±0.4s 16.9±0.01s 0.94 canonical.Canonical.time_run
0.904 0.944 1.04 canonical.Canonical.track_precision
0.911 0.92 1.01 canonical.Canonical.track_recall
228M 228M 1.00 canonical_gazetteer.Gazetteer.peakmem_run(None)
15.8±0.05s 15.5±0.2s 0.98 canonical_gazetteer.Gazetteer.time_run(None)
0.982 0.973 0.99 canonical_gazetteer.Gazetteer.track_precision(None)
0.982 0.982 1.00 canonical_gazetteer.Gazetteer.track_recall(None)
228M 228M 1.00 canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
228M 228M 1.00 canonical_matching.Matching.peakmem_run({'threshold': 0.5})
14.1±0.03s 14.1±0.01s 1.00 canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
14.2±0.07s 14.2±0.04s 1.00 canonical_matching.Matching.time_run({'threshold': 0.5})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
0.99 0.99 1.00 canonical_matching.Matching.track_precision({'threshold': 0.5})
0.911 0.911 1.00 canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
0.911 0.911 1.00 canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

@NickCrews
Copy link
Contributor Author

NickCrews commented Aug 11, 2022

I have no idea why the tests are failing, since they weren't failing just a few commits ago and I don't see what could have changed. Looking at the logs I think all the dependencies stayed the same, but maybe I missed something. but dependencies shouldn't be related to this? Reinstalling main locally and running I also get these errors but I don't know why...

@NickCrews
Copy link
Contributor Author

I think editable install is failing due to pypa/setuptools#3497?

@fgregg
Copy link
Contributor

fgregg commented Aug 12, 2022

i'm not sure that datamodel isn't the right owner of predicates, since it knows about the fields?

@fgregg
Copy link
Contributor

fgregg commented Aug 12, 2022

but i suppose the narrower question is whether we want datamodel to care about whether we want certain types of predicates.

@fgregg
Copy link
Contributor

fgregg commented Aug 17, 2022

can you rebase this so we can get tests going, @NickCrews ?

@NickCrews
Copy link
Contributor Author

i'm not sure that datamodel isn't the right owner of predicates, since it knows about the fields?
but i suppose the narrower question is whether we want datamodel to care about whether we want certain types of predicates.

I think I see what you're saying, and I think I agree. Yes, this PR is just slightly improving the situation.

Part of the quest to remove the implementation details
of predicates out of DataModel and into the things that
actually care about them.

This slightly changes the behavior in the test because we don't
do any filtering either way, so we use ALL predicates from the
variable definitions
@NickCrews NickCrews force-pushed the extract-predicate-filtering-from-datamodel branch from 74874ac to 7e26aaf Compare August 17, 2022 18:48
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 64.108% when pulling 7e26aaf on NickCrews:extract-predicate-filtering-from-datamodel into bf028e9 on dedupeio:main.

@NickCrews
Copy link
Contributor Author

Can we get this merged since it's an improvement, and then come back later to fully extract predicates?

@fgregg
Copy link
Contributor

fgregg commented Aug 30, 2022

thanks for the ping! yes, let's do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants