Code "cleanup" #1149

lmores · 2023-02-17T10:33:04Z

lmores
Feb 17, 2023

Hi @fgregg, as I am reading the code of this library (with the aim to reach a system able to incrementally digest one million records a day...) I could take the chance to apply some "code cleanup" and open a few PRs.

All changes in these PRs would be about coding style and possibly some minor performance improvements, no implementation or behavioural changes at all!

For example, here are some specimens of what I may change in variables/base.py.
Let me know if you are interested in these kinds of changes.

Direct import of entities:

from dedupe import predicates

would become

from dedupe.predicates import (
    ExistsPredicate,
    IndexPredicate,
    Predicate,
    SimplePredicate,
)

Use new f-string syntax:

self.name = "(%s: %s)" % (str(definition["name"]), str(definition["type"]))

would become

self.name = f"({definition['name']}: {definition['type']})"

Remove useless blank lines:

def __init__(self, name: str):

    self.name = "(%s: Not Missing)" % name

    self.has_missing = False

would become

def __init__(self, name: str):
    self.name = f"({name}: Not Missing)"
    self.has_missing = False

Always take a reference to objects before looping (to avoid repeated attribute lookup):

self.predicates = [
    self._Predicate(pred, self.field) for pred in self._predicate_functions
]

would become

_class = self._Predicate
self.predicates = [_class(fn, field) for fn in self._predicate_functions]

Use Python3 super() syntax:

super(FieldType, self).__init__(definition)

would become

super().__init__(definition)

Use comprehension when it results in simpler code:

index_predicates = []
for predicate in predicates:
    for threshold in thresholds:
        index_predicates.append(predicate(threshold, field))

return index_predicates

would become

return [pred(t, field) for pred in predicates for t in thresholds]

fgregg · 2023-02-17T12:45:19Z

fgregg
Feb 17, 2023
Maintainer

i don's see most of these as clear improvements, so no.

dedupe can already handle a much more than a million records a day, what is your bottleneck?

1 reply

lmores Feb 17, 2023
Author

I tried to dedupe a list of 2.2 million street addresses (each having 9 string fields) and reached an out of memory error on a machine with 128GB of RAM after a few minutes.
I do not have the log files right now, which info should I provide you to get some insights?

fgregg · 2023-02-17T13:28:17Z

fgregg
Feb 17, 2023
Maintainer

did you use one of the bigger data recipes from dedupe-examples?

5 replies

lmores Feb 17, 2023
Author

Yes, I adapted the pgsql_big_dedupe_example (writing in a postgres database).

fgregg Feb 17, 2023
Maintainer

if you can point out where you ran into the memory problems , i can probably get you past that point.

lmores Feb 17, 2023
Author

I'll run the simulation again and let you know, I'll open an issue with the result.

fgregg Feb 17, 2023
Maintainer

a discussion would be better, but happy to take a look.

lmores Feb 17, 2023
Author

I re-run the test that failed on 2.2 milion records using the latest dedupe and it completed in about 35 minutes using a bit less than 20GB at peak. I also tried to reproduce the past behavior using old dedupe versions from the last months but I could not reproduce it. Thank you anyway!

fgregg · 2023-02-17T17:44:41Z

fgregg
Feb 17, 2023
Maintainer

fantastic!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code "cleanup" #1149

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Code "cleanup" #1149

lmores Feb 17, 2023

Replies: 3 comments · 6 replies

fgregg Feb 17, 2023 Maintainer

lmores Feb 17, 2023 Author

fgregg Feb 17, 2023 Maintainer

lmores Feb 17, 2023 Author

fgregg Feb 17, 2023 Maintainer

lmores Feb 17, 2023 Author

fgregg Feb 17, 2023 Maintainer

lmores Feb 17, 2023 Author

fgregg Feb 17, 2023 Maintainer

lmores
Feb 17, 2023

Replies: 3 comments 6 replies

fgregg
Feb 17, 2023
Maintainer

lmores Feb 17, 2023
Author

fgregg
Feb 17, 2023
Maintainer

lmores Feb 17, 2023
Author

fgregg Feb 17, 2023
Maintainer

lmores Feb 17, 2023
Author

fgregg Feb 17, 2023
Maintainer

lmores Feb 17, 2023
Author

fgregg
Feb 17, 2023
Maintainer