start to use sklearn for ml algorithms #992

fgregg · 2022-04-26T02:02:41Z

relates to #991 and #990

Todo

get test passing
~~replace haversine dependency with https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html#sklearn.metrics.pairwise.haversine_distances~~
~~replace simple-cosine dependency with https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity~~

From trying to replace the cosine distance, it's pretty clearly not worth doing that unless we can batch the calls.

Any other likely places where we can use sklearn or scipy code instead of an additional library or replace dedupe code, @fjsj , @NickCrews ?

fjsj · 2022-04-26T13:09:57Z

@fgregg
Another thing you could replace is the TF-IDF index, although it's not trivial to build scikit's TF-IDF iteratively and query it like an index, if I recall correctly.

I also see some nested fors in canonical.py, but I'm not sure if something in scikit-learn or scipy can replace that.

There are also graph operations at clustering.py that could in principle use sparse graphs from scipy, but it would be a full rewrite and I'm not sure if performance / memory usage would improve / reduce.

dedupe/api.py

NickCrews · 2022-04-27T03:38:21Z

I don't think I'm familiar enough with the code to be that confident, but from a cursory scan it seems like you got the obvious low-hanging fruit.

NickCrews · 2022-04-27T16:04:38Z

If we implemented #967 then perhaps we could do a whole lot less guessing about how performance would be changed

fgregg · 2022-05-05T03:02:07Z

nice, so @NickCrews's benchmarking set up let's us see that sklearn's Random Forest classifier leads us to use about 90M of peak memory as opposed to 50M of peak memory.

coveralls · 2022-05-05T03:14:34Z

Coverage remained the same at 64.947% when pulling ca861b9 on sklearn_depend into a503691 on main.

rlrlearner -> rflearner update reqs use sklearn clustering update tests

This reverts commit 3c34d99. Using sklearn to calculate cosine is signficantly slower than the simplecosine package because the sklearn methods were not desined to be called field-pair by field-pair

fgregg · 2022-05-06T03:55:43Z

@NickCrews,there seems to be something going on with the benchmarks. The precision and recall scores are exactly the same between this PR and the comparison branch. That does not seem possible.

fgregg · 2022-05-06T04:29:24Z

I adjusted the benchmarker, and we are getting a much clearer picture.

Recall is up. Almost always at 1.
For record linkage modes, precision stays high, sometimes small improvement
For deduplication mode, recall is up to 1, but the precision has dropped from low 90s to high 70s.

NickCrews · 2022-05-06T04:32:01Z

I think the issue might be

benchmark runs, settings file created
switch to new commit, settings file is in .gitignore so it stays unchanged
on new commit, the benchmark still uses the old settings file

fgregg · 2022-05-06T15:54:42Z

Recall is up. Almost always at 1.

For record linkage modes, precision stays high, sometimes small improvement

For deduplication mode, recall is up to 1, but the precision has dropped from low 90s to high 70s.

If we increase the threshold a bit for canonical, we can get better recall and precision than with logistic regression. I think I'll keep this change.

The only thing that we have left to do is make sure that existing settings file that use rlr can still be loaded, and add a deprecation notice if we find such a settings file.

fgregg · 2022-05-27T03:20:04Z

@benchmark

github-actions · 2022-05-27T03:36:02Z

All benchmarks (diff):

	before	after	ratio	benchmark
	494M	532M	1.08	canonical.Canonical.peakmem_run
	15.4±0.3s	15.0±0.03s	0.97	canonical.Canonical.time_run
-	0.884	0.767	0.87	canonical.Canonical.track_precision
	0.973	1.0	1.03	canonical.Canonical.track_recall
+	195M	232M	1.19	canonical_gazetteer.Gazetteer.peakmem_run(None)
	13.8±0.08s	14.3±0.02s	1.03	canonical_gazetteer.Gazetteer.time_run(None)
	0.982	0.991	1.01	canonical_gazetteer.Gazetteer.track_precision(None)
	0.982	0.991	1.01	canonical_gazetteer.Gazetteer.track_recall(None)
+	196M	232M	1.18	canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
+	195M	232M	1.19	canonical_matching.Matching.peakmem_run({'threshold': 0.5})
	12.1±0.01s	12.4±0.01s	1.03	canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
	12.2±0.1s	12.6±0.03s	1.04	canonical_matching.Matching.time_run({'threshold': 0.5})
	0.99	0.982	0.99	canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
	0.99	0.982	0.99	canonical_matching.Matching.track_precision({'threshold': 0.5})
	0.911	1.0	1.10	canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
	0.911	0.973	1.07	canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

…arn_depend

fgregg · 2022-05-27T17:33:32Z

@benchmark

github-actions · 2022-05-27T17:53:48Z

All benchmarks (diff):

	before	after	ratio	benchmark
	494M	532M	1.08	canonical.Canonical.peakmem_run
	15.0±0.2s	22.9±0s	~1.53	canonical.Canonical.time_run
-	0.919	0.8	0.87	canonical.Canonical.track_precision
	0.911	1.0	1.10	canonical.Canonical.track_recall
+	194M	232M	1.19	canonical_gazetteer.Gazetteer.peakmem_run(None)
+	13.5±0.07s	23.5±0.01s	1.75	canonical_gazetteer.Gazetteer.time_run(None)
	0.982	0.991	1.01	canonical_gazetteer.Gazetteer.track_precision(None)
	0.982	0.982	1.00	canonical_gazetteer.Gazetteer.track_recall(None)
+	194M	232M	1.19	canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
+	194M	232M	1.19	canonical_matching.Matching.peakmem_run({'threshold': 0.5})
+	11.8±0.03s	21.7±0.01s	1.84	canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
	11.7±0.1s	21.9±0.02s	~1.87	canonical_matching.Matching.time_run({'threshold': 0.5})
	0.981	0.991	1.01	canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
	0.99	1.0	1.01	canonical_matching.Matching.track_precision({'threshold': 0.5})
	0.911	0.991	1.09	canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
	0.92	1.0	1.09	canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

fgregg · 2022-06-02T03:05:05Z

@benchmark

github-actions · 2022-06-02T03:20:16Z

All benchmarks (diff):

	before	after	ratio	benchmark
	495M	527M	1.07	canonical.Canonical.peakmem_run
	15.8±0.06s	14.2±0.3s	~0.90	canonical.Canonical.time_run
	0.904	0.962	1.06	canonical.Canonical.track_precision
	0.964	0.929	0.96	canonical.Canonical.track_recall
+	195M	228M	1.17	canonical_gazetteer.Gazetteer.peakmem_run(None)
	13.5±0.05s	13.2±0.03s	0.98	canonical_gazetteer.Gazetteer.time_run(None)
	0.982	0.982	1.00	canonical_gazetteer.Gazetteer.track_precision(None)
	0.982	0.982	1.00	canonical_gazetteer.Gazetteer.track_recall(None)
+	195M	228M	1.17	canonical_matching.Matching.peakmem_run({'threshold': 0.5, 'constraint': 'many-to-one'})
+	195M	228M	1.17	canonical_matching.Matching.peakmem_run({'threshold': 0.5})
	11.8±0.01s	11.7±0.06s	0.99	canonical_matching.Matching.time_run({'threshold': 0.5, 'constraint': 'many-to-one'})
	12.3±0.2s	11.7±0.02s	0.96	canonical_matching.Matching.time_run({'threshold': 0.5})
	0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5, 'constraint': 'many-to-one'})
	0.99	0.99	1.00	canonical_matching.Matching.track_precision({'threshold': 0.5})
	0.911	0.911	1.00	canonical_matching.Matching.track_recall({'threshold': 0.5, 'constraint': 'many-to-one'})
	0.929	0.911	0.98	canonical_matching.Matching.track_recall({'threshold': 0.5})

(logs)

NickCrews reviewed Apr 27, 2022

View reviewed changes

dedupe/api.py Outdated Show resolved Hide resolved

fgregg added 3 commits May 5, 2022 23:41

start to use sklearn for ml algorithms

a193834

rlrlearner -> rflearner update reqs use sklearn clustering update tests

continue on error for benchmarks

8bd2022

use sklearn for cosine distances

3c34d99

fgregg force-pushed the sklearn_depend branch from da9deea to 3c34d99 Compare May 6, 2022 03:42

cclauss and others added 3 commits May 6, 2022 03:42

fixup: Format Python code with Black

513eb12

use better autoblackener

576a6dc

Revert "use sklearn for cosine distances"

9bc5d8b

This reverts commit 3c34d99. Using sklearn to calculate cosine is signficantly slower than the simplecosine package because the sklearn methods were not desined to be called field-pair by field-pair

Merge branch 'main' into sklearn_depend

fdb2bdd

increase timeout, closes #1008

9dd5a8a

merge conflict

c7f18f3

dedupeio deleted a comment from github-actions bot May 7, 2022

dedupeio deleted a comment from NickCrews May 7, 2022

dedupeio deleted a comment from github-actions bot May 7, 2022

fgregg mentioned this pull request May 7, 2022

benchmark will not comment results if performance declines #1019

Closed

dedupeio deleted a comment from github-actions bot May 8, 2022

Merge branch 'main' into sklearn_depend

718f949

dedupeio deleted a comment from github-actions bot May 27, 2022

fgregg added 3 commits May 26, 2022 21:47

Merge branch 'main' into sklearn_depend

fae17d3

Merge branch 'main' into sklearn_depend

8b3ddb7

Merge branch 'main' into sklearn_depend

8876c04

dedupeio deleted a comment from github-actions bot May 27, 2022

fgregg added 3 commits May 26, 2022 22:35

Merge branch 'main' into sklearn_depend

921d47e

Merge branch 'main' into sklearn_depend

9819003

Merge branch 'main' into sklearn_depend

adbb8c8

dedupeio deleted a comment from github-actions bot May 27, 2022

fgregg added 2 commits May 27, 2022 13:32

try cross validation rf

955665f

Merge branch 'sklearn_depend' of github.com:dedupeio/dedupe into skle…

bab573c

…arn_depend

fixup! Format Python code with psf/black pull_request

9412c74

fgregg added 3 commits June 1, 2022 22:59

use regularized logistic regression

52ffcf6

rlrlearner

90efd62

rlrlearner

02b95e8

good error message if rlr is missing

ca861b9

fgregg merged commit 7e24af2 into main Jun 2, 2022

fgregg linked an issue Jun 2, 2022 that may be closed by this pull request

shoud we integrate with scikit learn #991

Closed

This was referenced Jun 2, 2022

shoud we integrate with scikit learn #991

Closed

Consider making the default classifier a random forest #990

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start to use sklearn for ml algorithms #992

start to use sklearn for ml algorithms #992

fgregg commented Apr 26, 2022 •

edited

Loading

fjsj commented Apr 26, 2022

NickCrews commented Apr 27, 2022

NickCrews commented Apr 27, 2022

fgregg commented May 5, 2022

coveralls commented May 5, 2022 •

edited

Loading

fgregg commented May 6, 2022

fgregg commented May 6, 2022

NickCrews commented May 6, 2022

fgregg commented May 6, 2022

fgregg commented May 27, 2022

github-actions bot commented May 27, 2022

fgregg commented May 27, 2022

github-actions bot commented May 27, 2022

fgregg commented Jun 2, 2022

github-actions bot commented Jun 2, 2022

start to use sklearn for ml algorithms #992

start to use sklearn for ml algorithms #992

Conversation

fgregg commented Apr 26, 2022 • edited Loading

Todo

fjsj commented Apr 26, 2022

NickCrews commented Apr 27, 2022

NickCrews commented Apr 27, 2022

fgregg commented May 5, 2022

coveralls commented May 5, 2022 • edited Loading

fgregg commented May 6, 2022

fgregg commented May 6, 2022

NickCrews commented May 6, 2022

fgregg commented May 6, 2022

fgregg commented May 27, 2022

github-actions bot commented May 27, 2022

All benchmarks (diff):

fgregg commented May 27, 2022

github-actions bot commented May 27, 2022

All benchmarks (diff):

fgregg commented Jun 2, 2022

github-actions bot commented Jun 2, 2022

All benchmarks (diff):

fgregg commented Apr 26, 2022 •

edited

Loading

coveralls commented May 5, 2022 •

edited

Loading