-
Notifications
You must be signed in to change notification settings - Fork 551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start to use sklearn for ml algorithms #992
Conversation
@fgregg I also see some nested There are also graph operations at clustering.py that could in principle use sparse graphs from scipy, but it would be a full rewrite and I'm not sure if performance / memory usage would improve / reduce. |
I don't think I'm familiar enough with the code to be that confident, but from a cursory scan it seems like you got the obvious low-hanging fruit. |
If we implemented #967 then perhaps we could do a whole lot less guessing about how performance would be changed |
nice, so @NickCrews's benchmarking set up let's us see that sklearn's Random Forest classifier leads us to use about 90M of peak memory as opposed to 50M of peak memory. |
rlrlearner -> rflearner update reqs use sklearn clustering update tests
This reverts commit 3c34d99. Using sklearn to calculate cosine is signficantly slower than the simplecosine package because the sklearn methods were not desined to be called field-pair by field-pair
@NickCrews,there seems to be something going on with the benchmarks. The precision and recall scores are exactly the same between this PR and the comparison branch. That does not seem possible. |
I adjusted the benchmarker, and we are getting a much clearer picture.
|
I think the issue might be
|
If we increase the threshold a bit for canonical, we can get better recall and precision than with logistic regression. I think I'll keep this change. The only thing that we have left to do is make sure that existing settings file that use rlr can still be loaded, and add a deprecation notice if we find such a settings file. |
All benchmarks (diff):
|
All benchmarks (diff):
|
All benchmarks (diff):
|
relates to #991 and #990
Todo
replacehaversine
dependency with https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html#sklearn.metrics.pairwise.haversine_distancesreplacesimple-cosine
dependency with https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarityFrom trying to replace the cosine distance, it's pretty clearly not worth doing that unless we can batch the calls.
Any other likely places where we can use sklearn or scipy code instead of an additional library or replace dedupe code, @fjsj , @NickCrews ?