-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added ability to cache matrix in queries across which master
is constant
#83
base: master
Are you sure you want to change the base?
Conversation
Awesome, thanks for taking some time to implement @ParticularMiner. Can't wait to test it out and report back what I see. I am especially curious about the size issue. I guess I hadn't considered the fact that this could has a performance hit since the 'other way' creates the entire embedding anyway. I need to wrap my head around that after re-looking at your code changes. I'm sure there are performance enhancements that happen in the way the code was originally structured when size>RAM, but I hadn't thought to investigate that. Thanks for pointing it out. |
You're very welcome, @asnorthrup! Indeed, we'd be very grateful if you would report your benchmarks related to this code. That would help us decide if these changes are worth merging or not. |
The 'other way' does not, in general, embed the entire The partitions are processed one after the other, implying that each partition's matrix replaces the former matrix in RAM (and CPU cache), keeping overall memory usage small. As a result, any time an end user submits a new query (even with the same However, with the 'new way' (provided in this PR), we'd be storing each partition's matrix in a Python Since |
To the best of my understanding I tried to see what kind of difference 'rebuilding a matrix' vs 'dict lookup' had. See if you agree with how I used the Grouper in the new scenario. But the outcome seems fairly consistent that in my small duplicates size (5 records in my test) vs large master size(varying size), I couldn't get an appreciable difference in the two versions. The cache version generally ran faster, but only on the order of 0.1 seconds even on master with sizes near ~1M records. Unless there is something about the tests that I'm not accounting for, I would say that the update really doesn't have enough performance increase to warrant making more base functions. It doesn't hurt, but for simplicity, I'm not sure I'd worry about it for my use cases |
Thank you very much @asnorthrup for your code! It's very good to test against different I'll try to explain further below:
I hope this explanation helps. |
Good explanation. I had in my mind that building master was the 'first query', but that was just a quick assumption, not due to 'confused end user', more just something I could have read into better. Anyways, updated tests. Shows exactly what you intended. Dramatic speed up. Even when the original tfidf took 3+seconds to build with 700k+ names, the speed up is dramatic. With my new understanding and test results, I think this is worth considering to include. For example, the performance increase would be significant in some cases. Lets say I wanted to do: find duplicates, then maybe i wanted to 'stem' duplicates, then maybe some other transformation, the speed increase against 'master' would be significant in the 'stem' and 'transform' queries. 5-6 seconds vs 0.1 seconds for the larger customer lists. |
That's exciting! Thank you very much for taking the trouble to benchmark this. Hopefully these changes will soon appear in a new version. |
Hi all! Thanks again for your work @ParticularMiner! I will try to add it too a new version soon once I have some time. |
Hi @Bergvca,
This PR addresses @asnorthrup's issue: see here.
Within this PR, the end-user is now able to:
master
dataset while theduplicates
dataset changes;master
is small enough).Caution: Caching is likely to provide a performance benefit only when the input
master
dataset is "small enough". The challenge would be in determining exactly how smallmaster
should be for caching to become beneficial.