Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added ability to cache matrix in queries across which master is constant #83

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ParticularMiner
Copy link
Contributor

Hi @Bergvca,

This PR addresses @asnorthrup's issue: see here.

Within this PR, the end-user is now able to:

  1. execute multiple separate queries on the same master dataset while the duplicates dataset changes;
  2. take advantage of caching to boost the performance of such queries (if master is small enough).
... # set pandas.Series master, df1, and df2
sg = StringGrouper(master, enable_cache=True)  # build corpus
matches1 = sg.match_strings(None, duplicates=df1)  # first query: compares ``df1`` against ``master`` while caching ``master``
matches2 = sg.match_strings(None, duplicates=df2)  # subsequent queries: compares ``df2`` against ``master`` using the cache

Caution: Caching is likely to provide a performance benefit only when the input master dataset is "small enough". The challenge would be in determining exactly how small master should be for caching to become beneficial.

@asnorthrup
Copy link

Awesome, thanks for taking some time to implement @ParticularMiner. Can't wait to test it out and report back what I see. I am especially curious about the size issue. I guess I hadn't considered the fact that this could has a performance hit since the 'other way' creates the entire embedding anyway. I need to wrap my head around that after re-looking at your code changes. I'm sure there are performance enhancements that happen in the way the code was originally structured when size>RAM, but I hadn't thought to investigate that. Thanks for pointing it out.

@ParticularMiner
Copy link
Contributor Author

You're very welcome, @asnorthrup!

Indeed, we'd be very grateful if you would report your benchmarks related to this code. That would help us decide if these changes are worth merging or not.

@ParticularMiner
Copy link
Contributor Author

@asnorthrup

The 'other way' does not, in general, embed the entire master-matrix in RAM. Only the entire corpus is stored. To boost performance, the master series is partitioned in such a way that each partition can be transformed into a matrix which is small enough to fit into the CPU cache where the processing is performed.

The partitions are processed one after the other, implying that each partition's matrix replaces the former matrix in RAM (and CPU cache), keeping overall memory usage small. As a result, any time an end user submits a new query (even with the same master series) each partition's matrix would always need to be re-built from the corpus.

However, with the 'new way' (provided in this PR), we'd be storing each partition's matrix in a Python dict, which I called cache (not to be confused with the CPU cache) as soon as it is computed. So with each new query involving the same master series, the partitions' matrices do not need to be recomputed. Instead they are looked-up from the Python dict (cache). We are, of course assuming that a Python dict lookup is faster than rebuilding a matrix from the corpus.

Since cache takes up significant space in RAM, it is possible for it to degrade performance. But the extent to which performance is enhanced/degraded by cache needs to be investigated experimentally.

@asnorthrup
Copy link

To the best of my understanding I tried to see what kind of difference 'rebuilding a matrix' vs 'dict lookup' had. See if you agree with how I used the Grouper in the new scenario.

But the outcome seems fairly consistent that in my small duplicates size (5 records in my test) vs large master size(varying size), I couldn't get an appreciable difference in the two versions. The cache version generally ran faster, but only on the order of 0.1 seconds even on master with sizes near ~1M records.

Unless there is something about the tests that I'm not accounting for, I would say that the update really doesn't have enough performance increase to warrant making more base functions. It doesn't hurt, but for simplicity, I'm not sure I'd worry about it for my use cases
test_cache code.txt
Test_Cache_results.txt
.

@ParticularMiner
Copy link
Contributor Author

Thank you very much @asnorthrup for your code!

It's very good to test against different master series. But I'm beginning to wonder if the changes I introduced in this PR improve functionality at the cost of confusing the end-user.

I'll try to explain further below:

  • In the inner loop of your code, you

    1. build the corpus (using sg = StringGrouper(master, ...)) then
    2. submit a single query on that corpus (using sg.match_strings(None, duplicates, ...)).
    3. repeat the above two steps with enable_cache=True

    This procedure is unlikely to demonstrate the impact of caching because you are submitting only one query in each case (where Case 1 means "no caching", and Case 2 means "caching").

  • You need to submit at least two queries in each case. The impact on performance can only be seen from the 2nd, 3rd, ... queries onwards.

  • This is because building a corpus alone does not create any tf-idf matrices, but submitting a query on the corpus does. So in both cases above, the first query computes fresh tf-idf matrices from the corpus. But in Case 2, the first query also saves the tf-idf matrix of master in the "cache".

  • So for any subsequent queries on the same corpus,

    • in Case 1: the same tf-idf matrices are computed again from the corpus even though master does not change.
    • in Case 2: the tf-idf matrix for master is not computed again, but is instead retrieved from cache in order to save time.

I hope this explanation helps.

@asnorthrup
Copy link

Good explanation. I had in my mind that building master was the 'first query', but that was just a quick assumption, not due to 'confused end user', more just something I could have read into better.

Anyways, updated tests. Shows exactly what you intended. Dramatic speed up. Even when the original tfidf took 3+seconds to build with 700k+ names, the speed up is dramatic.

With my new understanding and test results, I think this is worth considering to include. For example, the performance increase would be significant in some cases.

Lets say I wanted to do: find duplicates, then maybe i wanted to 'stem' duplicates, then maybe some other transformation, the speed increase against 'master' would be significant in the 'stem' and 'transform' queries. 5-6 seconds vs 0.1 seconds for the larger customer lists.

Test_Cache_results.txt
test_cache code.txt
.

@ParticularMiner
Copy link
Contributor Author

@asnorthrup

That's exciting! Thank you very much for taking the trouble to benchmark this. Hopefully these changes will soon appear in a new version.

@Bergvca
Copy link
Owner

Bergvca commented Jun 6, 2022

Hi all! Thanks again for your work @ParticularMiner! I will try to add it too a new version soon once I have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants