Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix speed problem with top_k>1 on CPU in edit tree lemmatizer #12017

Merged
merged 10 commits into from
Jan 20, 2023

Conversation

richardpaulhudson
Copy link
Contributor

@richardpaulhudson richardpaulhudson commented Dec 22, 2022

Description

This PR is one of a series of three (#11583; #11959; #12017) that increase the accuracy of the edit-tree lemmatizer: the cumulative accuracy improvement is shown in the final section below for each language. The mean morphologizer accuracy increased from 95.7% to 96.2% (+0.5%); the mean lemmatizer accuracy increased from 94.3% to 96.0% (+1.7%). The figures are particularly encouraging for the weaker models: while the pre-existing code yields models with accuracies in the mid-80s% for a few languages, with the cumulative changes at least 91% is achieved for all languages. Unlike the other two PRs in the series, this PR does not introduce new functionality, but rather improves speed to make it feasible to use existing functionality.

The edit-tree lemmatizer has a hyperparameter top_k that specifies the number of alternative predicted trees to consider for each token: if the first predicted tree is not applicable to the raw token text, the second predicted tree is considered and so on up to the value of top_k.

Although accuracies are normally higher if alternative predictions can be considered, the current standard published models all specify top_k = 1. This is because the pre-existing code is several times slower when executed on a CPU for any higher value of top_k owing to an expensive NumPy sort that is used to order the predictions for each token (the sort is avoided in the pre-existing code if top_k == 1).

This PR retains the existing approach if top_k==1, but introduces a procedural approach if top_k>1. The procedural approach avoids the NumPy sort and also has the important advantage that time is only spent processing alternative predictions if earlier predictions were not applicable. Because with a well-trained model the first predictions are typically applicable to most tokens, this largely removes the performance hit where top_k > 1 and should allow us to choose values for standard models in the future based purely on accuracy requirements.

When using the edit-tree lemmatizer for its original intended purpose, it is unlikely that values of top_k above about 10 would ever be useful. However, it is conceivable that the code might be used for some other purpose where much higher values of top_k are required. For values of top_k above about 20, the pre-existing approach is more efficient than the new approach. To take the various scenarios into account, the new code checks the value of top_k and selects one of three strategies depending on whether it is 1; 2<n<=20; or >20.

The following speed figures were measured training pl_core_news_lg. The relevant improvement is for 2<n<=20 and especially on CPU, although selecting the optimal strategy once for each batch of documents rather than testing to see whether top_k is greater than 1 or not each time an individual document is processed also has a small but consistent positive impact on speed for other scenarios:

top_k Speed CPU old Speed CPU new Speed GPU old Speed GPU new
1 37040 37483 184859 192005
5 9585 36050 113997 123473
25 9271 9569 111335 112726

The cumulative impact of improvements to the edit-tree lemmatizer

Changes

Approaches that were investigated and abandoned

  • Modelling regular morphological alternations: many languages have morphological alternations that are applied to a number of different letters or groups of letters in similar grammatical situations. Examples include replacing single letters with double letters or vice versa in Dutch and replacing voiceless consonants with voiced consonants in Croatian. The hope was that supplying the edit tree lemmatizer with the details of any regular alternations in a given language would allow it to apply them as "abstract substitutions" and lead to more parsimonious tree structures and more efficient and accurate learning. However, this did not work as expected.
  • Assessing predictions using n-gram fitness: The n-gram frequencies observed in the training set can be used to determine the acceptability of a lemma prediction. The hope was that, in conjunction with setting top_k = 5, this information could be used to filter out incorrect predictions; however, the positive effect on precision was accompanied by a more or less equal negative effect on recall.

Accuracy

The mean morphologizer accuracy increased from 95.7% to 96.2% (+0.5%); the mean lemmatizer accuracy increased from 94.3% to 96.0% (+1.7%):

Model Morph. acc. old (%) Morph. acc. new (%) Morph. acc. diff (%) Lemm. acc. old (%) Lemm. acc. new (%) Lemm. acc. diff (%)
ca_core_news_lg 98.2 98.2 0.0 98.7 99.1 0.4
da_core_news_lg 95.3 96.0 0.7 95.2 96.5 1.3
de_core_news_lg 92.2 92.6 0.4 97.9 98.6 0.7
el_core_news_lg 91.0 92.0 1.0 89.9 92.9 3.0
es_core_news_lg 98.2 98.3 0.1 98.4 99.1 0.7
fi_core_news_lg 92.2 93.6 1.4 86.4 91.0 4.6
fr_core_news_lg 96.8 97.1 0.3 94.9 96.9 2.0
hr_core_news_lg 92.8 92.6 -0.2 92.9 94.7 1.8
it_core_news_lg 97.4 97.8 0.4 97.6 98.2 0.6
ko_core_news_lg - - - 90.2 91.8 1.6
lt_core_news_lg 89.0 90.9 1.9 85.8 91.2 5.4
nb_core_news_lg 96.3 96.8 0.5 97.2 98.0 0.8
nl_core_news_lg 96.4 96.7 0.3 95.7 96.5 0.8
pl_core_news_lg 90.8 91.2 0.4 94.3 96.4 2.1
pt_core_news_lg 95.8 96.1 0.3 97.3 97.8 0.5
ro_core_news_lg 95.1 97.1 2.0 95.8 97.0 1.2
sv_core_news_lg 95.7 96.2 0.5 95.4 97.2 1.8

There is one transformer-based model, de_dep_news_trf, that uses the edit-tree lemmatizer. Three of the five changes listed above are relevant to transformer models; applying them increased the accuracy from 98.7% to 98.9% (+0.2%).

Speed

With the CNN models, lemmatizer and morphologizer inference with pl_core_news_lg were measured as being 12.2% slower on CPU and 42.4% slower on GPU with the five changes listed above included than without them.

With de_dep_news_trf, which was only run on GPU, the speed penalty was 7.3%.

Types of change

Speed enhancement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@richardpaulhudson
Copy link
Contributor Author

@explosion-bot please test_gpu

@explosion-bot
Copy link
Collaborator

explosion-bot commented Dec 22, 2022

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/120

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
@svlandeg svlandeg added enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization perf / speed Performance: speed labels Dec 22, 2022
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
@danieldk
Copy link
Contributor

As well as being slightly faster than the pre-existing code in its own right (i.e. where top_k = 1)

I don't completely understand this. For top_k=1, the new and old code do the same? (A single argmax)

Getting rid of the sort makes sense for small values of k. Though, it does lead to the degenerate O(n^2) case when someone says 'lets try all edit trees'. So, I am very interested in the empirical results. Maybe it's worth in the end using this approach for small k's and sort for large k's.

@richardpaulhudson
Copy link
Contributor Author

richardpaulhudson commented Dec 22, 2022

Though, it does lead to the degenerate O(n^2) case when someone says 'lets try all edit trees'.

An important point is that even if you say "let's try all edit trees" it only iterates through them until it comes across a tree that can be applied to the raw token text. And in any normal model a good proportion of trees are e.g. 'do nothing' or 'add an a to the end' which can be applied to any text. This will be shown in the results I'm collecting: increasing the value of top_k beyond about 5 has no significant effect, and increasing it beyond about 10 has no effect, because it's so unlikely that there is no applicable tree in the first handful tried.

@richardpaulhudson richardpaulhudson changed the title Fix speed problem with top_k>1 in edit tree lemmatizer Fix speed problem with top_k>1 on CPU in edit tree lemmatizer Dec 22, 2022
@danieldk
Copy link
Contributor

Though, it does lead to the degenerate O(n^2) case when someone says 'lets try all edit trees'.

An important point is that even if you say "let's try all edit trees" it only iterates through them until it comes across a tree that can be applied to the raw token text. And in any normal model a good proportion of trees are e.g. 'do nothing' or 'add an a to the end' which can be applied to any text.

I think it’s still good to have bounds on the complexity of it is easy to do so. These things don’t happen until they do. (E.g. we had a degenerate case in parser feature extraction that wasn’t noticed until a user used it in a way that triggered quadratic complexity).

It’s only a simple if statement: use sort when k is higher than a certain value. Two lines of code for avoiding quadratic complexity seems like an easily trade-off.

@richardpaulhudson
Copy link
Contributor Author

richardpaulhudson commented Dec 23, 2022

Here are the figures for pl_core_news_lgtesting the "new" procedural solution against the "old" pre-existing code:

top_k Accuracy Speed CPU old Speed CPU new Speed GPU old Speed GPU new
1 94.80 36939 36249 186701 122782
2 95.01 9431 36435 114735 124300
3 95.08 9623 36285 115295 121976
5 95.11 9657 36085 113974 124915
10 95.12 9635 36095 113873 121860
20 95.12 9605 36160 110383 123045

To check the patterns were reproducible for a language with very different morphology, I performed a couple of the experiments with de_core_news_lg:

top_k Accuracy Speed CPU old Speed CPU new Speed GPU old Speed GPU new
1 98.14 43174 41707 198944 146672
20 98.17 16896 41985 136063 144506

This shows it makes sense to retain the pre-existing solution for top_k = 1 and the new code for 1 < top_k <= 20. As @danieldk and I discussed in personal communication, we should also retain the second pre-existing solution for top_k > 20: although with the normal use of the edit-tree lemmatizer there will always be a tree in the first handful that is applicable to the form, meaning that the procedural solution will never iterate through more than a few predictions, it is possible that the edit-tree lemmatizer could be used to predict something other than lemma_ where this is not the case, and the performance for high values of top_k would then degrade heavily using the new code.

@richardpaulhudson richardpaulhudson marked this pull request as ready for review January 10, 2023 10:57
@richardpaulhudson richardpaulhudson marked this pull request as draft January 12, 2023 18:55
@richardpaulhudson richardpaulhudson marked this pull request as ready for review January 12, 2023 18:56
Copy link
Contributor

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I also did some experiments this morning, and works very nicely.

One small naming nitpick.

Note: merge after 3.5.0 is tagged.

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved
richardpaulhudson and others added 2 commits January 17, 2023 15:11
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
@danieldk danieldk merged commit f9e020d into explosion:master Jan 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization perf / speed Performance: speed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants