Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer #12017

richardpaulhudson · 2022-12-22T13:41:51Z

Description

This PR is one of a series of three (#11583; #11959; #12017) that increase the accuracy of the edit-tree lemmatizer: the cumulative accuracy improvement is shown in the final section below for each language. The mean morphologizer accuracy increased from 95.7% to 96.2% (+0.5%); the mean lemmatizer accuracy increased from 94.3% to 96.0% (+1.7%). The figures are particularly encouraging for the weaker models: while the pre-existing code yields models with accuracies in the mid-80s% for a few languages, with the cumulative changes at least 91% is achieved for all languages. Unlike the other two PRs in the series, this PR does not introduce new functionality, but rather improves speed to make it feasible to use existing functionality.

The edit-tree lemmatizer has a hyperparameter top_k that specifies the number of alternative predicted trees to consider for each token: if the first predicted tree is not applicable to the raw token text, the second predicted tree is considered and so on up to the value of top_k.

Although accuracies are normally higher if alternative predictions can be considered, the current standard published models all specify top_k = 1. This is because the pre-existing code is several times slower when executed on a CPU for any higher value of top_k owing to an expensive NumPy sort that is used to order the predictions for each token (the sort is avoided in the pre-existing code if top_k == 1).

This PR retains the existing approach if top_k==1, but introduces a procedural approach if top_k>1. The procedural approach avoids the NumPy sort and also has the important advantage that time is only spent processing alternative predictions if earlier predictions were not applicable. Because with a well-trained model the first predictions are typically applicable to most tokens, this largely removes the performance hit where top_k > 1 and should allow us to choose values for standard models in the future based purely on accuracy requirements.

When using the edit-tree lemmatizer for its original intended purpose, it is unlikely that values of top_k above about 10 would ever be useful. However, it is conceivable that the code might be used for some other purpose where much higher values of top_k are required. For values of top_k above about 20, the pre-existing approach is more efficient than the new approach. To take the various scenarios into account, the new code checks the value of top_k and selects one of three strategies depending on whether it is 1; 2<n<=20; or >20.

The following speed figures were measured training pl_core_news_lg. The relevant improvement is for 2<n<=20 and especially on CPU, although selecting the optimal strategy once for each batch of documents rather than testing to see whether top_k is greater than 1 or not each time an individual document is processed also has a small but consistent positive impact on speed for other scenarios:

`top_k`	Speed CPU old	Speed CPU new	Speed GPU old	Speed GPU new
1	37040	37483	184859	192005
5	9585	36050	113997	123473
25	9271	9569	111335	112726

The cumulative impact of improvements to the edit-tree lemmatizer

Changes

Considering multiple predictions for each lemma, i.e. setting top_k = 5 rather than top_k = 1, which was made feasible by this PR;
Feeding suffix and prefix features of various lengths through to tok2vec (Add new features and options to tok2vec to improve accuracy #11583)
Adding a dropout layer between tok2vec and the components it feeds into, and setting a global dropout probability of 0.2 (Add new features and options to tok2vec to improve accuracy #11583)
Learning lowercasing as a distinct task (Learn lowercasing as a distinct task within the edit tree lemmatizer #11959)
Using the glorot uniform initializer for the ETL neural network (Learn lowercasing as a distinct task within the edit tree lemmatizer #11959)

Approaches that were investigated and abandoned

Modelling regular morphological alternations: many languages have morphological alternations that are applied to a number of different letters or groups of letters in similar grammatical situations. Examples include replacing single letters with double letters or vice versa in Dutch and replacing voiceless consonants with voiced consonants in Croatian. The hope was that supplying the edit tree lemmatizer with the details of any regular alternations in a given language would allow it to apply them as "abstract substitutions" and lead to more parsimonious tree structures and more efficient and accurate learning. However, this did not work as expected.
Assessing predictions using n-gram fitness: The n-gram frequencies observed in the training set can be used to determine the acceptability of a lemma prediction. The hope was that, in conjunction with setting top_k = 5, this information could be used to filter out incorrect predictions; however, the positive effect on precision was accompanied by a more or less equal negative effect on recall.

Accuracy

The mean morphologizer accuracy increased from 95.7% to 96.2% (+0.5%); the mean lemmatizer accuracy increased from 94.3% to 96.0% (+1.7%):

Model	Morph. acc. old (%)	Morph. acc. new (%)	Morph. acc. diff (%)	Lemm. acc. old (%)	Lemm. acc. new (%)	Lemm. acc. diff (%)
ca_core_news_lg	98.2	98.2	0.0	98.7	99.1	0.4
da_core_news_lg	95.3	96.0	0.7	95.2	96.5	1.3
de_core_news_lg	92.2	92.6	0.4	97.9	98.6	0.7
el_core_news_lg	91.0	92.0	1.0	89.9	92.9	3.0
es_core_news_lg	98.2	98.3	0.1	98.4	99.1	0.7
fi_core_news_lg	92.2	93.6	1.4	86.4	91.0	4.6
fr_core_news_lg	96.8	97.1	0.3	94.9	96.9	2.0
hr_core_news_lg	92.8	92.6	-0.2	92.9	94.7	1.8
it_core_news_lg	97.4	97.8	0.4	97.6	98.2	0.6
ko_core_news_lg	-	-	-	90.2	91.8	1.6
lt_core_news_lg	89.0	90.9	1.9	85.8	91.2	5.4
nb_core_news_lg	96.3	96.8	0.5	97.2	98.0	0.8
nl_core_news_lg	96.4	96.7	0.3	95.7	96.5	0.8
pl_core_news_lg	90.8	91.2	0.4	94.3	96.4	2.1
pt_core_news_lg	95.8	96.1	0.3	97.3	97.8	0.5
ro_core_news_lg	95.1	97.1	2.0	95.8	97.0	1.2
sv_core_news_lg	95.7	96.2	0.5	95.4	97.2	1.8

There is one transformer-based model, de_dep_news_trf, that uses the edit-tree lemmatizer. Three of the five changes listed above are relevant to transformer models; applying them increased the accuracy from 98.7% to 98.9% (+0.2%).

Speed

With the CNN models, lemmatizer and morphologizer inference with pl_core_news_lg were measured as being 12.2% slower on CPU and 42.4% slower on GPU with the five changes listed above included than without them.

With de_dep_news_trf, which was only run on GPU, the speed penalty was 7.3%.

Types of change

Speed enhancement

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

spacy/pipeline/edit_tree_lemmatizer.py

richardpaulhudson · 2022-12-22T14:02:38Z

@explosion-bot please test_gpu

explosion-bot · 2022-12-22T14:03:00Z

🪁 Successfully triggered build on Buildkite

URL: https://buildkite.com/explosion-ai/spacy-gpu-test-suite/builds/120

spacy/pipeline/edit_tree_lemmatizer.py

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

spacy/pipeline/edit_tree_lemmatizer.py

danieldk · 2022-12-22T17:01:21Z

As well as being slightly faster than the pre-existing code in its own right (i.e. where top_k = 1)

I don't completely understand this. For top_k=1, the new and old code do the same? (A single argmax)

Getting rid of the sort makes sense for small values of k. Though, it does lead to the degenerate O(n^2) case when someone says 'lets try all edit trees'. So, I am very interested in the empirical results. Maybe it's worth in the end using this approach for small k's and sort for large k's.

richardpaulhudson · 2022-12-22T18:52:45Z

Though, it does lead to the degenerate O(n^2) case when someone says 'lets try all edit trees'.

An important point is that even if you say "let's try all edit trees" it only iterates through them until it comes across a tree that can be applied to the raw token text. And in any normal model a good proportion of trees are e.g. 'do nothing' or 'add an a to the end' which can be applied to any text. This will be shown in the results I'm collecting: increasing the value of top_k beyond about 5 has no significant effect, and increasing it beyond about 10 has no effect, because it's so unlikely that there is no applicable tree in the first handful tried.

danieldk · 2022-12-23T07:16:18Z

Though, it does lead to the degenerate O(n^2) case when someone says 'lets try all edit trees'.

An important point is that even if you say "let's try all edit trees" it only iterates through them until it comes across a tree that can be applied to the raw token text. And in any normal model a good proportion of trees are e.g. 'do nothing' or 'add an a to the end' which can be applied to any text.

I think it’s still good to have bounds on the complexity of it is easy to do so. These things don’t happen until they do. (E.g. we had a degenerate case in parser feature extraction that wasn’t noticed until a user used it in a way that triggered quadratic complexity).

It’s only a simple if statement: use sort when k is higher than a certain value. Two lines of code for avoiding quadratic complexity seems like an easily trade-off.

richardpaulhudson · 2022-12-23T13:36:19Z

Here are the figures for pl_core_news_lgtesting the "new" procedural solution against the "old" pre-existing code:

`top_k`	Accuracy	Speed CPU old	Speed CPU new	Speed GPU old	Speed GPU new
1	94.80	36939	36249	186701	122782
2	95.01	9431	36435	114735	124300
3	95.08	9623	36285	115295	121976
5	95.11	9657	36085	113974	124915
10	95.12	9635	36095	113873	121860
20	95.12	9605	36160	110383	123045

To check the patterns were reproducible for a language with very different morphology, I performed a couple of the experiments with de_core_news_lg:

`top_k`	Accuracy	Speed CPU old	Speed CPU new	Speed GPU old	Speed GPU new
1	98.14	43174	41707	198944	146672
20	98.17	16896	41985	136063	144506

This shows it makes sense to retain the pre-existing solution for top_k = 1 and the new code for 1 < top_k <= 20. As @danieldk and I discussed in personal communication, we should also retain the second pre-existing solution for top_k > 20: although with the normal use of the edit-tree lemmatizer there will always be a tree in the first handful that is applicable to the form, meaning that the procedural solution will never iterate through more than a few predictions, it is possible that the edit-tree lemmatizer could be used to predict something other than lemma_ where this is not the case, and the performance for high values of top_k would then degrade heavily using the new code.

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

danieldk

Looks good to me. I also did some experiments this morning, and works very nicely.

One small naming nitpick.

Note: merge after 3.5.0 is tagged.

spacy/pipeline/edit_tree_lemmatizer.py

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

Refactor _scores2guesses

278181a

richardpaulhudson commented Dec 22, 2022

View reviewed changes

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved

Handle arrays on GPU

9430306

shadeMe reviewed Dec 22, 2022

View reviewed changes

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved

svlandeg added enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization perf / speed Performance: speed labels Dec 22, 2022

Convert argmax result to raw integer

95a4835

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

danieldk reviewed Dec 22, 2022

View reviewed changes

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved

danieldk reviewed Dec 22, 2022

View reviewed changes

spacy/pipeline/edit_tree_lemmatizer.py Show resolved Hide resolved

richardpaulhudson changed the title ~~Fix speed problem with top_k>1 in edit tree lemmatizer~~ Fix speed problem with top_k>1 on CPU in edit tree lemmatizer Dec 22, 2022

richardpaulhudson and others added 5 commits December 23, 2022 14:37

Use NumpyOps() to copy data to CPU

361b64e

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

Changes based on review comments

4daf5e9

Use different _scores2guesses depending on tree_k

8d487bf

Add tests for corner cases

781661b

Add empty line for consistency

00b1d9a

richardpaulhudson marked this pull request as ready for review January 10, 2023 10:57

richardpaulhudson marked this pull request as draft January 12, 2023 18:55

richardpaulhudson marked this pull request as ready for review January 12, 2023 18:56

danieldk reviewed Jan 17, 2023

View reviewed changes

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved

spacy/pipeline/edit_tree_lemmatizer.py Outdated Show resolved Hide resolved

richardpaulhudson and others added 2 commits January 17, 2023 15:11

Improve naming

175b487

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

Improve naming

dd5975a

Co-authored-by: Daniël de Kok <me@github.danieldk.eu>

danieldk approved these changes Jan 20, 2023

View reviewed changes

danieldk merged commit f9e020d into explosion:master Jan 20, 2023

This was referenced Jan 24, 2023

Learn lowercasing as a distinct task within the edit tree lemmatizer #11959

Closed

Add new features and options to tok2vec to improve accuracy #11583

Closed

qeterme mentioned this pull request Mar 16, 2023

fix(huspacy): applied fix to edit-tree lemmatizer huspacy/huspacy#53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer #12017

Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer #12017

richardpaulhudson commented Dec 22, 2022 •

edited

Loading

richardpaulhudson commented Dec 22, 2022

explosion-bot commented Dec 22, 2022 •

edited

Loading

danieldk commented Dec 22, 2022

richardpaulhudson commented Dec 22, 2022 •

edited

Loading

danieldk commented Dec 23, 2022

richardpaulhudson commented Dec 23, 2022 •

edited

Loading

danieldk left a comment

Fix speed problem with top_k>1 on CPU in edit tree lemmatizer #12017

Fix speed problem with top_k>1 on CPU in edit tree lemmatizer #12017

Conversation

richardpaulhudson commented Dec 22, 2022 • edited Loading

Description

The cumulative impact of improvements to the edit-tree lemmatizer

Changes

Approaches that were investigated and abandoned

Accuracy

Speed

Types of change

Checklist

richardpaulhudson commented Dec 22, 2022

explosion-bot commented Dec 22, 2022 • edited Loading

danieldk commented Dec 22, 2022

richardpaulhudson commented Dec 22, 2022 • edited Loading

danieldk commented Dec 23, 2022

richardpaulhudson commented Dec 23, 2022 • edited Loading

danieldk left a comment

Choose a reason for hiding this comment

Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer #12017

Fix speed problem with `top_k>1` on CPU in edit tree lemmatizer #12017

richardpaulhudson commented Dec 22, 2022 •

edited

Loading

explosion-bot commented Dec 22, 2022 •

edited

Loading

richardpaulhudson commented Dec 22, 2022 •

edited

Loading

richardpaulhudson commented Dec 23, 2022 •

edited

Loading