perf: speed up g2p processing by caching the results #464
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Goal?
Processing g2p is pretty slow, well, because g2p-ing text is fairly expensive, it was not implemented to be a super fast library.
However, while processing a corpus, the number of types should be much smaller than the number of tokens (at least for languages that are not heavily agglutinative or morphologically rich) and therefore we can speed up processing by caching the results on a token by token basis.
For English, the results are great: the 13k sentences in the LJ corpus get processes at a rate of 700 items per second before, a 5k items per second with this PR.
For agglutinative and morphologically rich languages, benefits will be smaller, but it's really only for large and very large corpora that this optimization matters. And sometimes, we'll just have to be patient anyway.
Fixes?
Fixes: #446
Feedback sought?
I tested on English only. Please test on data from some other languages to make sure I didn't break anything. The filelist output of the new-project wizard should be identical before and after, with only the speed of the "Processing your characters:" step changing.
I didn't explicitly test generating phonological features, which depends on the same code, so if someone could do that it would be great.
Priority?
normal
Tests added?
Was already well covered by unit testing, for which I am very grateful.
How to test?
Run the wizard and pick "characters" as the representation.
Also, trigger the use of
calculate_phonological_features()
, which I don't know how to do off the top of my head.Confidence?
medium-high:
But overall high nonetheless, because my first implementation had a bug and tripped in three different unit test cases. Once that got fixed, the output test-filelist.psv from LJ through the wizard was bit-for-bit identical before and after.
Version change?
no
Related PRs?
no