perf: speed up g2p processing by caching the results #464

joanise · 2024-06-13T14:38:55Z

PR Goal?

Processing g2p is pretty slow, well, because g2p-ing text is fairly expensive, it was not implemented to be a super fast library.

However, while processing a corpus, the number of types should be much smaller than the number of tokens (at least for languages that are not heavily agglutinative or morphologically rich) and therefore we can speed up processing by caching the results on a token by token basis.

For English, the results are great: the 13k sentences in the LJ corpus get processes at a rate of 700 items per second before, a 5k items per second with this PR.

For agglutinative and morphologically rich languages, benefits will be smaller, but it's really only for large and very large corpora that this optimization matters. And sometimes, we'll just have to be patient anyway.

Fixes?

Fixes: #446

Feedback sought?

I tested on English only. Please test on data from some other languages to make sure I didn't break anything. The filelist output of the new-project wizard should be identical before and after, with only the speed of the "Processing your characters:" step changing.

I didn't explicitly test generating phonological features, which depends on the same code, so if someone could do that it would be great.

Priority?

normal

Tests added?

Was already well covered by unit testing, for which I am very grateful.

How to test?

Run the wizard and pick "characters" as the representation.

Also, trigger the use of calculate_phonological_features(), which I don't know how to do off the top of my head.

Confidence?

medium-high:

wizard: high: I'm confident the wizard is good
pfs: medium: I'm less confident for pfs since I have not exercised it except via existing unit tests, if there are any

But overall high nonetheless, because my first implementation had a bug and tripped in three different unit test cases. Once that got fixed, the output test-filelist.psv from LJ through the wizard was bit-for-bit identical before and after.

Version change?

no

Related PRs?

no

The cache is implemented on a token basis to keep the results identical but maximize reuse potential. Fixes: #446

github-actions · 2024-06-13T14:42:11Z

CLI load time: 0:00.24
Pull Request HEAD: 32a48131c4ad41b39975fbd198cd0645258e7a8c
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

codecov · 2024-06-13T14:43:19Z

Codecov Report

Attention: Patch coverage is 96.87500% with 1 line in your changes missing coverage. Please review.

Project coverage is 74.28%. Comparing base (e044cd5) to head (32a4813).

Files	Patch %	Lines
everyvoice/text/phonemizer.py	96.87%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #464      +/-   ##
==========================================
+ Coverage   74.15%   74.28%   +0.13%     
==========================================
  Files          44       44              
  Lines        2766     2780      +14     
  Branches      428      430       +2     
==========================================
+ Hits         2051     2065      +14     
  Misses        630      630              
  Partials       85       85

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

roedoejet

Awesome! tested, and much faster (1.5hrs -> 5 minutes on a 110k French corpus)

another speed issue raised to @joanise directly about validatewavs

joanise · 2024-06-13T21:04:09Z

Discussion with AP about memory:
This PR never clears the cache. EJ Q: is that going to be an issue? Not likely given the other structures we keep in memory.
Test done, with 110k sentences in the cml_tts_dataset_french_v0.1 corpus, and even after doing the whole wizard, we're still using about half a GB of RAM, so memory is not a problem.
I did a quick hack to clear the cache, and that actually had no effect at all: memory did not get released.
So we declared that whole question YAGNI, but I'm leaving the note here in case we wonder about this later.

perf: speed up g2p processing by caching the results

32a4813

The cache is implemented on a token basis to keep the results identical but maximize reuse potential. Fixes: #446

joanise requested review from roedoejet and marctessier June 13, 2024 14:38

roedoejet approved these changes Jun 13, 2024

View reviewed changes

joanise mentioned this pull request Jun 13, 2024

ValidateWavs takes too long on a large corpus. #466

Closed

joanise merged commit 689f324 into main Jun 13, 2024
6 checks passed

joanise deleted the dev.ej/446-speed-up-g2p branch June 13, 2024 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: speed up g2p processing by caching the results #464

perf: speed up g2p processing by caching the results #464

joanise commented Jun 13, 2024

github-actions bot commented Jun 13, 2024

codecov bot commented Jun 13, 2024

roedoejet left a comment

joanise commented Jun 13, 2024

perf: speed up g2p processing by caching the results #464

perf: speed up g2p processing by caching the results #464

Conversation

joanise commented Jun 13, 2024

PR Goal?

Fixes?

Feedback sought?

Priority?

Tests added?

How to test?

Confidence?

Version change?

Related PRs?

github-actions bot commented Jun 13, 2024

codecov bot commented Jun 13, 2024

Codecov Report

roedoejet left a comment

Choose a reason for hiding this comment

joanise commented Jun 13, 2024