Improve handling of inputs #348

roedoejet · 2024-03-20T22:46:12Z

Fixes #216
Fixes: #361

This PR implements the following:

We require the user to indicate whether a filelist represents characters, phones, or arpabet. The text column header is no longer used by the preprocessor. The Wizard converts the text column in a filelist to either characters, phones, or arpabet.
We always handle arpabet filelists by converting the data into phones.
The wizard guesses the symbol set by applying Unicode Grapheme Clustering rules instead of just iterating through each character in the filelist. This is done after the wizard applies the defined cleaners to the text.
Added a g2p module that specifies all supported g2p engines and ISO codes. TODO: Add documentation for adding a custom g2p engine: add documentation for adding a custom g2p engine #354
If the user provides a filelist representing characters and indicates that the language of their dataset is one that is supported by a supported g2p engine, then phones will be automatically calculated, and phone tokens will be guessed using IPA tokenization library ipatok.
I removed all the symbol questions in the wizard
Text tensors are not saved to disk anymore. Instead, they are saved to filelists where the tokenization can be inspected (like h/e/l/l/o h/o/w a/r/e y/o/u). The DataLoader then encodes them on-the-fly. Multi-hot phonological features are still saved to disk.
Models can specify whether they would like to train using characters, phones, or phonological features.
All punctuation is mapped to specific categories. TODO: allow different encoding of word-final punctuation: punctuation config should handle word-final punctuation #355
When synthesizing, you also now need to specify which representation of text you are synthesizing from (i.e. everyvoice synthesize from-text model.ckpt --text 'hello' --text-representation 'characters')

Minor changes:

The pad token is changed from _ to \x80 (Unicode Pad token)
whitespace is handled internally (not by the text config)
consolidated the code we use to load filelists. clarify in the wizard whether we are loading a list of lists or a list of dicts.
update features.py to use numpy everywhere instead of just adding lists of ints together.
refactored the TextProcessor to consolidate methods and added doctests for improved clarity

Related PRs:

Remaining issues:

To illustrate the major changes related to preprocessing, previously for LJ text we would provide a filelist with an ambiguous text column and receive the following:

basename	text	language	speaker
LJ001-0001	printing, in the only sense	eng	default

Whereas now, still providing same information, we get the following:

basename	characters	phones	language	speaker	character_tokens	phone_tokens
LJ001-0001	printing, in the only sense	pɹɪntɪŋ, ɪn ðʌ oʊnli sɛns	eng	default	p/r/i/n/t/i/n/g/<SB>/ /i/n/ /t/h/e/ /o/n/l/y/ /s/e/n/s/e/	p/ɹ/ɪ/n/t/ɪ/ŋ/<SB>/ /ɪ/n/ /ð/ʌ/ /o/ʊ/n/l/i/ /s/ɛ/n/s

codecov · 2024-03-26T19:53:08Z

Codecov Report

Attention: Patch coverage is 83.69352% with 83 lines in your changes are missing coverage. Please review.

Project coverage is 73.15%. Comparing base (0a99b5d) to head (887a633).
Report is 1 commits behind head on main.

Files	Patch %	Lines
everyvoice/text/text_processor.py	81.96%	15 Missing and 7 partials ⚠️
everyvoice/preprocessor/preprocessor.py	74.02%	11 Missing and 9 partials ⚠️
everyvoice/text/features.py	78.94%	6 Missing and 6 partials ⚠️
everyvoice/text/utils.py	50.00%	8 Missing ⚠️
everyvoice/wizard/dataset.py	86.66%	3 Missing and 5 partials ⚠️
everyvoice/utils/__init__.py	87.17%	3 Missing and 2 partials ⚠️
everyvoice/wizard/utils.py	90.38%	1 Missing and 4 partials ⚠️
everyvoice/model/e2e/dataset.py	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #348      +/-   ##
==========================================
+ Coverage   71.49%   73.15%   +1.66%     
==========================================
  Files          39       43       +4     
  Lines        2543     2786     +243     
  Branches      390      459      +69     
==========================================
+ Hits         1818     2038     +220     
- Misses        653      664      +11     
- Partials       72       84      +12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-03-26T20:32:23Z

CLI load time: 0:00.28
Pull Request HEAD: 887a63329946b53270506625a636ffe4bc0db8a9
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package
import time:       424 |     100437 |       everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.synthesize
import time:       313 |     116449 |     everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.cli
import time:       259 |     117300 |   everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli
import time:        21 |     117320 | everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.preprocess

marctessier · 2024-03-27T15:19:34Z

I ran a few test with success but getting lots of warnings. ( I did not really look at the results , just ran the full pipeline all the way to synth with success using CRK data.

BUT Noticing this bug when you use a filelist that includes a headerline during the wizard.

EX: I give a file that included a "header line" like below:
basename|text|speaker|language
When I tell the wizard to use the info from that headerline, the TEST-filelist.pst produced is missing the |phones field . Afterwards when you run PREPROCESS , it will crash.

pwd
/home/tes001/u/TxT2SPEECH/MOH/FILELIST
(EveryVoice_pr348_dev.ap_inputs) [U20-GPSC7]:$ head -1 TEST-filelist.psv 
basename|characters|speaker|language

VS using the same file but ignoring the header lines, no crash on proprocess.

pwd
/home/tes001/u/TxT2SPEECH/MOH/FILELIST_manual
(EveryVoice_pr348_dev.ap_inputs) [U20-GPSC7]:$ head -1 TEST-filelist.psv 
basename|characters|speaker|language|phones

roedoejet · 2024-03-27T17:18:35Z

I ran a few test with success but getting lots of warnings. ( I did not really look at the results , just ran the full pipeline all the way to synth with success using CRK data.

BUT Noticing this bug when you use a filelist that includes a headerline during the wizard.

EX: I give a file that included a "header line" like below: basename|text|speaker|language When I tell the wizard to use the info from that headerline, the TEST-filelist.pst produced is missing the |phones field . Afterwards when you run PREPROCESS , it will crash.
pwd
/home/tes001/u/TxT2SPEECH/MOH/FILELIST
(EveryVoice_pr348_dev.ap_inputs) [U20-GPSC7]:$ head -1 TEST-filelist.psv 
basename|characters|speaker|language
VS using the same file but ignoring the header lines, no crash on proprocess.
pwd
/home/tes001/u/TxT2SPEECH/MOH/FILELIST_manual
(EveryVoice_pr348_dev.ap_inputs) [U20-GPSC7]:$ head -1 TEST-filelist.psv 
basename|characters|speaker|language|phones

I turned this into an issue: #363

MENGZHEGENG · 2024-04-03T20:08:17Z

Thanks Adian for your great effort!

It seems that we need to include grapheme in the requirements.txt.

joanise · 2024-04-03T20:44:53Z

It seems that we need to include grapheme in the requirements.txt.

Yeah, I had the same reaction when I started testing, but it's because I had not reinstalled. It's already declared, but you need to rerun pip install -e . after updating your sandbox.

joanise · 2024-04-03T21:04:13Z

Some disorganized feedback...

I love how the character set questions are gone, those were just annoying.
- But I wonder, what do with do now with languages like moh or west coast ones that might use a command or an apostrophe as a letter?
found a bug running preprocess, got a crash and lots of warnings - sent that log by e-mail

Some new mypy warnings that should be fixable:

everyvoice/utils/__init__.py:375: error: Incompatible types in assignment (expression has type "islice[str]", variable has type "TextIOWrapper")  [assignment]
everyvoice/text/text_processor.py:355: error: Argument "lang_id" to "apply_g2p_and_tokenization" of "TextProcessor" has incompatible type "str | None"; expected "str"  [arg-type]
everyvoice/preprocessor/preprocessor.py:687: error: Incompatible types in assignment (expression has type "list[int] | ndarray[Any, dtype[signedinteger[_64Bit]]]", variable has type "list[int] | None")  [assignment]
everyvoice/preprocessor/preprocessor.py:696: error: Incompatible types in assignment (expression has type "list[int] | ndarray[Any, dtype[signedinteger[_64Bit]]]", variable has type "list[int] | None")  [assignment]
everyvoice/preprocessor/preprocessor.py:707: error: Incompatible types in assignment (expression has type "list[int] | ndarray[Any, dtype[signedinteger[_64Bit]]]", variable has type "list[int] | None")  [assignment]
everyvoice/preprocessor/preprocessor.py:716: error: Incompatible types in assignment (expression has type "list[int] | ndarray[Any, dtype[signedinteger[_64Bit]]]", variable has type "list[int] | None")  [assignment]
everyvoice/model/feature_prediction/FastSpeech2_lightning/fs2/cli/check_data.py:35: error: Incompatible types in assignment (expression has type "list[dict[Any, Any]]", variable has type "Path")  [assignment]

roedoejet · 2024-04-03T21:50:55Z

Thanks for the feedback @joanise !

But I wonder, what do with do now with languages like moh or west coast ones that might use a command or an apostrophe as a letter?

Yes, I also worry about this. I think we either:

Prompt the user in the wizard about punctuation
Prompt the user in the wizard to review their text_config

Related, but orthogonal to that, perhaps we should remove comma ,, apostrophe ', and colon : from the default punctuation. Or maybe just prompt about those? I'm not really sure. Open to suggestions.

MENGZHEGENG · 2024-04-03T22:12:58Z

Now we may have a symbol defined twice in everyvoice-shared-text.yaml （once in lower_ascii and the other in cluster detected symbols):

... | WARNING | everyvoice.text.text_processor:__init__:96 - Symbol 'a' has already been declared at position 18 so we will use that index instead of the current index 45. Please remove duplicates from your configuration.

Not sure if this is a desired operation. Perhaps we can modify the lower_ascii set once automatically when we get the cluster detected symbols.

Again thanks Aidan for the great effort in making the input so much better!

roedoejet · 2024-04-03T23:27:16Z

Now we may have a symbol defined twice in everyvoice-shared-text.yaml （once in lower_ascii and the other in cluster detected symbols):

Thanks Tim! I just added 4a0dc56 and 3fdc561 which I think fixes this. It removes lower_ascii, removes all duplicates automatically, and sorts the symbols first based on length, and then based on standard Python sorting rules. I think this provides better stability and doesn't annoy the user with duplicate-defined symbol warnings because it doesn't really matter if the user defines duplicates, it just matters that there is only one place in the embedding table that the symbol is stored. Thanks for this feedback!

SamuelLarkin

Wow this is a big one, may be too big. It was hard to review.

everyvoice/config/shared_types.py

everyvoice/config/text_config.py

everyvoice/text/text_processor.py

everyvoice/tests/data/metadata.csv

everyvoice/tests/data/metadata_arpabet.csv

everyvoice/tests/test_preprocessing.py

everyvoice/tests/test_text.py

everyvoice/exceptions.py

SamuelLarkin

Humm starting yet another review.
I forgot a suggestion.

everyvoice/text/features.py

everyvoice/text/text_processor.py

fixes #361

fixes #370

the only thing that is required is that the filelist has the appropriate column headers

Co-authored-by: Samuel Larkin <samuel.larkin@nrc-cnrc.gc.ca>

@marctessier

fixes #370 according to @marctessier

refactor to process cleaners on filelist-lists before parsing into a list of dicts

roedoejet · 2024-04-11T00:55:33Z

Wow, what a monster! Lots of comments below. I would say we want to merge this sooner than later, as soon as it's stable enough to open the repo, so these comments can be turned into issues for future work whenever they're too much to do just now.

Thanks so much for the super helpful review! I think this is good to go now, modulo the bug that you reported when training. I'm happy to merge it now and sort that bug out later though, if all looks good otherwise?

joanise

Let's merge this now, and do further improvements as new PRs.

It's in good enough shape to merge, and we know this is where we want to work from.

roedoejet force-pushed the dev.ap/inputs branch 2 times, most recently from 2ce7fbc to 15e3226 Compare March 22, 2024 00:10

roedoejet changed the title ~~[DRAFT] Improve handling of inputs~~ Improve handling of inputs Mar 26, 2024

roedoejet force-pushed the dev.ap/inputs branch from 271778a to 7325149 Compare March 26, 2024 20:28

roedoejet requested review from joanise and SamuelLarkin March 26, 2024 20:50

roedoejet force-pushed the dev.ap/inputs branch from def5f34 to 469b515 Compare March 26, 2024 23:04

roedoejet requested review from MENGZHEGENG and marctessier March 26, 2024 23:05

roedoejet force-pushed the dev.ap/inputs branch from 469b515 to 3bff09b Compare March 26, 2024 23:11

roedoejet force-pushed the dev.ap/inputs branch 3 times, most recently from 03df507 to 4966efd Compare April 2, 2024 16:51

SamuelLarkin requested changes Apr 5, 2024

View reviewed changes

everyvoice/text/features.py Outdated Show resolved Hide resolved

roedoejet force-pushed the dev.ap/inputs branch 2 times, most recently from 1483d56 to 8178cfc Compare April 5, 2024 18:01

roedoejet mentioned this pull request Apr 5, 2024

Consider making text an alias of characters to support backwards compatibility #371

Closed

roedoejet force-pushed the dev.ap/inputs branch from a3ecf06 to e420858 Compare April 5, 2024 21:40

roedoejet commented Apr 9, 2024

View reviewed changes

everyvoice/text/text_processor.py Outdated Show resolved Hide resolved

joanise and others added 22 commits April 10, 2024 17:54

chore: update submodules for import refactoring

a0adcfc

feat: remove duplicates in symbols by default

9a3b6ae

fix: remove lowercase ascii from default symbols

5262b9f

fixes #361

fix: missing ascii characters in preprocessing test

14636e4

fix: only encode character or phone strings when they exist

3ecd69d

fix: pin typer to less than 0.12.0

3b09521

fixes #370

refactor: defining the text representation is unnecessary

76a869b

the only thing that is required is that the filelist has the appropriate column headers

refactor: change csv to psv

0840d63

refactor: add changes suggested by Sam

b4ede2a

Co-authored-by: Samuel Larkin <samuel.larkin@nrc-cnrc.gc.ca>

fix: pin typer to 0.9.0

b0836be

fixes #370 according to @marctessier

fix: properly process g2p for multi-lingual filelists

8eb96be

refactor to process cleaners on filelist-lists before parsing into a list of dicts

feat: remove punctuation from automatically guessed characters

a26453a

docs: alternate method

8ed5f21

feat: message the user about punctuation in character set

76df2f2

feat: simpler iteration over fields and skipping punctuation

19b8112

feat: helper functions

9446382

fix: doctests

1933283

fix(pfs): phonological features should apply punctuation transformation

c43da88

refactor: save pfs to pfs folder, not text

36fcaab

refactor: use dict mappings for symbol and id

6f4b078

refactor: set punctuation rule application everywhere

17821d0

refactor: apply fixes and refactors suggested by @joanise

f656bf2

roedoejet force-pushed the dev.ap/inputs branch from 8e69328 to f656bf2 Compare April 11, 2024 00:54

joanise approved these changes Apr 11, 2024

View reviewed changes

fix: filter text data based on target training representation

887a633

roedoejet requested a review from SamuelLarkin April 11, 2024 18:28

roedoejet merged commit 4bc5861 into main Apr 11, 2024
4 checks passed

roedoejet deleted the dev.ap/inputs branch April 11, 2024 18:35

roedoejet mentioned this pull request Apr 12, 2024

The interface of selecting symbols/punctuations sometimes stucks #338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of inputs #348

Improve handling of inputs #348

roedoejet commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Mar 26, 2024 •

edited

Loading

marctessier commented Mar 27, 2024

roedoejet commented Mar 27, 2024

MENGZHEGENG commented Apr 3, 2024

joanise commented Apr 3, 2024

joanise commented Apr 3, 2024

roedoejet commented Apr 3, 2024

MENGZHEGENG commented Apr 3, 2024 •

edited

Loading

roedoejet commented Apr 3, 2024 •

edited

Loading

SamuelLarkin left a comment

SamuelLarkin left a comment

roedoejet commented Apr 11, 2024

joanise left a comment

Improve handling of inputs #348

Improve handling of inputs #348

Conversation

roedoejet commented Mar 20, 2024 • edited Loading

codecov bot commented Mar 26, 2024 • edited Loading

Codecov Report

github-actions bot commented Mar 26, 2024 • edited Loading

marctessier commented Mar 27, 2024

roedoejet commented Mar 27, 2024

MENGZHEGENG commented Apr 3, 2024

joanise commented Apr 3, 2024

joanise commented Apr 3, 2024

roedoejet commented Apr 3, 2024

MENGZHEGENG commented Apr 3, 2024 • edited Loading

roedoejet commented Apr 3, 2024 • edited Loading

SamuelLarkin left a comment

Choose a reason for hiding this comment

SamuelLarkin left a comment

Choose a reason for hiding this comment

roedoejet commented Apr 11, 2024

joanise left a comment

Choose a reason for hiding this comment

roedoejet commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Mar 26, 2024 •

edited

Loading

MENGZHEGENG commented Apr 3, 2024 •

edited

Loading

roedoejet commented Apr 3, 2024 •

edited

Loading