[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection #4637

1-800-BAD-CODE · 2022-07-31T23:02:40Z

What does this PR do ?

This is a work-in-progress for a model and data set that performs multilingual punctuation restoration, true casing, and sentence boundary detection. See Usage below for a demo of what this PR does.

Some key features:

Resolves most of the issues mentioned in On punctuation and capitalization #3819
Emphasis on multi-lingual; data set handles the nuances of each language.
Language-agnostic inference. Inputs do not need language labels, and batches can contain multiple languages. Text is processed in its native script (e.g., Chinese is processed without spaces, for Spanish we can predict inverted punctuation, etc.).
User can train with plain-text files as input; all "art work" is done by the pre-processor at training time

Collection: NLP

Changelog

No changes to existing NeMo code, only additions.

Usage

See current example model on the HF hub.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to discussion On punctuation and capitalization #3819

…train config

lgtm-com · 2022-07-31T23:12:49Z

This pull request introduces 7 alerts when merging 5db367d into 2a387bb - view on LGTM.com

new alerts:

7 for Unused import

okuchaiev · 2022-08-08T23:16:09Z

@1-800-BAD-CODE is this still draft, or ready for review?

1-800-BAD-CODE · 2022-08-09T00:21:05Z

@okuchaiev It's probably worth waiting another week. I got rid of the hacky char tokenizer and cleaned some things, I'll make sure it still works and check it all in this weekend.

1-800-BAD-CODE · 2022-08-15T22:48:21Z

@okuchaiev It's probably in a reasonable place for a review. I presume my decision to use a character-level language model will be controversial, but it works. Some things aren't done yet, but those are what would be pointless if people disagree with the big ideas.

1-800-BAD-CODE · 2022-08-19T00:00:16Z

If the character-level language model is too constraining (I think it is), I have an alternative branch that uses arbitrary subword tokenization and LM, but generates character-level predictions in the heads. So pre-training can be fully utilized and the only tricks are in the heads (which predict N*num_classes for each subword, where N is the subword length).

It produces the same results, but trains and infers faster and has fewer constraints.

It has no problems with acronyms, even if they are lumped into one subword, which was a driving factor in the character-level decision:

Input 0: george w bush was the president of the us for 8 years he left office in january 2009 and was succeeded by barack obama prior to his presidency he was the governor of texas
Output:
    George W. Bush was the president of the U.S. for 8 years.
    He left office in January 2009, and was succeeded by Barack Obama.
    Prior to his presidency, he was the governor of Texas.

Input 1: then oj simpson attempted to flee in his white bronco it created a major spectacle but he was eventually apprehended
Output:
    Then, O.J. Simpson attempted to flee in his white bronco.
    It created a major spectacle, but he was eventually apprehended.

ekmb · 2022-08-27T01:34:02Z

@1-800-BAD-CODE thank you for working on this! I really like the pre-processing step setup that provides so much flexibility when adding new languages!

A few questions about the segmentation head:

Why do you suggest using the segmentation head instead of the predicted EOS punctuation mark for paragraph segmentation?
How does the capitalization task benefit from puncuated+segmented text and not directly punctuated text? How about introducing an argument in the infer method so that users can select whether to use punctuated or punctuated+segmented output as input to the capitalization head?

On Character-based LM:
Although the char-based model gets common acronyms right, like U.S., it might struggle with unique cases that are not present in the training data. As a result, a separate module would still be needed to correct this. E.g., inverse text normalization lookup based on WFST, which is easy to implement and fast during inference. If we exclude cases like "U.S." where the punctuation marks are inserted within the word, then the rest of the cases should be covered with "all lower", "all upper", "start with upper", "start with XxX", "|start with XxxX" and maybe a few additional "start with" classes. And these should work with subword models.
You mentioned you have an alternative solution with a subword model that generated character-level predictions. Could you please point to this branch?

1-800-BAD-CODE · 2022-08-30T00:21:18Z

@ekmb thanks for the feedback.

The token-based model that makes character-level predictions is in the branch pcs2. A better description can be found in this model card: https://huggingface.co/1-800-BAD-CODE/pcs_multilang_bert_base. I now think that's a better branch than this one.

Why do you suggest using the segmentation head instead of the predicted EOS punctuation mark for paragraph segmentation?

I just didn't think of that alternative. It could reduce punctuation + segmentation to a single pass, but if true-casing requires a second pass (with encoded punctuated texts) then the current implementation doesn't add a penalty.

I believe that would be equivalent to running the punctuation and segmentation head in parallel, which could be an easy change if there is a reason to do so.

How does the capitalization task benefit from puncuated+segmented text and not directly punctuated text? How about introducing an argument in the infer method so that users can select whether to use punctuated or punctuated+segmented output as input to the capitalization head?

The true-casing task benefits from sentence boundary information to more easily differentiate between breaking and non-breaking punctuation preceding a token.

But there is likely enough information in a punctuated text to true case correctly. The true case head is actually trained on concatenated sentences anyway, so I'll add an option to do inference it in two passes instead of three.

…e overall

ekmb · 2022-09-28T19:15:12Z

Hi @1-800-BAD-CODE, are there any updates on this PR?

1-800-BAD-CODE · 2022-09-28T22:46:27Z

Hi @1-800-BAD-CODE, are there any updates on this PR?

I have:

Matured the branch that uses regular subwords, and moved on from the character-based LM constraints
Got rid of the "three pass" training scheme (running the encoder three time). During training, models can be trained with one or two passes.
- In one-pass mode, all analytics are predicted in parallel on raw, unpunctuated texts.
- In two-pass mode, punctuation is added first, then sentence boundary detection and true-casing are run on punctuated text (to model conditional probabilities).
- At inference time, any model can run in two- or three-pass mode to fully condition the probabilities, if desired. Models trained in one-pass mode can run inference in one-pass mode or higher.

I have a model that demonstrates the capabilities with a diverse set of 22 languages; I will try to clean up the code and put a model on the HF hub this weekend.

for more information, see https://pre-commit.ci

lgtm-com · 2022-10-12T23:22:05Z

This pull request introduces 3 alerts when merging 57bc4b9 into c259ae1 - view on LGTM.com

new alerts:

2 for Unused import
1 for Unused local variable

…ference

for more information, see https://pre-commit.ci

lgtm-com · 2022-10-23T19:28:41Z

This pull request introduces 1 alert when merging bdcfcce into 2574ddf - view on LGTM.com

new alerts:

1 for Unused local variable

…into punct_cap_seg

for more information, see https://pre-commit.ci

…into punct_cap_seg

for more information, see https://pre-commit.ci

1-800-BAD-CODE · 2022-10-31T23:50:41Z

@ekmb This is probably as far as I should take it on my own.

Recent updates focus primarily on single-pass training and inference, as well as reducing the amount of code. There is a decent 22-language, single-pass model on the HF hub with some description of how all this works.

If people disagree with the fundamental ideas, now is a good time to do so. Otherwise, next steps would be to clean it up a little more.

for more information, see https://pre-commit.ci

github-actions · 2022-12-03T01:56:26Z

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

1-800-BAD-CODE · 2022-12-09T23:19:22Z

I'm ok with letting this one die. The code turned out more complicated than I prefer.

1-800-BAD-CODE and others added 5 commits July 16, 2022 19:53

add initial files

892c734

update workspace

45300ab

remove yaml dataset, add infer post processing

a216b4b

Implement max input length, ignore empty str in segmentation, update …

705b287

…train config

Remove pretrained path

5db367d

1-800-BAD-CODE and others added 2 commits July 31, 2022 19:41

remove unused imports

6403c8c

Merge branch 'main' into punct_cap_seg

9f39a6c

okuchaiev requested a review from ekmb August 1, 2022 16:46

Switch to character-based language modeling

153e9b7

1-800-BAD-CODE marked this pull request as ready for review August 15, 2022 22:48

commit BPE-compatible branch

706875d

1-800-BAD-CODE added 2 commits September 5, 2022 13:09

Add inference data loader, two-pass inference option, better inferenc…

81d3cb5

…e overall

Add to doc string; remove debugging prints

93798fd

1-800-BAD-CODE and others added 8 commits October 3, 2022 19:26

updates for n-pass inference, training; more languages

84eb7e5

resolve for fork sync

3d1f7a1

[pre-commit.ci] auto fixes from pre-commit.com hooks

8116bb7

for more information, see https://pre-commit.ci

Fix RNG bug that caused each worker to draw duplicates

9014dc5

use map dataset instead of iterable

01bb483

merge

d9afcd3

Use more compact example config in repo

4769669

Move pre-processing out of dataset; switch back to concatdataset

e230a69

[pre-commit.ci] auto fixes from pre-commit.com hooks

57bc4b9

for more information, see https://pre-commit.ci

1-800-BAD-CODE and others added 6 commits October 14, 2022 19:55

ignore final sentence boundary; fix single-pass max length bug

a5a62b8

Mask non-full stop sentence boundary targets during; ignore during in…

bbd97ec

…ference

support for eval threshold sweep

d10582f

consolidate all inference methods into one

26aaf72

unused import

15c554f

[pre-commit.ci] auto fixes from pre-commit.com hooks

bdcfcce

for more information, see https://pre-commit.ci

1-800-BAD-CODE and others added 8 commits October 30, 2022 19:25

Add support for SPE, XLM Roberta, inference and one-pass training fixes

369b2b6

Merge branch 'punct_cap_seg' of https://github.com/1-800-bad-code/nemo …

358d234

…into punct_cap_seg

[pre-commit.ci] auto fixes from pre-commit.com hooks

e1e92a6

for more information, see https://pre-commit.ci

Update example configs; add data normalization script

d1a966d

[pre-commit.ci] auto fixes from pre-commit.com hooks

0695516

for more information, see https://pre-commit.ci

support for model evaluation

f60c8c7

Merge branch 'punct_cap_seg' of https://github.com/1-800-bad-code/nemo …

aea69df

…into punct_cap_seg

[pre-commit.ci] auto fixes from pre-commit.com hooks

975d3ba

for more information, see https://pre-commit.ci

1-800-BAD-CODE and others added 2 commits November 8, 2022 17:59

use ChannelType instead of TokenIndex

2e8ff93

[pre-commit.ci] auto fixes from pre-commit.com hooks

db7e63b

for more information, see https://pre-commit.ci

Kipok added ASR and removed ASR labels Nov 18, 2022

github-actions bot added the stale label Dec 3, 2022

1-800-BAD-CODE closed this Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection #4637

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection #4637

1-800-BAD-CODE commented Jul 31, 2022 •

edited

Loading

lgtm-com bot commented Jul 31, 2022

okuchaiev commented Aug 8, 2022

1-800-BAD-CODE commented Aug 9, 2022

1-800-BAD-CODE commented Aug 15, 2022

1-800-BAD-CODE commented Aug 19, 2022

ekmb commented Aug 27, 2022

1-800-BAD-CODE commented Aug 30, 2022

ekmb commented Sep 28, 2022

1-800-BAD-CODE commented Sep 28, 2022

lgtm-com bot commented Oct 12, 2022

lgtm-com bot commented Oct 23, 2022

1-800-BAD-CODE commented Oct 31, 2022

github-actions bot commented Dec 3, 2022

1-800-BAD-CODE commented Dec 9, 2022

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection #4637

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection #4637

Conversation

1-800-BAD-CODE commented Jul 31, 2022 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

lgtm-com bot commented Jul 31, 2022

okuchaiev commented Aug 8, 2022

1-800-BAD-CODE commented Aug 9, 2022

1-800-BAD-CODE commented Aug 15, 2022

1-800-BAD-CODE commented Aug 19, 2022

ekmb commented Aug 27, 2022

1-800-BAD-CODE commented Aug 30, 2022

ekmb commented Sep 28, 2022

1-800-BAD-CODE commented Sep 28, 2022

lgtm-com bot commented Oct 12, 2022

lgtm-com bot commented Oct 23, 2022

1-800-BAD-CODE commented Oct 31, 2022

github-actions bot commented Dec 3, 2022

1-800-BAD-CODE commented Dec 9, 2022

1-800-BAD-CODE commented Jul 31, 2022 •

edited

Loading