Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection #4637

Closed
wants to merge 36 commits into from

Conversation

1-800-BAD-CODE
Copy link
Contributor

@1-800-BAD-CODE 1-800-BAD-CODE commented Jul 31, 2022

What does this PR do ?

This is a work-in-progress for a model and data set that performs multilingual punctuation restoration, true casing, and sentence boundary detection. See Usage below for a demo of what this PR does.

Some key features:

  • Resolves most of the issues mentioned in On punctuation and capitalization #3819
  • Emphasis on multi-lingual; data set handles the nuances of each language.
  • Language-agnostic inference. Inputs do not need language labels, and batches can contain multiple languages. Text is processed in its native script (e.g., Chinese is processed without spaces, for Spanish we can predict inverted punctuation, etc.).
  • User can train with plain-text files as input; all "art work" is done by the pre-processor at training time

Collection: NLP

Changelog

No changes to existing NeMo code, only additions.

Usage

See current example model on the HF hub.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
  • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

@lgtm-com
Copy link

lgtm-com bot commented Jul 31, 2022

This pull request introduces 7 alerts when merging 5db367d into 2a387bb - view on LGTM.com

new alerts:

  • 7 for Unused import

@okuchaiev okuchaiev requested a review from ekmb August 1, 2022 16:46
@okuchaiev
Copy link
Member

@1-800-BAD-CODE is this still draft, or ready for review?

@1-800-BAD-CODE
Copy link
Contributor Author

@okuchaiev It's probably worth waiting another week. I got rid of the hacky char tokenizer and cleaned some things, I'll make sure it still works and check it all in this weekend.

@1-800-BAD-CODE
Copy link
Contributor Author

@okuchaiev It's probably in a reasonable place for a review. I presume my decision to use a character-level language model will be controversial, but it works. Some things aren't done yet, but those are what would be pointless if people disagree with the big ideas.

@1-800-BAD-CODE 1-800-BAD-CODE marked this pull request as ready for review August 15, 2022 22:48
@1-800-BAD-CODE
Copy link
Contributor Author

If the character-level language model is too constraining (I think it is), I have an alternative branch that uses arbitrary subword tokenization and LM, but generates character-level predictions in the heads. So pre-training can be fully utilized and the only tricks are in the heads (which predict N*num_classes for each subword, where N is the subword length).

It produces the same results, but trains and infers faster and has fewer constraints.

It has no problems with acronyms, even if they are lumped into one subword, which was a driving factor in the character-level decision:

Input 0: george w bush was the president of the us for 8 years he left office in january 2009 and was succeeded by barack obama prior to his presidency he was the governor of texas
Output:
    George W. Bush was the president of the U.S. for 8 years.
    He left office in January 2009, and was succeeded by Barack Obama.
    Prior to his presidency, he was the governor of Texas.

Input 1: then oj simpson attempted to flee in his white bronco it created a major spectacle but he was eventually apprehended
Output:
    Then, O.J. Simpson attempted to flee in his white bronco.
    It created a major spectacle, but he was eventually apprehended.

@ekmb
Copy link
Collaborator

ekmb commented Aug 27, 2022

@1-800-BAD-CODE thank you for working on this! I really like the pre-processing step setup that provides so much flexibility when adding new languages!

A few questions about the segmentation head:

  1. Why do you suggest using the segmentation head instead of the predicted EOS punctuation mark for paragraph segmentation?
  2. How does the capitalization task benefit from puncuated+segmented text and not directly punctuated text? How about introducing an argument in the infer method so that users can select whether to use punctuated or punctuated+segmented output as input to the capitalization head?

On Character-based LM:
Although the char-based model gets common acronyms right, like U.S., it might struggle with unique cases that are not present in the training data. As a result, a separate module would still be needed to correct this. E.g., inverse text normalization lookup based on WFST, which is easy to implement and fast during inference. If we exclude cases like "U.S." where the punctuation marks are inserted within the word, then the rest of the cases should be covered with "all lower", "all upper", "start with upper", "start with XxX", "|start with XxxX" and maybe a few additional "start with" classes. And these should work with subword models.
You mentioned you have an alternative solution with a subword model that generated character-level predictions. Could you please point to this branch?

@1-800-BAD-CODE
Copy link
Contributor Author

@ekmb thanks for the feedback.

The token-based model that makes character-level predictions is in the branch pcs2. A better description can be found in this model card: https://huggingface.co/1-800-BAD-CODE/pcs_multilang_bert_base. I now think that's a better branch than this one.

Why do you suggest using the segmentation head instead of the predicted EOS punctuation mark for paragraph segmentation?

I just didn't think of that alternative. It could reduce punctuation + segmentation to a single pass, but if true-casing requires a second pass (with encoded punctuated texts) then the current implementation doesn't add a penalty.

I believe that would be equivalent to running the punctuation and segmentation head in parallel, which could be an easy change if there is a reason to do so.

How does the capitalization task benefit from puncuated+segmented text and not directly punctuated text? How about introducing an argument in the infer method so that users can select whether to use punctuated or punctuated+segmented output as input to the capitalization head?

The true-casing task benefits from sentence boundary information to more easily differentiate between breaking and non-breaking punctuation preceding a token.

But there is likely enough information in a punctuated text to true case correctly. The true case head is actually trained on concatenated sentences anyway, so I'll add an option to do inference it in two passes instead of three.

@ekmb
Copy link
Collaborator

ekmb commented Sep 28, 2022

Hi @1-800-BAD-CODE, are there any updates on this PR?

@1-800-BAD-CODE
Copy link
Contributor Author

Hi @1-800-BAD-CODE, are there any updates on this PR?

I have:

  • Matured the branch that uses regular subwords, and moved on from the character-based LM constraints
  • Got rid of the "three pass" training scheme (running the encoder three time). During training, models can be trained with one or two passes.
    • In one-pass mode, all analytics are predicted in parallel on raw, unpunctuated texts.
    • In two-pass mode, punctuation is added first, then sentence boundary detection and true-casing are run on punctuated text (to model conditional probabilities).
    • At inference time, any model can run in two- or three-pass mode to fully condition the probabilities, if desired. Models trained in one-pass mode can run inference in one-pass mode or higher.

I have a model that demonstrates the capabilities with a diverse set of 22 languages; I will try to clean up the code and put a model on the HF hub this weekend.

@lgtm-com
Copy link

lgtm-com bot commented Oct 12, 2022

This pull request introduces 3 alerts when merging 57bc4b9 into c259ae1 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Unused local variable

@lgtm-com
Copy link

lgtm-com bot commented Oct 23, 2022

This pull request introduces 1 alert when merging bdcfcce into 2574ddf - view on LGTM.com

new alerts:

  • 1 for Unused local variable

@1-800-BAD-CODE
Copy link
Contributor Author

@ekmb This is probably as far as I should take it on my own.

Recent updates focus primarily on single-pass training and inference, as well as reducing the amount of code. There is a decent 22-language, single-pass model on the HF hub with some description of how all this works.

If people disagree with the fundamental ideas, now is a good time to do so. Otherwise, next steps would be to clean it up a little more.

@Kipok Kipok added ASR and removed ASR labels Nov 18, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2022

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Dec 3, 2022
@1-800-BAD-CODE
Copy link
Contributor Author

I'm ok with letting this one die. The code turned out more complicated than I prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants