-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Multilingual punctuation restoration, true casing, and sentence boundary detection #4637
Conversation
This pull request introduces 7 alerts when merging 5db367d into 2a387bb - view on LGTM.com new alerts:
|
@1-800-BAD-CODE is this still draft, or ready for review? |
@okuchaiev It's probably worth waiting another week. I got rid of the hacky char tokenizer and cleaned some things, I'll make sure it still works and check it all in this weekend. |
@okuchaiev It's probably in a reasonable place for a review. I presume my decision to use a character-level language model will be controversial, but it works. Some things aren't done yet, but those are what would be pointless if people disagree with the big ideas. |
If the character-level language model is too constraining (I think it is), I have an alternative branch that uses arbitrary subword tokenization and LM, but generates character-level predictions in the heads. So pre-training can be fully utilized and the only tricks are in the heads (which predict N*num_classes for each subword, where N is the subword length). It produces the same results, but trains and infers faster and has fewer constraints. It has no problems with acronyms, even if they are lumped into one subword, which was a driving factor in the character-level decision:
|
@1-800-BAD-CODE thank you for working on this! I really like the pre-processing step setup that provides so much flexibility when adding new languages! A few questions about the segmentation head:
On Character-based LM: |
@ekmb thanks for the feedback. The token-based model that makes character-level predictions is in the branch
I just didn't think of that alternative. It could reduce punctuation + segmentation to a single pass, but if true-casing requires a second pass (with encoded punctuated texts) then the current implementation doesn't add a penalty. I believe that would be equivalent to running the punctuation and segmentation head in parallel, which could be an easy change if there is a reason to do so.
The true-casing task benefits from sentence boundary information to more easily differentiate between breaking and non-breaking punctuation preceding a token. But there is likely enough information in a punctuated text to true case correctly. The true case head is actually trained on concatenated sentences anyway, so I'll add an option to do inference it in two passes instead of three. |
Hi @1-800-BAD-CODE, are there any updates on this PR? |
I have:
I have a model that demonstrates the capabilities with a diverse set of 22 languages; I will try to clean up the code and put a model on the HF hub this weekend. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
This pull request introduces 3 alerts when merging 57bc4b9 into c259ae1 - view on LGTM.com new alerts:
|
for more information, see https://pre-commit.ci
This pull request introduces 1 alert when merging bdcfcce into 2574ddf - view on LGTM.com new alerts:
|
…into punct_cap_seg
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…into punct_cap_seg
for more information, see https://pre-commit.ci
@ekmb This is probably as far as I should take it on my own. Recent updates focus primarily on single-pass training and inference, as well as reducing the amount of code. There is a decent 22-language, single-pass model on the HF hub with some description of how all this works. If people disagree with the fundamental ideas, now is a good time to do so. Otherwise, next steps would be to clean it up a little more. |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
I'm ok with letting this one die. The code turned out more complicated than I prefer. |
What does this PR do ?
This is a work-in-progress for a model and data set that performs multilingual punctuation restoration, true casing, and sentence boundary detection. See
Usage
below for a demo of what this PR does.Some key features:
Collection: NLP
Changelog
No changes to existing NeMo code, only additions.
Usage
See current example model on the HF hub.
Before your PR is "Ready for review"
Pre checks:
PR Type:
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information