Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate unused vocabulary entries of our mBERT-based models #42

Open
jowagner opened this issue Nov 27, 2020 · 1 comment
Open

Populate unused vocabulary entries of our mBERT-based models #42

jowagner opened this issue Nov 27, 2020 · 1 comment
Labels
enhancement New feature or request idea Future work idea

Comments

@jowagner
Copy link
Collaborator

Issue #33 points out that there are 99 unused entries in the mBERT vocabulary intended for users to add task-specific vocabulary entries for fine-tuning. We could use the entries to improve the vocabulary's coverage of Irish without having to train from scratch. However, to not put stones in the way of users of our models who want to use unused entries for their own tasks, we should not use all 99 entries.

A way to choose the entries to add would be to induce new vocabularies for a clean Irish corpus, reducing the size until the number of new entries, i.e. entries that are not in the mBERT vocabulary, is less than or equal to the number of entries we want to add, say 49.

@jowagner jowagner changed the title Populate unused vocabulary entries of mBERT model Populate unused vocabulary entries of our mBERT-based models Nov 27, 2020
@jowagner jowagner added enhancement New feature or request idea Future work idea labels Nov 27, 2020
@jowagner
Copy link
Collaborator Author

Shared idea publicly on huggingface/tokenizers#627 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request idea Future work idea
Projects
None yet
Development

No branches or pull requests

1 participant