Populate unused vocabulary entries of our mBERT-based models #42

jowagner · 2020-11-27T11:06:13Z

Issue #33 points out that there are 99 unused entries in the mBERT vocabulary intended for users to add task-specific vocabulary entries for fine-tuning. We could use the entries to improve the vocabulary's coverage of Irish without having to train from scratch. However, to not put stones in the way of users of our models who want to use unused entries for their own tasks, we should not use all 99 entries.

A way to choose the entries to add would be to induce new vocabularies for a clean Irish corpus, reducing the size until the number of new entries, i.e. entries that are not in the mBERT vocabulary, is less than or equal to the number of entries we want to add, say 49.

jowagner · 2021-02-24T10:43:54Z

Shared idea publicly on huggingface/tokenizers#627 (comment)

jowagner changed the title ~~Populate unused vocabulary entries of mBERT model~~ Populate unused vocabulary entries of our mBERT-based models Nov 27, 2020

jowagner added enhancement New feature or request idea Future work idea labels Nov 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate unused vocabulary entries of our mBERT-based models #42

Populate unused vocabulary entries of our mBERT-based models #42

jowagner commented Nov 27, 2020

jowagner commented Feb 24, 2021

Populate unused vocabulary entries of our mBERT-based models #42

Populate unused vocabulary entries of our mBERT-based models #42

Comments

jowagner commented Nov 27, 2020

jowagner commented Feb 24, 2021