-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge subcorpus-specific wordpiece vocabularies #33
Comments
Good idea. The script https://github.com/jbrry/wiki-bert-pipeline/blob/858d323e1fa3a63368441d68309d5afb9389d3fe/external_scripts/gather_external_data.py#L17 I haven't read the above paper yet but I wonder how easy it is to merge vocabularies? Is it as simple as just merging three
|
A way to find out would be to remove all other intermediate output files generated when building the vocabulary and to see whether BERT still trains as usual. If it does this means it is only using the The example suggests the regular entries do not need to be in any particular order but I'd guess that the first 4 special entries must be at the start. This could be done manually with
|
The BERT vocabulary (for bert-base-uncased) is laid out as follows:
The vocabulary is 30,522 tokens: the first 999 tokens are reserved e.g. [unused993] including the tokens 1, 101-105. As I understand it the vocabulary is used as a dictionary mapping |
Thanks Alan. Also FYI Joachim, the example I posted skips lines 1-101 in a vocab file, which as Alan pointed out are [PAD] - [unused99] (though the example Alan uses ranges from 0:999). I think the vocab file for bert-base-uncased must be different from multilingual BERT as mBERT only keeps 99 places for unseen tokens, e.g. Footnote 5 in Chau et al., (2020) mentions:
Also the vocab file I am using for mBERT only keeps 99 as well. They are also in the format of just one wordpiece token per line:
Perhaps they changed how they write vocab files between bert-base-uncased and multilingual-bert. In any case, I imagine the word keys are hard-coded to a token ID value in the model itself, even if that's not how they write it for multilingual-bert. So in one model, 'apple' may be mapped to ID 102 and in a model in another language, e.g. French, 'rouge' may be mapped to ID 102 so it may mean that the word-to-ID lookup dictionary has to be changed to accommodate the different key-value pairs. |
Thanks. Yes, I also think that when you work with an existing model you must not append entries to the vocab files or change the order of existing entries. Vocab files with such changes are only useful for training from scratch. As the footnote quoted above says, the unused entries are there to help people add some entries in fine-tuning. |
When training on Irish, English and possibly other languages, Hung et al. (2020) "Improving Multilingual Models with Language-Clustered Vocabularies" suggest to create wordpiece vocabularies for clusters of related languages and then use the union of these vocabularies as the final vocabulary used during BERT training and prediction. For us this could mean to split the data into (1) clearly English only, (2) clearly Irish only and (3) all other text, train 3 vocabularies and merge them.
The text was updated successfully, but these errors were encountered: