-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a question about self.vocab #3
Comments
想问下,你们训练的单词只是针对你们的word2id.obj文件吗? 如果我自己建一套我自己的word2id可以使用你们的模型吗?主要我看了下词只有57777个感觉,有点少。 |
Definitely!! Different word2id would project the same word to a different id. So you should use my word2id.obj. BTW, 57777 words is not very small as we use the sentencepiece word tokenizer, so OOV is not a problem. |
But, I need more than 450000 words, 57777 for 450000 is few. It is so upset. Therefore, for most companies, I think Bert is still difficult for training, even, fine-tune. |
Oh, that so bad, if you have your own vocab, this application may not suitable for you. Nevertheless, you can use my codes to train your own BERT. |
haha, But my company has no so affluent hardware equipment. My boss told me, next year, they would introduce cloud server, But at that time, I have finished my internship. Actually, I have made a machine comprehension reading system, I used the QAnet based on transformer as model, so I would like to try to use bert to train the Dureader dataset. However, it seems its value is too high to train it. |
Could you explain why you add self.vocab_size between question id and answer id?
The text was updated successfully, but these errors were encountered: