Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a question about self.vocab #3

Open
tomtang110 opened this issue Jan 17, 2019 · 6 comments
Open

a question about self.vocab #3

tomtang110 opened this issue Jan 17, 2019 · 6 comments

Comments

@tomtang110
Copy link

2019-01-17 18 13 06
Could you explain why you add self.vocab_size between question id and answer id?

@benywon
Copy link
Owner

benywon commented Jan 17, 2019

2019-01-17 18 13 06
Could you explain why you add self.vocab_size between question id and answer id?

The self.vocab_size is just a padding symbol to separate the question and the answer.

@tomtang110
Copy link
Author

tomtang110 commented Jan 18, 2019

想问下,你们训练的单词只是针对你们的word2id.obj文件吗? 如果我自己建一套我自己的word2id可以使用你们的模型吗?主要我看了下词只有57777个感觉,有点少。

@benywon
Copy link
Owner

benywon commented Jan 18, 2019

Definitely!! Different word2id would project the same word to a different id. So you should use my word2id.obj. BTW, 57777 words is not very small as we use the sentencepiece word tokenizer, so OOV is not a problem.

@tomtang110
Copy link
Author

But, I need more than 450000 words, 57777 for 450000 is few. It is so upset. Therefore, for most companies, I think Bert is still difficult for training, even, fine-tune.

@benywon
Copy link
Owner

benywon commented Jan 18, 2019

Oh, that so bad, if you have your own vocab, this application may not suitable for you. Nevertheless, you can use my codes to train your own BERT.

@tomtang110
Copy link
Author

haha, But my company has no so affluent hardware equipment. My boss told me, next year, they would introduce cloud server, But at that time, I have finished my internship. Actually, I have made a machine comprehension reading system, I used the QAnet based on transformer as model, so I would like to try to use bert to train the Dureader dataset. However, it seems its value is too high to train it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants