Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to expand English vocabulary in llama tokenizer? #25

Open
ClinuxMDL opened this issue Dec 13, 2023 · 0 comments
Open

How to expand English vocabulary in llama tokenizer? #25

ClinuxMDL opened this issue Dec 13, 2023 · 0 comments

Comments

@ClinuxMDL
Copy link

感谢作者的有帮助性的工作。想问一下在模型预训练阶段的一些问题:
1、针对生物或医学类的词汇,如何扩充到现有的llama词汇表中?
2、重新制作目前新语料的tokens会带来更好的loss收益么?
3、我试着用目前的预料切了一下生物类的专业词汇,看起来切得比较散,不知道您有没有注意到这一点。4、我发现在训练过程中1个epoch下来loss降得有限,后面必须要多个epoch的loss才能降下来,这样的话无疑增加了很多训练时间?
5、预训练的loss一般需要达到多少是比较能够往SFT继续走的水平?

期待作者的回复,感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant