-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
请问在预训练时,如何保持书籍等数据的换行符号?如何把书籍切成block? #1891
Labels
solved
This problem has been already solved
Comments
json 格式允许换行符,通过转义字符可解决 |
请问下, 感谢大神 |
|
流式加载也会shuffle吗?当前做继续预训练的数据量很大一下子吃不进去的时候,感觉流式没法shuffle呀,只能前置把数据集自己shuffle好 |
@Zombiessss 流式会 shuffle 一部分 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
请大神指教:
(1) 我在pt阶段的数据是大量书籍和文件数据,有大量的换行符,无法构造成wiki_demo.txt或者json格式,请问如何处理?
(2) 请问group texts的流程是怎么样的?各个block在预训练之前进行shuffle了吗?
(3) 如果是多个txt文件,会在切成多个block之后混合进行shuffle吗?
感谢。
@hiyouga
The text was updated successfully, but these errors were encountered: