-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
关于预训练数据的来源 #11
Comments
csl仓库已更新用于预训练的数据,和csl来源相同
…________________________________
发件人: Willard Sheen ***@***.***>
发送时间: Tuesday, August 29, 2023 10:45:28 AM
收件人: ydli-ai/CSL ***@***.***>
抄送: Subscribed ***@***.***>
主题: [ydli-ai/CSL] 关于预训练数据的来源 (Issue #11)
预训练数据集数据似乎远多于发布的论文元数据集。
在训练模型时为了去重,我简单校验了两个数据,似乎是不重叠的?
方便简要说明下预训练数据的来源和内容吗
* 预训练的数据集
* csl.jsonl
* 2310165 line
* 论文元数据
* csl_camera_readly.tsv
* 396209 line
―
Reply to this email directly, view it on GitHub<#11>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AE3SPVZ374RPQDKNMEJGRY3XXVJURANCNFSM6AAAAAA4CH7X7E>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
预训练数据集数据似乎远多于发布的论文元数据集。
在训练模型时为了去重,我简单校验了两个数据,似乎是不重叠的?
方便简要说明下预训练数据的来源和内容吗
The text was updated successfully, but these errors were encountered: