We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
首先是下载链接我试了自己账号申请的链接无法下载,只能用scidb的链接,不需要登录,然后用curl下载老是出错(下完了文件md5不一致,也没法解压),就换成了wget,终于下载成功。我用的下载代码是(没有循环):
wget -v -c 'https://download.scidb.cn/download?fileId=63a30383fed6a8a9e8454302&dataSetType=organization&fileName=WuDaoCorporaText-2.0-open.rar' -O data/WuDaoCorpus2.0_base_200G.rar
然后解压的命令没有指定保存路径,如果是在项目根目录运行这个sh文件的话会解压到根目录里(Open-LLama/WuDaoCorpus2.0_base_200G/)。需要将其移到data文件里,或者修改data/preprocess_wudao.py里的路径。 另外pile真的很难下(还得翻墙)……
Open-LLama/WuDaoCorpus2.0_base_200G/
data
data/preprocess_wudao.py
The text was updated successfully, but these errors were encountered:
感谢对下载数据集部分的建议,这个下载方法看起来不错,我已经加到了readme里 并且@你了。我用了循环是因为wudao那个链接不太稳定,每下载1G会中断,不得不加个循环不断的继续下载才行。
curl和wget可能是处理redirect有区别,在下载instruct数据集的时候也有几个用curl下载不了的。
unrar没指定路径的问题,刚刚更新了。
Sorry, something went wrong.
No branches or pull requests
首先是下载链接我试了自己账号申请的链接无法下载,只能用scidb的链接,不需要登录,然后用curl下载老是出错(下完了文件md5不一致,也没法解压),就换成了wget,终于下载成功。我用的下载代码是(没有循环):
然后解压的命令没有指定保存路径,如果是在项目根目录运行这个sh文件的话会解压到根目录里(
Open-LLama/WuDaoCorpus2.0_base_200G/
)。需要将其移到data
文件里,或者修改data/preprocess_wudao.py
里的路径。另外pile真的很难下(还得翻墙)……
The text was updated successfully, but these errors were encountered: