-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Other language data #93
Comments
Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals. |
Thank you very much for your reply. It is very difficult for us to filter Chinese data from the original large-scale CommonCrawl because we cannot handle such a large CC dump package. Is there a channel to obtain language-differentiated data? Chinese raw data? In this way, we can process and generate Chinese data based on CCNet and the library you provided. |
@mauriceweber I am a faculty member at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. I am about to kick-off a project to apply these workflows to prepare the Arabic language subset with the goal of contributing the Arabic language subset to the next version of this dataset. Would there be interest in collaborating on this project? We have technical skills and plenty of compute so what we really need is general guidance if we get stuck. @Dzg0309 depending on how much resources we need to use to prepare the Arabic data we may be able to also prepare the data for other languages. |
Hi @davidrpugh , awesome to hear that! I'm happy to provide any guidance you need and open for collaboration on this!:) |
Thank you very much for your work in providing such rich data to the open source community, I was wondering if there are any plans for release in other languages, such as Chinese? I think Chinese data is also a need for most people.
The text was updated successfully, but these errors were encountered: