Other language data #93

Dzg0309 · 2023-12-22T00:45:00Z

Thank you very much for your work in providing such rich data to the open source community, I was wondering if there are any plans for release in other languages, such as Chinese? I think Chinese data is also a need for most people.

mauriceweber · 2024-01-05T08:35:09Z

Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals.

Dzg0309 · 2024-01-09T09:35:32Z

Hi @Dzg0309 -- currently we don't have plans to release data in other languages. However, if you want to create such a dataset (e.g. in Chinese), you can use the CCNet pipeline and the scripts in this repo to compute quality signals and deduplicate the corpus. Note that in other languages you will likely have to adapt the quality signals.

Thank you very much for your reply. It is very difficult for us to filter Chinese data from the original large-scale CommonCrawl because we cannot handle such a large CC dump package. Is there a channel to obtain language-differentiated data? Chinese raw data? In this way, we can process and generate Chinese data based on CCNet and the library you provided.

davidrpugh · 2024-05-15T09:44:24Z

@mauriceweber I am a faculty member at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. I am about to kick-off a project to apply these workflows to prepare the Arabic language subset with the goal of contributing the Arabic language subset to the next version of this dataset. Would there be interest in collaborating on this project? We have technical skills and plenty of compute so what we really need is general guidance if we get stuck.

@Dzg0309 depending on how much resources we need to use to prepare the Arabic data we may be able to also prepare the data for other languages.

mauriceweber · 2024-05-16T11:56:06Z

Hi @davidrpugh , awesome to hear that! I'm happy to provide any guidance you need and open for collaboration on this!:)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Other language data #93

Other language data #93

Dzg0309 commented Dec 22, 2023

mauriceweber commented Jan 5, 2024

Dzg0309 commented Jan 9, 2024

davidrpugh commented May 15, 2024

mauriceweber commented May 16, 2024 •

edited

Loading

Other language data #93

Other language data #93

Comments

Dzg0309 commented Dec 22, 2023

mauriceweber commented Jan 5, 2024

Dzg0309 commented Jan 9, 2024

davidrpugh commented May 15, 2024

mauriceweber commented May 16, 2024 • edited Loading

mauriceweber commented May 16, 2024 •

edited

Loading