BurmeseCorpus

Burmese Language Corpus

Directory structure

Each file contain atleast 150,000 sentences.

corpus
    |--- bm_corpus_1.txt
    |--- bm_corpus_2.txt
    ...

Data Source

CC100 -> burmese -> https://huggingface.co/datasets/cc100

Dataset Description

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
corpus		corpus
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BurmeseCorpus

Directory structure

Data Source

About

Releases

Packages

License

1chimaruGin/BurmeseCorpus

Folders and files

Latest commit

History

Repository files navigation

BurmeseCorpus

Directory structure

Data Source

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages