Skip to content

1chimaruGin/BurmeseCorpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

BurmeseCorpus

Burmese Language Corpus

Directory structure

Each file contain atleast 150,000 sentences.

corpus
    |--- bm_corpus_1.txt
    |--- bm_corpus_2.txt
    ...

Data Source

CC100 -> burmese -> https://huggingface.co/datasets/cc100

Dataset Description

This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

About

Burmese Language Corpus

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published