Skip to content

Latest commit

 

History

History

news_categorization

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

News Categorization Dataset

The dataset is originally hosted on https://github.com/AI4Bharat/indicnlp_corpus. We curated it from the work of Bangla Text Classification using Transformers.

Dataset

The dataset contains six different class labels for news categorization task and is available with training, development, and test splits with 11,284, 1,411, and 1,411 news articles, respectively.

Directory Structure:

  • train.tsv
  • dev.tsv
  • test.tsv

Licensing

The dataset is licensed under CC BY-NC-SA 4.0.

Citation

Please cite the following papers if you are using the data:

@article{alam2021review,
  title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
  author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
  journal={arXiv preprint arXiv:2107.03844},
  year={2021}
}
@article{alam2020bangla,
  title={Bangla Text Classification using Transformers},
  author={Alam, Tanvirul and Khan, Akib and Alam, Firoj},
  journal={arXiv preprint arXiv:2011.04446},
  year={2020}
}

@article{kunchukuttan2020ai4bharat,
 author = {Anoop Kunchukuttan and Divyanshu Kakwani and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
 journal = {arXiv preprint arXiv:2005.00085},
 title = {AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages},
 year = {2020}
}