The dataset is originally hosted on https://github.com/AI4Bharat/indicnlp_corpus. We curated it from the work of Bangla Text Classification using Transformers.
The dataset contains six different class labels for news categorization task and is available with training, development, and test splits with 11,284, 1,411, and 1,411 news articles, respectively.
- train.tsv
- dev.tsv
- test.tsv
The dataset is licensed under CC BY-NC-SA 4.0.
Please cite the following papers if you are using the data:
@article{alam2021review,
title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
journal={arXiv preprint arXiv:2107.03844},
year={2021}
}
@article{alam2020bangla,
title={Bangla Text Classification using Transformers},
author={Alam, Tanvirul and Khan, Akib and Alam, Firoj},
journal={arXiv preprint arXiv:2011.04446},
year={2020}
}
@article{kunchukuttan2020ai4bharat,
author = {Anoop Kunchukuttan and Divyanshu Kakwani and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
journal = {arXiv preprint arXiv:2005.00085},
title = {AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages},
year = {2020}
}