The dataset has been curated from https://www.isical.ac.in/~utpal/resources.php. The raw text was collected from a collection of Rabindranath Tagore’s short stories and news articles from various domains.
Each of the following files contains word and its lemma form.
- train.txt
- dev.txt
- test.txt
The original dataset does not provide any license information.
Please cite the following papers if you are using the data:
@article{alam2021review,
title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
journal={arXiv preprint arXiv:2107.03844},
year={2021}
}
@inproceedings{chakrabarty-etal-2017-context,
address = {Vancouver, Canada},
author = {Chakrabarty, Abhisek and Pandit, Onkar Arun and Garain, Utpal},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics},
doi = {10.18653/v1/P17-1136},
pages = {1481--1491},
publisher = {Association for Computational Linguistics},
title = {Context Sensitive Lemmatization Using Two Successive Bidirectional Gated Recurrent Networks},
url = {https://www.aclweb.org/anthology/P17-1136},
year = {2017}
}