Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We have contributed un-supervised, supervised datasets, and Transformer models for Marathi. The supervised datasets include Marathi sentiment analysis, named entity recognition, and hate speech detection. With this, we at L3Cube-Pune aim to bring Marathi to the forefront of IndicNLP. Our vision is to make Marathi a resource-rich language and promote AI for Maharashtra!
[Update] The library is now available in a python package:
pip install mahaNLP
Usage examples are provided in this demo Colab .
[Update] We have released a new code-mixed Marathi-English unsupervised dataset MeCorpus and supervised datasets like MeSent, MeHate, and MeLID.
[Update] We have released a new multi-domain Sentiment analysis dataset MahaSent-MD with 60k samples across four diverse domains. A new sentiment analysis model is also released on HF.
L3Cube-MahaCorpus is a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We also present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. The evaluation details are mentioned in our paper link
L3Cube-MahaCorpus(full) = L3Cube-MahaCorpus(news) + L3Cube-MahaCorpus(non-news)
Full Marathi Corpus incorporates all existing sources .
Dataset | #tokens(M) | #sentences(M) | Link |
---|---|---|---|
L3Cube-MahaCorpus (news) | 212 | 17.6 | link |
L3Cube-MahaCorpus (non-news) | 76.4 | 7.2 | link |
L3Cube-MahaCorpus (full) | 289 | 24.8 | link |
Full Marathi Corpus (all sources) | 752 | 57.2 | link |
L3Cube-MeCorpus is a first-of-its-kind large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences released in paper .
Dataset | #tokens(M) | #sentences(M) | Link |
---|---|---|---|
L3Cube-MeCorpus (Roman) | 70.9 | 5 | link |
L3Cube-MeCorpus (Devanagari) | 68.6 | 5 | link |
L3Cube-MeCorpus (Roman + Devanagari) | 139.5 | 10 | link |
The full Marathi Corpus is used to train BERT language models and made available on Hugging Face model hub.
Model | Description | Link |
---|---|---|
MahaGemma-7B | Gemma-7B | v1 |
MahaGemma-2B | Gemma-2B | v1 |
MahaBERT | Base-BERT | v1 , v2 , paper |
MahaRoBERTa | RoBERTa | link |
MahaAlBERT | AlBERT | v1 , v2 |
MahaGPT | GPT2 | link |
MahaFT | Fast Text | bin , vec |
MahaTweetBERT | MahaBERT + Tweets | model , paper |
MahaSBERT | Sentence-BERT | MahaSBERT-STS , MahaSBERT , paper |
IndicSBERT | Sentence-BERT (for cross-language) | IndicSBERT-STS , IndicSBERT , paper |
MeBERT | Codemixed Marathi-English BERT (Roman) | me-bert , paper |
MeRoBERTa | Codemixed Marathi-English RoBERTa (Roman) | me-roberta , paper |
MeBERT-Mixed | Codemixed Marathi-English BERT (Roman + Devanagari) | me-bert-mixed , me-bert-mixed-v2 , paper |
MeRoBERTa-Mixed | Codemixed Marathi-English RoBERTa (Roman + Devanagari) | me-roberta-mixed , paper |
Dataset | Description | Samples(train, valid, test) | link | model | paper |
---|---|---|---|---|---|
MahaSQuAD | Marathi Question Answering Dataset | 142k (118516, 11873, 11803) | data | MahaSQuAD-BERT | link |
MahaNews | Marathi long, medium, and short document classification dataset in Marathi dataset with 12 target classes | 53k (42k, 5k, 5k) | data | MahaNews-All-BERT | link |
MahaNER | Marathi Named Entity Recognition dataset with 8 entity classes | 25k (21.5k, 1.5k, 2k) | data | MahaNER-BERT | link |
MahaSocialNER | Social media based Marathi Named Entity Recognition dataset with 8 entity classes | 18k (12k, 1.5k, 2.2k) | data | MahaSocialNER-BERT | link |
MahaHate | Marathi Hate Speech Detection dataset with 4 class (hate, offensive, pofane, and not) and 2 class (hate and not) labels | 4-class: 25k (21.5k, 1.5k, 2k), 2-class: 37500 | data | 4-class , 2-class | link |
MahaSent | Marathi Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0) | 18,378 (12114, 1500, 2250); extra(2,514=2355(+1) + 159(-1)) | data | MarathiSentiment | link |
HateEval-Mr | Another dataset for evaluation of Hate Speech models with two classes - Hate(1) and None(0) | 2k samples | data | link | |
MahaSent-MD | A Multi-domain Marathi Sentiment Analysis dataset (4 domains - Marathi Movie Reviews, TV Subtitles, Generic Tweets, and Political Tweets) with three classes - Positive(1), Negative(-1) and Neutral(0) | 60k samples | data | MahaSent-MD | link |
MeSent | A code-mixed Marathi-English Sentiment Analysis dataset with three classes - Positive(1), Negative(-1) and Neutral(0) | 12k samples | data | me-sent-roberta | link |
MeHate | A code-mixed Marathi-English Hate speech identification dataset with two classes - Hate(1) and None(0) | 2768 samples | data | me-hate-bert | link |
MeLID | A code-mixed Marathi-English language identification (LID) dataset with three classes - Marathi, English, and Undefined | 12k samples | data | me-lid-bert | link |
L3Cube-MahaCorpus, L3Cube-MahaNER, L3Cube-MahaHate, L3Cube-HateEval-Mr, L3Cube-MahaSent-MD, L3CubeMahaSent, L3Cube-MeCorpus, L3Cube-MahaSent-MD, L3Cube-MeSent, L3Cube-MeHate, and L3Cube-MeLID are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The datasets are released to the community for research purposes only and the group is not responsible for any misuse of these datasets.
@article{joshi2022l3cube,
title={L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library},
author={Joshi, Raviraj},
journal={arXiv preprint arXiv:2205.14728},
year={2022}
}
@inproceedings{joshi-2022-l3cube,
title = "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources",
author = "Joshi, Raviraj",
booktitle = "Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.wildre-1.17",
pages = "97--101",
}
Joshi, Raviraj. "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources." LREC 2022 Workshop Language Resources and Evaluation Conference 20-25 June 2022. 2022.
Mittal, Saloni, et al. "L3Cube-MahaNews: News-Based Short Text and Long Document Classification Datasets in Marathi." International Conference on Speech and Language Technologies for Low-resource Languages. Cham: Springer Nature Switzerland, 2023.
Chavan, Tanmay, et al. "My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks." arXiv preprint arXiv:2306.14030 (2023).
Pingle, Aabha, et al. "L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models." arXiv preprint arXiv:2306.13888 (2023).
Pingle, Aabha, et al. "Robust Sentiment Analysis for Low Resource languages Using Data Augmentation Approaches: A Case Study in Marathi." arXiv preprint arXiv:2310.00734 (2023).
Deode, Samruddhi, et al. "L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT." arXiv preprint arXiv:2304.11434 (2023).
Joshi, Ananya, et al. "L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi." arXiv preprint arXiv:2211.11187 (2022).
Gokhale, Omkar Bhushan, et al. "Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection." I Can't Believe It's Not Better Workshop: Understanding Deep Learning Through Empirical Falsification.
Sabane, Maithili, et al. "Enhancing Low Resource NER using Assisting Language and Transfer Learning." 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC). IEEE, 2023.
Litake, Onkar, et al. "L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models." Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference. 2022.
Litake, Onkar, et al. "Mono vs Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition." arXiv preprint arXiv:2203.12907 (2022).
Velankar, Abhishek, Hrushikesh Patil, and Raviraj Joshi. "Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi." IAPR Workshop on Artificial Neural Networks in Pattern Recognition. Springer, Cham, 2023.
Patil, Hrushikesh, Abhishek Velankar, and Raviraj Joshi. "L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT Models." Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022). 2022.
Velankar, Abhishek, et al. "Hate and offensive speech detection in Hindi and Marathi." arXiv preprint arXiv:2110.12200 (2021).
Kulkarni, Atharva, et al. "L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset." Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2021.
Kulkarni, Atharva, et al. "Experimental Evaluation of Deep Learning Models for Marathi Text Classification." Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Springer, Singapore, 2022.
This project is led by Raviraj Joshi under L3Cube Labs, Pune. For any queries contact ravirajoshi@gmail.com .