This repository contains State of the Art Language models and Classifier for Gujarati, which is a language native to the Indian state of Gujarat.
The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)
- iNLTK Headlines Corpus - Gujarati : Uses the Gujarati News Dataset prepared above.
Architecture/Dataset | Gujarati Wikipedia Articles |
---|---|
ULMFiT | 34.12 |
TransformerXL | 28.12 |
Dataset | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|
iNLTK Headlines Corpus - Gujarati | 91.05 | 86.09 | Link |
Architecture | Visualization |
---|---|
ULMFiT | Embeddings projection |
TransformerXL | Embeddings projection |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Gujarati | (5269, 659, 659) | 91.05 | 86.09 | Link |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Gujarati | (526, 659, 659) | 80.88 | 70.18 | Link |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Gujarati | (526, 659, 659) | 81.03 | 70.44 | Link |
Download pretrained Language Models from here
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here