Skip to content

goru001/nlp-for-hinglish

Repository files navigation

NLP for Hinglish (Code mixed Hindi+English)

This repository contains Language model for Code mixed Hinglish (Hindi and English) - spoken in Indian sub-continent.

Methodology followed in this repo is detailed in this paper, accepted at Dravidian-Codemix-HASOC2020@FIRE2020

Dataset

  1. Synthetically Generated Hinglish Dataset from Wikipedia Articles

Results

Language Model Perplexity (on validation set)

Architecture/Dataset Synthetically Generated Wikipedia Articles Dataset
ULMFiT 86.48

Visualizations

Word Embeddings
Architecture Visualization
ULMFiT Embeddings projection

Pretrained Models

Language Models

Download pretrained ULMFiT LM from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published