Skip to content

DataGov-SamagraX/Hinglish_transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

Hinglish_transformer

Requirement

We very often get feedback on our products and on field visits in Hinglish(we work in education in Indian states). Currently we use AI4Bharats Fairseq transformers (combining hindi <-> English translation and transliteration) to carry out Hinglish sentence conversion to English for further use for aggregation, keyword analysis and sentiment analysis.

An example of Hinglish sentence would be - 'Aapne achha flash card ke madyam se concept ko clear kiya hai. Aapko bachho ko aur adhik involve karna chiye'

The general structure is that the sentence is in Hindi (written in Roman script) but some nouns (like flash card) are in English.

We don't have a transformer that is trained on Hinglish sentences but rather Hindi <-> English transformers that we leverage to carry out the same task. It's currently not very effective and we hope to build a transformer that is trained on Hinglish sentences to improve the performance.

There is no current thinking on the design of the architecture required for the task and the aim in general is not to have a high BLEU score in our first attempt at creating this.

We want to build a pipeline first such that it allows one to train on Hinglish data and finetune the pretrained model on new datasets. Once the pipeline is set up, we will continue working on modifying the architecture to improve the accuracy.

So far, we haven't found a big enough Hinglish data set to build an effective model and the current thinking is to leverage existing Hindi<-> English transformers to generate data to carry out the same. We do have existing Hindi/English corpus datasets to do that.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published