Skip to content

A Language Classifier powered by Recurrent Neural Network implemented in Python without AI libraries. AI from scratch.

License

Notifications You must be signed in to change notification settings

JasonFengGit/RNN-Language-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNN-Language-Classifier

A Language Classifier powered by Recurrent Neural Network(RNN) implemented in Python without AI libraries.

Features

The classifier classifies a word in English, Spanish, Finnish, Dutch, or Polish. The classifier outputs correctly at a rate of approximately 85%. It is purely implemented with numpy and built-in libraries.

Model Architecture

  • Input Layer: 47 nodes representing 47 different characters
  • Output Layer: 5 nodes representing 5 languages

The technique used in this project is called Recurrent Neural Network(RNN):



Here, an RNN is used to encode the word “c-a-t” into a fixed-size vector h3.

Sample Run

Training until validation accuracy achieve a certain level:

epoch 1 iteration 24 validation-accuracy 43.0%
  shaking    English ( 22.4%) Pred: Dutch   |en 22%|es 20%|fi 18%|nl 26%|pl 14%
  relaxing   English ( 23.7%) Pred: Dutch   |en 24%|es 20%|fi 18%|nl 25%|pl 13%
  prophecy   English ( 17.6%) Pred: Spanish |en 18%|es 24%|fi 24%|nl 16%|pl 19%
  tiroteo    Spanish ( 25.8%)               |en 21%|es 26%|fi 18%|nl 18%|pl 17%
  vientre    Spanish ( 24.2%)               |en 17%|es 24%|fi 21%|nl 21%|pl 17%
  estupenda  Spanish ( 31.4%)               |en 16%|es 31%|fi 18%|nl 19%|pl 16%
  osti       Finnish ( 21.2%) Pred: Polish  |en 15%|es 19%|fi 21%|nl 20%|pl 25%
  veljensä   Finnish ( 19.8%) Pred: Spanish |en 21%|es 22%|fi 20%|nl 20%|pl 18%
  aikoinaan  Finnish ( 22.3%)               |en 15%|es 21%|fi 22%|nl 21%|pl 21%
  betwijfel  Dutch   ( 22.8%) Pred: English |en 24%|es 23%|fi 15%|nl 23%|pl 15%
  merkte     Dutch   ( 17.1%) Pred: Spanish |en 17%|es 22%|fi 22%|nl 17%|pl 21%
  beseffen   Dutch   ( 24.5%)               |en 21%|es 19%|fi 21%|nl 25%|pl 15%
  kończę     Polish  ( 21.5%) Pred: Spanish |en 17%|es 23%|fi 20%|nl 18%|pl 21%
  firmy      Polish  ( 20.7%) Pred: Finnish |en 15%|es 22%|fi 23%|nl 19%|pl 21%
  decyzje    Polish  ( 16.2%) Pred: Dutch   |en 19%|es 22%|fi 20%|nl 23%|pl 16%

.
.
.

epoch 6 iteration 153 validation-accuracy 84.2%
  shaking    English ( 86.4%)               |en 86%|es  0%|fi  1%|nl 12%|pl  1%
  relaxing   English ( 84.6%)               |en 85%|es  0%|fi  0%|nl 15%|pl  0%
  prophecy   English ( 54.2%)               |en 54%|es  0%|fi  0%|nl  4%|pl 41%
  tiroteo    Spanish ( 38.9%)               |en 12%|es 39%|fi 36%|nl  6%|pl  8%
  vientre    Spanish ( 43.4%)               |en 19%|es 43%|fi  2%|nl 29%|pl  7%
  estupenda  Spanish ( 75.2%)               |en  1%|es 75%|fi 15%|nl  2%|pl  7%
  osti       Finnish ( 75.7%)               |en  1%|es  1%|fi 76%|nl  3%|pl 20%
  veljensä   Finnish ( 81.7%)               |en  0%|es  1%|fi 82%|nl 17%|pl  0%
  aikoinaan  Finnish ( 99.9%)               |en  0%|es  0%|fi100%|nl  0%|pl  0%
  betwijfel  Dutch   ( 98.7%)               |en  1%|es  0%|fi  0%|nl 99%|pl  1%
  merkte     Dutch   ( 71.9%)               |en 10%|es  1%|fi  6%|nl 72%|pl 10%
  beseffen   Dutch   ( 96.6%)               |en  2%|es  0%|fi  0%|nl 97%|pl  0%
  kończę     Polish  (100.0%)               |en  0%|es  0%|fi  0%|nl  0%|pl100%
  firmy      Polish  ( 29.4%) Pred: English |en 59%|es  5%|fi  2%|nl  5%|pl 29%
  decyzje    Polish  ( 87.7%)               |en  1%|es  1%|fi  0%|nl 10%|pl 88%

Test Results:

test set accuracy is: 83.800000%

User Input:

word: tervetuloa # welcome
predicted language is: Finnish, with a confidence of 80.011147%

word: ciudades # cities
predicted language is: Spanish, with a confidence of 88.442353%

word: właź # hatch
predicted language is: Polish, with a confidence of 99.979566%

word: algorithm
predicted language is: English, with a confidence of 79.893499%

word: resolution
predicted language is: English, with a confidence of 94.786443%

word: ademt # breathe
predicted language is: Dutch, with a confidence of 47.399565%

word: invitar # invite
predicted language is: Spanish, with a confidence of 93.986880%

Dependencies

You will need numpy for this project

pip install numpy

How To Use

clone this project or download the zip file

py run.py

Improvements To Make

  • support save & load models
  • classify more languages
  • improve accuracy
  • classify a sentence or paragraph instead of words
  • ...

Reference

The dataset lang_id.npz, image demonstrating RNN, and project skeleton are from cs188.ml.

About

A Language Classifier powered by Recurrent Neural Network implemented in Python without AI libraries. AI from scratch.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages