Skip to content

jamessr2/Language-Detection

Repository files navigation

Language Detection

CS 478 Machine Learning Project

Possible features

  • average number of diacritical marks per sentence
  • Average frequency of diacritical marks (how many other characters appear between diacritical marks)
  • types of diacritical marks (acute accent, grave accent, umlaut, etc)
  • Average vowel cluster size (number of consecutive vowels in a word)
  • Average consonant cluster size (number of consecutive consonants in a word)
  • Contains non-ASCII characters? true or false
  • Uses non-Latin characters? true or false
  • Average word length
  • Average number of words in sentence
  • length of the text sample in tokens (words and punctuation symbols) (not useful in and of itself, but may be helpful in higher order combinations with other features)
  • Percentage of writing sample for each alphabet we end up detecting (Latin, Greek, Cyrillic, Hebrew, Asian languages, etc)

Possible languages

  • English //Check
  • Spanish //Check
  • French //Check
  • Italian //Check
  • German //Check
  • Portuguese //Check
  • Finnish //Check
  • Norwegian //Check
  • Dutch //Check
  • Danish //Check
  • Swedish //Check
  • Russian //Check
  • Ukrainian //Check
  • Afrikaans //Check
  • Vietnamese //Check
  • Bosnian //Check
  • Czech //Check
  • Esperanto //Check
  • Gaelic //Check
  • Polish //Check
  • Serbian //Check
  • Swahili //Check
  • Welsh //Check
  • Tagalog //Check
  • Greek //Check
  • Coptic //No longer exists as a language
  • Arabic //Check
  • Kurdish //Check

etc.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages