Skip to content

jamessr2/Language-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Apr 9, 2015
c5c7d43 · Apr 9, 2015

History

44 Commits
Apr 8, 2015
Apr 9, 2015
Mar 10, 2015
Mar 19, 2015
Apr 8, 2015
Apr 9, 2015
Apr 9, 2015
Mar 19, 2015
Mar 12, 2015
Mar 10, 2015
Mar 19, 2015
Mar 19, 2015
Mar 10, 2015
Apr 7, 2015

Repository files navigation

Language Detection

CS 478 Machine Learning Project

Possible features

  • average number of diacritical marks per sentence
  • Average frequency of diacritical marks (how many other characters appear between diacritical marks)
  • types of diacritical marks (acute accent, grave accent, umlaut, etc)
  • Average vowel cluster size (number of consecutive vowels in a word)
  • Average consonant cluster size (number of consecutive consonants in a word)
  • Contains non-ASCII characters? true or false
  • Uses non-Latin characters? true or false
  • Average word length
  • Average number of words in sentence
  • length of the text sample in tokens (words and punctuation symbols) (not useful in and of itself, but may be helpful in higher order combinations with other features)
  • Percentage of writing sample for each alphabet we end up detecting (Latin, Greek, Cyrillic, Hebrew, Asian languages, etc)

Possible languages

  • English //Check
  • Spanish //Check
  • French //Check
  • Italian //Check
  • German //Check
  • Portuguese //Check
  • Finnish //Check
  • Norwegian //Check
  • Dutch //Check
  • Danish //Check
  • Swedish //Check
  • Russian //Check
  • Ukrainian //Check
  • Afrikaans //Check
  • Vietnamese //Check
  • Bosnian //Check
  • Czech //Check
  • Esperanto //Check
  • Gaelic //Check
  • Polish //Check
  • Serbian //Check
  • Swahili //Check
  • Welsh //Check
  • Tagalog //Check
  • Greek //Check
  • Coptic //No longer exists as a language
  • Arabic //Check
  • Kurdish //Check

etc.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages