CS 478 Machine Learning Project
- average number of diacritical marks per sentence
- Average frequency of diacritical marks (how many other characters appear between diacritical marks)
- types of diacritical marks (acute accent, grave accent, umlaut, etc)
- Average vowel cluster size (number of consecutive vowels in a word)
- Average consonant cluster size (number of consecutive consonants in a word)
- Contains non-ASCII characters? true or false
- Uses non-Latin characters? true or false
- Average word length
- Average number of words in sentence
- length of the text sample in tokens (words and punctuation symbols) (not useful in and of itself, but may be helpful in higher order combinations with other features)
- Percentage of writing sample for each alphabet we end up detecting (Latin, Greek, Cyrillic, Hebrew, Asian languages, etc)
- English //Check
- Spanish //Check
- French //Check
- Italian //Check
- German //Check
- Portuguese //Check
- Finnish //Check
- Norwegian //Check
- Dutch //Check
- Danish //Check
- Swedish //Check
- Russian //Check
- Ukrainian //Check
- Afrikaans //Check
- Vietnamese //Check
- Bosnian //Check
- Czech //Check
- Esperanto //Check
- Gaelic //Check
- Polish //Check
- Serbian //Check
- Swahili //Check
- Welsh //Check
- Tagalog //Check
- Greek //Check
- Coptic //No longer exists as a language
- Arabic //Check
- Kurdish //Check
etc.