Language Detection

CS 478 Machine Learning Project

Possible features

average number of diacritical marks per sentence
Average frequency of diacritical marks (how many other characters appear between diacritical marks)
types of diacritical marks (acute accent, grave accent, umlaut, etc)
Average vowel cluster size (number of consecutive vowels in a word)
Average consonant cluster size (number of consecutive consonants in a word)
Contains non-ASCII characters? true or false
Uses non-Latin characters? true or false
Average word length
Average number of words in sentence
length of the text sample in tokens (words and punctuation symbols) (not useful in and of itself, but may be helpful in higher order combinations with other features)
Percentage of writing sample for each alphabet we end up detecting (Latin, Greek, Cyrillic, Hebrew, Asian languages, etc)

etc.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
data		data
.gitignore		.gitignore
Download.py		Download.py
DownloadWebService.py		DownloadWebService.py
FinalWriteup.doc		FinalWriteup.doc
FinalWriteup.odt		FinalWriteup.odt
FinalWriteup.pdf		FinalWriteup.pdf
GetSomePages.py		GetSomePages.py
README.md		README.md
dataSource.csv		dataSource.csv
dataSourceWebService.csv		dataSourceWebService.csv
dataSourceWebServiceTemp.csv		dataSourceWebServiceTemp.csv
downloadReadMe.txt		downloadReadMe.txt
feature-extract.py		feature-extract.py