Language identification

This project contains code for a language identification/document classification task as described in the e-mail.

It is self-contained, including training data and evaluation scripts.

For calling instructions and sample output, see sampleOutput.md in this folder.

Data

In order to test and evaluate the implemented code, I chose to use an established data set. Data from the Discriminating between similar languages (DSL) shared task the from VarDial Workshop @ COLING 2014 is used.

Details on the task and the data can be found at the website or the Bitbucket repo.

For an overview paper of the evaluation, see [1].

The data DSLCC.zip and DSLCC-eval.zip is already included in the data subdirectory of this project.

Implemented Methods

Due to its popularity, adequacy for the task and simplicity, I chose to implement a Naive Bayes classifier. Laplace smoothing was used and also applied for out-of-vocabulary words. Needless to say, I normally wouldn't re-implement a classifier and use library code, but here only the Java 8 standard library is used.

Comparison with published results

Looking at the results on the workshop website, the implemented approach (which is really quite straight-forward) would theoretically rank third in the eight submissions.

Looking at the overview paper of the task [1] and individual submissions, a basic yet relatively successful approach can be found in [2] and corresponds to this implementation. From this, it is evident, that as a simple baseline approach, a Naive Bayes classifier (c.f., Table 3) based on word unigram probabilities (c.f., Table 2) easily reaps the "low-hanging fruits".

References

Zampieri, M., Tan, L., Ljubešic, N., & Tiedemann, J. (2014). A report on the DSL shared task 2014. COLING 2014, 58.

link to paper
King, B., Radev, D., & Abney, S. (2014). Experiments in sentence language identification with groups of similar languages. COLING 2014, 473, 146.

link to paper

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
gradle/wrapper		gradle/wrapper
src/main/java/st/kolkhor/projects/languageid		src/main/java/st/kolkhor/projects/languageid
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
sampleOutput.md		sampleOutput.md
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language identification

Data

Implemented Methods

Comparison with published results

References

About

Releases

Packages

Languages

hekolk/aiphes-lid

Folders and files

Latest commit

History

Repository files navigation

Language identification

Data

Implemented Methods

Comparison with published results

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages