Polyortho

A lightweight language detection algorithm

Polyortho (POLYnomial regression + ORTHOgraphy) is a lightweight language detection algorithm that applies concepts from natural language processing, statistical analysis, regression analysis and integral calculus to detect which of 21 languages an input text is written in.

The supported languages are Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish.

The data used to train and test the model are from the European Parliament Proceedings Parallel Corpus 1996-2011.

The program has three analysis modes: grapheme frequency, grapheme combination frequency, and word-final grapheme frequency. Due to the central limit theorem, Polyortho works best when at least 10 KB of sample text are used.

Here's how to use Polyortho:

Clone the repository:

git clone git@github.com:cbeimers113/polyortho.git

Download the Europarl corpus (1.5 GB) and extract the 'txt' folder to the data/ directory of the project.

Place text to be analyzed in input.txt (at least 10 KB of text)

Run main.py with Python 3 (3.8 or higher recommended):

python main.py

or

python3 main.py

The output should look something like this:


Analyzing.....................

Modelling......................

Integrating.....................

Detected: <target language>

If the input language is not supported by Polyortho, it should detect the supported language with the closest similarity of the chosen feature analysis mode.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
input.txt		input.txt
main.py		main.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Polyortho

A lightweight language detection algorithm

Here's how to use Polyortho:

About

Uh oh!

Releases

Packages

Languages

cbeimers113/polyortho

Folders and files

Latest commit

History

Repository files navigation

Polyortho

A lightweight language detection algorithm

Here's how to use Polyortho:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages