lingua - language detection done right

lingua is a language detection library for Kotlin and Java, suitable for long and short text alike.

What does this library do?

Its task is simple: It tells you which language some provided textual data is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

Why does this library exist?

Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.

So far, on the JVM the only other comprehensive open source library for this task is language-detector. Unfortunately, it has two major drawbacks:

Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, it doesn't provide adequate results.
Configuration of the library is quite cumbersome and requires some knowledge about the statistical methods that are used internally.

lingua aims at eliminating these problems. It nearly doesn't need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries.

Which languages are supported?

Currently, the following seven languages are supported:

Language	ISO 639-1 code
English	en
French	fr
German	de
Italian	it
Latin	la
Portuguese	pt
Spanish	es

How good is it?

lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:

a list of single words with a minimum length of 5 characters
a list of word pairs with a minimum length of 10 characters
a list of complete grammatical sentences of various lengths

Both the language models and the test data have been created from the Wortschatz corpora offered by Leipzig University, Germany.

When running mvn test -P accuracy-reports, a report file for each language is created under target/surefire-reports. As an example, here is the current output of the German report:

com.github.pemistahl.lingua.detector.report.GermanDetectionAccuracyReport afterAll 

##### GERMAN #####

>>> Accuracy on average: 95,28%

>> Detection of 11748 single words (average length: 10 chars)
Accuracy: 89,31%
Erroneously classified as LATIN: 3,10%, ENGLISH: 2,47%, FRENCH: 2,00%, ITALIAN: 1,54%, SPANISH: 0,87%, PORTUGUESE: 0,72%

>> Detection of 9347 word pairs (average length: 17 chars)
Accuracy: 97,86%
Erroneously classified as ENGLISH: 0,76%, LATIN: 0,70%, FRENCH: 0,27%, ITALIAN: 0,22%, SPANISH: 0,13%, PORTUGUESE: 0,06%

>> Detection of 10000 sentences (average length: 47 chars)
Accuracy: 98,67%
Erroneously classified as ENGLISH: 0,96%, LATIN: 0,12%, PORTUGUESE: 0,11%, FRENCH: 0,05%, ITALIAN: 0,05%, SPANISH: 0,04%

Here is a summary of all accuracy reports of the current lingua version 0.2.0. All supported languages have been taken into account during the classification process. Accuracy values are stated as rounded percentages.

Language	Average	Single Words	Word Pairs	Sentences
English	87	72	89	98
French	90	78	93	97
German	95	89	98	99
Italian	87	73	92	96
Latin	88	80	94	89
Portuguese	86	69	90	99
Spanish	83	64	87	98
overall	88	75	92	97

How to build?

git clone https://github.com/pemistahl/lingua.git
cd lingua
mvn install

Maven's package phase is able to generate two jar files in the target directory:

mvn package creates lingua-0.2.0.jar that contains the compiled sources only.
mvn package -P with-dependencies creates lingua-0.2.0-with-dependencies.jar that additionally contains all dependencies needed to use the library. This jar file can be included in projects without dependency management systems. It can also be used to run lingua in standalone mode (see below).

How to use?

lingua can be used programmatically in your own code or in standalone mode.

Programmatic use

The API is pretty straightforward and can be used in both Kotlin and Java code.

/* Kotlin */

import com.github.pemistahl.lingua.detector.LanguageDetector
import com.github.pemistahl.lingua.model.Language

println(LanguageDetector.supportedLanguages())
// [ENGLISH, FRENCH, GERMAN, ITALIAN, LATIN, PORTUGUESE, SPANISH]

val detector = LanguageDetector.fromAllBuiltInLanguages()
val detectedLanguage: Language = detector.detectLanguageOf(text = "languages are awesome")

// returns Language.ENGLISH

If a string's language cannot be detected reliably because of missing linguistic information, Language.UNKNOWN is returned. The public API of lingua never returns null somewhere, so it is safe to be used from within Java code as well.

/* Java */

import com.github.pemistahl.lingua.detector.LanguageDetector;
import com.github.pemistahl.lingua.model.Language;

final LanguageDetector detector = LanguageDetector.fromAllBuiltInLanguages();
final Language detectedLanguage = detector.detectLanguageOf("languages are awesome");

// returns Language.ENGLISH

There might be classification tasks where you know beforehand that your language data is definitely not written in Latin, for instance (what a surprise :-). The detection accuracy can become better in such cases if you exclude certain languages from the decision process or just explicitly include relevant languages:

// include only languages that are not yet extinct (= currently excludes Latin)
LanguageDetector.fromAllBuiltInSpokenLanguages()

// exclude the Spanish language from the decision algorithm
LanguageDetector.fromAllBuiltInLanguagesWithout(Language.SPANISH)

// only decide between English and German
LanguageDetector.fromLanguages(Language.ENGLISH, Language.GERMAN)

Standalone mode

If you want to try out lingua before you decide whether to use it or not, you can run it in a REPL and immediately see its detection results.

With Maven: mvn exec:java
Without Maven: java -jar lingua-0.2.0-with-dependencies.jar

Then just play around:

This is Lingua.
Loading language models...
Done. 7 language models loaded.

Type some text and press <Enter> to detect its language.
Type :quit to exit.

> Good day
ENGLISH
> Guten Tag
GERMAN
> Bonjour
FRENCH
> Buon giorno
ITALIAN
> Buenos dias
SPANISH
> Bom dia
PORTUGUESE
> :quit
Bye! Ciao! Tschüss! Salut!

What's next for upcoming versions?

languages, languages, even more languages :-)
accuracy improvements
more unit tests
public API stability until version 1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

lingua - language detection done right

What does this library do?

Why does this library exist?

Which languages are supported?

How good is it?

How to build?

How to use?

Programmatic use

Standalone mode

What's next for upcoming versions?

Files

README.md

Latest commit

History

README.md

File metadata and controls

lingua - language detection done right

What does this library do?

Why does this library exist?

Which languages are supported?

How good is it?

How to build?

How to use?

Programmatic use

Standalone mode

What's next for upcoming versions?