Luga

A blazing fast language detection using fastText's language models.

Luga is a Swahili word for language. fastText provides blazing-fast language detection tool. Lamentably, fastText's API is beauty-less, and the documentation is a bit fuzzy. It is also funky that we have to manually download and load models.

Here is where luga comes in. We abstract unnecessary steps and allow you to do precisely one thing: detecting text language.

cover image

Stand Still. Stay Silent - The relationships between Indo-European and Uralic languages by Minna Sundberg.

Show, don't tell

Installation

python -m pip install -U luga

Usage:

⚠️ Note: The first usage downloads the model for you. It will take a bit longer to import depending on internet speed. It is done only once.

from luga import language

print(language("the world ended yesterday"))

# Language(name='en', score=0.98)

With the list of texts, we can create a mask for a filtering pipeline, that can be used, for example, with DataFrames

from luga import language
import pandas as pd

examples = ["Jeg har ikke en rød reje", "Det blæser en halv pelican", "We are not robots yet"]
languages(texts=examples, only_language=True, to_array=True) == "en"
# output
# array([False, False, True])

dataf = pd.DataFrame({"text": examples})
dataf.loc[lambda d: languages(texts=d["text"].to_list(), only_language=True, to_array=True) == "en"]
# output
# 2    We are not robots yet
# Name: text, dtype: object

Without Luga:

Download the model

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O /tmp/lid.176.bin

Load and use

import fasttext

PATH_TO_MODEL = '/tmp/lid.176.bin'
fmodel = fasttext.load_model(PATH_TO_MODEL)
fmodel.predict(["the world has ended yesterday"])

# ([['__label__en']], [array([0.98046654], dtype=float32)])

Dev:

poetry run pre-commit install

Release Flow

# assumes git push is completed
git tag -l #  lists tags
git tag v*.*.* # Major.Minor.Fix
git push origin tag v*.*.*

# to delete tag:
git tag -d v*.*.* && git push origin tag -d v*.*.*

# change project_toml and __init__.py to reflect new version

TODO:

refactor artifacts.py
auto checkers with pre-commit | invoke
write more tests
write github actions
create an intelligent data checker (a fast List[str], what do with none strings)
make it faster with Cython
get NDArray typing correctly
fix artifacts.py line 111 cast to List[str] that causes issues
remove nptyping when more packages move to numpy > 1.21

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github/workflows		.github/workflows
luga		luga
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
example.gif		example.gif
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Luga

cover image

Show, don't tell

Installation

Usage:

Without Luga:

Dev:

Release Flow

TODO:

About

Releases 10

Packages

Contributors 3

Languages

License

Proteusiq/luga

Folders and files

Latest commit

History

Repository files navigation

Luga

cover image

Show, don't tell

Installation

Usage:

Without Luga:

Dev:

Release Flow

TODO:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 3

Languages

Packages