Skip to content

Commit 17b159a

Browse files
committed
🎨 chore(ft_detect): refactor detect_langs to detect_language
Refactor `detect_langs` function to `detect_language` for better clarity and deprecation warning. Remove redundancy and improve code readability.
1 parent 00aa55e commit 17b159a

File tree

7 files changed

+130
-53
lines changed

7 files changed

+130
-53
lines changed

README.md

Lines changed: 53 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,67 +2,89 @@
22

33
[![PyPI version](https://badge.fury.io/py/fast-langdetect.svg)](https://badge.fury.io/py/fast-langdetect)
44
[![Downloads](https://pepy.tech/badge/fast-langdetect)](https://pepy.tech/project/fast-langdetect)
5-
[![Downloads](https://pepy.tech/badge/fast-langdetect/month)](https://pepy.tech/project/fast-langdetect/month)
5+
[![Downloads](https://pepy.tech/badge/fast-langdetect/month)](https://pepy.tech/project/fast-langdetect/)
66

7-
Python 3.9-3.12 support only. 🐍
7+
## Overview
88

9-
80x faster and 95% accurate language identification with Fasttext 🏎️
9+
**fast-langdetect** provides ultra-fast and highly accurate language detection based on FastText, a library developed by
10+
Facebook. This package is 80x faster than traditional methods and offers 95% accuracy.
1011

11-
This library is a wrapper for the language detection model trained on fasttext by Facebook. For more information, please
12-
visit: https://fasttext.cc/docs/en/language-identification.html 📘
12+
It supports Python versions 3.9 to 3.12.
1313

14-
This repository is patched
15-
from [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark), adding
16-
multi-language segmentation and better packaging
17-
support. 🌐
14+
This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark)
15+
with enhancements in packaging.
1816

19-
Facilitates more accurate TTS implementation. 🗣️
17+
For more information on the underlying FastText model, refer to the official
18+
documentation: [FastText Language Identification](https://fasttext.cc/docs/en/language-identification.html).
2019

21-
**Need 200M+ memory to use low_memory mode** 💾
20+
> [!NOTE]
21+
> This library requires over 200MB of memory to use in low memory mode.
2222
2323
## Installation 💻
2424

25+
To install fast-langdetect, you can use either `pip` or `pdm`:
26+
27+
### Using pip
28+
2529
```bash
2630
pip install fast-langdetect
2731
```
2832

29-
## Usage 🖥️
33+
### Using pdm
34+
35+
```bash
36+
pdm add fast-langdetect
37+
```
3038

31-
**For more accurate language detection, please use `detect(text,low_memory=False)` to load the big model.**
39+
## Usage 🖥️
3240

33-
**Model will be downloaded in `/tmp/fasttext-langdetect` directory when you first use it.**
41+
For optimal performance and accuracy in language detection, use `detect(text, low_memory=False)` to load the larger
42+
model.
3443

35-
```python
36-
from fast_langdetect import detect_langs
44+
> The model will be downloaded to the `/tmp/fasttext-langdetect` directory upon first use.
3745
38-
print(detect_langs("Hello, world!"))
39-
# EN
46+
### Native API (Recommended)
4047

41-
print(detect_langs("Привет, мир!"))
42-
# RU
48+
```python
49+
from fast_langdetect import detect, detect_multilingual
4350

51+
# Single language detection
52+
print(detect("Hello, world!"))
53+
# Output: {'lang': 'en', 'score': 0.1520957201719284}
4454

45-
print(detect_langs("你好,世界!"))
46-
# ZH
55+
print(detect("Привет, мир!")["lang"])
56+
# Output: ru
4757

58+
# Multi-language detection
59+
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
60+
# Output: [
61+
# {'lang': 'ru', 'score': 0.39008623361587524},
62+
# {'lang': 'zh', 'score': 0.18235979974269867},
63+
# ]
4864
```
4965

50-
## Advanced usage 🚀
66+
### Convenient `detect_language` Function
5167

5268
```python
53-
from fast_langdetect import detect, detect_multilingual
69+
from fast_langdetect import detect_language
5470

55-
print(detect("Hello, world!"))
56-
# {'lang': 'en', 'score': 0.1520957201719284}
71+
# Single language detection
72+
print(detect_language("Hello, world!"))
73+
# Output: EN
5774

58-
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
59-
# [{'lang': 'ru', 'score': 0.39008623361587524}, {'lang': 'zh', 'score': 0.18235979974269867}, {'lang': 'ja', 'score': 0.08473210036754608}, {'lang': 'sr', 'score': 0.057975586503744125}, {'lang': 'en', 'score': 0.05422825738787651}]
75+
print(detect_language("Привет, мир!"))
76+
# Output: RU
77+
78+
print(detect_language("你好,世界!"))
79+
# Output: ZH
6080
```
6181

62-
### Splitting text by language 🌐
82+
### Splitting Text by Language 🌐
6383

64-
check out the [split-lang](https://github.com/DoodleBears/split-lang).
84+
For text splitting based on language, please refer to the [split-lang](https://github.com/DoodleBears/split-lang)
85+
repository.
6586

6687
## Accuracy 🎯
6788

68-
References to the [benchmark](https://github.com/zafercavdar/fasttext-langdetect#benchmark)
89+
For detailed benchmark results, refer
90+
to [zafercavdar/fasttext-langdetect#benchmark](https://github.com/zafercavdar/fasttext-langdetect#benchmark).

feature_test/__init__.py

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,21 @@
11
# -*- coding: utf-8 -*-
22
# @Time : 2024/1/18 上午11:41
33
# @Author : sudoskys
4-
# @File : __init__.py.py
5-
# @Software: PyCharm
6-
from fast_langdetect import detect, detect_multilingual, detect_langs
7-
from fast_langdetect import parse_sentence
84

9-
print(parse_sentence("你好世界"))
10-
print(parse_sentence("你好世界!Hello, world!Привет, мир!"))
11-
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
125

6+
from fast_langdetect import detect, detect_multilingual, detect_language
7+
8+
# 测试繁体,简体,日文,英文,韩文,法文,德文,西班牙文
9+
10+
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
11+
# [{'lang': 'ja', 'score': 0.32009604573249817}, {'lang': 'uk', 'score': 0.27781224250793457}, {'lang': 'zh', 'score': 0.17542070150375366}, {'lang': 'sr', 'score': 0.08751443773508072}, {'lang': 'bg', 'score': 0.05222449079155922}]
1312
print(detect("hello world"))
14-
print(detect_langs("Привет, мир!"))
13+
14+
print(detect_language("Привет, мир!"))
15+
print(detect_language("你好世界"))
16+
print(detect_language("こんにちは世界"))
17+
print(detect_language("안녕하세요 세계"))
18+
print(detect_language("Bonjour le monde"))
19+
print(detect_language("Hallo Welt"))
20+
print(detect_language("Hola mundo"))
21+
print(detect_language("這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等"))

pdm.lock

Lines changed: 27 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,8 @@ authors = [
77
]
88
dependencies = [
99
"fasttext-wheel>=0.9.2",
10-
"requests>=2.31.0",
1110
"robust-downloader>=0.0.2",
12-
"numpy>=1.26.4,<2.0.0",
11+
"langdetect>=1.0.9",
1312
]
1413
requires-python = ">=3.9,<3.13"
1514
readme = "README.md"

src/fast_langdetect/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
# -*- coding: utf-8 -*-
22

3-
from .ft_detect import detect, detect_langs, detect_multilingual # noqa: F401
3+
from .ft_detect import detect, detect_language, detect_langs, detect_multilingual # noqa: F401

src/fast_langdetect/ft_detect/__init__.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# -*- coding: utf-8 -*-
22
# @Time : 2024/1/17 下午4:00
3-
# @Author : sudoskys
4-
# @File : __init__.py
3+
import logging
4+
55
from .infer import detect
66
from .infer import detect_multilingual # noqa: F401
77

@@ -13,7 +13,7 @@ def is_japanese(string):
1313
return False
1414

1515

16-
def detect_langs(sentence, *, low_memory: bool = True):
16+
def detect_language(sentence, *, low_memory: bool = True):
1717
"""
1818
Detect language
1919
:param sentence: str sentence
@@ -24,3 +24,14 @@ def detect_langs(sentence, *, low_memory: bool = True):
2424
if lang_code == "JA" and not is_japanese(sentence):
2525
lang_code = "ZH"
2626
return lang_code
27+
28+
29+
def detect_langs(sentence, *, low_memory: bool = True):
30+
"""
31+
Detect language
32+
:param sentence: str sentence
33+
:param low_memory: bool (default: True) whether to use low memory mode
34+
:return: ZH, EN, JA, KO, FR, DE, ES, .... (two uppercase letters)
35+
"""
36+
logging.warning("detect_langs is deprecated, use detect_language instead")
37+
return detect_language(sentence, low_memory=low_memory)

tests/test_detect.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,23 @@ def test_muti_detect():
1111
assert result[0].get("lang") == "en", "ft_detect error"
1212

1313

14+
def test_detect():
15+
from fast_langdetect import detect
16+
assert detect("hello world")["lang"] == "en", "ft_detect error"
17+
assert detect("你好世界")["lang"] == "zh", "ft_detect error"
18+
assert detect("こんにちは世界")["lang"] == "ja", "ft_detect error"
19+
assert detect("안녕하세요 세계")["lang"] == "ko", "ft_detect error"
20+
assert detect("Bonjour le monde")["lang"] == "fr", "ft_detect error"
21+
22+
1423
def test_detect_totally():
15-
from fast_langdetect import detect_langs
16-
assert detect_langs("hello world") == "EN", "ft_detect error"
17-
assert detect_langs("你好世界") == "ZH", "ft_detect error"
18-
assert detect_langs("こんにちは世界") == "JA", "ft_detect error"
24+
from fast_langdetect import detect_language
25+
assert detect_language("hello world") == "EN", "ft_detect error"
26+
assert detect_language("你好世界") == "ZH", "ft_detect error"
27+
assert detect_language("こんにちは世界") == "JA", "ft_detect error"
28+
assert detect_language("안녕하세요 세계") == "KO", "ft_detect error"
29+
assert detect_language("Bonjour le monde") == "FR", "ft_detect error"
30+
assert detect_language("Hallo Welt") == "DE", "ft_detect error"
31+
assert detect_language(
32+
"這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等"
33+
) == "ZH", "ft_detect error"

0 commit comments

Comments
 (0)