Skip to content

Commit

Permalink
Merge branch 'main' into fix-vietnamese-normalizer
Browse files Browse the repository at this point in the history
  • Loading branch information
ManyTheFish authored Feb 13, 2024
2 parents aee3e43 + b140e49 commit ed34dd6
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion charabia/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "charabia"
version = "0.8.6"
version = "0.8.7"
license = "MIT"
authors = ["Many <many@meilisearch.com>"]
edition = "2021"
Expand Down
2 changes: 1 addition & 1 deletion charabia/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Charabia provides a simple API to segment, normalize, or tokenize (segment + nor

| Script / Language | specialized segmentation | specialized normalization | Segmentation Performance level | Tokenization Performance level |
|---------------------|-------------------------------------------------------------------------------|---------------------------|-------------------|---|
| **Latin** | ✅ CamelCase segmentation |[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal | 🟩 ~23MiB/sec | 🟨 ~9MiB/sec |
| **Latin** | ✅ CamelCase segmentation |[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + [nonspacing-marks](https://www.compart.com/en/unicode/category/Mn) removal + `Ð vs Đ` spoofing normalization | 🟩 ~23MiB/sec | 🟨 ~9MiB/sec |
| **Greek** ||[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase + final sigma normalization | 🟩 ~27MiB/sec | 🟨 ~8MiB/sec |
| **Cyrillic** - **Georgian** ||[compatibility decomposition](https://unicode.org/reports/tr15/) + lowercase | 🟩 ~27MiB/sec | 🟨 ~9MiB/sec |
| **Chinese** **CMN** 🇨🇳 |[jieba](https://github.com/messense/jieba-rs) |[compatibility decomposition](https://unicode.org/reports/tr15/) + pinyin conversion | 🟨 ~10MiB/sec | 🟧 ~5MiB/sec |
Expand Down

0 comments on commit ed34dd6

Please sign in to comment.