Python Fast Unidecode

This repo is a fork of the rust-unicode repository and transports the original Rust implementation to be used with Python. It also implements a couple of source code changes to hasten a translation of ASCII family of characters and makes this implementation on par with Python unidecode implementation on this set of characters.

The overall result is this package should provide you with the same output as the aforementioned Python implementation, but is much faster on a translation of non-ASCII characters (>4x) and slightly faster on ASCII characters (in a degree of several percents) on average based on the test_speedup.py benchmark (depending on caching, etc.; sometimes, a translation of non-ASCII characters provides you with a speedup of up to 100x).

Installation

pip install fast_unidecode

Installation from source

First, you need to build the package using maturin, then install fast_unidecode simply with pip.

maturin build --release
pip install target/wheels/fast_unidecode...

Usage

>>> from fast_unidecode import unidecode

>>> print(unidecode("Æneid"))
'AEneid'

>>> print(unidecode("北亰"))
'Bei Jing'

rust-unidecode (Original README.md)

Documentation

The rust-unidecode library is a Rust port of Sean M. Burke's famous Text::Unidecode module for Perl. It transliterates Unicode strings such as "Æneid" into pure ASCII ones such as "AEneid." For a detailed explanation on the rationale behind using such a library, you can refer to both the documentation of the original module and this article written by Burke in 2001.

The data set used to translate the Unicode was ported directly from the Text::Unidecode module using a Perl script, so rust-unidecode should produce identical output.

Examples

extern crate unidecode;
use unidecode::unidecode;

assert_eq!(unidecode("Æneid"), "AEneid");
assert_eq!(unidecode("étude"), "etude");
assert_eq!(unidecode("北亰"), "Bei Jing");
assert_eq!(unidecode("ᔕᓇᓇ"), "shanana");
assert_eq!(unidecode("げんまい茶"), "genmaiCha ");

Guarantees and Warnings

Here are some guarantees you have when calling unidecode():

The String returned will be valid ASCII; the decimal representation of every char in the string will be between 0 and 127, inclusive.
Every ASCII character (0x0000 - 0x007F) is mapped to itself.
All Unicode characters will translate to a string containing newlines ("\n") or ASCII characters in the range 0x0020 - 0x007E. So for example, no Unicode character will translate to \u{01}. The exception is if the ASCII character itself is passed in, in which case it will be mapped to itself. (So '\u{01}' will be mapped to "\u{01}".)

There are, however, some things you should keep in mind:

As stated, some transliterations do produce \n characters.
Some Unicode characters transliterate to an empty string, either on purpose or because rust-unidecode does not know about the character.
Some Unicode characters are unknown and transliterate to "[?]".
Many Unicode characters transliterate to multi-character strings. For example, 北 is transliterated as "Bei ".

This information was paraphrased from the original Text::Unidecode documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
requirements		requirements
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build_manylinux_wheels.sh		build_manylinux_wheels.sh
setup.cfg		setup.cfg
test_speed.py		test_speed.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Fast Unidecode

Installation

Usage

Examples

Guarantees and Warnings

About

Releases

Packages

Languages

License

stancld/rust-unidecode

Folders and files

Latest commit

History

Repository files navigation

Python Fast Unidecode

Installation

Usage

Examples

Guarantees and Warnings

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages