yurenizer

This is a Japanese text normalizer that resolves spelling inconsistencies.

Japanese README is Here.（日本語のREADMEはこちら）
https://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md

Overview

yurenizer is a tool for detecting and unifying variations in Japanese text notation.
For example, it can unify variations like "パソコン" (pasokon), "パーソナル・コンピュータ" (personal computer), and "パーソナルコンピュータ" into "パーソナルコンピューター".
These rules follow the Sudachi Synonym Dictionary.

web-based Demo

You can try the web-based demo here.
yurenizer Web-demo

Installation

pip install yurenizer

Download Synonym Dictionary

curl -L -o /path/to/synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt

Usage

Quick Start

from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="path/to/synonym_file_path")
text = "「パソコン」は「パーソナルコンピュータ」の「synonym」で、「パーソナル・コンピュータ」と表記することもあります。"
print(normalizer.normalize(text))
# Output: 「パーソナルコンピューター」は「パーソナルコンピューター」の「シノニム」で、「パーソナルコンピューター」と表記することもあります。

Customizing Settings

You can control normalization by specifying NormalizerConfig as an argument to the normalize function.

Example with Custom Settings

from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="path/to/synonym_file_path")
text = "「東日本旅客鉄道」は「JR東」や「JR-East」とも呼ばれます"
config = NormalizerConfig(
            unify_level="lexeme",
            taigen=True, 
            yougen=False,
            expansion="from_another", 
            other_language=False,
            alias=False,
            old_name=False,
            misuse=False,
            alphabetic_abbreviation=True, # Normalize only alphabetic abbreviations
            non_alphabetic_abbreviation=False,
            alphabet=False,
            orthographic_variation=False,
            misspelling=False
        )
print(f"Output: {normalizer.normalize(text, config)}")
# Output: 「東日本旅客鉄道」は「JR東」や「東日本旅客鉄道」とも呼ばれます

Configuration Details

unify_level (default="lexeme"): Flag to specify unification level. Default "lexeme" unifies based on lexeme number. "word_form" option unifies based on word form number. "abbreviation" option unifies based on abbreviation number.
taigen (default=True): Flag to include nouns in unification. Default is to include. Specify False to exclude.
yougen (default=False): Flag to include conjugated words in unification. Default is to exclude. Specify True to include. However, conjugated words are unified to the headword.
expansion (default="from_another"): Synonym expansion control flag. Default only expands those with expansion control flag 0. Specify "ANY" to always expand.
other_language (default=True): Flag to normalize non-Japanese languages to Japanese. Default is to normalize. Specify False to disable.
alias (default=True): Flag to normalize aliases. Default is to normalize. Specify False to disable.
old_name (default=True): Flag to normalize old names. Default is to normalize. Specify False to disable.
misuse (default=True): Flag to normalize misused terms. Default is to normalize. Specify False to disable.
alphabetic_abbreviation (default=True): Flag to normalize alphabetic abbreviations. Default is to normalize. Specify False to disable.
non_alphabetic_abbreviation (default=True): Flag to normalize Japanese abbreviations. Default is to normalize. Specify False to disable.
alphabet (default=True): Flag to normalize alphabet variations. Default is to normalize. Specify False to disable.
orthographic_variation (default=True): Flag to normalize orthographic variations. Default is to normalize. Specify False to disable.
misspelling (default=True): Flag to normalize misspellings. Default is to normalize. Specify False to disable.
custom_synonym (default=True): Flag to use user-defined custom synonyms. Default is to use. Specify False to disable.

Specifying SudachiDict

The length of text segmentation varies depending on the type of SudachiDict. Default is "full", but you can specify "small" or "core".
To use "small" or "core", install it and specify in the SynonymNormalizer() arguments:

pip install sudachidict_small
# or
pip install sudachidict_core

normalizer = SynonymNormalizer(sudachi_dict="small")
# or
normalizer = SynonymNormalizer(sudachi_dict="core")

※ Please refer to SudachiDict documentation for details.

Custom Dictionary Specification

You can specify your own custom dictionary.
If the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.

Custom Dictionary Format

Create a JSON file with the following format for your custom dictionary:

{
    "representative_word1": ["synonym1_1", "synonym1_2", ...],
    "representative_word2": ["synonym2_1", "synonym2_2", ...],
    ...
}

Example

If you create a file like this, "幽白", "ゆうはく", and "幽☆遊☆白書" will be normalized to "幽遊白書":

{
    "幽遊白書": ["幽白", "ゆうはく", "幽☆遊☆白書"]
}

How to Specify

normalizer = SynonymNormalizer(custom_synonyms_file="path/to/custom_dict.json")

License

This project is licensed under the Apache License 2.0.

Open Source Software Used

Sudachi Synonym Dictionary: Apache License 2.0
SudachiPy: Apache License 2.0
SudachiDict: Apache License 2.0

This library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.

For detailed license information, please check the LICENSE files of each project:

Sudachi Synonym Dictionary LICENSE ※ Provided under the same license as the Sudachi dictionary.
SudachiPy LICENSE
SudachiDict LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.github/workflows		.github/workflows
scripts		scripts
tests		tests
yurenizer		yurenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

yurenizer

Overview

web-based Demo

Installation

Download Synonym Dictionary

Usage

Quick Start

Customizing Settings

Example with Custom Settings

Configuration Details

Specifying SudachiDict

Custom Dictionary Specification

Custom Dictionary Format

Example

How to Specify

License

Open Source Software Used

About

Releases

Packages

Languages

License

sea-turt1e/yurenizer

Folders and files

Latest commit

History

Repository files navigation

yurenizer

Overview

web-based Demo

Installation

Download Synonym Dictionary

Usage

Quick Start

Customizing Settings

Example with Custom Settings

Configuration Details

Specifying SudachiDict

Custom Dictionary Specification

Custom Dictionary Format

Example

How to Specify

License

Open Source Software Used

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages