This plugin provides phonetic analysis of Russian language by exposing russian_phonetic
token filter
which transforms russian words to their phonetic representation or so-called phonetic code. These codes are used
for matching words and names which sound similar. The process of transformation is also known as phonetic encoding
and this plugin is able to encode millions of russian words per second with the lowest impact on GC among all encoders
compared in encoding throughput benchmarks.
📎
|
Results for matching misspellings and typos, distribution and encoding throughput benchmarks. |
Encoding algorithm extensively employs phonetic and orthographic rules in order to fill the inconsistency gap between spelling and pronunciation in Russian Language.
вдры[зг] ⟷ вдры[ск]
слове[тск]ий ⟷ славе[цк]ий
ла[ндш]афт ⟷ ла[нш]афт
п[я]так ⟷ п[и]так
бу[хг]алтер ⟷ бу[г]алтер
бю[стг]алтер ⟷ бю[зд]галтер
ле[стн]ица ⟷ ле[сн]ица
кислово[дск] ⟷ кислово[цк]
You can find more information about encoding process at the encoding rules and unit tests.
In order to install the plugin, choose a version and run:
$ bin/elasticsearch-plugin install URL
where URL
points to zip file of the appropriate release which corresponds to your elasticsearch version.
❗
|
The plugin must be installed on every node in the cluster, and each node must be restarted after installation. |
E.g., command for Elasticsearch 7.6.2
# install plugin on Elasticsearch 7.6.2
$ bin/elasticsearch-plugin install https://github.com/papahigh/elasticsearch-russian-phonetics/raw/7.6.2/esplugin/plugin-distributions/analysis-russian-phonetic-7.6.2.zip
After installation plugin exposes new token filter named russian_phonetic
.
You can start using the russian_phonetic
token filter by providing analysis configuration:
PUT /russian_phonetic_sample
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"russian_phonetic"
]
}
},
"filter": {
"russian_phonetic": {
"type": "russian_phonetic",
"replace": false
}
}
}
}
}
Then you should be able to hit the analyzer with russian_phonetic
token filter using the analyze API
POST /russian_phonetic_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "студентка комсомолка спортсменка"
}
Returns: стднк
, студентка
, кмсмлк
, комсомолка
, спрцмнк
, спортсменка
The russian_phonetic
token filter provides a bunch of configuration options to meet your particular needs:
- replace
-
Whether or not the original token should be replaced by the phonetic code. Accepts
true
(default) orfalse
. - vowels
-
Defines encoding mode for vowels. Accepts
encode_first
(default) orencode_all
.encode_first: only first vowel in the supplied word will be encodedупячка → упчк голландский → глнскй абсурд → апсрт
encode_all: all vowels will be encoded according to the encoding rulesупячка → уп2чк1 голландский → г1л1нск2й абсурд → апс3рт
- max_code_len
-
The maximum length of the phonetic code. Defaults to
8
. - enable_stemmer
-
Whether or not the stemming should be applied. Accepts
true
orfalse
(default). When this option is enabled only base (or root) form of the supplied word will be encoded.аннотируешь → антрш аннотируешься → антрш аннотируешь → ан1т2р32ш аннотируешься → ан1т2р32ш ящурным → ящрн ящурные → ящрн ящурным → ящ3рн ящурные → ящ3рн
💡
|
Please take a look at the throughput and distribution benchmarks to be aware of encoder’s behaviour and performance under certain options value. |
-
Blog post "Phonetic algorithms" by Nikita Smetanin
-
Apache Lucene full-featured text search engine library
-
Elasticsearch distributed search and analytics engine
Use the issue tracker and/or open pull requests.
Both encoder and esplugin projects are released under version 2.0 of the Apache Licence.