Skip to content

Commit

Permalink
[DOCS] Reformat CJK bigram and CJK width token filter docs (#48210)
Browse files Browse the repository at this point in the history
  • Loading branch information
jrodewig authored Oct 21, 2019
1 parent b84b7e1 commit bb635e5
Show file tree
Hide file tree
Showing 2 changed files with 249 additions and 20 deletions.
184 changes: 171 additions & 13 deletions docs/reference/analysis/tokenfilters/cjk-bigram-tokenfilter.asciidoc
Original file line number Diff line number Diff line change
@@ -1,18 +1,176 @@
[[analysis-cjk-bigram-tokenfilter]]
=== CJK Bigram Token Filter
=== CJK bigram token filter
++++
<titleabbrev>CJK bigram</titleabbrev>
++++

The `cjk_bigram` token filter forms bigrams out of the CJK
terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
or the `icu_tokenizer` (see {plugins}/analysis-icu-tokenizer.html[`analysis-icu` plugin]).
Forms https://en.wikipedia.org/wiki/Bigram[bigrams] out of CJK (Chinese,
Japanese, and Korean) tokens.

By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the `output_unigrams` flag to `true`. This can be used for a
combined unigram+bigram approach.
This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
analyzer>>. It uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html[CJKBigramFilter].

Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
`hangul`, but bigrams can be disabled for particular scripts with the
`ignored_scripts` parameter. All non-CJK input is passed through unmodified.

[[analysis-cjk-bigram-tokenfilter-analyze-ex]]
==== Example

The following <<indices-analyze,analyze API>> request demonstrates how the
CJK bigram token filter works.

[source,console]
--------------------------------------------------
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["cjk_bigram"],
"text" : "東京都は、日本の首都であり"
}
--------------------------------------------------

The filter produces the following tokens:

[source,text]
--------------------------------------------------
[ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
--------------------------------------------------

/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens" : [
{
"token" : "東京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<DOUBLE>",
"position" : 0
},
{
"token" : "京都",
"start_offset" : 1,
"end_offset" : 3,
"type" : "<DOUBLE>",
"position" : 1
},
{
"token" : "都は",
"start_offset" : 2,
"end_offset" : 4,
"type" : "<DOUBLE>",
"position" : 2
},
{
"token" : "日本",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<DOUBLE>",
"position" : 3
},
{
"token" : "本の",
"start_offset" : 6,
"end_offset" : 8,
"type" : "<DOUBLE>",
"position" : 4
},
{
"token" : "の首",
"start_offset" : 7,
"end_offset" : 9,
"type" : "<DOUBLE>",
"position" : 5
},
{
"token" : "首都",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<DOUBLE>",
"position" : 6
},
{
"token" : "都で",
"start_offset" : 9,
"end_offset" : 11,
"type" : "<DOUBLE>",
"position" : 7
},
{
"token" : "であ",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<DOUBLE>",
"position" : 8
},
{
"token" : "あり",
"start_offset" : 11,
"end_offset" : 13,
"type" : "<DOUBLE>",
"position" : 9
}
]
}
--------------------------------------------------
/////////////////////

[[analysis-cjk-bigram-tokenfilter-analyzer-ex]]
==== Add to an analyzer

The following <<indices-create-index,create index API>> request uses the
CJK bigram token filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>.

[source,console]
--------------------------------------------------
PUT /cjk_bigram_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"standard_cjk_bigram" : {
"tokenizer" : "standard",
"filter" : ["cjk_bigram"]
}
}
}
}
}
--------------------------------------------------


[[analysis-cjk-bigram-tokenfilter-configure-parms]]
==== Configurable parameters

`ignored_scripts`::
+
--
(Optional, array of character scripts)
Array of character scripts for which to disable bigrams.
Possible values:

* `han`
* `hangul`
* `hiragana`
* `katakana`

All non-CJK input is passed through unmodified.
--

`output_unigrams`
(Optional, boolean)
If `true`, emit tokens in both bigram and
https://en.wikipedia.org/wiki/N-gram[unigram] form. If `false`, a CJK character
is output in unigram form when it has no adjacent characters. Defaults to
`false`.

[[analysis-cjk-bigram-tokenfilter-customize]]
==== Customize

To customize the CJK bigram token filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.

[source,console]
--------------------------------------------------
Expand All @@ -30,9 +188,9 @@ PUT /cjk_bigram_example
"han_bigrams_filter" : {
"type" : "cjk_bigram",
"ignored_scripts": [
"hangul",
"hiragana",
"katakana",
"hangul"
"katakana"
],
"output_unigrams" : true
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,83 @@
[[analysis-cjk-width-tokenfilter]]
=== CJK Width Token Filter
=== CJK width token filter
++++
<titleabbrev>CJK width</titleabbrev>
++++

The `cjk_width` token filter normalizes CJK width differences:
Normalizes width differences in CJK (Chinese, Japanese, and Korean) characters
as follows:

* Folds fullwidth ASCII variants into the equivalent basic Latin
* Folds halfwidth Katakana variants into the equivalent Kana
* Folds full-width ASCII character variants into the equivalent basic Latin
characters
* Folds half-width Katakana character variants into the equivalent Kana
characters

NOTE: This token filter can be viewed as a subset of NFKC/NFKD
Unicode normalization. See the {plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin]
for full normalization support.
This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
analyzer>>. It uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html[CJKWidthFilter].

NOTE: This token filter can be viewed as a subset of NFKC/NFKD Unicode
normalization. See the
{plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin] for
full normalization support.

[[analysis-cjk-width-tokenfilter-analyze-ex]]
==== Example

[source,console]
--------------------------------------------------
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["cjk_width"],
"text" : "シーサイドライナー"
}
--------------------------------------------------

The filter produces the following token:

[source,text]
--------------------------------------------------
シーサイドライナー
--------------------------------------------------

/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens" : [
{
"token" : "シーサイドライナー",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<KATAKANA>",
"position" : 0
}
]
}
--------------------------------------------------
/////////////////////

[[analysis-cjk-width-tokenfilter-analyzer-ex]]
==== Add to an analyzer

The following <<indices-create-index,create index API>> request uses the
CJK width token filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>.

[source,console]
--------------------------------------------------
PUT /cjk_width_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"standard_cjk_width" : {
"tokenizer" : "standard",
"filter" : ["cjk_width"]
}
}
}
}
}
--------------------------------------------------

0 comments on commit bb635e5

Please sign in to comment.