Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Reformat CJK bigram and CJK width token filter docs #48210

Merged
merged 3 commits into from
Oct 21, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 173 additions & 13 deletions docs/reference/analysis/tokenfilters/cjk-bigram-tokenfilter.asciidoc
Original file line number Diff line number Diff line change
@@ -1,18 +1,178 @@
[[analysis-cjk-bigram-tokenfilter]]
=== CJK Bigram Token Filter
=== CJK bigram token filter
++++
<titleabbrev>CJK bigram</titleabbrev>
++++

The `cjk_bigram` token filter forms bigrams out of the CJK
terms that are generated by the <<analysis-standard-tokenizer,`standard` tokenizer>>
or the `icu_tokenizer` (see {plugins}/analysis-icu-tokenizer.html[`analysis-icu` plugin]).
Forms https://en.wikipedia.org/wiki/Bigram[bigrams] out of the CJK (Chinese,
Japanese, and Korean) terms generated by the
<<analysis-standard-tokenizer,standard tokenizer>> or the
{plugins}/analysis-icu-tokenizer.html[ICU tokenizer].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, it will form bigrams from the CJK tokens produced by any tokenizer, so I'm not sure we need to refer to standard and icu here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @romseygeek. I removed the standard and ICU reference with cecd9bc.


By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the `output_unigrams` flag to `true`. This can be used for a
combined unigram+bigram approach.
This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
analyzer>>. It uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilter.html[CJKBigramFilter].

Bigrams are generated for characters in `han`, `hiragana`, `katakana` and
`hangul`, but bigrams can be disabled for particular scripts with the
`ignored_scripts` parameter. All non-CJK input is passed through unmodified.

[[analysis-cjk-bigram-tokenfilter-analyze-ex]]
==== Example

The following <<indices-analyze,analyze API>> request demonstrates how the
CJK bigram token filter works.

[source,console]
--------------------------------------------------
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["cjk_bigram"],
"text" : "東京都は、日本の首都であり"
}
--------------------------------------------------

The filter produces the following tokens:

[source,text]
--------------------------------------------------
[ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
--------------------------------------------------

/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens" : [
{
"token" : "東京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<DOUBLE>",
"position" : 0
},
{
"token" : "京都",
"start_offset" : 1,
"end_offset" : 3,
"type" : "<DOUBLE>",
"position" : 1
},
{
"token" : "都は",
"start_offset" : 2,
"end_offset" : 4,
"type" : "<DOUBLE>",
"position" : 2
},
{
"token" : "日本",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<DOUBLE>",
"position" : 3
},
{
"token" : "本の",
"start_offset" : 6,
"end_offset" : 8,
"type" : "<DOUBLE>",
"position" : 4
},
{
"token" : "の首",
"start_offset" : 7,
"end_offset" : 9,
"type" : "<DOUBLE>",
"position" : 5
},
{
"token" : "首都",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<DOUBLE>",
"position" : 6
},
{
"token" : "都で",
"start_offset" : 9,
"end_offset" : 11,
"type" : "<DOUBLE>",
"position" : 7
},
{
"token" : "であ",
"start_offset" : 10,
"end_offset" : 12,
"type" : "<DOUBLE>",
"position" : 8
},
{
"token" : "あり",
"start_offset" : 11,
"end_offset" : 13,
"type" : "<DOUBLE>",
"position" : 9
}
]
}
--------------------------------------------------
/////////////////////

[[analysis-cjk-bigram-tokenfilter-analyzer-ex]]
==== Add to an analyzer

The following <<indices-create-index,create index API>> request uses the
CJK bigram token filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>.

[source,console]
--------------------------------------------------
PUT /cjk_bigram_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"standard_cjk_bigram" : {
"tokenizer" : "standard",
"filter" : ["cjk_bigram"]
}
}
}
}
}
--------------------------------------------------


[[analysis-cjk-bigram-tokenfilter-configure-parms]]
==== Configurable parameters

`ignored_scripts`::
+
--
(Optional, array of character scripts)
Array of character scripts for which to disable bigrams.
Possible values:

* `han`
* `hangul`
* `hiragana`
* `katakana`

All non-CJK input is passed through unmodified.
--

`output_unigrams`
(Optional, boolean)
If `true`, emit tokens in both bigram and
https://en.wikipedia.org/wiki/N-gram[unigram] form. If `false`, a CJK character
is output in unigram form when it has no adjacent characters. Defaults to
`false`.

[[analysis-cjk-bigram-tokenfilter-customize]]
==== Customize

To customize the CJK bigram token filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.

[source,console]
--------------------------------------------------
Expand All @@ -30,9 +190,9 @@ PUT /cjk_bigram_example
"han_bigrams_filter" : {
"type" : "cjk_bigram",
"ignored_scripts": [
"hangul",
"hiragana",
"katakana",
"hangul"
"katakana"
],
"output_unigrams" : true
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,83 @@
[[analysis-cjk-width-tokenfilter]]
=== CJK Width Token Filter
=== CJK width token filter
++++
<titleabbrev>CJK width</titleabbrev>
++++

The `cjk_width` token filter normalizes CJK width differences:
Normalizes width differences in CJK (Chinese, Japanese, and Korean) characters
as follows:

* Folds fullwidth ASCII variants into the equivalent basic Latin
* Folds halfwidth Katakana variants into the equivalent Kana
* Folds full-width ASCII character variants into the equivalent basic Latin
characters
* Folds half-width Katakana character variants into the equivalent Kana
characters

NOTE: This token filter can be viewed as a subset of NFKC/NFKD
Unicode normalization. See the {plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin]
for full normalization support.
This filter is included in {es}'s built-in <<cjk-analyzer,CJK language
analyzer>>. It uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilter.html[CJKWidthFilter].

NOTE: This token filter can be viewed as a subset of NFKC/NFKD Unicode
normalization. See the
{plugins}/analysis-icu-normalization-charfilter.html[`analysis-icu` plugin] for
full normalization support.

[[analysis-cjk-width-tokenfilter-analyze-ex]]
==== Example

[source,console]
--------------------------------------------------
GET /_analyze
{
"tokenizer" : "standard",
"filter" : ["cjk_width"],
"text" : "シーサイドライナー"
}
--------------------------------------------------

The filter produces the following token:

[source,text]
--------------------------------------------------
シーサイドライナー
--------------------------------------------------

/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens" : [
{
"token" : "シーサイドライナー",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<KATAKANA>",
"position" : 0
}
]
}
--------------------------------------------------
/////////////////////

[[analysis-cjk-width-tokenfilter-analyzer-ex]]
==== Add to an analyzer

The following <<indices-create-index,create index API>> request uses the
CJK width token filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>.

[source,console]
--------------------------------------------------
PUT /cjk_width_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"standard_cjk_width" : {
"tokenizer" : "standard",
"filter" : ["cjk_width"]
}
}
}
}
}
--------------------------------------------------