Skip to content

Commit 10859be

Browse files
committed
Documentation
1 parent 187711f commit 10859be

File tree

7 files changed

+102
-45
lines changed

7 files changed

+102
-45
lines changed

CHANGELOG

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
mkdocs-material-8.3.2+insiders-4.17.2 (2022-06-05)
2+
3+
* Added support for custom jieba dictionaries (Chinese search)
4+
15
mkdocs-material-8.3.2+insiders-4.17.1 (2022-06-05)
26

37
* Added support for cookie consent reject button

docs/blog/2021/search-better-faster-smaller.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -197,8 +197,8 @@ the following steps are taken:
197197
remain. Linking is necessary, as search results are grouped by page.
198198
199199
2. __Tokenization__: The `title` and `text` values of each section are split
200-
into tokens by using the [separator] as configured in `mkdocs.yml`.
201-
Tokenization itself is carried out by
200+
into tokens by using the [`separator`][separator] as configured in
201+
`mkdocs.yml`. Tokenization itself is carried out by
202202
[lunr's default tokenizer][default tokenizer], which doesn't allow for
203203
lookahead or separators spanning multiple characters.
204204
@@ -216,7 +216,7 @@ more magic involved, e.g., search results are [post-processed] and [rescored] to
216216
account for some shortcomings of [lunr], but in general, this is how data gets
217217
into and out of the index.
218218
219-
[separator]: ../../setup/setting-up-site-search.md#separator
219+
[separator]: ../../setup/setting-up-site-search.md#search-separator
220220
[default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
221221
[post-processed]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L249-L272
222222
[rescored]: https://github.com/squidfunk/mkdocs-material/blob/ec7ccd2b2d15dd033740f388912f7be7738feec2/src/assets/javascripts/integrations/search/_/index.ts#L274-L275
@@ -421,9 +421,9 @@ On to the next step in the process: __tokenization__.
421421
### Tokenizer lookahead
422422
423423
The [default tokenizer] of [lunr] uses a regular expression to split a given
424-
string by matching each character against the [separator] as defined in
425-
`mkdocs.yml`. This doesn't allow for more complex separators based on
426-
lookahead or multiple characters.
424+
string by matching each character against the [`separator`][separator] as
425+
defined in `mkdocs.yml`. This doesn't allow for more complex separators based
426+
on lookahead or multiple characters.
427427
428428
Fortunately, __our new search implementation provides an advanced tokenizer__
429429
that doesn't have these shortcomings and supports more complex regular
@@ -439,14 +439,14 @@ characters at which the string should be split, the following three sections
439439
explain the remainder of the regular expression.[^4]
440440
441441
[^4]:
442-
As a fun fact: the [separator default value] of the search plugin being
443-
`[\s\-]+` always has been kind of irritating, as it suggests that multiple
444-
characters can be considered being a separator. However, the `+` is
445-
completely irrelevant, as regular expression groups involving multiple
446-
characters were never supported by
442+
As a fun fact: the [`separator`][separator] [default value] of the search
443+
plugin being `[\s\-]+` always has been kind of irritating, as it suggests
444+
that multiple characters can be considered being a separator. However, the
445+
`+` is completely irrelevant, as regular expression groups involving
446+
multiple characters were never supported by
447447
[lunr's default tokenizer][default tokenizer].
448448
449-
[separator default value]: https://www.mkdocs.org/user-guide/configuration/#separator
449+
[default value]: https://www.mkdocs.org/user-guide/configuration/#separator
450450
451451
#### Case changes
452452

docs/blog/2022/chinese-search-support.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@ number of Chinese users.__
3232
---
3333

3434
After the United States and Germany, the third-largest country of origin of
35-
Material for MkDocs users is China. For a long time, the built-in search plugin
35+
Material for MkDocs users is China. For a long time, the [built-in search plugin]
3636
didn't allow for proper segmentation of Chinese characters, mainly due to
37-
missing support in [lunr-languages] which is used for search tokenization and
38-
stemming. The latest Insiders release adds long-awaited Chinese language support
37+
missing support in [lunr-languages] which is used for search tokenization and
38+
stemming. The latest Insiders release adds long-awaited Chinese language support
3939
for the built-in search plugin, something that has been requested by many users.
4040

4141
_Material for MkDocs終於​支持​中文​了!文本​被​正確​分割​並且​更​容易​找到。_
@@ -50,18 +50,19 @@ search plugin in a few minutes._
5050
## Configuration
5151

5252
Chinese language support for Material for MkDocs is provided by [jieba], an
53-
excellent Chinese text segmentation library. If [jieba] is installed, the
54-
built-in search plugin automatically detects Chinese characters and runs them
53+
excellent Chinese text segmentation library. If [jieba] is installed, the
54+
built-in search plugin automatically detects Chinese characters and runs them
5555
through the segmenter. You can install [jieba] with:
5656

5757
```
5858
pip install jieba
5959
```
6060

61-
The next step is only required if you specified the [separator] configuration
62-
in `mkdocs.yml`. Text is segmented with [zero-width whitespace] characters, so
63-
it renders exactly the same in the search modal. Adjust `mkdocs.yml` so that
64-
the [separator] includes the `\u200b` character:
61+
The next step is only required if you specified the [`separator`][separator]
62+
configuration in `mkdocs.yml`. Text is segmented with [zero-width whitespace]
63+
characters, so it renders exactly the same in the search modal. Adjust
64+
`mkdocs.yml` so that the [`separator`][separator] includes the `\u200b`
65+
character:
6566

6667
``` yaml
6768
plugins:

docs/blog/index.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,12 @@ number of Chinese users.__
3333
---
3434

3535
After the United States and Germany, the third-largest country of origin of
36-
Material for MkDocs users is China. For a long time, the built-in search plugin
36+
Material for MkDocs users is China. For a long time, the [built-in search plugin]
3737
didn't allow for proper segmentation of Chinese characters, mainly due to
38-
missing support in [lunr-languages] which is used for search tokenization and
39-
stemming. The latest Insiders release adds long-awaited Chinese language support
40-
for the built-in search plugin, something that has been requested by many users.
38+
missing support in [`lunr-languages`][lunr-languages] which is used for search
39+
tokenization and stemming. The latest Insiders release adds long-awaited Chinese
40+
language support for the built-in search plugin, something that has been
41+
requested by many users.
4142

4243
[:octicons-arrow-right-24: Continue reading][Chinese search support – 中文搜索​支持]
4344

docs/insiders/changelog.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@ template: overrides/main.html
66

77
## Material for MkDocs Insiders
88

9+
### 4.17.2 <small>_ June 5, 2022</small> { id="4.17.2" }
10+
11+
- Added support for custom jieba dictionaries (Chinese search)
12+
913
### 4.17.1 <small>_ June 5, 2022</small> { id="4.17.1" }
1014

1115
- Added support for cookie consent reject button

docs/setup/ensuring-data-privacy.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -104,15 +104,15 @@ The following properties are available:
104104

105105
: [:octicons-tag-24: insiders-4.17.1][Insiders] · :octicons-milestone-24:
106106
Default: `[accept, manage]` – This property defines which buttons are shown
107-
and in which order, e.g. to allow the user to manage settings and accept
108-
the cookie:
107+
and in which order, e.g. to allow the user to accept cookies and manage
108+
settings:
109109

110110
``` yaml
111111
extra:
112112
consent:
113113
actions:
114-
- manage
115114
- accept
115+
- manage
116116
```
117117

118118
The cookie consent form includes three types of buttons:

docs/setup/setting-up-site-search.md

Lines changed: 64 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -92,12 +92,6 @@ The following configuration options are supported:
9292
part of this list by automatically falling back to the stemmer yielding the
9393
best result.
9494

95-
!!! tip "Chinese search support – 中文搜索​支持"
96-
97-
Material for MkDocs recently added __experimental language support for
98-
Chinese__ as part of [Insiders]. [Read the blog article][chinese search]
99-
to learn how to set up search for Chinese in a matter of minutes.
100-
10195
`separator`{ #search-separator }
10296

10397
: :octicons-milestone-24: Default: _automatically set_ – The separator for
@@ -112,10 +106,9 @@ The following configuration options are supported:
112106
```
113107

114108
1. Tokenization itself is carried out by [lunr's default tokenizer], which
115-
doesn't allow for lookahead or separators spanning multiple characters.
116-
117-
For more finegrained control over the tokenization process, see the
118-
section on [tokenizer lookahead].
109+
doesn't allow for lookahead or multi-character separators. For more
110+
finegrained control over the tokenization process, see the section on
111+
[tokenizer lookahead].
119112

120113
<div class="mdx-deprecated" markdown>
121114

@@ -142,28 +135,82 @@ The following configuration options are supported:
142135

143136
</div>
144137

145-
The other configuration options of this plugin are not officially supported
146-
by Material for MkDocs, which is why they may yield unexpected results. Use
147-
them at your own risk.
148-
149138
[search support]: https://github.com/squidfunk/mkdocs-material/releases/tag/0.1.0
150139
[lunr]: https://lunrjs.com
151140
[lunr-languages]: https://github.com/MihaiValentin/lunr-languages
152-
[chinese search]: ../blog/2022/chinese-search-support.md
153141
[lunr's default tokenizer]: https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L413-L456
154142
[site language]: changing-the-language.md#site-language
155143
[tokenizer lookahead]: #tokenizer-lookahead
156144
[prebuilt index support]: https://github.com/squidfunk/mkdocs-material/releases/tag/5.0.0
157145
[prebuilt index]: https://www.mkdocs.org/user-guide/configuration/#prebuild_index
158146
[50% smaller]: ../blog/2021/search-better-faster-smaller.md#benchmarks
159147

148+
#### Chinese language support
149+
150+
[:octicons-heart-fill-24:{ .mdx-heart } Sponsors only][Insiders]{ .mdx-insiders } ·
151+
[:octicons-tag-24: insiders-4.14.0][Insiders] ·
152+
:octicons-beaker-24: Experimental
153+
154+
[Insiders] adds search support for the Chinese language (see our [blog article]
155+
[chinese search] from May 2022) by integrating with the text segmentation
156+
library [jieba], which can be installed with `pip`.
157+
158+
``` sh
159+
pip install jieba
160+
```
161+
162+
If [jieba] is installed, the [built-in search plugin] automatically detects
163+
Chinese characters and runs them through the segmenter. The following
164+
configuration options are available:
165+
166+
`jieba_dict`{ #jieba-dict }
167+
168+
: [:octicons-tag-24: insiders-4.17.2][Insiders] · :octicons-milestone-24:
169+
Default: _none_ – This option allows for specifying a [custom dictionary]
170+
to be used by [jieba] for segmenting text, replacing the default dictionary:
171+
172+
``` yaml
173+
plugins:
174+
- search:
175+
jieba_dict: dict.txt # (1)!
176+
```
177+
178+
1. The following alternative dictionaries are provided by [jieba]:
179+
180+
- [dict.txt.small] – 占用内存较小的词典文件
181+
- [dict.txt.big] – 支持繁体分词更好的词典文件
182+
183+
`jieba_dict_user`{ #jieba-dict-user }
184+
185+
: [:octicons-tag-24: insiders-4.17.2][Insiders] · :octicons-milestone-24:
186+
Default: _none_ – This option allows for specifying an additional
187+
[user dictionary] to be used by [jieba] for segmenting text, augmenting the
188+
default dictionary:
189+
190+
``` yaml
191+
plugins:
192+
- search:
193+
jieba_dict_user: user_dict.txt
194+
```
195+
196+
User dictionaries can be used for tuning the segmenter to preserve
197+
technical terms.
198+
199+
[chinese search]: ../blog/2022/chinese-search-support.md
200+
[jieba]: https://pypi.org/project/jieba/
201+
[built-in search plugin]: #built-in-search-plugin
202+
[custom dictionary]: https://github.com/fxsjy/jieba#%E5%85%B6%E4%BB%96%E8%AF%8D%E5%85%B8
203+
[dict.txt.small]: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.small
204+
[dict.txt.big]: https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big
205+
[user dictionary]: https://github.com/fxsjy/jieba#%E8%BD%BD%E5%85%A5%E8%AF%8D%E5%85%B8
206+
160207
### Rich search previews
161208

162209
[:octicons-heart-fill-24:{ .mdx-heart } Sponsors only][Insiders]{ .mdx-insiders } ·
163210
[:octicons-tag-24: insiders-3.0.0][Insiders] ·
164211
:octicons-beaker-24: Experimental
165212

166-
Insiders ships rich search previews as part of the [new search plugin], which
213+
[Insiders] ships rich search previews as part of the [new search plugin], which
167214
will render code blocks directly in the search result, and highlight all
168215
occurrences inside those blocks:
169216

@@ -186,7 +233,7 @@ occurrences inside those blocks:
186233
[:octicons-tag-24: insiders-3.0.0][Insiders] ·
187234
:octicons-beaker-24: Experimental
188235

189-
Insiders allows for more complex configurations of the [`separator`][separator]
236+
[Insiders] allows for more complex configurations of the [`separator`][separator]
190237
setting as part of the [new search plugin], yielding more influence on the way
191238
documents are tokenized:
192239

0 commit comments

Comments
 (0)