Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search does not support non-English languages #1081

Open
taills opened this issue Oct 29, 2019 · 13 comments
Open

Search does not support non-English languages #1081

taills opened this issue Oct 29, 2019 · 13 comments
Labels
A-Localization Area: Localization, language support, etc. A-Search Area: Search

Comments

@taills
Copy link

taills commented Oct 29, 2019

Unable to search for Chinese Keywords
image

@weihanglo
Copy link
Member

mdBook uses elasticlunr.js for offline searching under the hood. And according to this issue weixsong/elasticlunr.js#53, it seems that there is no plan to support searching in other languages.

@hysencn
Copy link

hysencn commented Mar 27, 2020

it is a good tools. I love it.

We, more than 1 million peoples, have the same issue. could you help on it?
need chinese search support.

or, could we search chinese with google?
How to do it?

@ehuss ehuss added A-Localization Area: Localization, language support, etc. A-Search Area: Search labels Apr 21, 2020
@futurist
Copy link

For Chinese like languages, maybe it's not suitable for local search, instead could using some service like algolia, which also being used in vuepress

@cxumol
Copy link

cxumol commented Aug 16, 2020

Yes, it is highly possible to add searching in Chinese characters.

  1. elasticlunr is the search engine used by mdBook
  2. Go to elasticlunr official documentation, read the section Other Languages. With just 3 more lines of code, elasticlunr can be used with other languages.
  3. The Chinese language support of lunr-languages is PRed but not yet merged.
  4. Alternatively, suggested by comments on Need Chinese support MihaiValentin/lunr-languages#32, Japanese language can be used as a workaround. That's because this line has covered 一-龠, which is a usual range including most Chinese characters on the Unicode table.

@ehuss ehuss changed the title Unable to search for Chinese Keywords Search does not support non-English languages Jul 27, 2021
@bigdogs
Copy link

bigdogs commented Apr 9, 2022

Any new progress for this issue?

@Sciroccogti
Copy link

Any new progress for this issue?

There is a PR #1496 working on it but needs help.

@wc7086
Copy link

wc7086 commented May 14, 2023

Replacing elasticlunr.js with https://github.com/ajitid/fzf-for-js may allow this issue to be resolved.

@silence-coding
Copy link

Looking forward to new progress on this MR

@switefaster
Copy link

switefaster commented Jul 5, 2023

@ehuss Would you please tell me if this feature would be accepted? I don't think I'm able to find out any more issues by prototyping.

In case you're too busy to read the whole comment, I will highlight some key issues for you. All modifications are feature-gated.

  • I intend to import lunr-languages's Chinese extension and a WebAssembly as a segmenter, which leads to:
    • extra static file dependencies, not only to several .js but also to a .wasm
    • usage of ES6 Module and async
  • I have a custom Language trait implementation created, which is sort of irrelevant to mdBook itself
  • I want to grant users the ability to include a custom dictionary. My plan is to add a subsection to book.toml such as [output.html.zh] and add a field additional-dict just like additional-js. I don't know if you will be comfortable with this.

All mentioned modifications except the custom dictionary are available to check in my fork. I would appreciate it a lot if you could tell me about your attitudes toward this feature and/or the issues I listed.


Importing lunr.zh.js(with slight modification to make it compatible with elasticlunr) and relevant extensions never works as mdBook is using a pre-generated search index from elasticlunr-rs when building the book. The correct way to solve this is PR #1496. However, the original PR is not using a preferable solution besides some flaws pointed out by the core maintainers, for example, it is not using an appropriate segmenter. And also, the PR is seemingly not updating anymore. I've figured out a possibly more elegant solution, namely to use either Intl.Segmenter or jieba-wasm as a Chinese segmenter. I'd love to work on this issue, but as per the Contribution Guide, I'm not sure if this issue is grabbing any attention from the maintainers, and I won't bother making a pull request if they do not.

Apart from all these, there are still some details we need to discuss:

  • Whether to use Intl.Segmenter or jieba-wasm. I prefer the latter since Intl.Segmenter is not supported by FireFox while WebAssembly is fully supported by almost all browsers, and using jieba can assure the consistency between the generated index and segment generated in the browser. However, using jieba-wasm requires extra file dependencies to at least two files jieba_rs_wasm.js and jieba_rs_wasm_bg.wasm, I'm not sure if the maintainers will be happy with that even if we can control them with feature flags.
  • elasticlunr-rs's Chinese support is incomplete, in the way that the stop word filter is inconsistent with the one from lunr-languages, and there is no sign of it being solved. We can implement Language trait ourselves in mdBook, but again I'm not sure if this is appropriate as it is kind of irrelevant to mdBook itself. If the maintainers prefer not, we then have to make a PR to elasticlunr-rs first and then wait for it to be merged.
  • The search result is kind of odd(showing apparently not very matching results) due to the segmenter segmenting some particular idiom-ish phrases as individual words, e.g. "换而言之" -> ["换", "而言", "之"], this also happens to uncommon terms. This could be solved by either allowing users to add a custom dictionary or just don't use any segmenter at all, as I realize most users would just be searching for keywords, in which case matching for the whole term is more reasonable. If we accept the latter solution then we don't need to consider everything listed above at all :P We shall not consider using no segmenter because elasticlunr depends highly on a tokenizer to work, otherwise we'll have to build our own searcher or switch to another one, which I don't consider a good trade-off.
  • Search results in the result list are not highlighted. To be exact, it seems that only those having 'space' characters(i.e. space, tab or \n, etc.) ahead of them will be highlighted. Guess we have somewhere in searcher.js to modify. Solved this by making changes to searcher.js, particularly on makeTeaser() function.
  • ...and more if any

I'll be keeping an eye on this comment to see if anyone is interested.

Update

I created a fork as a proof-of-concept, and I find some problems that I didn't notice at the time I posted this comment. jieba-wasm somehow requires async mechanism to work. As a consequence, I was forced to change the loading method of searcher.js from regular CJS to an import() call in index.hbs and lunr.zh.js. Eventually, it worked well, and the ECMA module is well-supported so I don't think that's a big issue, but I note it here as mdBook did it nowhere before.

Anyone willing to test it may build the forked repository with zh feature and use the product as usual. An expected result is produced on my machine.

Update 2

Listed some more problems.

@chens08
Copy link

chens08 commented Jan 13, 2024

How to use it,can you give a example about the book.toml? I fork your
project and build it,but,it can‘t work.

@miaomiao1992
Copy link

Any new progress on this issue?

@aneteanetes
Copy link

just 3 more lines of code

It was a lie.

Аnd kind of long journey...

But anyway, here am i, and my instructions for adding non-english search by little blood:

  1. First of all, you need add lunr.stemmer.support.js and lunr.YOURLANG.js. You can do this by multiple ways:
    1.1 Create head.hbs in theme folder and add html-tag
    1.2 Add scripts by additional-js key in book.toml
    1.3 Or just append this script's to overrided file in step 4
  2. Additionally, i am advice add lunr.multi.js too.
  3. Then, you need override searcher.js by putting copy of original file into src folder.
  4. The best part. Find searchindex = elasticlunr.Index.load(config.index); line, and replace with:
searchindex = elasticlunr(function() {
            // adding (multi)language
            this.use(elasticlunr.multiLanguage('en', 'ru')); 
            
            // fields to index.
            this.addField('title');
            this.addField('body');
            this.addField('breadcrumbs');

            // Identify documents field
            this.setRef('id');
            
            // Get all documents stored in prebuilded index
            for (let key in config.index.documentStore.docs) {
              this.addDoc(config.index.documentStore.docs[key]);
            }
        });

And search will be fine.

But one more word: when mdbook build searchindex, it's completely ignore language settings in book.toml. It's uses only english chars for creating index, while elasticlunr-rs support other languages. By this behaviour, all attempts for adding additinal language will fail. I do not write Rust code, and can't create PR, but i hope this information will help someone.

@luizbgomide
Copy link

This issue is 5 years old and no actuall progress in the code base yet? It fails even with diacrits like áàã, etc witch in theory could be very easy to handle with normalization and canonical decomposition...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Localization Area: Localization, language support, etc. A-Search Area: Search
Projects
None yet
Development

No branches or pull requests