Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title search appears to be broken with Chinese characters (possibly all UTF-8 multibyte characters) #3587

Closed
Jaifroid opened this issue Dec 9, 2023 · 14 comments

Comments

@Jaifroid
Copy link
Member

Jaifroid commented Dec 9, 2023

A user on reddit has reported that search in Chinese text is no-longer working on Android v3.8.1, both the Google Play and the APK version.

Assuming this issue can be reproduced (should be easy with a Chinese ZIM, and using search for one of the titles given by the Random button, if that works), then I would suspect that UTF-8 3-byte (most Chinese characters) and UTF-8 4-byte character codes are somehow not being catered for when reading the search field.

@Jaifroid Jaifroid added the i18n label Dec 9, 2023
@kelson42
Copy link
Collaborator

kelson42 commented Dec 9, 2023

Yes, this is a duplicate of openzim/libzim#794

@wdscxsj
Copy link

wdscxsj commented Jan 8, 2024

As a Chinese user of kiwix-android v3.9.1, I'm afraid the title search is also broken, so this issue may not be an exact duplicate.

For example, when I look up "毛泽东" (Mao Zedong in English, i.e. Chairman Mao of the PRC) in the Chinese Wikipedia (all-maxi version, 2023-09), there is no match.

If I try again character by character, the first character "毛" will trigger a long list of matches. (I suppose "毛泽东" is listed there, but not among the top dozens.) On my phone the third match is "毛一公", so let's enter "一" after "毛". This time there is no match again.

I believe this is still related to character encoding and text tokenization, as pointed out by @xiaoyifang.

@Jaifroid
Copy link
Member Author

Jaifroid commented Jan 8, 2024

I corroborate this. I've done a comparison of the Android app, the PWA (on Android) and Kiwix Desktop (on Windows). I used the wikipedia_zh_history_maxi_2023-12.zim ZIM and searched for 第一次世界大戰 (First World War).

For Kiwix Desktop and the PWA on Android, we can do basic title search for this, and we get the same two results (first two images). For Kiwix Android, we get no results (third image). So this corroborates that title search is also broken for Chinese text on Android, so I'm re-opening.

Only Kiwix Desktop can do a full-text search for this term. Although it indicates at the top of the display that no results were found, in fact it displays various correct results that include snippets. The PWA uses libzim for full-text search, but is unable to do full-text search for Chinese. I suspect there is an issue with how the text is transferred to libzim, and I'll open a new issue for that in the appropriate Repo.

image

@xiaoyifang
Copy link

xiaoyifang commented Jan 9, 2024

openzim/libzim#802

I think if the zims (which contain CJK )created using libzim before 8.2.1,they should all have this issue.

@wdscxsj
Copy link

wdscxsj commented Jan 9, 2024

@xiaoyifang Thanks for the pointer! I wonder if this is related to the missing English characters issue in any way. It's still a big issue in the latest all-maxi Chinese Wikipedia zim.

@kelson42 kelson42 added this to the 3.10.0 milestone Jan 9, 2024
@Jaifroid
Copy link
Member Author

Jaifroid commented Jan 9, 2024

The ZIM I tested was created in December, whereas that PR was merged back in June. Do we know which libzim is currently being used in mwOflliner?

@kelson42
Copy link
Collaborator

openzim/libzim#802

I think if the zims (which contain CJK )created using libzim before 8.2.1,they should all have this issue.

This is fixed, but MWoffliner, the scraper for Wikipedia still uses and old version of the libzim. Everything works fine here. We just need to complete openzim/mwoffliner#1702

@Jaifroid
Copy link
Member Author

But just to point out that this issue relates to title search not working on the Android app with UTF8 multibyte characters, rather than Xapian search, which is what was fixed by openzim/libzim#802. Or maybe the Android app doesn't have title search any more (which is a shame if so, and a problem for searching any ZIM that doesn't have a Xapian index -- surely that can't be the case)?

@Jaifroid Jaifroid changed the title Search appears to be broken with Chinese characters (possibly all UTF-8 multibyte characters) Title search appears to be broken with Chinese characters (possibly all UTF-8 multibyte characters) Jan 11, 2024
@Jaifroid
Copy link
Member Author

I don't want to belabour the point, but I tested title search in wikipedia_zh_medicine-app_maxi_2023-12.zim. NB This ZIM does NOT have a Xapian index. I was unable to search for 心房顫動 (Atrial fibrilation) in the Android app, whereas this term is found in other apps. @kelson42 Is this a separate issue, or should we re-open this issue, or have I misunderstood something? openzim/libzim#806 only appears to fix Xapian-based search, from my reading of the code, but I may be wrong.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 13, 2024

But just to point out that this issue relates to title search not working on the Android app with UTF8 multibyte characters, rather than Xapian search, which is what was fixed by openzim/libzim#802. Or maybe the Android app doesn't have title search any more (which is a shame if so, and a problem for searching any ZIM that doesn't have a Xapian index -- surely that can't be the case)?

@Jaifroid Our ZIM files, at Kiwix, have two title indexes, see https://wiki.openzim.org/wiki/Search_indexes. If the Xapian one is there, then it ignores the native ZIM one which is the thing to do.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 13, 2024

I don't want to belabour the point, but I tested title search in wikipedia_zh_medicine-app_maxi_2023-12.zim.

No problem, but to allow to move forward we need to be very precise about what we do. For example here, you take a non-public special ZIM (only for apps) file which is not part of the one reported first. That means, by doing so, you fundamentally change the scope of the bug report and that does not really make things easier.

NB This ZIM does NOT have a Xapian index.

It does have "a Xapian index". One for the titles suggestions, but not a fulltext Xapian index. This is done on purpose because:

  • FT index is pretty heavy and does not bring that much additional value
  • Kiwix Android can not deal with the Xapian FT index

I was unable to search for 心房顫動 (Atrial fibrilation) in the Android app, whereas this term is found in other apps. @kelson42 Is this a separate issue, or should we re-open this issue, or have I misunderstood something? openzim/libzim#806 only appears to fix Xapian-based search, from my reading of the code, but I may be wrong.

I have already given the reason why it does not work I believe. Which other apps have you tested with? Might that be this is one which does work with the ZIM native title index?

@Jaifroid
Copy link
Member Author

Thanks for the further explanations, they help pinpoint the potential scope of this issue. It's clearly a serious problem for Chinese users.

Which other apps have you tested with? Might that be this is one which does work with the ZIM native title index?

I tested with Kiwix Android, Kiwix Destkop (Windows) 2.3.1-2, and Kiwix PWA. The last two can do title search on the Chinese medicine ZIM. The Android app can't . I chose that ZIM to test because I thought it would narrow down the issue.

However, since it does indeed contain a Xapian non-FT index (something I was unaware of), it seems likely it should work with the Android app once the fix is in production. If not, we can revisit after.

I'm not sure I agree that ignoring binary search of the title index is good behaviour for an app. It should be the last fallback IMHO. At least for KJS apps, searching Xapian indices is very slow, so will always be secondary to binary title search unless we can speed things up.

@kelson42
Copy link
Collaborator

tested with Kiwix Android, Kiwix Destkop (Windows) 2.3.1-2, and Kiwix PWA. The last two can do title search on the Chinese medicine ZIM.

I don't know for Kiwix Desktop 2.3.1-2, but I tested with cutting-edge version of Kiwix-Desktop (dev) and it does not work and this is normal (I just have checked because I was worried by your sentence). You should test with a ZIM made with a recent version of the libzim like https://library.kiwix.org/viewer#gutenberg_zh_all_2023-12/ ... and then things work like they should.

@Jaifroid
Copy link
Member Author

OK, sorry to have worried you ☹️. I tried testing just now with kiwix-desktop_x86_64_2024-01-14.appimage, but it just gives me the warning No stemming for language 'zh' in console (running from a terminal in my Ubuntu WSL) when I try to search. This is probably due to something missing in WSL, and not something to worry about. So, let's leave this till next official release of Kiwix Android and re-test then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants