Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to show many English characters in Chinese Wikipedia #1256

Open
ghost opened this issue Sep 13, 2020 · 9 comments
Open

Fail to show many English characters in Chinese Wikipedia #1256

ghost opened this issue Sep 13, 2020 · 9 comments
Labels
bug question wikimedia Direct impact on Wikimedia content scraping
Milestone

Comments

@ghost
Copy link

ghost commented Sep 13, 2020

@wdscxsj commented on Sep 13, 2020, 9:11 AM UTC:

  • Kiwix version: 2.0.4
  • Affected local file: wikipedia_zh_all_maxi_2020-07.zim
  • OS: Windows 10 Enterprise 2004 Simplified Chinese Edition

On my machine, Kiwix often fails to show English characters in the Chinese Wikipedia zim.

For example, this is the display of the online item "纽约" (New York) in Chrome:

ny-chrome

And this is the same item displayed in Kiwix:

ny-kiwix

Kiwix doesn't provide a "devtools" to examine the zim data, but it's unlikely that the English text is missing in the zim. Judging from the online page, text snippets with a lang attribute, like <span lang="en">City of New York</span> and <span lang="la">Novum Eboracum</span>, fail to render properly. "Plain" text like 41 °C is OK.

By the way, the English Wikipedia works perfectly on the same machine. And it'll be so nice if custom CSS can be allowed, because the default choice of Chinese fonts is not quite visually appealing. Thanks.

This issue was moved by kelson42 from kiwix/kiwix-desktop#520.

@ghost ghost added bug question labels Sep 13, 2020
@ghost
Copy link
Author

ghost commented Sep 13, 2020

@kelson42 commented on Sep 13, 2020, 9:24 AM UTC:

@wdscxsj Thank you fr your bug report. To better see the HTML of an article you can start the server mode and look via your usual Web browser.

@ghost
Copy link
Author

ghost commented Sep 13, 2020

@wdscxsj commented on Sep 13, 2020, 9:46 AM UTC:

@kelson42 Thank you so much for the pointer! It's quite surprising to see the English characters are missing from the DOM:

1

and the page source:

2

Any way to directly check the zim data?

@stale
Copy link

stale bot commented Nov 13, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Nov 13, 2020
@kelson42 kelson42 added this to the 1.13 milestone Nov 13, 2020
@stale stale bot removed the stale label Nov 13, 2020
@kelson42 kelson42 added the wikimedia Direct impact on Wikimedia content scraping label Nov 13, 2020
@stale
Copy link

stale bot commented Jan 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jan 14, 2021
@JokerQyou
Copy link

This is still an issue after almost two years.

@JokerQyou
Copy link

JokerQyou commented Jul 26, 2022

Take for example the "Antarctica" page, Wikipedia renders this:
image

while the same page in latest ZIM file from kiwix renders this:
image

You can see the <span lang="en">...</span> tag is missing.

the original wikitext content is:

'''南极洲'''({{lang-en|Antarctica}})是[[地球]]最南端的[[大洲|]]

in which the {{lang-en|Antarctica}} is a template for quickly creating English language variant for specific words. This kind of lead me to think if it's the parser failed to render this specific template.

However it's very confusing that the Japanese Wikipedia does not have the same issue, this page (and other archived Japanese pages) is OK:
image

This Japanese page uses another similar template called Lang-en-short.

@stale stale bot removed the stale label Jul 26, 2022
@JokerQyou
Copy link

Upon further investigation and trying to understand what's going on here, it looks like <span lang="en"></span> elements are not the only ones affected. Texts are also missing in reference list and external link list of the exact same page:

image
image

It's normal on the original Wikipedia page:

image
image

@stale
Copy link

stale bot commented May 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Jan 9, 2024

Issue still current as of December 2023 (I tested with wikipedia_zh_history_maxi_2023-12.zim). I think this should be a higher priority for fixing. Can we find out whether the mobile endpoint is omitting the English-language text, or whether mwOffliner is somehow filtering it out? Since the text is not in the ZIM (it's not a case of it being hidden), it might be the former.

@stale stale bot removed the stale label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug question wikimedia Direct impact on Wikimedia content scraping
Projects
None yet
Development

No branches or pull requests

3 participants