Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wiktionary/Wikivoyage zim databases lag website by five months #1397

Closed
archenemies opened this issue Feb 6, 2021 · 51 comments
Closed

Wiktionary/Wikivoyage zim databases lag website by five months #1397

archenemies opened this issue Feb 6, 2021 · 51 comments
Assignees
Labels
upstream wikimedia Direct impact on Wikimedia content scraping
Milestone

Comments

@archenemies
Copy link

I have a Wikitionary Zim file from December 2020, which I downloaded using the GUI kiwix-desktop interface (2020-12-10; "Pictures, Fulltext index"; 5.65 GB).

This works great for me but I'm not sure how to figure out which Wiktionary it is based on.

It lacks changes to Wiktionary made in August 2020, although it contains changes from May 2020.

Where can I find out which Wiktionary dump a Zim file is based on, and how do I find a Zim file which is based on a current version of Wiktionary?

(And where should I submit this issue?)

@kelson42 kelson42 self-assigned this Feb 7, 2021
@kelson42
Copy link
Collaborator

kelson42 commented Feb 7, 2021

Are you talking about Wiktionary in English? Which content exactly is missing (two screenshots would be helpful)?

@archenemies
Copy link
Author

Yes English.

Here is an example of a diff from August which is missing from the December 2020 Kiwix Wiktionary Zim file. I just picked it at random, so far the December Zim file seems to be missing everything since around June or so.

https://en.wiktionary.org/w/index.php?title=rocker&diff=prev&oldid=60027083

Someone added a sense to "rocker", number 4 here:

screenshot-2021-02-06_20 12 34

Here's the Kiwix screenshot where you can see that it's missing:

screenshot-2021-02-06_20 12 47

I guess the answer to my other question is that there is no reason for the Zim file to be out of date then? Certainly as a software developer I would expect the Zim file to have embedded in it a date corresponding to when it was compiled, so that this kind of ad-hoc testing would not be necessary. Or does it get updated one word at a time, so different dictionary entries are out of date by different amounts? But in that case I would expect each entry to come with a timestamp...

@kelson42
Copy link
Collaborator

kelson42 commented Feb 7, 2021

@archenemies I will have a look (and move the ticket), but looks like a problem with a root cause in Wikimedia infrastructure.

@kelson42 kelson42 transferred this issue from openzim/zim-requests Feb 7, 2021
@kelson42 kelson42 added bug question upstream wikimedia Direct impact on Wikimedia content scraping labels Feb 7, 2021
@kelson42 kelson42 added this to the 1.12 milestone Feb 7, 2021
@kelson42
Copy link
Collaborator

kelson42 commented Feb 7, 2021

@archenemies BTW, revision id, like revision date are available in the upstream link in the foorter of each article.

@archenemies
Copy link
Author

That's interesting about the upstream link in the footer, well "rocker" has the wrong link

https://en.wiktionary.org/wiki/?title=rocker&oldid=61038509

because it points to a revision from 4 November 2020 with the "breve below" sense #4 filled in, but the page that Kiwix serves me lacks that sense.

@kelson42
Copy link
Collaborator

It looks like to be a bug in the Wikimedia REST API because it simply does not deliver the latest version (like you reported). See: https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker. This is the root of the bug.

On the mwoffliner side, there is a weakness which is that we don't request a specific revisionid, but just take the latest. If we would retrieve https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker/61774146, then we would have get the proper content.

I will do the necessary on both sides to improve the situation.

@kelson42
Copy link
Collaborator

A bug ticket has been open upstream at https://phabricator.wikimedia.org/T274359

@kelson42
Copy link
Collaborator

@MananJethwani Here again this is "complicated" to change due to the architecture.

@archenemies
Copy link
Author

@kelson42 Thank you so much for tracking that down and re-reporting the bug

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jun 2, 2021
@kelson42 kelson42 changed the title Wiktionary zim databases lag website by five months Wiktionary/Wikivoyage zim databases lag website by five months Aug 16, 2021
@stale stale bot removed the stale label Aug 16, 2021
@Jaifroid
Copy link
Collaborator

Just to track and keep this issue fresh, it is still impossible to open the article "Cambridge" from the 2021-09 English Wikivoyage ostensibly due to this bug. (Cambridge is a major tourist destination pre- and post-pandemic, so it is a quite serious upstream bug!)

@kelson42
Copy link
Collaborator

kelson42 commented Dec 5, 2021

See as well https://phabricator.wikimedia.org/T226931. It seems there is a momentum these days to fix it upstream...

@Jaifroid
Copy link
Collaborator

"Cambridge" still inaccessible in the December Wikivoyage in English... The lag hasn't caught up yet...

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 1, 2023

@Jaifroid If we fell lucky, implementing #1664 will help to fix the problem then.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Jan 1, 2023

It looks like it would in some cases at least...

@Jaifroid
Copy link
Collaborator

A slight advance: the article Cambridge (England) is now accessible in the January 2023 Wikivoyage!
However, the article Mompox is still showing an outdated version, despite the upstream fix. I guess clearing the caches upstream didn't work.

@Jaifroid
Copy link
Collaborator

I know this is just adding more examples, but just to underline the gravity of this issue, now almost anywhere I look in the latest English Wikivoyage ZIM (01_2023), I find pages that claim to be scrapes of a recent version, but in fact are out of date by a year or more and contain increasingly useless information. Here's another example, the "Argentina" article. The link at the foot of the page takes us to a revision last made 2nd January 2023, but in fact the scraped text does not at all correspond to this version. As can be seen below (ZIM on left, 2-Jan-23 version on right), there is a seriously outdated info panel saying that entry to Argentina is heavily restricted due to COVID-19.

At this point, it's not possible to recommend travelling with the latest Wikivoyage ZIMs!

Do we have a timeline on switching to the new API? It is becoming quite urgent, unfortunately.

image

@Jaifroid
Copy link
Collaborator

Jaifroid commented Feb 7, 2023

It looks like this may finally be fixed in principle: see https://phabricator.wikimedia.org/T226931. The caveat is that articles will only get updated once they are edited after today, in which case the mobile-sections endpoint for the article should update. If an article is not edited since the fix, then the cache won't be changed, and it will still continue to serve out-of-date content.

@kelson42
Copy link
Collaborator

kelson42 commented Feb 7, 2023

@Jaifroid They should purge the full cache, otherwise our bug won't be fixed.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Feb 7, 2023

Well, I agree, but the maintainers are being cautious and want to watch the change for a couple of weeks to make sure they haven't introduced a regression. I think we can push for a full cache purge if everything seems OK...,

@Jaifroid
Copy link
Collaborator

I'm happy to report that the latest English Wikivoyage (February 2023) is now showing the latest revision for the article on Mompox (one of my test articles). I updated that page after the fix in https://phabricator.wikimedia.org/T226931, as a test, and it's pleasing to see that the fix has worked and made it into the Wikivoyage ZIM. Additionally, the dated COVID-19 warning no longer appears on the Argentina country page, though it still appears on some other Latin American country pages (e.g. Colombia). This is because some pages will have been edited since the fix, and others not yet. Even a null edit will, apparently, update the cache now.

It would be worth testing Wiktionary pages that have been updated since round about 8th February. After confirmation, I think this issue could be closed.

@kelson42
Copy link
Collaborator

@Jaifroid Shoukd I restart a specific scrape for wiktionary?

@kelson42 kelson42 modified the milestones: 1.14.0, 1.13.0 Feb 19, 2023
@Jaifroid
Copy link
Collaborator

@kelson42 The last English Wiktionary scrape appears to be 31st October 2022 (at least that's the last one on download.kiwix.org), so yes, it would be good to try to get a new scrape if possible, though we could test other languages if we can identify a page updated since 8th Feb (or update a page manually with a minor edit). It might be worth doing this in a controlled way: make a minor edit to a page we know to be problematic, then run the scrape?

@kelson42
Copy link
Collaborator

kelson42 commented Feb 19, 2023

@Jaifroid You are in lead, let me just what do do. But latest scrape of Wiktionary EN seems to suffer of a bug #1789

@Jaifroid
Copy link
Collaborator

OK, I'll edit one of the reported pages above and will let you know when done so you can initiate a scrape.

@kelson42
Copy link
Collaborator

@Jaifroid any new recipe to relaunch?

@Jaifroid
Copy link
Collaborator

Sorry, I realized I would have to download the latest available Wiktionary archive to find an article that is not updated... Nearly there.

@Jaifroid
Copy link
Collaborator

Jaifroid commented Feb 22, 2023

@kelson42 OK, I've made a minor edit (adding a derived word) to the rocker article mentioned above, which appears to be still outdated in the latest Wikivoyage WIKTIONARY we have, and had been like that since first reported above.

So you could run a new scrape of English Wiktionary. We need a new one in any case, since the last one is a bit old now.

@Jaifroid
Copy link
Collaborator

@kelson42 Please note I meant WIKTIONARY, not Wikivoyage!

@danielzgtg
Copy link

Also something seems to be up these days with https://download.kiwix.org/zim/wiktionary/ . The last nopics were 2022-10 and 2022-09 and the last maxis were 2022-09 and 2022-07. The nopic used to be released every month, and the maxi used to be every 3 months. I use the maxi zims, but it's already 2023-02 now.

@Jaifroid
Copy link
Collaborator

@danielzgtg There was an issue about this here: #1789. It's been fixed very recently (fingers crossed).

@kelson42
Copy link
Collaborator

@Jaifroid Actually I have checked and your revision is delivered properly by the API, see https://en.wiktionary.org/api/rest_v1/page/mobile-sections/rocker. Closing the ticket.

@Jaifroid
Copy link
Collaborator

Thanks, @kelson42 -- great to be able to close this issue finally!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream wikimedia Direct impact on Wikimedia content scraping
Projects
None yet
Development

No branches or pull requests

5 participants