Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure of www.wikidoc.org due to missing CSS dependency #2091

Open
benoit74 opened this issue Oct 7, 2024 · 20 comments
Open

Failure of www.wikidoc.org due to missing CSS dependency #2091

benoit74 opened this issue Oct 7, 2024 · 20 comments

Comments

@benoit74
Copy link
Contributor

benoit74 commented Oct 7, 2024

I tried to create a ZIM of https://www.wikidoc.org/ with docker run --rm --name mwoffliner_test ghcr.io/openzim/mwoffliner:dev mwoffliner --adminEmail="contact@kiwix.org" --customZimDescription="Desc" --format="novid:maxi" --mwUrl="https://www.wikidoc.org/" --mwWikiPath "index.php" --mwActionApiPath "api.php" --mwRestApiPath "rest.php" --publisher="openZIM" --webp --customZimTitle="Custom title" --verbose

It fails with following error:

[error] [2024-10-07T07:51:01.434Z] Unable to retrieve js/css dependencies for article 'CSA Trust': nosuchrevid
[log] [2024-10-07T07:51:01.434Z] Exiting with code [1]
[log] [2024-10-07T07:51:01.434Z] Deleting temporary directory [/tmp/mwoffliner-1728287394504]
file:///tmp/mwoffliner/lib/Downloader.js:570
            throw new Error(errorMessage);
                  ^

Error: Unable to retrieve js/css dependencies for article 'CSA Trust': nosuchrevid
    at Downloader.getModuleDependencies (file:///tmp/mwoffliner/lib/Downloader.js:570:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async file:///tmp/mwoffliner/lib/util/saveArticles.js:250:45

Does it means we cannot ZIM this wiki just because we have one bad CSS configured? Is there a way to ignore it (it is probably not used anyway on live website if it does not exists)?

@audiodude
Copy link
Member

I tried running this locally, and got the same error, except for a different article:

Unable to retrieve js/css dependencies for article 'MPP+': nosuchrevid

Looking at the source website, I can't find articles for MPP+ or CSA Trust (which is the article mentioned in OP). So there are two questions:

  1. Should we fail the entire ZIM if we can't load the module dependencies of one page?
  2. Are we making a mistake in how we get the list of articles to download?

For 1), my impression was that the general approach of mwoffliner is to fail if an article cannot be retrieved, except in the narrow case that it was deleted between the time the article list was built and when the data was requested. @kelson42 what are your thoughts?

For 2), I would need to dig more into the way the article list is built, because I'm not not immediately familiar with it.

@benoit74
Copy link
Contributor Author

benoit74 commented Oct 8, 2024

Thank you!

Regarding the fact that our attempts stop at a different article, this is not a surprise to me. From my experience, the order of articles list seems to be "random".

@audiodude
Copy link
Member

They're not "random" really, just highly asynchronous as you pointed out in #2092

@audiodude
Copy link
Member

So I tried to start a PR that would ignore the nosuchrevid error code. However, the scraping then fails with:

[error] [2024-10-09T00:41:01.659Z] Error downloading article MPP+

So I think the problem is definitely in the methodology for figuring out which articles to scrape.

@audiodude
Copy link
Member

Okay, so mwoffliner fetches API responses from URLs like:

https://www.wikidoc.org/api.php?action=query&format=json&prop=redirects%7Crevisions%7Ccoordinates&rdlimit=max&rdnamespace=0&formatversion=2&colimit=max&rawcontinue=true&generator=allpages&gapfilterredir=nonredirects&gaplimit=max&gapnamespace=0&gapcontinue=MCM7

And uses the result to get a list of article titles to later download.

This endpoint is returning the following:

 {
    "pageid": 112260,
    "ns": 0,
    "title": "MS Bike Tour",
    "revisions": [
      {
        "revid": 718699,
        "parentid": 678940,
        "minor": true,
        "user": "WikiBot",
        "timestamp": "2012-09-04T19:21:10Z",
        "comment": "Robot: Automated text replacement (-{{WikiDoc Cardiology Network Infobox}} +, -<references /> +{{reflist|2}}, -{{reflist}} +{{reflist|2}})"
      }
    ]
  },
  {
    "pageid": 119777,
    "ns": 0,
    "title": "MEND-CABG II trial does not suggest improved outcomes with the novel drug MC1 in patients undergoing high risk coronary artery bypass surgery"
  },
  {
    "pageid": 120569,
    "ns": 0,
    "title": "MPP+"
  },
  {
    "pageid": 123474,
    "ns": 0,
    "title": "MDL Chime",
    "revisions": [
      {
        "revid": 678905,
        "parentid": 352108,
        "minor": true,
        "user": "WikiBot",
        "timestamp": "2012-08-09T17:05:45Z",
        "comment": "Robot: Automated text replacement (-{{SIB}} + & -{{EH}} + & -{{EJ}} + & -{{Editor Help}} + & -{{Editor Join}} +)"
      }
    ]
  },

So I hate to say it, but I think the wiki is misconfigured. It's returning an MPP+ page with no revisions, which leads to the errors later.

This page recommends running update.php which I believe is part of the Mediawiki install? Can we reach out to the operators of the wiki to do that, and then try again?

@audiodude
Copy link
Member

If we really need to, we can also filter pages with no revisions from the scrape.

@audiodude
Copy link
Member

Now I'm getting this:

{
  error: {
  code: "parsoid-stash-rate-limit-error",
  info: "Stashing failed because rate limit was exceeded. Please try again later.",
  docref: "See https://www.wikidoc.org/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
  }
}

for this URL:

https://www.wikidoc.org/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=Lymphangiomyomatosis_surgery

@audiodude
Copy link
Member

See wikimedia/restbase#1140

@benoit74
Copy link
Contributor Author

So I hate to say it, but I think the wiki is misconfigured. It's returning an MPP+ page with no revisions, which leads to the errors later.$

Is it a wiki misconfiguration or just a slightly broken database content?

I already achieved to break database content on multiple occasion due to mediawiki bugs

Anyway, I really consider the scraper should be capable to continue on such errors, and just stop if too many errors occurs. It is a sad from my PoV to not put a content offline just because the website has some small issues, and the website maintainer is not here anymore / capable to fix them. This happens just way too often. And we cannot expect the scraper user to list manually one by one all pages which finally have an issue. Or at least we should report all articles which have an issue at once, and fail the scrape, letting the user decide if it is ok for him to ignore these articles (adding them to the ignore list).

@kelson42
Copy link
Collaborator

kelson42 commented Oct 15, 2024

Is is tolerated that a whole article is missing (so http 404). This scenario can happen any time because a user can anytime delete an article.

What is not tolerated is that the backend does not deliver like it should (for example with timeouts or http 5xx errors).

But here the situation seems different and not that easy to assess.

@audiodude
Copy link
Member

Here's some more examples from action=query endpoint:

{
  pageid: 22273,
  ns: 0,
  title: "MLN64"
},
{
  pageid: 23910,
  ns: 0,
  title: "MSin3 interaction domain"
},
{
  pageid: 31260,
  ns: 0,
  title: "MHC restriction"
},

None of these entries have a revisions key, so I assume they would also error out. In fact, when I added code to skip pages with no revisions key and log a warning, I got hundreds and hundreds of warnings. So many, that I thought my code was broken and that you can't count on data items from this request to have revisions. However, when I search on wikidoc.org for these article titles I get no results.

So even if we had a configurable limit of missing/broken articles, in this case we would likely exceed it anyways. I don't think mwoffliner can do much when the wiki in question is very broken.

@kelson42
Copy link
Collaborator

kelson42 commented Oct 17, 2024

We should definitly not skip articles without revision.

At this stage we should understand why there is no revid.

If this is a feature, then we will have to handle it, AFAIK and from a technical POV we could make all requests without giving a revid... and then it will take latest version. Again, this has to be confirmed.

If this is somehow a bug, we should stop the scraping process properly with a proper error.

@benoit74
Copy link
Contributor Author

Looking at https://www.mediawiki.org/wiki/Manual:RevisionDelete, it seems totally possible to completely hide/remove all revisions of a page. To be tested on a mediawiki instance to confirm of course.

@audiodude
Copy link
Member

we could make all requests without giving a revid... and then it will take latest version. Again, this has to be confirmed.

We do not currently use the revision ID in the request. We do not extract it from the article list scrape either. The fact that it's missing is a symptom of a more fundamental problem, probably the one that @benoit74 pointed out.

I assume the practical reason is that they have articles that they don't want to be public facing, but they don't want to delete either.

See this URL:

https://www.wikidoc.org/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=The_Living_Guidelines%3A_UA%2FNSTEMI_Recomendations_for_CABG_View_the_Current_CLASS_IIa_Guidelines

The error is:

{
  error: {
    code: "nosuchrevid",
    info: "No current revision of title The Living Guidelines: UA/NSTEMI Recomendations for CABG View the Current CLASS IIa Guidelines.",
    docref: "See https://www.wikidoc.org/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
  }
}

@audiodude
Copy link
Member

So with that said, I believe we should skip articles with no revisions. They have no public facing pages and are not a tangible part of the wiki. They are essentially "hidden".

@benoit74
Copy link
Contributor Author

So with that said, I believe we should skip articles with no revisions. They have no public facing pages and are not a tangible part of the wiki. They are essentially "hidden".

This makes sense to me, and it is not that different from a page returning a 404.

The more complex question is "what should we do when we encounter a link to a page with no revid?". But this could be tracked in a distinct issue, and it is maybe even already handled by the scraper.

@kelson42
Copy link
Collaborator

I'm a bit surprised if revid is not used at all, hard to believe to me.

The reason why the revid should not be totally ignored is that ultimatively I want to be able to deal with it, see #982 or #2072 for example.

Here we need to confirm if this is he consequence of deleting manually a revision. I doubt a bit about that.

My question would be: why as we retrieve the whole list of article titles of the wiki, we get these articles listed... although they are not available. If we face here a kind of feature we should probably fix the problem there. If at this stage the MediaWiki is not able to delicer a revid of an article to scrape later, then we should maybe skip it... but we clearly need to understand why this could happen.

@audiodude
Copy link
Member

So someone (probably me) has to spin up a MediaWiki instance, install the plugin for hiding revisions, and confirm that the JSON I posted above is what is returned in that case?

@audiodude
Copy link
Member

So I did it, I installed a local MediaWiki instance and enabled revision deletion. I tried to delete all revisions of a page and got this:

Image

So as we expected, the Mediawiki software should be requiring every article to have at least one revision. These wikidoc pages have been altered in some other way.

@kelson42
Copy link
Collaborator

@audiodude Thx for the effort, even if I'm not surprised about the conclusion. I should really have a look IMHO.

@kelson42 kelson42 added this to the 1.15.0 milestone Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants