Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with Zimit ZIM file format and URLs #86

Closed
Jaifroid opened this issue Apr 29, 2022 · 19 comments
Closed

Problems with Zimit ZIM file format and URLs #86

Jaifroid opened this issue Apr 29, 2022 · 19 comments
Assignees
Labels
question Further information is requested
Milestone

Comments

@Jaifroid
Copy link

Jaifroid commented Apr 29, 2022

While working on experimental support for Zimit archives in Kiwix JS Windows/Linux, I have noticed a few issues with the Zimit ZIM file format that seem to break the OpenZIM specification. I list them here, but will add to this issue as I find any other issues:

  • Some URLs are stored percent-encoded in the ZIM. An example: in the ready.gov ZIM, a number of images are stored in the ZIM with a URL like 'A/www.ready.gov/sites/default/files/Week%202%20-%20plan.jpg'. The reader decodes such URLs before extracting the title, but this produces an incorrect result because the file title has been stored percent-encoded in the ZIM;
  • Furthermore, URLs are encoded using encodeURIComponent() instead of encodeURI(), with the exception of forward slashes (and any querystring). This means that commas in filenames are (incorrectly) encoded, whereas commas are an accepted character in the URL schema;
  • In most ZIMs the querystring is unencoded (as per spec) as part of the dirEntry.url, and in others (e.g. bouquineux.com_fr_all_2022-02.zim) it is URIencoded. The reader has to try first unencoded and then encoded;
  • These ZIMs store all assets misleadingly under the A namespace, or C/A/ in the case of Type 1 ZIMs (see Location of WARC assets in Type 1 (no-namespace) ZIMs #99);
  • Some URLs stored in dirEntries in the ZIM are extremely long, >4KB, which is larger than the de facto limit of ~2KB identified in https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers. This may cause issues in some clients/server software/browsers. See Extremely long dirEntry URLs (>4KB) produced by warc2zim #102 for more info.
  • Against the ZIM spec, textual assets (css, js, html) are not guaranteed to be UTF-8 encoded. See Incorrect charset encoding indicated in meta tag of some warc2zim archives #88.
  • Many URLs are kept as absolute URLs ('https://'), from many different domains, some of which are represented in the ZIM (after removing the 'https://', and some of which are not, leading the reader to attempt to load external content. The latter is usually blocked by CSP, causing cross-origin errors;
  • Some URLs begin '//some.domain.com' (two forward slashes) - this appears to be a workaround for the fact that these URLs are stored in the ZIM under a different domain from the main page's domain...;
  • Some images are stored in the ZIM under a url that includes the URI component, e.g. A/media.cheatography.com/storage/thumb/davechild_regular-expressions.750.jpg?last=1584011681;
  • I've also noticed a case of a <meta http-equiv ....> redirect to an absolute domain stored in the ZIM: ideally the final redirected file should be stored, not the redirect page itself (example: "Make a plan" link on front page of ready.gov ZIM);
  • OpenZIM bans absolute URLs, they should all be relative;
  • MIME type of html pages is often text/html;raw=true instead of text/html - can be tested for, but any reader that tests /^text\/html$/ in the backend will fail;
  • Every html file has custom in-page JavaScript that redirects the browser under certain conditions and assumptions which are not generic (it is customized for the Android app) -- other readers have to work around this;
  • On some landing pages that are located at the top level of a domain (i.e. not in a directory), there are relative links that begin ../, which are technically incorrect because they would cause the browser to navigate outside of the domain under which links are stored - see Incorrect relative URLs on top-level landing pages zimit#155.

It seems to me that the above are significant challenges for universal reader support for these ZIMs. I understand that these ZIMs have an in-built Service Worker that translates URLs, though it will never be possible to trap absolute URLs to external domains in an internal Service Worker, even if this can be made to run in all readers.

Please don't misunderstand: the above is not intended as a criticism. The Zimit format is a very good and interesting effort to make it possible for users to create their own ZIMs. This is in the spirit of improving the format, so that other readers can access content in these ZIMs.

@rgaudin rgaudin transferred this issue from openzim/zimit Apr 29, 2022
@rgaudin
Copy link
Member

rgaudin commented Apr 29, 2022

Thank you @Jaifroid ; I believe we are aware of each of those but writing them all down together was much needed ; especially as we are in the process of evaluating the strategy for the future of this “generic ZIM creator”. There are additional challenges beside those, format-related, that you mentioned.

Topic's on the table but the table is very long so don't expect a timely answer here.

@Jaifroid
Copy link
Author

OK, thanks @rgaudin. I'll update the top post with any other issues I notice. I don't expect any quick answers, but it could be good to hear about any future changes that are envisaged, as I now have "passable" (experimental) support for this format.

@mossroy has suggested a teleconference call (maybe during the Hackathon, which I can't be at in person) to discuss this format, its strategic value within Kiwix, its place in the road map...

@rgaudin
Copy link
Member

rgaudin commented Apr 29, 2022

@mossroy has suggested a teleconference call (maybe during the Hackathon, which I can't be at in person) to discuss this format, its strategic value within Kiwix, its place in the road map...

Definitely

@Jaifroid
Copy link
Author

Jaifroid commented Apr 30, 2022

Another very odd one (added to list above):

image

Here we have an image asset where the ZIM url includes the URI component! Searching for the asset without the component returns a not-found error.

PS I can understand why this might be necessary for a web "recorder", in case changing the component would actually change the returned asset server-side, but it could be a problem for some ZIM readers if they do any processing on the asset url.

@Jaifroid
Copy link
Author

Added another gotcha: MIME type of HTML files is set to text/html;raw=true instead of text/html.

@kelson42
Copy link
Contributor

kelson42 commented Apr 30, 2022

@Jaifroid Thank you very much for these bug reports. I was (positively) surprised to see you working on this now and see that you get encouraging results. Unfortunately, we are these time so busy with other problems that we can not support you right now like you desserve it. Sorry about that. But I expect that later this year, your bug reports you have made here so far will be really valuable.

I have made also an entry to the hackathon wiki page to discuss this together in a month: https://wiki.kiwix.org/wiki/Hackathon_Spring_2022

Unfortunately, such a ticket with bunch of different things is difficult to handle. We should really stick to the rule: one problem, one ticket. Maybe it would be useful to create a tag kiwix-js to gather them somehow? But would you be able please to do so, and for each precisely give the details: ZIM file (url), article, broken HTML, etc... so we can find and reproduce the bug without searching too much?

@Jaifroid
Copy link
Author

@kelson42 No problem! This issue is really a way to list the problems I came across while "reverse engineering" the Zimit format. There are a number of irregularities that I think affect any reader software trying to support this format, not just Kiwix JS. I think it's useful to have a list in one place, as @rgaudin suggests, but I do agree that I / we should spin off separate issues as and when they can be discussed in more detail or worked on in the future.

It might be better to spin these off once we've had our discussion about what kind of support you want to see in readers?

@kelson42
Copy link
Contributor

@Jaifroid No fundamental problem to have/keep this ticket to bring them all together (even if I prefer a label). But then, lets create other tickets and put only titles/ticket number in a bullet list. I would recommend to create the tickets right now, as long as you have all the details and everything in your head. Not sure this will be still the case in a month (but otherwise no problem to do that work in a month).

@ikreymer
Copy link
Collaborator

@Jaifroid , sorry I also don't have much time at the moment, just wanted to add that these are all things that were done to support storing more generic web archives in ZIM - this means storing almost full URLs, with query arguments, etc..
The intent was to map what WARC files can support to ZIM, hence the slight irregularities. Perhaps it is worth revisiting at a point in the future..

@Jaifroid
Copy link
Author

@ikreymer Thanks for explanation. It's useful to know. I think client software will need to take these things into account. Maybe it is more a question of documentation, then, than "problems" with the format. To be discussed!

@kelson42
Copy link
Contributor

@Jaifroid @benoit74 What is the status of this ticket. It seems meanwhile pretty outdated to me.

@benoit74
Copy link
Collaborator

benoit74 commented Apr 18, 2024

  • Some URLs are stored percent-encoded in the ZIM. An example: in the ready.gov ZIM, a number of images are stored in the ZIM with a URL like 'A/www.ready.gov/sites/default/files/Week%202%20-%20plan.jpg'. The reader decodes such URLs before extracting the title, but this produces an incorrect result because the file title has been stored percent-encoded in the ZIM;
  • Furthermore, URLs are encoded using encodeURIComponent() instead of encodeURI(), with the exception of forward slashes (and any querystring). This means that commas in filenames are (incorrectly) encoded, whereas commas are an accepted character in the URL schema;
  • In most ZIMs the querystring is unencoded (as per spec) as part of the dirEntry.url, and in others (e.g. bouquineux.com_fr_all_2022-02.zim) it is URIencoded. The reader has to try first unencoded and then encoded;
  • These ZIMs store all assets misleadingly under the A namespace, or C/A/ in the case of Type 1 ZIMs (see
    Location of WARC assets in Type 1 (no-namespace) ZIMs #99);
    • ✅We do not use namespaces anymore
  • Some URLs stored in dirEntries in the ZIM are extremely long, >4KB, which is larger than the de facto limit of ~2KB identified in https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers. This may cause issues in some clients/server software/browsers. See
    Extremely long dirEntry URLs (>4KB) produced by warc2zim #102 for more info.
    • 👷This is still an issue, nothing has been done so far
  • Against the ZIM spec, textual assets (css, js, html) are not guaranteed to be UTF-8 encoded. See
    Incorrect charset encoding indicated in meta tag of some warc2zim archives #88.
  • Many URLs are kept as absolute URLs ('https://'), from many different domains, some of which are represented in the ZIM (after removing the 'https://', and some of which are not, leading the reader to attempt to load external content. The latter is usually blocked by CSP, causing cross-origin errors;
    • 🤷🏻‍♂️Internal URLs are all relative now ; we still have many absolute URLs for everything which is external
  • Some URLs begin '//some.domain.com' (two forward slashes) - this appears to be a workaround for the fact that these URLs are stored in the ZIM under a different domain from the main page's domain...;
    • 🤷🏻‍♂️If URL is internal, it won't be the case anymore ; if it is external and was written as //some.domain.com in the original content, it will kept as-is ; if original content didn't specified the scheme, it means that both http and https are supported by the upstream, so I do not expect any more work needed on this one. Is using http a problem for PWA / Kiwix JS?
  • Some images are stored in the ZIM under a url that includes the URI component, e.g. A/media.cheatography.com/storage/thumb/davechild_regular-expressions.750.jpg?last=1584011681;
    • 🤷🏻‍♂️Still the case, but why is it an issue?
  • I've also noticed a case of a <meta http-equiv ....> redirect to an absolute domain stored in the ZIM: ideally the final redirected file should be stored, not the redirect page itself (example: "Make a plan" link on front page of ready.gov ZIM);
    OpenZIM bans absolute URLs, they should all be relative;
    • 🤷🏻‍♂️will have to be checked again
  • MIME type of html pages is often text/html;raw=true instead of text/html - can be tested for, but any reader that tests /^text/html$/ in the backend will fail;
    • 🤷🏻‍♂️we use the MIME type as advertised by upstream webserver
  • Every html file has custom in-page JavaScript that redirects the browser under certain conditions and assumptions which are not generic (it is customized for the Android app) -- other readers have to work around this;
    • 🤷🏻‍♂️I don't get what we are speaking about here
  • On some landing pages that are located at the top level of a domain (i.e. not in a directory), there are relative links that begin ../, which are technically incorrect because they would cause the browser to navigate outside of the domain under which links are stored - see
    Incorrect relative URLs on top-level landing pages zimit#155.
    • ✅This is fixed in zimit2

@rgaudin
Copy link
Member

rgaudin commented Apr 18, 2024

MIME type of html pages is often text/html;raw=true instead of text/html - can be tested for, but any reader that tests /^text/html$/ in the backend will fail;

🤷🏻‍♂️we use the MIME type as advertised by upstream webserver

This was Zimit1 workaround the SW. Gone.

Every html file has custom in-page JavaScript that redirects the browser under certain conditions and assumptions which are not generic (it is customized for the Android app) -- other readers have to work around this;

🤷🏻‍♂️I don't get what we are speaking about here

Idem

@rgaudin
Copy link
Member

rgaudin commented Apr 18, 2024

I believe this can be closed now.

@benoit74
Copy link
Collaborator

From my PoV, the only thing left to be checked (but we can transfer this to another issue indeed) is the stuff about the <meta http-equiv ....> redirect to an absolute domain stored in the ZIM

@benoit74
Copy link
Collaborator

In fact, I also don't get what was exactly the point of meta http-equiv ; this is not supposed to contain a URL/redirect AFAIK: https://www.w3schools.com/tags/att_meta_http_equiv.asp

@Jaifroid
Copy link
Author

Thanks for going over the details @benoit74. Just to clarify that this list was made with regard to Zimit1, a long time before Zimit2 was a thing. So some of the issues refer to the way Zimit1 introduced kludges and workarounds. That's the case with regard to the custom JavaScript on every landing page of Zimit1 ZIMs: this was a workaround for Android, a way of bootstrapping the loading of the Replay Service Worker. There was also on every HTML page a snippet of JS that checked if the page had been invoked directly, without loading the Replay Worker, and redirected to the bootstrap code if necessary.

meta-http-equiv redirect/refresh is definitely used for redirect -- it is/was a very common way of doing redirects without JS (see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#http-equiv). However, any URL used as the redirect target would surely be rewritten by the Python scripts in Zimit2.

The question about image URLs being stored with query strings was again that some backends explicitly stripped the query string when looking for assets. This behaviour has been amended in affected backends for Zimit2.


In sum, I think all the issues mentioned above have now been addressed, or as @benoit74 mentioned, are in current PRs, or are known issues.

The known issues are:

  • Extremely long URLs: these are only an issue if any backends have hard-coded limits. Some older browsers also have hard-coded limits. There was an issue in Kiwix JS, now fixed, whereby a hard limit was hit. If this is an issue in any other backends, it would manifest when playing YouTube video, so it is probably something to monitor but not worry about too much.
  • Non-UTF-8 character sets: we're aware of these and have other issues open.

@Jaifroid
Copy link
Author

Closing as resolved. Many thanks to everyone for the huge effort!

@benoit74
Copy link
Collaborator

I finally got it about http-equiv and opened #237 to track this (it is not rewritten at all as of today)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants