Problems with Zimit ZIM file format and URLs #86

Jaifroid · 2022-04-29T07:06:26Z

While working on experimental support for Zimit archives in Kiwix JS Windows/Linux, I have noticed a few issues with the Zimit ZIM file format that seem to break the OpenZIM specification. I list them here, but will add to this issue as I find any other issues:

Some URLs are stored percent-encoded in the ZIM. An example: in the ready.gov ZIM, a number of images are stored in the ZIM with a URL like 'A/www.ready.gov/sites/default/files/Week%202%20-%20plan.jpg'. The reader decodes such URLs before extracting the title, but this produces an incorrect result because the file title has been stored percent-encoded in the ZIM;
Furthermore, URLs are encoded using encodeURIComponent() instead of encodeURI(), with the exception of forward slashes (and any querystring). This means that commas in filenames are (incorrectly) encoded, whereas commas are an accepted character in the URL schema;
In most ZIMs the querystring is unencoded (as per spec) as part of the dirEntry.url, and in others (e.g. bouquineux.com_fr_all_2022-02.zim) it is URIencoded. The reader has to try first unencoded and then encoded;
These ZIMs store all assets misleadingly under the A namespace, or C/A/ in the case of Type 1 ZIMs (see Location of WARC assets in Type 1 (no-namespace) ZIMs #99);
Some URLs stored in dirEntries in the ZIM are extremely long, >4KB, which is larger than the de facto limit of ~2KB identified in https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers. This may cause issues in some clients/server software/browsers. See Extremely long dirEntry URLs (>4KB) produced by warc2zim #102 for more info.
Against the ZIM spec, textual assets (css, js, html) are not guaranteed to be UTF-8 encoded. See Incorrect charset encoding indicated in meta tag of some warc2zim archives #88.
Many URLs are kept as absolute URLs ('https://'), from many different domains, some of which are represented in the ZIM (after removing the 'https://', and some of which are not, leading the reader to attempt to load external content. The latter is usually blocked by CSP, causing cross-origin errors;
Some URLs begin '//some.domain.com' (two forward slashes) - this appears to be a workaround for the fact that these URLs are stored in the ZIM under a different domain from the main page's domain...;
Some images are stored in the ZIM under a url that includes the URI component, e.g. A/media.cheatography.com/storage/thumb/davechild_regular-expressions.750.jpg?last=1584011681;
I've also noticed a case of a <meta http-equiv ....> redirect to an absolute domain stored in the ZIM: ideally the final redirected file should be stored, not the redirect page itself (example: "Make a plan" link on front page of ready.gov ZIM);
OpenZIM bans absolute URLs, they should all be relative;
MIME type of html pages is often text/html;raw=true instead of text/html - can be tested for, but any reader that tests /^text\/html$/ in the backend will fail;
Every html file has custom in-page JavaScript that redirects the browser under certain conditions and assumptions which are not generic (it is customized for the Android app) -- other readers have to work around this;
On some landing pages that are located at the top level of a domain (i.e. not in a directory), there are relative links that begin ../, which are technically incorrect because they would cause the browser to navigate outside of the domain under which links are stored - see Incorrect relative URLs on top-level landing pages zimit#155.

It seems to me that the above are significant challenges for universal reader support for these ZIMs. I understand that these ZIMs have an in-built Service Worker that translates URLs, though it will never be possible to trap absolute URLs to external domains in an internal Service Worker, even if this can be made to run in all readers.

Please don't misunderstand: the above is not intended as a criticism. The Zimit format is a very good and interesting effort to make it possible for users to create their own ZIMs. This is in the spirit of improving the format, so that other readers can access content in these ZIMs.

The text was updated successfully, but these errors were encountered:

rgaudin · 2022-04-29T08:37:32Z

Thank you @Jaifroid ; I believe we are aware of each of those but writing them all down together was much needed ; especially as we are in the process of evaluating the strategy for the future of this “generic ZIM creator”. There are additional challenges beside those, format-related, that you mentioned.

Topic's on the table but the table is very long so don't expect a timely answer here.

Jaifroid · 2022-04-29T09:16:06Z

OK, thanks @rgaudin. I'll update the top post with any other issues I notice. I don't expect any quick answers, but it could be good to hear about any future changes that are envisaged, as I now have "passable" (experimental) support for this format.

@mossroy has suggested a teleconference call (maybe during the Hackathon, which I can't be at in person) to discuss this format, its strategic value within Kiwix, its place in the road map...

rgaudin · 2022-04-29T09:17:30Z

@mossroy has suggested a teleconference call (maybe during the Hackathon, which I can't be at in person) to discuss this format, its strategic value within Kiwix, its place in the road map...

Definitely

Jaifroid · 2022-04-30T09:33:44Z

Another very odd one (added to list above):

Here we have an image asset where the ZIM url includes the URI component! Searching for the asset without the component returns a not-found error.

PS I can understand why this might be necessary for a web "recorder", in case changing the component would actually change the returned asset server-side, but it could be a problem for some ZIM readers if they do any processing on the asset url.

Jaifroid · 2022-04-30T14:10:44Z

Added another gotcha: MIME type of HTML files is set to text/html;raw=true instead of text/html.

kelson42 · 2022-04-30T14:54:46Z

@Jaifroid Thank you very much for these bug reports. I was (positively) surprised to see you working on this now and see that you get encouraging results. Unfortunately, we are these time so busy with other problems that we can not support you right now like you desserve it. Sorry about that. But I expect that later this year, your bug reports you have made here so far will be really valuable.

I have made also an entry to the hackathon wiki page to discuss this together in a month: https://wiki.kiwix.org/wiki/Hackathon_Spring_2022

Unfortunately, such a ticket with bunch of different things is difficult to handle. We should really stick to the rule: one problem, one ticket. Maybe it would be useful to create a tag kiwix-js to gather them somehow? But would you be able please to do so, and for each precisely give the details: ZIM file (url), article, broken HTML, etc... so we can find and reproduce the bug without searching too much?

Jaifroid · 2022-04-30T15:08:53Z

@kelson42 No problem! This issue is really a way to list the problems I came across while "reverse engineering" the Zimit format. There are a number of irregularities that I think affect any reader software trying to support this format, not just Kiwix JS. I think it's useful to have a list in one place, as @rgaudin suggests, but I do agree that I / we should spin off separate issues as and when they can be discussed in more detail or worked on in the future.

It might be better to spin these off once we've had our discussion about what kind of support you want to see in readers?

kelson42 · 2022-04-30T15:16:54Z

@Jaifroid No fundamental problem to have/keep this ticket to bring them all together (even if I prefer a label). But then, lets create other tickets and put only titles/ticket number in a bullet list. I would recommend to create the tickets right now, as long as you have all the details and everything in your head. Not sure this will be still the case in a month (but otherwise no problem to do that work in a month).

ikreymer · 2022-04-30T15:43:03Z

@Jaifroid , sorry I also don't have much time at the moment, just wanted to add that these are all things that were done to support storing more generic web archives in ZIM - this means storing almost full URLs, with query arguments, etc..
The intent was to map what WARC files can support to ZIM, hence the slight irregularities. Perhaps it is worth revisiting at a point in the future..

Jaifroid · 2022-04-30T15:45:38Z

@ikreymer Thanks for explanation. It's useful to know. I think client software will need to take these things into account. Maybe it is more a question of documentation, then, than "problems" with the format. To be discussed!

kelson42 · 2024-04-18T05:47:08Z

@Jaifroid @benoit74 What is the status of this ticket. It seems meanwhile pretty outdated to me.

benoit74 · 2024-04-18T06:27:40Z

Some URLs are stored percent-encoded in the ZIM. An example: in the ready.gov ZIM, a number of images are stored in the ZIM with a URL like 'A/www.ready.gov/sites/default/files/Week%202%20-%20plan.jpg'. The reader decodes such URLs before extracting the title, but this produces an incorrect result because the file title has been stored percent-encoded in the ZIM;
- ✅This is fixed in zimit2 especially Revisit handling of special characters in ZIM / HTML URLs #218
Furthermore, URLs are encoded using encodeURIComponent() instead of encodeURI(), with the exception of forward slashes (and any querystring). This means that commas in filenames are (incorrectly) encoded, whereas commas are an accepted character in the URL schema;
- ✅This is fixed in zimit2 especially Revisit handling of special characters in ZIM / HTML URLs #218
In most ZIMs the querystring is unencoded (as per spec) as part of the dirEntry.url, and in others (e.g. bouquineux.com_fr_all_2022-02.zim) it is URIencoded. The reader has to try first unencoded and then encoded;
- ✅This is fixed in zimit2 especially Revisit handling of special characters in ZIM / HTML URLs #218
These ZIMs store all assets misleadingly under the A namespace, or C/A/ in the case of Type 1 ZIMs (see
Location of WARC assets in Type 1 (no-namespace) ZIMs #99);
- ✅We do not use namespaces anymore
Some URLs stored in dirEntries in the ZIM are extremely long, >4KB, which is larger than the de facto limit of ~2KB identified in https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers. This may cause issues in some clients/server software/browsers. See
Extremely long dirEntry URLs (>4KB) produced by warc2zim #102 for more info.
- 👷This is still an issue, nothing has been done so far
Against the ZIM spec, textual assets (css, js, html) are not guaranteed to be UTF-8 encoded. See
Incorrect charset encoding indicated in meta tag of some warc2zim archives #88.
- 👷More work needed, see Failure of iranwire.com: impossible to decode content #188 and probably Zimit2: another encoding problem with solidarité-numérique #221
Many URLs are kept as absolute URLs ('https://'), from many different domains, some of which are represented in the ZIM (after removing the 'https://', and some of which are not, leading the reader to attempt to load external content. The latter is usually blocked by CSP, causing cross-origin errors;
- 🤷🏻‍♂️Internal URLs are all relative now ; we still have many absolute URLs for everything which is external
Some URLs begin '//some.domain.com' (two forward slashes) - this appears to be a workaround for the fact that these URLs are stored in the ZIM under a different domain from the main page's domain...;
- 🤷🏻‍♂️If URL is internal, it won't be the case anymore ; if it is external and was written as //some.domain.com in the original content, it will kept as-is ; if original content didn't specified the scheme, it means that both http and https are supported by the upstream, so I do not expect any more work needed on this one. Is using http a problem for PWA / Kiwix JS?
Some images are stored in the ZIM under a url that includes the URI component, e.g. A/media.cheatography.com/storage/thumb/davechild_regular-expressions.750.jpg?last=1584011681;
- 🤷🏻‍♂️Still the case, but why is it an issue?
I've also noticed a case of a <meta http-equiv ....> redirect to an absolute domain stored in the ZIM: ideally the final redirected file should be stored, not the redirect page itself (example: "Make a plan" link on front page of ready.gov ZIM);
OpenZIM bans absolute URLs, they should all be relative;
- 🤷🏻‍♂️will have to be checked again
MIME type of html pages is often text/html;raw=true instead of text/html - can be tested for, but any reader that tests /^text/html$/ in the backend will fail;
- 🤷🏻‍♂️we use the MIME type as advertised by upstream webserver
Every html file has custom in-page JavaScript that redirects the browser under certain conditions and assumptions which are not generic (it is customized for the Android app) -- other readers have to work around this;
- 🤷🏻‍♂️I don't get what we are speaking about here
On some landing pages that are located at the top level of a domain (i.e. not in a directory), there are relative links that begin ../, which are technically incorrect because they would cause the browser to navigate outside of the domain under which links are stored - see
Incorrect relative URLs on top-level landing pages zimit#155.
- ✅This is fixed in zimit2

rgaudin · 2024-04-18T08:36:40Z

MIME type of html pages is often text/html;raw=true instead of text/html - can be tested for, but any reader that tests /^text/html$/ in the backend will fail;

🤷🏻‍♂️we use the MIME type as advertised by upstream webserver

This was Zimit1 workaround the SW. Gone.

Every html file has custom in-page JavaScript that redirects the browser under certain conditions and assumptions which are not generic (it is customized for the Android app) -- other readers have to work around this;

🤷🏻‍♂️I don't get what we are speaking about here

Idem

rgaudin · 2024-04-18T08:37:09Z

I believe this can be closed now.

benoit74 · 2024-04-18T09:02:54Z

From my PoV, the only thing left to be checked (but we can transfer this to another issue indeed) is the stuff about the <meta http-equiv ....> redirect to an absolute domain stored in the ZIM

benoit74 · 2024-04-18T09:28:44Z

In fact, I also don't get what was exactly the point of meta http-equiv ; this is not supposed to contain a URL/redirect AFAIK: https://www.w3schools.com/tags/att_meta_http_equiv.asp

Jaifroid · 2024-04-18T13:34:56Z

Thanks for going over the details @benoit74. Just to clarify that this list was made with regard to Zimit1, a long time before Zimit2 was a thing. So some of the issues refer to the way Zimit1 introduced kludges and workarounds. That's the case with regard to the custom JavaScript on every landing page of Zimit1 ZIMs: this was a workaround for Android, a way of bootstrapping the loading of the Replay Service Worker. There was also on every HTML page a snippet of JS that checked if the page had been invoked directly, without loading the Replay Worker, and redirected to the bootstrap code if necessary.

meta-http-equiv redirect/refresh is definitely used for redirect -- it is/was a very common way of doing redirects without JS (see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#http-equiv). However, any URL used as the redirect target would surely be rewritten by the Python scripts in Zimit2.

The question about image URLs being stored with query strings was again that some backends explicitly stripped the query string when looking for assets. This behaviour has been amended in affected backends for Zimit2.

In sum, I think all the issues mentioned above have now been addressed, or as @benoit74 mentioned, are in current PRs, or are known issues.

The known issues are:

Extremely long URLs: these are only an issue if any backends have hard-coded limits. Some older browsers also have hard-coded limits. There was an issue in Kiwix JS, now fixed, whereby a hard limit was hit. If this is an issue in any other backends, it would manifest when playing YouTube video, so it is probably something to monitor but not worry about too much.
Non-UTF-8 character sets: we're aware of these and have other issues open.

Jaifroid · 2024-04-18T13:37:36Z

Closing as resolved. Many thanks to everyone for the huge effort!

benoit74 · 2024-04-18T14:17:30Z

I finally got it about http-equiv and opened #237 to track this (it is not rewritten at all as of today)

rgaudin transferred this issue from openzim/zimit Apr 29, 2022

Jaifroid mentioned this issue Apr 30, 2022

Some images in ready.gov ZIM are showing in jQuery mode, but not in SW mode kiwix/kiwix-js-pwa#254

Closed

Jaifroid mentioned this issue Aug 1, 2022

Extremely long dirEntry URLs (>4KB) produced by warc2zim #102

Closed

Jaifroid mentioned this issue Aug 26, 2022

Detect Zimit archive type and warn user kiwix/kiwix-js#885

Merged

kelson42 added the question Further information is requested label Feb 1, 2023

Jaifroid mentioned this issue Feb 1, 2023

Incorrect charset encoding indicated in meta tag of some warc2zim archives #88

Closed

kelson42 added this to the 1.6.0 milestone Apr 24, 2023

Jaifroid mentioned this issue Mar 6, 2024

Zimit2: Fix URL encoding of ZIM items #206

Closed

kelson42 assigned benoit74 Apr 18, 2024

Jaifroid closed this as completed Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with Zimit ZIM file format and URLs #86

Problems with Zimit ZIM file format and URLs #86

Jaifroid commented Apr 29, 2022 •

edited

Loading

rgaudin commented Apr 29, 2022

Jaifroid commented Apr 29, 2022

rgaudin commented Apr 29, 2022

Jaifroid commented Apr 30, 2022 •

edited

Loading

Jaifroid commented Apr 30, 2022

kelson42 commented Apr 30, 2022 •

edited

Loading

Jaifroid commented Apr 30, 2022

kelson42 commented Apr 30, 2022

ikreymer commented Apr 30, 2022

Jaifroid commented Apr 30, 2022

kelson42 commented Apr 18, 2024

benoit74 commented Apr 18, 2024 •

edited

Loading

rgaudin commented Apr 18, 2024

rgaudin commented Apr 18, 2024

benoit74 commented Apr 18, 2024

benoit74 commented Apr 18, 2024

Jaifroid commented Apr 18, 2024

Jaifroid commented Apr 18, 2024

benoit74 commented Apr 18, 2024

Problems with Zimit ZIM file format and URLs #86

Problems with Zimit ZIM file format and URLs #86

Comments

Jaifroid commented Apr 29, 2022 • edited Loading

rgaudin commented Apr 29, 2022

Jaifroid commented Apr 29, 2022

rgaudin commented Apr 29, 2022

Jaifroid commented Apr 30, 2022 • edited Loading

Jaifroid commented Apr 30, 2022

kelson42 commented Apr 30, 2022 • edited Loading

Jaifroid commented Apr 30, 2022

kelson42 commented Apr 30, 2022

ikreymer commented Apr 30, 2022

Jaifroid commented Apr 30, 2022

kelson42 commented Apr 18, 2024

benoit74 commented Apr 18, 2024 • edited Loading

rgaudin commented Apr 18, 2024

rgaudin commented Apr 18, 2024

benoit74 commented Apr 18, 2024

benoit74 commented Apr 18, 2024

Jaifroid commented Apr 18, 2024

Jaifroid commented Apr 18, 2024

benoit74 commented Apr 18, 2024

Jaifroid commented Apr 29, 2022 •

edited

Loading

Jaifroid commented Apr 30, 2022 •

edited

Loading

kelson42 commented Apr 30, 2022 •

edited

Loading

benoit74 commented Apr 18, 2024 •

edited

Loading