-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with Zimit ZIM file format and URLs #86
Comments
Thank you @Jaifroid ; I believe we are aware of each of those but writing them all down together was much needed ; especially as we are in the process of evaluating the strategy for the future of this “generic ZIM creator”. There are additional challenges beside those, format-related, that you mentioned. Topic's on the table but the table is very long so don't expect a timely answer here. |
OK, thanks @rgaudin. I'll update the top post with any other issues I notice. I don't expect any quick answers, but it could be good to hear about any future changes that are envisaged, as I now have "passable" (experimental) support for this format. @mossroy has suggested a teleconference call (maybe during the Hackathon, which I can't be at in person) to discuss this format, its strategic value within Kiwix, its place in the road map... |
Definitely |
Another very odd one (added to list above): Here we have an image asset where the ZIM url includes the URI component! Searching for the asset without the component returns a not-found error. PS I can understand why this might be necessary for a web "recorder", in case changing the component would actually change the returned asset server-side, but it could be a problem for some ZIM readers if they do any processing on the asset url. |
Added another gotcha: MIME type of HTML files is set to |
@Jaifroid Thank you very much for these bug reports. I was (positively) surprised to see you working on this now and see that you get encouraging results. Unfortunately, we are these time so busy with other problems that we can not support you right now like you desserve it. Sorry about that. But I expect that later this year, your bug reports you have made here so far will be really valuable. I have made also an entry to the hackathon wiki page to discuss this together in a month: https://wiki.kiwix.org/wiki/Hackathon_Spring_2022 Unfortunately, such a ticket with bunch of different things is difficult to handle. We should really stick to the rule: one problem, one ticket. Maybe it would be useful to create a tag |
@kelson42 No problem! This issue is really a way to list the problems I came across while "reverse engineering" the Zimit format. There are a number of irregularities that I think affect any reader software trying to support this format, not just Kiwix JS. I think it's useful to have a list in one place, as @rgaudin suggests, but I do agree that I / we should spin off separate issues as and when they can be discussed in more detail or worked on in the future. It might be better to spin these off once we've had our discussion about what kind of support you want to see in readers? |
@Jaifroid No fundamental problem to have/keep this ticket to bring them all together (even if I prefer a label). But then, lets create other tickets and put only titles/ticket number in a bullet list. I would recommend to create the tickets right now, as long as you have all the details and everything in your head. Not sure this will be still the case in a month (but otherwise no problem to do that work in a month). |
@Jaifroid , sorry I also don't have much time at the moment, just wanted to add that these are all things that were done to support storing more generic web archives in ZIM - this means storing almost full URLs, with query arguments, etc.. |
@ikreymer Thanks for explanation. It's useful to know. I think client software will need to take these things into account. Maybe it is more a question of documentation, then, than "problems" with the format. To be discussed! |
|
This was Zimit1 workaround the SW. Gone.
Idem |
I believe this can be closed now. |
From my PoV, the only thing left to be checked (but we can transfer this to another issue indeed) is the stuff about the <meta http-equiv ....> redirect to an absolute domain stored in the ZIM |
In fact, I also don't get what was exactly the point of meta http-equiv ; this is not supposed to contain a URL/redirect AFAIK: https://www.w3schools.com/tags/att_meta_http_equiv.asp |
Thanks for going over the details @benoit74. Just to clarify that this list was made with regard to Zimit1, a long time before Zimit2 was a thing. So some of the issues refer to the way Zimit1 introduced kludges and workarounds. That's the case with regard to the custom JavaScript on every landing page of Zimit1 ZIMs: this was a workaround for Android, a way of bootstrapping the loading of the Replay Service Worker. There was also on every HTML page a snippet of JS that checked if the page had been invoked directly, without loading the Replay Worker, and redirected to the bootstrap code if necessary.
The question about image URLs being stored with query strings was again that some backends explicitly stripped the query string when looking for assets. This behaviour has been amended in affected backends for Zimit2. In sum, I think all the issues mentioned above have now been addressed, or as @benoit74 mentioned, are in current PRs, or are known issues. The known issues are:
|
Closing as resolved. Many thanks to everyone for the huge effort! |
I finally got it about http-equiv and opened #237 to track this (it is not rewritten at all as of today) |
While working on experimental support for Zimit archives in Kiwix JS Windows/Linux, I have noticed a few issues with the Zimit ZIM file format that seem to break the OpenZIM specification. I list them here, but will add to this issue as I find any other issues:
encodeURIComponent()
instead ofencodeURI()
, with the exception of forward slashes (and any querystring). This means that commas in filenames are (incorrectly) encoded, whereas commas are an accepted character in the URL schema;dirEntry.url
, and in others (e.g.bouquineux.com_fr_all_2022-02.zim
) it is URIencoded. The reader has to try first unencoded and then encoded;A/media.cheatography.com/storage/thumb/davechild_regular-expressions.750.jpg?last=1584011681
;<meta http-equiv ....>
redirect to an absolute domain stored in the ZIM: ideally the final redirected file should be stored, not the redirect page itself (example: "Make a plan" link on front page of ready.gov ZIM);text/html;raw=true
instead oftext/html
- can be tested for, but any reader that tests/^text\/html$/
in the backend will fail;../
, which are technically incorrect because they would cause the browser to navigate outside of the domain under which links are stored - see Incorrect relative URLs on top-level landing pages zimit#155.It seems to me that the above are significant challenges for universal reader support for these ZIMs. I understand that these ZIMs have an in-built Service Worker that translates URLs, though it will never be possible to trap absolute URLs to external domains in an internal Service Worker, even if this can be made to run in all readers.
Please don't misunderstand: the above is not intended as a criticism. The Zimit format is a very good and interesting effort to make it possible for users to create their own ZIMs. This is in the spirit of improving the format, so that other readers can access content in these ZIMs.
The text was updated successfully, but these errors were encountered: