-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Location of WARC assets in Type 1 (no-namespace) ZIMs #99
Comments
From my point of view as the libzim developer: The goal of "no-namespace" scheme is to remove the namespace for the user. (Removing the namespace from the zim format would be such a different thing and it is not planned at all). libzim >= 7, didn't remove the namespace, but takes it for its own usage. Doing so, it provides features that were handled by libkiwix, as storing metadata (in From the WARC point of view, nothing has changed. It was creating item in Using libzim, this is transparent to the user.
I'm curious about what assumptions you are making ? Regarding this specific issue, an important point with libzim>=7, is that entry path doesn't need a namespace (where you were needing a But if we start to have readers trying to interpret warc zim file (and I agree with that), we must standardize how we store warc things. |
The assumption is quite simple and, I hope, logical. Each ZIM type has a namespace where an article will be found. Consider these two scenarios:
Now consider the fact that WARC-style ZIMs very often have absolute URLs of type In the Kiwix JS code, we have a variable However, in a Type 1 ZIM, I now have to prefix the We've therefore ended up with a hybrid of Type 0 and Type 1 ZIMs, where the previous article namespace has now become a prefix. (An added problem in Kiwix JS code is the way the namespace is recognized with a regex, but that is very specific to us.) |
I think you're mixing things. There are two levels:
The two level are somehow independent. I think you try to handle them at the same level and it become complicated. I will start with the second level: The first point is lower level, when the zim reader (not the warc reader) is asked for a entry using a key:
I'm also curious about what you have to do to handle Headers. I'm still not able to fully understand what is the full purpose of headers. Only mark redirection? Could they be expressed with zim redirection? |
@mgautierfr Thank you for the suggestions. I think our differences of perspective derive from a difference in the way our respective codebases handle Type 0 and Type 1 ZIMs. We depend entirely on the title pointer list and url pointer index (we can't use Xapian yet), and we must derive a ZIM URL (i.e. a URL including namespace, whether or not the namespace is In a Type 0 WARC ZIM the url pointer list looks like this: The hyperlinked URL from which we need to derive one of the above ZIM URLs often looks like this (it can also be relative to current article, or can begin
Please also note that in WARC ZIMs all assets are in the Compare this with a Type 1 WARC ZIM as currently coded: The hyperlinked URL (in an article or in a js or css file) from which we need to derive one of the above looks like this:
You suggest I always need to add an /A/ in front of the URL, but if I were to do that in Type 0 ZIMs, I'd end up with a (wrong) ZIM URL that would look like To summarize: Type 0: Type 1: It's logically inconsistent, as I see it... This is clearly a legacy situation. We would never have ended up with URLs looking like this if we were designing WARC Type 1 ZIMs from scratch (clearly the intermediate Now as it's currently just me supporting the reading of WARC ZIMs without the Replay system, I will of course adapt. I thought it was useful, however, to have this discussion, because I understand that you would like to do something similar in Kiwix Desktop, and possibly the same might be the case for the iOS app, and it is worth settling this before any more development work (or patching) is done. (I'll try to answer your other query separately.) |
To be honest, I've only ever found Headers useful for: redirection, and for telling us that an asset is not found. Both of these uses can usually be extracted from the Response instead of the Header (the Response is almost always there too and it is a 404 or redirection response). However, there are some cases where there is a Header URL for an asset, but there is no Response in the ZIM at the same URL, only the redirected URL. In these cases (rare) we have to examine the headers. Headers look like this -- I coded https://pwa.kiwix.org to allow us (me) to examine the headers by searching for It's the custom headers we need to look at, particularly Because I only need to examine the headers rarely, I only search for a Header if a search for the corresponding asset in the content namespace returns null. This way, I save doing two ZIM URL lookups (one for https://github.com/kiwix/kiwix-js-windows/blob/master/www/js/lib/zimArchive.js#L496 but reading the various forms of redirection is handled here: https://github.com/kiwix/kiwix-js-windows/blob/master/www/js/lib/transformZimit.js#L56 (look for |
PS Headers no doubt have other functions in highly dynamic HTML/JS websites, that allow the Replay system fully to emulate the headers sent and the responses received for every Fetch request. As I'm only offering basic support (I don't attempt to emulate fully the Reply system), I haven't gone further than checking headers when absolutely required. Evidently highly dynamic ZIMs really need to use Replay libraries to emulate fully the original environment. My approach is really only intended to enable relatively static sites to be read (like Low Tech Magazine, or Internet Encyclopaedia of Philosophy). However, it works on nearly all ZIMs we currently publish (or rather, it worked, because it's now broken with the new Type 1 ZIMs appearing |
Indeed, headers are crucial for many JS-driven situations. Pareto principle at play here 😉 |
You don't have to use it. Xapian is only used for fulltext search. Locating a entry is totally done without xapian.
That is the problem. The backend must not add the namespace for Type 0 zim file. The nonamespace format is the scam. It is not about the format, it is about the API. What have been done is mainly two things: What is done is:
From the user point of view, nothing has changed : Always use
The only difference of treatment is points 2 and 4. Everything else is transparent. Point 2 is made here and Point4 is here. And if you are surprised about the naming of internal function, it is because they have not change between Type0 and Type1. The nonamespace things is just about releasing the constraint. zimwriterfs and python scrappers have been adapted to the fact that they can remove this one char subdirectory and so don't use it. Because of that, Type1 zim files appears to not contains namespace. But it is because creators have been adapted, not because of Type1 (assuming point 2 and 4 are made). About your image about Type1 search starting with |
The problem is no one knows (or want to explains) what is the 20% situation 😉 |
It's not hard to understand that JS code can expect specific headers/value-in-headers to work properly. Now I know you want specific examples (I do too) but I don't have them. I can imagine they exists though. I think I remember Twitter feed was mentioned. I know YT videos uses a complex scenario with POST request and processing to get the actual video stream URL, but I don't know if it involves headers or not. This is probably not the right place to ask for this though ; but in https://github.com/webrecorder/replayweb.page |
Sorry for slow response (I'm travelling). As mentioned, there is clearly a difference in the treatment of namespaces in libzim and in the Kiwix JS emulation of it in the backend. You are hiding it from the main programme, whereas our code is peppered with lines that concatenate the title URL with the namespace to produce a full ZIM URL, usually looking like In any case, it shouldn't take me too long to handle Type 1 Zimit ZIMs, and once we adopt the Emscripten port of libzim in Kiwix JS, we'll have more conformant backend code. My preference would have been to maintain support for an So, to conclude this discussion: I think we've decided to keep the pseudo namespaces (inside |
I can confirm this, having looked into it a little. When it works, it's a case of redirection. You ask for a local resource (a "fuzzy URL" for the video), and the header provides you (or should provide you, because it was broken in many cases) with a long and complex |
There is one thing in the conversation above that suggests further changes:
If this adaptation is likely to change the format again, is it possible to establish a guideline for that now? I want to fix my code in a way that is reasonably future proof. |
From the discussion I had the feeling we should stick to the current behavior "/A" and "/H" prefixes. My initial thinking was to remove "/A" for articles but the benefit seems limited (and replay code depends on a prefix) |
@rgaudin OK, thanks for confirming. I'll develop to this. I'm closing this issue for now. |
I just wanted to add a note (without re-opening this issue) to say that I've implemented various fixes that now support reading Type 1 WARC-based ZIMs in Kiwix JS for Windows and Linux v2.1.0. The fix is also in the PWA at https://pwa.kiwix.org (be sure it has self-updated to v2.1.0). The fixes are based on what was discussed and agreed above. As a reminder, the approach to reading these archives is based on transforming HTML/CSS/JS and ZIM URLs, and does not currently use the Replay system. This provides some advantages (no need for a Service Worker, some ZIMs are accessible even in Internet Explorer) and some disadvantages (it is less robust). NB multimedia support is still missing, but this has nothing to do with the Type 1 format -- just lack of time to address it. |
As briefly discussed in openzim/zimit#135, WARC assets and WARC Headers, are all stored under the Content area (
C/
) in Type 1 ("no-namespace") ZIMs. In legacy Type 0 ZIMs, all WARC assets were stored under the Article namespace (A/
) and all WARC Headers were stored under a custom Header namespaceH/
.The new approach is a side effect of the fact that libzim7 has lost access to the
H/
namespace (by design) for Type 1 ZIMs. It leads to ZIM URLs of the typeC/A/armypubs.army.mil/FAQ/FAQ1.aspx
, and a corresponding headerC/H/armypubs.army.mil/FAQ/FAQ1.aspx
.This is problematic for readers that make assumptions based on the OpenZIM specification, because the old namespaces (dealt with at a low level) have now become prefixes (which have to be dealt with in a different way, i.e. with custom code).
This issue is to discuss whether it is possible to design a more rational way of storing WARC assets in Type 1 ZIMs, so that assets can be accessed by a standard ZIM reader based on the OpenZIM spec (NB, I say "assets can be accessed"; this does not necessarily mean that they can displayed without the Replay system, currently based on installation of a Service Worker).
One possibility, suggested by @rgaudin, is to remove the
A/
prefix for WARC assets, and to leave Headers with anH/
prefix (still under theC/
namespace). This would satisfy the goal of any ZIM reader being able, in principle, to access the assets.The goal is to have a settled schema so that those apps that still need to develop a way of reading Zimit ZIMs -- Kiwix Desktop @mgautierfr, Kiwix JS @mossroy, Kiwix JS Windows/Linux (which had experimental support, now broken), the iOS app -- can make safe assumptions going forward.
The text was updated successfully, but these errors were encountered: