Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Location of WARC assets in Type 1 (no-namespace) ZIMs #99

Closed
Jaifroid opened this issue Jul 8, 2022 · 16 comments
Closed

Location of WARC assets in Type 1 (no-namespace) ZIMs #99

Jaifroid opened this issue Jul 8, 2022 · 16 comments

Comments

@Jaifroid
Copy link

Jaifroid commented Jul 8, 2022

As briefly discussed in openzim/zimit#135, WARC assets and WARC Headers, are all stored under the Content area (C/) in Type 1 ("no-namespace") ZIMs. In legacy Type 0 ZIMs, all WARC assets were stored under the Article namespace (A/) and all WARC Headers were stored under a custom Header namespace H/.

The new approach is a side effect of the fact that libzim7 has lost access to the H/ namespace (by design) for Type 1 ZIMs. It leads to ZIM URLs of the type C/A/armypubs.army.mil/FAQ/FAQ1.aspx, and a corresponding header C/H/armypubs.army.mil/FAQ/FAQ1.aspx.

This is problematic for readers that make assumptions based on the OpenZIM specification, because the old namespaces (dealt with at a low level) have now become prefixes (which have to be dealt with in a different way, i.e. with custom code).

This issue is to discuss whether it is possible to design a more rational way of storing WARC assets in Type 1 ZIMs, so that assets can be accessed by a standard ZIM reader based on the OpenZIM spec (NB, I say "assets can be accessed"; this does not necessarily mean that they can displayed without the Replay system, currently based on installation of a Service Worker).

One possibility, suggested by @rgaudin, is to remove the A/ prefix for WARC assets, and to leave Headers with an H/ prefix (still under the C/ namespace). This would satisfy the goal of any ZIM reader being able, in principle, to access the assets.

The goal is to have a settled schema so that those apps that still need to develop a way of reading Zimit ZIMs -- Kiwix Desktop @mgautierfr, Kiwix JS @mossroy, Kiwix JS Windows/Linux (which had experimental support, now broken), the iOS app -- can make safe assumptions going forward.

@mgautierfr
Copy link
Contributor

From my point of view as the libzim developer:

The goal of "no-namespace" scheme is to remove the namespace for the user. (Removing the namespace from the zim format would be such a different thing and it is not planned at all).
With "Type-0" zims and the associated libzim (ie, libzim < 7), the usage of the namespace is left to the user. kiwix (not zim) were using namespace by putting articles in A namespace, medatadata in M, image and svg in I/J and so...
With time, both libzim and libkiwix have move from a couple ("namespace" and "url") to a simple key: "/url", it is simply a path starting with a single letter.

libzim >= 7, didn't remove the namespace, but takes it for its own usage. Doing so, it provides features that were handled by libkiwix, as storing metadata (in M namespace, but it is libzim doing it, providing corresponding API) or creating indexes (in X namespace). User just put its content in zim file, without caring about the namespace (the fact it is store in C namespace is libzim business)

From the WARC point of view, nothing has changed. It was creating item in H/armypubs.army.mil/FAQ/FAQ1.aspx (and libzim<7 was putting the content in H namespace).
Now it is creating item in H/armypubs.army.mil/FAQ/FAQ1.aspx, (and libzim>=7, put the content in C namespace).
When reading, in both case it access the entry at H/armypubs.army.mil/FAQ/FAQ1.aspx and libzim search in C or H namespace depending of the type of the zim file.

Using libzim, this is transparent to the user.
For other implementation (yours) it should be the same behavior : extract the first char of the url as a namespace and search in this namespace for the other part of the url.

This is problematic for readers that make assumptions based on the OpenZIM specification, because the old namespaces (dealt with at a low level) have now become prefixes (which have to be dealt with in a different way, i.e. with custom code).

I'm curious about what assumptions you are making ?


Regarding this specific issue, an important point with libzim>=7, is that entry path doesn't need a namespace (where you were needing a A namespace with libzim<7). So yes, warc2zim can remove the A subdirectory. As long as readers make no assumption about warc zim file, it should be ok (the reader in the zim file itself will care about that).
WARC zim file is another format, based upon zim format (somehow the same way kiwix zim file is/were a format based upon zim format), it can do whatever it what (in regards of libzim)

But if we start to have readers trying to interpret warc zim file (and I agree with that), we must standardize how we store warc things.
Removing the A namespace creates a different "format" of warc zim file, we should be careful before removing it.

@Jaifroid
Copy link
Author

Jaifroid commented Jul 11, 2022

I'm curious about what assumptions you are making ?

The assumption is quite simple and, I hope, logical. Each ZIM type has a namespace where an article will be found. Consider these two scenarios:

  • For Type 0 ZIMs the article namespace is A/. Other assets are in other namespaces, but we don't care about that because the assets are hyperlinked with relative URLs that include the namespace.
  • For Type 1 ZIMs the namespace is C/. Other assets are in the same namespace, but (for an OpenZIM-compliant ZIM) we don't care about that because, again, the assets are hyperlinked correctly with relative URLs to find them -- we just follow the hyperlink).

Now consider the fact that WARC-style ZIMs very often have absolute URLs of type https://armypubs.army.mil/Tools/PDF/AD_Process.pdf (an actual example from the home page of the armypubs ZIM (Type 1).

In the Kiwix JS code, we have a variable cns = content namespace, which is set to A for Type 0 ZIMs and C for Type 1 ZIMs. For a Type 0 WARC ZIM, we don't need any special code to retrieve the ZIM URL of an article: we know we can find it at cns + '/' + articleURL, where articleURL is either a relative URL or an absolute URL with the https:// or // removed. We of course have to add special code to handle redirects in Headers, i.e. we only have to handle the H/ namespace.

However, in a Type 1 ZIM, I now have to prefix the articleURL with cns + '/' + 'A' + '/'. Nothing in the URL tells me I need to do this. Therefore, I have to write code specifically to handle this situation, i.e. I have to recognize that we are dealing with a Type 1 ZIM, and I have to make an assumption that we must add an intermediate A in order to derive the ZIM URL, whereas we did not need to do that in a Type 0 ZIM, because the namespace is handled transparently in the backend.

We've therefore ended up with a hybrid of Type 0 and Type 1 ZIMs, where the previous article namespace has now become a prefix. (An added problem in Kiwix JS code is the way the namespace is recognized with a regex, but that is very specific to us.)

@mgautierfr
Copy link
Contributor

I think you're mixing things. There are two levels:

  • A technical one which is how we store things in zim files (the zim level)
  • A functional one which is what key we use for entry in zim files (the warc level)

The two level are somehow independent. I think you try to handle them at the same level and it become complicated.

I will start with the second level:
Always prepend A/. It how warc2zim build the file and how it "see" the file. It always have been with a A/ prefix, you should do the same at reading. (At least for you use case with absolute link)

The first point is lower level, when the zim reader (not the warc reader) is asked for a entry using a key:

  • On zim Type 0, it assume that the namespace is in the key, and so split the key and search in the right namespace
  • On zim Type 1, it assume that the key doesn't contain a namespace, and so search in namespace C
    This behavior should always be the same, warc or not.
    If somehow, you receive a path without namespace for a zim Type 0, search in A (and I, J, -)
    If somehow, you receive a path with namespace for a zim Type 1, replace the namespace with C
    This is what is done here : https://github.com/openzim/libzim/blob/master/src/archive.cpp#L195-L225

We of course have to add special code to handle redirects in Headers, i.e. we only have to handle the H/ namespace.

I'm also curious about what you have to do to handle Headers. I'm still not able to fully understand what is the full purpose of headers. Only mark redirection? Could they be expressed with zim redirection?
If you have a small explanation or a link to a bit of code, it would be nice.

@Jaifroid
Copy link
Author

Jaifroid commented Jul 11, 2022

@mgautierfr Thank you for the suggestions. I think our differences of perspective derive from a difference in the way our respective codebases handle Type 0 and Type 1 ZIMs. We depend entirely on the title pointer list and url pointer index (we can't use Xapian yet), and we must derive a ZIM URL (i.e. a URL including namespace, whether or not the namespace is C or A) from all hyperlinked content.

In a Type 0 WARC ZIM the url pointer list looks like this:

image

The hyperlinked URL from which we need to derive one of the above ZIM URLs often looks like this (it can also be relative to current article, or can begin // without any protocol):

https://www.lowtechmagazine.com/2009/09/water-powered-cable-trains.html

Please also note that in WARC ZIMs all assets are in the A/ namespace, including images, js, everything. Effectively the A/ namespace in such a ZIM acts just like the C/ namespace in a standards-complaint "no namespace" ZIM.

Compare this with a Type 1 WARC ZIM as currently coded:

image

The hyperlinked URL (in an article or in a js or css file) from which we need to derive one of the above looks like this:

https://armypubs.army.mil/FAQ/pdf/PDF_Forms_Browser1.docx (no A/ anywhere in the URL).

You suggest I always need to add an /A/ in front of the URL, but if I were to do that in Type 0 ZIMs, I'd end up with a (wrong) ZIM URL that would look like A/A/www.lowtechmagazine.com/2009/09/water-powered-cable-trains.html, because the backend will add the namespace no matter what I do.

To summarize:

Type 0: https://www.lowtechmagazine.com/2009/09/water-powered-cable-trains.html -> A/www.lowtechmagazine.com/2009/09/water-powered-cable-trains.html

Type 1: https://armypubs.army.mil/FAQ/pdf/PDF_Forms_Browser1.docx -> C/A/armypubs.army.mil/FAQ/pdf/PDF_Forms_Browser1.docx

It's logically inconsistent, as I see it... This is clearly a legacy situation. We would never have ended up with URLs looking like this if we were designing WARC Type 1 ZIMs from scratch (clearly the intermediate /A/ in the URL pointer list is completely redundant information).


Now as it's currently just me supporting the reading of WARC ZIMs without the Replay system, I will of course adapt. I thought it was useful, however, to have this discussion, because I understand that you would like to do something similar in Kiwix Desktop, and possibly the same might be the case for the iOS app, and it is worth settling this before any more development work (or patching) is done. (I'll try to answer your other query separately.)

@Jaifroid
Copy link
Author

Jaifroid commented Jul 11, 2022

I'm also curious about what you have to do to handle Headers. I'm still not able to fully understand what is the full purpose of headers. Only mark redirection? Could they be expressed with zim redirection?
If you have a small explanation or a link to a bit of code, it would be nice.

To be honest, I've only ever found Headers useful for: redirection, and for telling us that an asset is not found. Both of these uses can usually be extracted from the Response instead of the Header (the Response is almost always there too and it is a 404 or redirection response). However, there are some cases where there is a Header URL for an asset, but there is no Response in the ZIM at the same URL, only the redirected URL. In these cases (rare) we have to examine the headers. Headers look like this -- I coded https://pwa.kiwix.org to allow us (me) to examine the headers by searching for H/ (you can also search for H/.*something, with something being interpreted as a regex):

image

It's the custom headers we need to look at, particularly WARC-Target-URI, as it will tell us where to find the "lost" asset.

Because I only need to examine the headers rarely, I only search for a Header if a search for the corresponding asset in the content namespace returns null. This way, I save doing two ZIM URL lookups (one for H/ + URL and one for C/ + URL or A/ + URL) for every ZIM URL that is requested. The code starts here:

https://github.com/kiwix/kiwix-js-windows/blob/master/www/js/lib/zimArchive.js#L496

but reading the various forms of redirection is handled here:

https://github.com/kiwix/kiwix-js-windows/blob/master/www/js/lib/transformZimit.js#L56 (look for WARC-Target-URI).

@Jaifroid
Copy link
Author

PS Headers no doubt have other functions in highly dynamic HTML/JS websites, that allow the Replay system fully to emulate the headers sent and the responses received for every Fetch request. As I'm only offering basic support (I don't attempt to emulate fully the Reply system), I haven't gone further than checking headers when absolutely required. Evidently highly dynamic ZIMs really need to use Replay libraries to emulate fully the original environment. My approach is really only intended to enable relatively static sites to be read (like Low Tech Magazine, or Internet Encyclopaedia of Philosophy). However, it works on nearly all ZIMs we currently publish (or rather, it worked, because it's now broken with the new Type 1 ZIMs appearing ☹️).

@rgaudin
Copy link
Member

rgaudin commented Jul 12, 2022

Indeed, headers are crucial for many JS-driven situations. Pareto principle at play here 😉

@mgautierfr
Copy link
Contributor

we can't use Xapian yet

You don't have to use it. Xapian is only used for fulltext search. Locating a entry is totally done without xapian.

You suggest I always need to add an /A/ in front of the URL, but if I were to do that in Type 0 ZIMs, I'd end up with a (wrong) ZIM URL that would look like A/A/www.lowtechmagazine.com/2009/09/water-powered-cable-trains.html, because the backend will add the namespace no matter what I do.

That is the problem. The backend must not add the namespace for Type 0 zim file.

The nonamespace format is the scam. It is not about the format, it is about the API. What have been done is mainly two things:
A - Hide the namespace from the API. In libzim6, there were Article File::getArticle(char ns, const std::string& url) and Article File::getArticle(const std::string& url). We have removed the former, and only keep the second with Entry Archive::getEntryByPath(cons std::string& url). This change nothing. It is just that now, we don't have namespace. We simple have a url (which always start by a one char subdirectory). The same way. Article::getNamespace() is removed. Now Entry::getPath() contains a one char subdirectory at start.
B - As client now never manage namespace, libzim is free to simply don't parse the url to extract a namespace, but simply use the Content namespace. (That is Type1). The same way, Enty::getPath() now doesn't include the namespace (for Type1).

What is done is:

  1. User ask for a entry for path A/foo.html.
  2. Depending of the zim Type. If it is Type0, we assume the path contains the namespace, for Type1, we don't care.
    • Type0: parse the url and extract the namespace => A / foo.html
    • Type1: select namespace C => C / A/foo.html
  3. Search for the path in the namespace
  4. Return a entry. When calling entry.getPath(), it returns something depending of zim Type:
    • Type0: prepend the namespace to the path => A + foo.html = A/foo.html
    • Type1: return the path => A/foo.html
  5. User uses entry, and don't care of the zim Type.

From the user point of view, nothing has changed : Always use A/foo.html, don't care about the zim Type and never try to specify the namespace.
And it is the same for creator:

  • With A, creator always pass a path and never the namespace. (The only one constraints is that the path must start with a one char subdirectory)
  • With B, the constraint is gone and creator can use whatever they want.

The only difference of treatment is points 2 and 4. Everything else is transparent. Point 2 is made here and Point4 is here. And if you are surprised about the naming of internal function, it is because they have not change between Type0 and Type1.

The nonamespace things is just about releasing the constraint.

zimwriterfs and python scrappers have been adapted to the fact that they can remove this one char subdirectory and so don't use it. Because of that, Type1 zim files appears to not contains namespace. But it is because creators have been adapted, not because of Type1 (assuming point 2 and 4 are made).
warc2zim has not being adapted. Creator still add a A subdirectory so readers must still create/use path with a A subdirectory.


About your image about Type1 search starting with C/. If you search for things starting with C/ on Type1, a correct implementation should return nothing.
Because there is no entries starting with C/ in C namespace.

@mgautierfr
Copy link
Contributor

Indeed, headers are crucial for many JS-driven situations. Pareto principle at play here

The problem is no one knows (or want to explains) what is the 20% situation 😉

@rgaudin
Copy link
Member

rgaudin commented Jul 12, 2022

It's not hard to understand that JS code can expect specific headers/value-in-headers to work properly.
As this is supposed to be generic, the crawler just blindly stores the requests and the replayer blindly replays the request.

Now I know you want specific examples (I do too) but I don't have them. I can imagine they exists though. I think I remember Twitter feed was mentioned. I know YT videos uses a complex scenario with POST request and processing to get the actual video stream URL, but I don't know if it involves headers or not.

This is probably not the right place to ask for this though ; but in https://github.com/webrecorder/replayweb.page

@Jaifroid
Copy link
Author

About your image about Type1 search starting with C/. If you search for things starting with C/ on Type1, a correct implementation should return nothing.
Because there is no entries starting with C/ in C namespace.

Sorry for slow response (I'm travelling). As mentioned, there is clearly a difference in the treatment of namespaces in libzim and in the Kiwix JS emulation of it in the backend. You are hiding it from the main programme, whereas our code is peppered with lines that concatenate the title URL with the namespace to produce a full ZIM URL, usually looking like dirEntry.namespace + '/' + dirEntry.url. This is what I meant when I said that I would end up with A/A/someURL.html if I additionally add another A in the front end for Type 0.

In any case, it shouldn't take me too long to handle Type 1 Zimit ZIMs, and once we adopt the Emscripten port of libzim in Kiwix JS, we'll have more conformant backend code.

My preference would have been to maintain support for an H/ namespace in Type 1 ZIMs (specifically for Zimit use), and swapping A/ for C/ for the content exposed to the end user. This would unclutter the C/ namespace and remove otherwise redundant legacy "pseudo-namespaces" (prefixes). But doing this would imply some development work on both libzim and WARC2ZIM, I guess.


So, to conclude this discussion: I think we've decided to keep the pseudo namespaces (inside C/) for Type 1 ZIMs, and we can assume that this format will be stable for the foreseeable future and should develop with this assumption. Could I suggest we document this somewhere (at least explaining the difference between Type 0 and Type 1 with regard to namespaces being converted into prefixes)? Thank you all for your time and patience in working through this.

@Jaifroid
Copy link
Author

Jaifroid commented Jul 14, 2022

I know YT videos uses a complex scenario with POST request and processing to get the actual video stream URL, but I don't know if it involves headers or not.

I can confirm this, having looked into it a little. When it works, it's a case of redirection. You ask for a local resource (a "fuzzy URL" for the video), and the header provides you (or should provide you, because it was broken in many cases) with a long and complex WARC-Refers-To-Target-URI on either YouTube or "googlevideo". See openzim/zimit#122 for discussion of the problem.

@Jaifroid
Copy link
Author

There is one thing in the conversation above that suggests further changes:

warc2zim has not been adapted. The creator still adds a A subdirectory so readers must still create/use a path with an A subdirectory.

If this adaptation is likely to change the format again, is it possible to establish a guideline for that now? I want to fix my code in a way that is reasonably future proof.

@rgaudin
Copy link
Member

rgaudin commented Jul 17, 2022

From the discussion I had the feeling we should stick to the current behavior "/A" and "/H" prefixes.

My initial thinking was to remove "/A" for articles but the benefit seems limited (and replay code depends on a prefix)

@Jaifroid
Copy link
Author

@rgaudin OK, thanks for confirming. I'll develop to this. I'm closing this issue for now.

@Jaifroid
Copy link
Author

Jaifroid commented Aug 2, 2022

I just wanted to add a note (without re-opening this issue) to say that I've implemented various fixes that now support reading Type 1 WARC-based ZIMs in Kiwix JS for Windows and Linux v2.1.0. The fix is also in the PWA at https://pwa.kiwix.org (be sure it has self-updated to v2.1.0). The fixes are based on what was discussed and agreed above.

As a reminder, the approach to reading these archives is based on transforming HTML/CSS/JS and ZIM URLs, and does not currently use the Replay system. This provides some advantages (no need for a Service Worker, some ZIMs are accessible even in Internet Explorer) and some disadvantages (it is less robust).

NB multimedia support is still missing, but this has nothing to do with the Type 1 format -- just lack of time to address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants