Extremely long dirEntry URLs (>4KB) produced by warc2zim #102

Jaifroid · 2022-07-31T21:07:44Z

At least in Type 1 WARC ZIMs, some ridiculously long URLs are being stored in the ZIM -- larger than 4096 bytes. For example, in armypubs_en_all_2022-06.zim, I got a warning about this dirEntry:

There are similar huge URLs in edu.gcfglobal.org_es_all_2022-06.zim (also a Type 1 ZIM). In all the Type 0 WARC ZIMs I tested, I never encountered a URL longer than 2KB (it would have caused a hang in Kiwix JS Windows). The ZIM type is probably irrelevant here, but it may have something to do with the more recent Replay implementation now being used to generate these Type 1 ZIMs? Or it may just be a coincidence that I've only seen them in two Type 1 ZIMs.

In any case, the reason I'm flagging this (apart from the fact that it caused a serious issue with the Kiwix JS / Windows backend, being fixed in kiwix/kiwix-js#876), is because:

2000 characters is generally considered the de facto limit for a URL (see https://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers);
Even if you no longer care about Internet Explorer, it seems to me that such long URLs are pretty inefficient in terms of ZIM storage: each one is usually stored twice as uncompressed UTF-8 strings, once with an A/ prefix or namespace, and once with an H/ prefix or namespace. A ZIM could contain hundreds of such URLs.

It shouldn't be too hard to reduce them -- e.g. if URL > 2048 bytes, calculate the unique SHA-256 and substitute that for the URL (after the prefix) when the dirEntries are being generated for the ZIM.

As a side issue, most of these extreme URLs are associated with Google: they are either analytics, or some other Google API (including ads). I wonder if it might be possible to filter out the analytics, tracking and ads by default, by using ad and tracking filters when spidering a site -- at least for ZIMs that are officially published by Kiwix? I realize that is a separate issue, and wouldn't catch all cases.

The text was updated successfully, but these errors were encountered:

rgaudin · 2022-08-01T09:24:29Z

Looks like a Kiwix-JS specific issue. I am not in favor of adding a special case for long URIs. That would just add-up to the format-in-format issue you've been (rightfully) complaining about. Those kind of things are to be expected with generic scraping.

As for blocking ads, we already block some ad domains in /etc/hosts in zimit. That doesn't prevent entries from being added to the ZIM though ; it's just that those fail to return content during scraping.

Jaifroid · 2022-08-01T09:47:22Z

While the issue was discovered as a Kiwix-JS-specific problem, the reason for flagging it here is because the length is greater than the de facto limit identified in the StackOverflow answer, and therefore may cause issues in other clients on different platforms, for example in server software (the article flags a specific issue there). But if you prefer, I can close this issue and add this to the list of issues to be aware of in #86.

Jaifroid · 2022-08-01T09:53:59Z

Closing in favour of #86.

rgaudin · 2022-08-01T09:54:15Z

Let's wait for @kelson42 and @mgautierfr's opinion on this.

mgautierfr · 2022-08-02T15:30:49Z

We can agree that >4KB url is ridiculous. But as @rgaudin said, it is not "our fault" and we must prepare for that.

But being prepared doesn't mean be able to read it. We can have a arbitrary limit on the size of the url.
Maybe we can simply say that entry with a too long path are simply not handled (at least in a generic scrapper as warc2zim) ?
(I'm not sure of the best way to handle this problem, but it seems ok to discard this kind of entry)

FYI : On libzim side, we assume that url/title are pretty short, so we read 256 bytes and try to parse the dirent, if we fail, we add 256 bytes more and reparse, ....

That doesn't prevent entries from being added to the ZIM though ; it's just that those fail to return content during scraping.

Can we change that to not add the entry ? It is pity that we have to store 4KB of uncompressed data to know the entry is empty. It may be better to let libzim fails to found the entry in the zim file (especially for ads)

rgaudin · 2022-08-02T15:33:48Z

That doesn't prevent entries from being added to the ZIM though ; it's just that those fail to return content during scraping.

Can we change that to not add the entry ? It is pity that we have to store 4KB of uncompressed data to know the entry is empty. It may be better to let libzim fails to found the entry in the zim file (especially for ads)

I'd need to check the consequence of such a move but I believe the browser would act the same in both scenarios.

wsdookadr · 2022-08-29T16:06:53Z

In a project I'm working on, I filter WARC records and drop them depending on their attributes. This happens before conversion to ZIM. I imagine a similar approach can be used for this.

- added second pass to find missing resources - made sure to filter by uri length to address issue 102 openzim/warc2zim#102 - started writing a script to get all the missing resources (in progress) - added a working bin/wayback_dl.sh for archive.org. it will have to be rewritten to python

kelson42 · 2023-04-24T09:23:00Z

@Jaifroid What is the concrete impact we have with ZIM files having these very long URLs?

Jaifroid · 2023-04-24T09:47:35Z

@Jaifroid What is the concrete impact we have with ZIM files having these very long URLs?

@kelson42 As stated above, two impacts:

The de facto browser limit for URLs is 2000 characters, so some browsers could fail with these enormous URLs (depending on the reading system employed, and whether it can introduce workarounds);
>4K uncompressed data are being stored as directory entry URLs, usually twice with both an A/ header and an H/ header (>8K for each URL), quite often pointing to empty data (since these are often blocked ads), so it's a waste of valuable space.

However, I had previously closed this issue in favour of #86 (where it is listed), so no need to keep it open any more, as realistically we're probably not going to change this.

Jaifroid closed this as completed Aug 1, 2022

Jaifroid mentioned this issue Aug 1, 2022

Problems with Zimit ZIM file format and URLs #86

Closed

Jaifroid reopened this Aug 1, 2022

kelson42 added the question Further information is requested label Apr 24, 2023

kelson42 added this to the 1.6.0 milestone Apr 24, 2023

Jaifroid closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely long dirEntry URLs (>4KB) produced by warc2zim #102

Extremely long dirEntry URLs (>4KB) produced by warc2zim #102

Jaifroid commented Jul 31, 2022

rgaudin commented Aug 1, 2022

Jaifroid commented Aug 1, 2022

Jaifroid commented Aug 1, 2022

rgaudin commented Aug 1, 2022

mgautierfr commented Aug 2, 2022

rgaudin commented Aug 2, 2022

wsdookadr commented Aug 29, 2022 •

edited

Loading

kelson42 commented Apr 24, 2023

Jaifroid commented Apr 24, 2023 •

edited

Loading

Extremely long dirEntry URLs (>4KB) produced by warc2zim #102

Extremely long dirEntry URLs (>4KB) produced by warc2zim #102

Comments

Jaifroid commented Jul 31, 2022

rgaudin commented Aug 1, 2022

Jaifroid commented Aug 1, 2022

Jaifroid commented Aug 1, 2022

rgaudin commented Aug 1, 2022

mgautierfr commented Aug 2, 2022

rgaudin commented Aug 2, 2022

wsdookadr commented Aug 29, 2022 • edited Loading

kelson42 commented Apr 24, 2023

Jaifroid commented Apr 24, 2023 • edited Loading

wsdookadr commented Aug 29, 2022 •

edited

Loading

Jaifroid commented Apr 24, 2023 •

edited

Loading