-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely long dirEntry URLs (>4KB) produced by warc2zim #102
Comments
Looks like a Kiwix-JS specific issue. I am not in favor of adding a special case for long URIs. That would just add-up to the format-in-format issue you've been (rightfully) complaining about. Those kind of things are to be expected with generic scraping. As for blocking ads, we already block some ad domains in /etc/hosts in zimit. That doesn't prevent entries from being added to the ZIM though ; it's just that those fail to return content during scraping. |
While the issue was discovered as a Kiwix-JS-specific problem, the reason for flagging it here is because the length is greater than the de facto limit identified in the StackOverflow answer, and therefore may cause issues in other clients on different platforms, for example in server software (the article flags a specific issue there). But if you prefer, I can close this issue and add this to the list of issues to be aware of in #86. |
Closing in favour of #86. |
Let's wait for @kelson42 and @mgautierfr's opinion on this. |
We can agree that >4KB url is ridiculous. But as @rgaudin said, it is not "our fault" and we must prepare for that. But being prepared doesn't mean be able to read it. We can have a arbitrary limit on the size of the url. FYI : On libzim side, we assume that url/title are pretty short, so we read 256 bytes and try to parse the dirent, if we fail, we add 256 bytes more and reparse, ....
Can we change that to not add the entry ? It is pity that we have to store 4KB of uncompressed data to know the entry is empty. It may be better to let libzim fails to found the entry in the zim file (especially for ads) |
I'd need to check the consequence of such a move but I believe the browser would act the same in both scenarios. |
In a project I'm working on, I filter WARC records and drop them depending on their attributes. This happens before conversion to ZIM. I imagine a similar approach can be used for this. |
- added second pass to find missing resources - made sure to filter by uri length to address issue 102 openzim/warc2zim#102 - started writing a script to get all the missing resources (in progress) - added a working bin/wayback_dl.sh for archive.org. it will have to be rewritten to python
@Jaifroid What is the concrete impact we have with ZIM files having these very long URLs? |
@kelson42 As stated above, two impacts:
However, I had previously closed this issue in favour of #86 (where it is listed), so no need to keep it open any more, as realistically we're probably not going to change this. |
At least in Type 1 WARC ZIMs, some ridiculously long URLs are being stored in the ZIM -- larger than 4096 bytes. For example, in
armypubs_en_all_2022-06.zim
, I got a warning about this dirEntry:There are similar huge URLs in
edu.gcfglobal.org_es_all_2022-06.zim
(also a Type 1 ZIM). In all the Type 0 WARC ZIMs I tested, I never encountered a URL longer than 2KB (it would have caused a hang in Kiwix JS Windows). The ZIM type is probably irrelevant here, but it may have something to do with the more recent Replay implementation now being used to generate these Type 1 ZIMs? Or it may just be a coincidence that I've only seen them in two Type 1 ZIMs.In any case, the reason I'm flagging this (apart from the fact that it caused a serious issue with the Kiwix JS / Windows backend, being fixed in kiwix/kiwix-js#876), is because:
A/
prefix or namespace, and once with anH/
prefix or namespace. A ZIM could contain hundreds of such URLs.It shouldn't be too hard to reduce them -- e.g. if URL > 2048 bytes, calculate the unique SHA-256 and substitute that for the URL (after the prefix) when the dirEntries are being generated for the ZIM.
As a side issue, most of these extreme URLs are associated with Google: they are either analytics, or some other Google API (including ads). I wonder if it might be possible to filter out the analytics, tracking and ads by default, by using ad and tracking filters when spidering a site -- at least for ZIMs that are officially published by Kiwix? I realize that is a separate issue, and wouldn't catch all cases.
The text was updated successfully, but these errors were encountered: