Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely long dirEntry URLs (>4KB) produced by warc2zim #102

Closed
Jaifroid opened this issue Jul 31, 2022 · 9 comments
Closed

Extremely long dirEntry URLs (>4KB) produced by warc2zim #102

Jaifroid opened this issue Jul 31, 2022 · 9 comments
Labels
question Further information is requested
Milestone

Comments

@Jaifroid
Copy link

At least in Type 1 WARC ZIMs, some ridiculously long URLs are being stored in the ZIM -- larger than 4096 bytes. For example, in armypubs_en_all_2022-06.zim, I got a warning about this dirEntry:

image

There are similar huge URLs in edu.gcfglobal.org_es_all_2022-06.zim (also a Type 1 ZIM). In all the Type 0 WARC ZIMs I tested, I never encountered a URL longer than 2KB (it would have caused a hang in Kiwix JS Windows). The ZIM type is probably irrelevant here, but it may have something to do with the more recent Replay implementation now being used to generate these Type 1 ZIMs? Or it may just be a coincidence that I've only seen them in two Type 1 ZIMs.

In any case, the reason I'm flagging this (apart from the fact that it caused a serious issue with the Kiwix JS / Windows backend, being fixed in kiwix/kiwix-js#876), is because:

It shouldn't be too hard to reduce them -- e.g. if URL > 2048 bytes, calculate the unique SHA-256 and substitute that for the URL (after the prefix) when the dirEntries are being generated for the ZIM.

As a side issue, most of these extreme URLs are associated with Google: they are either analytics, or some other Google API (including ads). I wonder if it might be possible to filter out the analytics, tracking and ads by default, by using ad and tracking filters when spidering a site -- at least for ZIMs that are officially published by Kiwix? I realize that is a separate issue, and wouldn't catch all cases.

@rgaudin
Copy link
Member

rgaudin commented Aug 1, 2022

Looks like a Kiwix-JS specific issue. I am not in favor of adding a special case for long URIs. That would just add-up to the format-in-format issue you've been (rightfully) complaining about. Those kind of things are to be expected with generic scraping.

As for blocking ads, we already block some ad domains in /etc/hosts in zimit. That doesn't prevent entries from being added to the ZIM though ; it's just that those fail to return content during scraping.

@Jaifroid
Copy link
Author

Jaifroid commented Aug 1, 2022

While the issue was discovered as a Kiwix-JS-specific problem, the reason for flagging it here is because the length is greater than the de facto limit identified in the StackOverflow answer, and therefore may cause issues in other clients on different platforms, for example in server software (the article flags a specific issue there). But if you prefer, I can close this issue and add this to the list of issues to be aware of in #86.

@Jaifroid
Copy link
Author

Jaifroid commented Aug 1, 2022

Closing in favour of #86.

@rgaudin
Copy link
Member

rgaudin commented Aug 1, 2022

Let's wait for @kelson42 and @mgautierfr's opinion on this.

@Jaifroid Jaifroid reopened this Aug 1, 2022
@mgautierfr
Copy link
Contributor

We can agree that >4KB url is ridiculous. But as @rgaudin said, it is not "our fault" and we must prepare for that.

But being prepared doesn't mean be able to read it. We can have a arbitrary limit on the size of the url.
Maybe we can simply say that entry with a too long path are simply not handled (at least in a generic scrapper as warc2zim) ?
(I'm not sure of the best way to handle this problem, but it seems ok to discard this kind of entry)

FYI : On libzim side, we assume that url/title are pretty short, so we read 256 bytes and try to parse the dirent, if we fail, we add 256 bytes more and reparse, ....

That doesn't prevent entries from being added to the ZIM though ; it's just that those fail to return content during scraping.

Can we change that to not add the entry ? It is pity that we have to store 4KB of uncompressed data to know the entry is empty. It may be better to let libzim fails to found the entry in the zim file (especially for ads)

@rgaudin
Copy link
Member

rgaudin commented Aug 2, 2022

That doesn't prevent entries from being added to the ZIM though ; it's just that those fail to return content during scraping.

Can we change that to not add the entry ? It is pity that we have to store 4KB of uncompressed data to know the entry is empty. It may be better to let libzim fails to found the entry in the zim file (especially for ads)

I'd need to check the consequence of such a move but I believe the browser would act the same in both scenarios.

@wsdookadr
Copy link
Contributor

wsdookadr commented Aug 29, 2022

In a project I'm working on, I filter WARC records and drop them depending on their attributes. This happens before conversion to ZIM. I imagine a similar approach can be used for this.

wsdookadr added a commit to wsdookadr/warctools that referenced this issue Sep 3, 2022
- added second pass to find missing resources
- made sure to filter by uri length to address issue 102
  openzim/warc2zim#102
- started writing a script to get all the missing resources
  (in progress)
- added a working bin/wayback_dl.sh for archive.org. it will
  have to be rewritten to python
@kelson42
Copy link
Contributor

@Jaifroid What is the concrete impact we have with ZIM files having these very long URLs?

@kelson42 kelson42 added the question Further information is requested label Apr 24, 2023
@kelson42 kelson42 added this to the 1.6.0 milestone Apr 24, 2023
@Jaifroid
Copy link
Author

Jaifroid commented Apr 24, 2023

@Jaifroid What is the concrete impact we have with ZIM files having these very long URLs?

@kelson42 As stated above, two impacts:

  • The de facto browser limit for URLs is 2000 characters, so some browsers could fail with these enormous URLs (depending on the reading system employed, and whether it can introduce workarounds);
  • >4K uncompressed data are being stored as directory entry URLs, usually twice with both an A/ header and an H/ header (>8K for each URL), quite often pointing to empty data (since these are often blocked ads), so it's a waste of valuable space.

However, I had previously closed this issue in favour of #86 (where it is listed), so no need to keep it open any more, as realistically we're probably not going to change this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants