Skip to content

Commit

Permalink
Document limitation around content-type: attachement
Browse files Browse the repository at this point in the history
  • Loading branch information
benoit74 authored May 30, 2024
1 parent 97eb10b commit 0219021
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Scenario which are known to work well:
- servers do not mix multiple ports with two different resources at same hostname and path. E.g. if `http://www.acme.com:80/resource1` and `http://www.acme.com:8080/resource1` both exist AND lead to different resources, the scraper will include in the ZIM only the first resource fetched and silently ignore all other resources in conflict
- corollary: servers do not mix HTTP and HTTPS with two different resources at same hostname and path. E.g. if `http://www.acme.com/resource1` and `https://www.acme.com/resource1` both exist AND lead to different resources, the scraper will include in the ZIM only the first resource fetched and silently ignore all other resources in conflict
- Scraper does not store HTTP response headers: these headers are not stored inside the ZIM / not replayed ; any website requiring these will be broken
- Files with a `Content-Disposition: attachment` response header are expected to be automatically saved by the browser. This does not happen for now (see https://github.com/openzim/warc2zim/issues/288).
- Scraper does not take into account HTTP request headers: if different request header values leads to two different page / resource, scraper is ignoring this information
- User-Agent: corollary of the point above on HTTP request headers, scraper supposes a single User-Agent has been used to create the WARC files ; if the website is providing different content based on the User-Agent, only one will be used
- HTTP return codes have known limitations:
Expand All @@ -57,7 +58,7 @@ Scenario which are known to work well:
- Web workers are not yet supported (see https://github.com/openzim/warc2zim/issues/272)
- Service workers are not supported and will most probably never be
- Inline JS code inside an onxxx HTML event (e.g. onclick, onhover, ...) is rewritten, so for instance redirection to another handled with these events is working
- However since URL rewriting is performed with dynamic JS rewriting, at this stage scraper has no clue on what is inside the ZIM and what is external ; all URLs are hence supposed to be internal, which might break some dynamic redirection to an online website
- However since URL rewriting is performed with dynamic JS rewriting, at this stage scraper has no clue on what is inside the ZIM and what is external ; all URLs are hence supposed to be internal, which might break some dynamic redirection to an online website

It is also important to note that warc2zim is inherently limited to what is present inside the WARC. A bad WARC can only produce a bad ZIM. Garbage in, garbage out.

Expand Down

0 comments on commit 0219021

Please sign in to comment.