-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Warc-Resource-Type header to decide how to rewrite a WARC record #296
Comments
Resource is present at https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0 Failure occurs when trying to include the resource in the ZIM, considering it might have to be rewritten (HTML/JS/CSS ...). Stacktrace is something like this (this has been reproduced locally at 060cbd6):
The scraper hence considered this had to be rewritten as HTML, trying to get a decoded string from the binary content of the woff2 policy ... which fails for obvious. These are the details we have about the WARC record:
As one can see, the Currently the scraper uses this mimetype (from the content-type response header) to decide if / how the WARC record needs to be rewritten: warc2zim/src/warc2zim/content_rewriting/generic.py Lines 124 to 150 in 060cbd6
Only basing the decision on the content-type header is obviously a tradeoff between rewriting too much (as here) or too little (not rewriting something because we consider it doesn't need to be while it was needed in fact). I propose to however be more resilient by taking benefit of the new I propose to alter the logic to:
This can clearly wait for 2.1, since core problem is that the server is lying to the scraper + such a change will need a bit of testing before declaring it has only expected impact. |
LGTM except we are a bit unclear on the impact, as you said. I think it's a better approach than current one as there is no obligation to return a content-type nor to return a valid one. It's conventions and with the professionalization of the web and the weight of tech giants, it is now mainstream. But zimit goal is a browsing fidelity one, not a tech-spec-validator, so whatever works in the browser should be the goal. In that sense, using those hints from the browser makes a lot more sense and should be preferred when available. |
I just realized we could (and should probably) easily keep both approach in parallel for the 2.1, use the result from the new approach but raise WARNINGs when the result of the two approaches are different. This will help to check for non-regression during 2.1 tests AND help to diagnose problems in production once 2.1 will be released |
This also caused the failure of https://farm.openzim.org/pipeline/32a2ad19-1ceb-4679-9d16-0b7d92f46c23 |
Logs:
Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0
URL:
https://www.synology.com/en-br
The text was updated successfully, but these errors were encountered: