Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Warc-Resource-Type header to decide how to rewrite a WARC record #296

Closed
benoit74 opened this issue Jun 3, 2024 · 4 comments · Fixed by #306
Closed

Use Warc-Resource-Type header to decide how to rewrite a WARC record #296

benoit74 opened this issue Jun 3, 2024 · 4 comments · Fixed by #306
Assignees
Milestone

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Jun 3, 2024

Logs: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

URL: https://www.synology.com/en-br

@benoit74 benoit74 added this to the 2.0.0 milestone Jun 3, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Jun 3, 2024

Resource is present at https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

Failure occurs when trying to include the resource in the ZIM, considering it might have to be rewritten (HTML/JS/CSS ...).

Stacktrace is something like this (this has been reproduced locally at 060cbd6):

[warc2zim::2024-06-03 18:37:10,694] INFO:Expecting 7252 ZIM entries including redirects
[warc2zim::2024-06-03 18:37:12,041] ERROR:Problem encountered while processing https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0.
Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 85, in content_str
    result = to_string(self.content, self.encoding)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/utils.py", line 212, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'wOF2\x00\x01\x00\x00\x00\x01-h\x00\r\x00\x00\x00\x02\x00\x01-\x0e\x00\x04\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00?FFTM\x1c\x1a \x06`\x00r\x11\x08\n(X\x016\x02$\x03p\x0b\x10\x00\x04 \x05\x06\x07u[R\trGa\r\':\x1a&=r*\n\x02\x19\x07nF|\x14\x08fm`$\xd8\x91@d[BQ\x11$([U<+(@P\x1e\x0e;lh\xd4\xa8y%\xdb\x81^\x14G3\x12nDp\\Yr Lt)6R"S\x0bL~CXR\x15\t4y\\[\x1ds\xe0\xbb\x8cq\x1eM%K\x17.\xdb\xba\x0e,\x0bt\'M\x1d,\x11\x15cs^.\x07\x0ch&gb\'\x0f6:'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 321, in run
    self.add_items_for_warc_record(record)
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 733, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/items.py", line 43, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 106, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 185, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 98, in content_str
    raise RuntimeError(f"Impossible to decode item {self.path.value}") from e
RuntimeError: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0
[warc2zim::2024-06-03 18:37:12,042] ERROR:Scraper will stop. Pass --verbose flag for more details.
Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 85, in content_str
    result = to_string(self.content, self.encoding)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/utils.py", line 212, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'wOF2\x00\x01\x00\x00\x00\x01-h\x00\r\x00\x00\x00\x02\x00\x01-\x0e\x00\x04\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00?FFTM\x1c\x1a \x06`\x00r\x11\x08\n(X\x016\x02$\x03p\x0b\x10\x00\x04 \x05\x06\x07u[R\trGa\r\':\x1a&=r*\n\x02\x19\x07nF|\x14\x08fm`$\xd8\x91@d[BQ\x11$([U<+(@P\x1e\x0e;lh\xd4\xa8y%\xdb\x81^\x14G3\x12nDp\\Yr Lt)6R"S\x0bL~CXR\x15\t4y\\[\x1ds\xe0\xbb\x8cq\x1eM%K\x17.\xdb\xba\x0e,\x0bt\'M\x1d,\x11\x15cs^.\x07\x0ch&gb\'\x0f6:'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/.hatch/warc2zim/bin/warc2zim", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/main.py", line 115, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 321, in run
    self.add_items_for_warc_record(record)
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 733, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/items.py", line 43, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 106, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 185, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 98, in content_str
    raise RuntimeError(f"Impossible to decode item {self.path.value}") from e
RuntimeError: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

The scraper hence considered this had to be rewritten as HTML, trying to get a decoded string from the binary content of the woff2 policy ... which fails for obvious.

These are the details we have about the WARC record:

### REC Headers ###
WARC/1.1
WARC-Page-ID: 593863b3-215a-4b5d-883c-e42296b62846
WARC-Resource-Type: font
WARC-JSON-Metadata: {"ipType":"Public"}
WARC-Target-URI: https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0
WARC-Date: 2024-06-03T15:09:58.603Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:bf04c2f7-2efb-4e15-ae8c-d7d5663f6cdd>
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha256:d5dbe350d2e95210ec0e04b251afb682403dbb851f7e408778fd509498511bf4
WARC-Block-Digest: sha256:16c28808ec3911005aacf07d250ca06c98bedefd2adaac5f56ba2b26f2b0859f
Content-Length: 33418

### HTTP Headers ###
HTTP/1.1 200 OK
content-type: text/html
server: nginx
last-modified: Mon, 21 Jun 2021 08:56:33 GMT
strict-transport-security: max-age=31536000; preload
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
date: Sun, 02 Jun 2024 20:30:41 GMT
etag: W/"60d05441-12d68"
vary: Accept-Encoding
x-cache: Hit from cloudfront
via: 1.1 79b38e01cf5e16de2ad2a0ec2187e7f4.cloudfront.net (CloudFront)
x-amz-cf-pop: HEL50-C2
x-amz-cf-id: GYC8i3zVgw31oKQx5PWzHPVKU_9buT1NhGGNjmZuZvpjcqPmM_f5ZA==
age: 74325

As one can see, the content-type returned by the webserver is wrong, text/html is not the correct mimetype.

Currently the scraper uses this mimetype (from the content-type response header) to decide if / how the WARC record needs to be rewritten:

def get_rewrite_mode(self, record, mimetype):
if mimetype == "text/html":
if getattr(record, "method", "GET") == "POST":
return None
# TODO : Handle header "Accept" == "application/json"
return "html"
if mimetype == "text/css":
return "css"
if mimetype in [
"text/javascript",
"application/javascript",
"application/x-javascript",
]:
if extract_jsonp_callback(self.orig_url_str):
return "jsonp"
if self.path.value.endswith(".json"):
return "json"
return "javascript"
if mimetype == "application/json":
return "json"
return None

Only basing the decision on the content-type header is obviously a tradeoff between rewriting too much (as here) or too little (not rewriting something because we consider it doesn't need to be while it was needed in fact).

I propose to however be more resilient by taking benefit of the new WARC-Resource-Type WARC header now available, and coming from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType ; since this explains how the browser considered the resource for its own usage, it is clearly more in line with the information we need.

I propose to alter the logic to:

  • rewrite as HTML if WARC-Resource-Type is Document and HTTP method is GET (and not "not POST" as of today, PUT, PATCH, DELETE responses probably deserve the same treatment)
  • rewrite as CSS if WARC-Resource-Type is Stylesheet
  • if WARC-Resource-Type is Script, then continue same logic as today based on the mimetype to differentiate javascript from json and other mimetypes

This can clearly wait for 2.1, since core problem is that the server is lying to the scraper + such a change will need a bit of testing before declaring it has only expected impact.

@benoit74 benoit74 modified the milestones: 2.0.0, 2.1.0 Jun 3, 2024
@benoit74 benoit74 changed the title Yet another decoding issue on fontawesome-webfont.woff2 Use Warc-Resource-Type header to decide how to rewrite a WARC record Jun 3, 2024
@rgaudin
Copy link
Member

rgaudin commented Jun 4, 2024

LGTM except we are a bit unclear on the impact, as you said.

I think it's a better approach than current one as there is no obligation to return a content-type nor to return a valid one. It's conventions and with the professionalization of the web and the weight of tech giants, it is now mainstream.

But zimit goal is a browsing fidelity one, not a tech-spec-validator, so whatever works in the browser should be the goal. In that sense, using those hints from the browser makes a lot more sense and should be preferred when available.

@benoit74
Copy link
Collaborator Author

benoit74 commented Jun 4, 2024

I just realized we could (and should probably) easily keep both approach in parallel for the 2.1, use the result from the new approach but raise WARNINGs when the result of the two approaches are different. This will help to check for non-regression during 2.1 tests AND help to diagnose problems in production once 2.1 will be released

@benoit74
Copy link
Collaborator Author

benoit74 commented Jun 7, 2024

This also caused the failure of https://farm.openzim.org/pipeline/32a2ad19-1ceb-4679-9d16-0b7d92f46c23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants