Use Warc-Resource-Type header to decide how to rewrite a WARC record #296

benoit74 · 2024-06-03T13:06:54Z

Logs: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

URL: https://www.synology.com/en-br

The text was updated successfully, but these errors were encountered:

benoit74 · 2024-06-03T19:13:52Z

Resource is present at https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

Failure occurs when trying to include the resource in the ZIM, considering it might have to be rewritten (HTML/JS/CSS ...).

Stacktrace is something like this (this has been reproduced locally at 060cbd6):

[warc2zim::2024-06-03 18:37:10,694] INFO:Expecting 7252 ZIM entries including redirects
[warc2zim::2024-06-03 18:37:12,041] ERROR:Problem encountered while processing https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0.
Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 85, in content_str
    result = to_string(self.content, self.encoding)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/utils.py", line 212, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'wOF2\x00\x01\x00\x00\x00\x01-h\x00\r\x00\x00\x00\x02\x00\x01-\x0e\x00\x04\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00?FFTM\x1c\x1a \x06`\x00r\x11\x08\n(X\x016\x02$\x03p\x0b\x10\x00\x04 \x05\x06\x07u[R\trGa\r\':\x1a&=r*\n\x02\x19\x07nF|\x14\x08fm`$\xd8\x91@d[BQ\x11$([U<+(@P\x1e\x0e;lh\xd4\xa8y%\xdb\x81^\x14G3\x12nDp\\Yr Lt)6R"S\x0bL~CXR\x15\t4y\\[\x1ds\xe0\xbb\x8cq\x1eM%K\x17.\xdb\xba\x0e,\x0bt\'M\x1d,\x11\x15cs^.\x07\x0ch&gb\'\x0f6:'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 321, in run
    self.add_items_for_warc_record(record)
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 733, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/items.py", line 43, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 106, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 185, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 98, in content_str
    raise RuntimeError(f"Impossible to decode item {self.path.value}") from e
RuntimeError: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0
[warc2zim::2024-06-03 18:37:12,042] ERROR:Scraper will stop. Pass --verbose flag for more details.
Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 85, in content_str
    result = to_string(self.content, self.encoding)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/utils.py", line 212, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'wOF2\x00\x01\x00\x00\x00\x01-h\x00\r\x00\x00\x00\x02\x00\x01-\x0e\x00\x04\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00?FFTM\x1c\x1a \x06`\x00r\x11\x08\n(X\x016\x02$\x03p\x0b\x10\x00\x04 \x05\x06\x07u[R\trGa\r\':\x1a&=r*\n\x02\x19\x07nF|\x14\x08fm`$\xd8\x91@d[BQ\x11$([U<+(@P\x1e\x0e;lh\xd4\xa8y%\xdb\x81^\x14G3\x12nDp\\Yr Lt)6R"S\x0bL~CXR\x15\t4y\\[\x1ds\xe0\xbb\x8cq\x1eM%K\x17.\xdb\xba\x0e,\x0bt\'M\x1d,\x11\x15cs^.\x07\x0ch&gb\'\x0f6:'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/benoit/Repos/openzim/warc2zim/.hatch/warc2zim/bin/warc2zim", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/main.py", line 115, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 321, in run
    self.add_items_for_warc_record(record)
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/converter.py", line 733, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/items.py", line 43, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 106, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 185, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/home/benoit/Repos/openzim/warc2zim/src/warc2zim/content_rewriting/generic.py", line 98, in content_str
    raise RuntimeError(f"Impossible to decode item {self.path.value}") from e
RuntimeError: Impossible to decode item www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0

The scraper hence considered this had to be rewritten as HTML, trying to get a decoded string from the binary content of the woff2 policy ... which fails for obvious.

These are the details we have about the WARC record:

### REC Headers ###
WARC/1.1
WARC-Page-ID: 593863b3-215a-4b5d-883c-e42296b62846
WARC-Resource-Type: font
WARC-JSON-Metadata: {"ipType":"Public"}
WARC-Target-URI: https://www.synology.com/font/fontawesome-webfont.woff2?v=4.7.0
WARC-Date: 2024-06-03T15:09:58.603Z
WARC-Type: response
WARC-Record-ID: <urn:uuid:bf04c2f7-2efb-4e15-ae8c-d7d5663f6cdd>
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha256:d5dbe350d2e95210ec0e04b251afb682403dbb851f7e408778fd509498511bf4
WARC-Block-Digest: sha256:16c28808ec3911005aacf07d250ca06c98bedefd2adaac5f56ba2b26f2b0859f
Content-Length: 33418

### HTTP Headers ###
HTTP/1.1 200 OK
content-type: text/html
server: nginx
last-modified: Mon, 21 Jun 2021 08:56:33 GMT
strict-transport-security: max-age=31536000; preload
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
date: Sun, 02 Jun 2024 20:30:41 GMT
etag: W/"60d05441-12d68"
vary: Accept-Encoding
x-cache: Hit from cloudfront
via: 1.1 79b38e01cf5e16de2ad2a0ec2187e7f4.cloudfront.net (CloudFront)
x-amz-cf-pop: HEL50-C2
x-amz-cf-id: GYC8i3zVgw31oKQx5PWzHPVKU_9buT1NhGGNjmZuZvpjcqPmM_f5ZA==
age: 74325

As one can see, the content-type returned by the webserver is wrong, text/html is not the correct mimetype.

Currently the scraper uses this mimetype (from the content-type response header) to decide if / how the WARC record needs to be rewritten:

warc2zim/src/warc2zim/content_rewriting/generic.py

Lines 124 to 150 in 060cbd6

    
           def get_rewrite_mode(self, record, mimetype): 
        
               if mimetype == "text/html": 
        
                   if getattr(record, "method", "GET") == "POST": 
        
                       return None 
        
                   # TODO : Handle header "Accept" == "application/json" 
        
                   return "html" 
        
               if mimetype == "text/css": 
        
                   return "css" 
        
               if mimetype in [ 
        
                   "text/javascript", 
        
                   "application/javascript", 
        
                   "application/x-javascript", 
        
               ]: 
        
                   if extract_jsonp_callback(self.orig_url_str): 
        
                       return "jsonp" 
        
                   if self.path.value.endswith(".json"): 
        
                       return "json" 
        
                   return "javascript" 
        
               if mimetype == "application/json": 
        
                   return "json" 
        
               return None

Only basing the decision on the content-type header is obviously a tradeoff between rewriting too much (as here) or too little (not rewriting something because we consider it doesn't need to be while it was needed in fact).

I propose to however be more resilient by taking benefit of the new WARC-Resource-Type WARC header now available, and coming from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType ; since this explains how the browser considered the resource for its own usage, it is clearly more in line with the information we need.

I propose to alter the logic to:

rewrite as HTML if WARC-Resource-Type is Document and HTTP method is GET (and not "not POST" as of today, PUT, PATCH, DELETE responses probably deserve the same treatment)
rewrite as CSS if WARC-Resource-Type is Stylesheet
if WARC-Resource-Type is Script, then continue same logic as today based on the mimetype to differentiate javascript from json and other mimetypes

This can clearly wait for 2.1, since core problem is that the server is lying to the scraper + such a change will need a bit of testing before declaring it has only expected impact.

rgaudin · 2024-06-04T08:51:27Z

LGTM except we are a bit unclear on the impact, as you said.

I think it's a better approach than current one as there is no obligation to return a content-type nor to return a valid one. It's conventions and with the professionalization of the web and the weight of tech giants, it is now mainstream.

But zimit goal is a browsing fidelity one, not a tech-spec-validator, so whatever works in the browser should be the goal. In that sense, using those hints from the browser makes a lot more sense and should be preferred when available.

benoit74 · 2024-06-04T09:20:59Z

I just realized we could (and should probably) easily keep both approach in parallel for the 2.1, use the result from the new approach but raise WARNINGs when the result of the two approaches are different. This will help to check for non-regression during 2.1 tests AND help to diagnose problems in production once 2.1 will be released

benoit74 · 2024-06-07T08:20:30Z

This also caused the failure of https://farm.openzim.org/pipeline/32a2ad19-1ceb-4679-9d16-0b7d92f46c23

benoit74 added this to the 2.0.0 milestone Jun 3, 2024

benoit74 mentioned this issue Jun 3, 2024

Week 23 routine kiwix/operations#200

Closed

18 tasks

benoit74 modified the milestones: 2.0.0, 2.1.0 Jun 3, 2024

benoit74 changed the title ~~Yet another decoding issue on fontawesome-webfont.woff2~~ Use Warc-Resource-Type header to decide how to rewrite a WARC record Jun 3, 2024

benoit74 modified the milestones: 2.1.0, 2.0.1 Jun 10, 2024

benoit74 self-assigned this Jun 11, 2024

benoit74 mentioned this issue Jun 11, 2024

Detect content type based on WARC-Resource-Type #306

Merged

benoit74 closed this as completed in #306 Jun 13, 2024

benoit74 mentioned this issue Jun 25, 2024

Some resources rewrite mode are not correctly identified #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Warc-Resource-Type header to decide how to rewrite a WARC record #296

Use Warc-Resource-Type header to decide how to rewrite a WARC record #296

benoit74 commented Jun 3, 2024

benoit74 commented Jun 3, 2024

rgaudin commented Jun 4, 2024

benoit74 commented Jun 4, 2024

benoit74 commented Jun 7, 2024

Use Warc-Resource-Type header to decide how to rewrite a WARC record #296

Use Warc-Resource-Type header to decide how to rewrite a WARC record #296

Comments

benoit74 commented Jun 3, 2024

benoit74 commented Jun 3, 2024

rgaudin commented Jun 4, 2024

benoit74 commented Jun 4, 2024

benoit74 commented Jun 7, 2024