Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LookupError: unknown encoding: unicode #331

Closed
rgaudin opened this issue Jun 24, 2024 · 10 comments · Fixed by #347
Closed

LookupError: unknown encoding: unicode #331

rgaudin opened this issue Jun 24, 2024 · 10 comments · Fixed by #347
Assignees
Labels
bug Something isn't working
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Jun 24, 2024

Not sure if still valid (we merged related stuff last week I think) but this zimit.kiwix.org run failed in rewrite

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 585, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 507, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 146, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 330, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 748, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/items.py", line 56, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 108, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 247, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 93, in content_str
    return to_string(
           ^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/utils.py", line 175, in to_string
    return input_.decode(head_encoding, errors="replace")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LookupError: unknown encoding: unicode

@rgaudin rgaudin added the bug Something isn't working label Jun 24, 2024
@rgaudin
Copy link
Member Author

rgaudin commented Jun 24, 2024

@benoit74
Copy link
Collaborator

Thank you!

@benoit74 benoit74 self-assigned this Jun 27, 2024
@benoit74 benoit74 added this to the 2.1.0 milestone Jun 27, 2024
@benoit74
Copy link
Collaborator

Same problem with iso-utf-8 on https://farm.openzim.org/pipeline/d1fa0c7a-29c2-4229-80cf-686aba6ac0f5/debug

Note that on https://farm.zimit.kiwix.org/pipeline/2c1770f9-1281-48f0-ab8a-4bb2a727552b/debug the problem was with iso-8559-1 (this is a typo, encoding is most probably iso-8859-1)

@benoit74
Copy link
Collaborator

benoit74 commented Jul 1, 2024

Same kind of problem on page http://api.map.baidu.com/mapCard/js/finish/scriptURL.js where the charset is badly found with regex in JS code.

@benoit74
Copy link
Collaborator

benoit74 commented Jul 1, 2024

@benoit74
Copy link
Collaborator

benoit74 commented Jul 1, 2024

@benoit74
Copy link
Collaborator

benoit74 commented Jul 8, 2024

Again with https://cdn.cookielaw.org/consent/796524c5-24c2-4cdf-907c-573192ba6a9d/otSDKStub.js (this), bad regex in JS code

@benoit74
Copy link
Collaborator

benoit74 commented Jul 8, 2024

It looks like there is two problems in one error log:

  • the code trying to guess the charset is too permissive and grab a significant number of false positives in JS documents ; is there really a reason to run this regex on non-HTML documents? AFAIK, only HTML has a spec to declare charset in its header ; I suggest to detect charset in document header with regex only on HTML documents
  • there is a variety of websites which provide wrong encodings, or at least encodings unknown in Python (unicode, windows-utf-8, windows-874, iso-8559-1, iso-utf-8) ; add automated support for these encodings is going to be cumbersome, because we never really know what is correct to use. I suggest that we simply add yet another parameter to configure additional (Python already has many) encoding aliases (unicode=utf-8,iso-8559-1=iso-8859-1,windows-874=cp874)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants