LookupError: unknown encoding: unicode #331

rgaudin · 2024-06-24T10:32:37Z

Not sure if still valid (we merged related stuff last week I think) but this zimit.kiwix.org run failed in rewrite

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 585, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 507, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 146, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 330, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 748, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/items.py", line 56, in __init__
    ).rewrite(pre_head_template, post_head_template)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 108, in rewrite
    return self.rewrite_html(pre_head_template, post_head_template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 247, in rewrite_html
    ).rewrite(self.content_str)
              ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/content_rewriting/generic.py", line 93, in content_str
    return to_string(
           ^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/utils.py", line 175, in to_string
    return input_.decode(head_encoding, errors="replace")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LookupError: unknown encoding: unicode

The text was updated successfully, but these errors were encountered:

rgaudin · 2024-06-24T10:34:02Z

And https://farm.zimit.kiwix.org/pipeline/2c1770f9-1281-48f0-ab8a-4bb2a727552b/debug

benoit74 · 2024-06-24T11:46:47Z

Thank you!

benoit74 · 2024-06-27T14:47:55Z

Same problem with iso-utf-8 on https://farm.openzim.org/pipeline/d1fa0c7a-29c2-4229-80cf-686aba6ac0f5/debug

Note that on https://farm.zimit.kiwix.org/pipeline/2c1770f9-1281-48f0-ab8a-4bb2a727552b/debug the problem was with iso-8559-1 (this is a typo, encoding is most probably iso-8859-1)

benoit74 · 2024-06-27T14:52:36Z

So problematic pages are https://www.qsl.net/emporiaars/newsletter.html, https://marxists.incn.su/history/etol/writers/goldman/1936/09/campaign.html and https://www.qsl.net/vk2jem/swlogs.htm

benoit74 · 2024-07-01T13:07:05Z

Same kind of problem on page http://api.map.baidu.com/mapCard/js/finish/scriptURL.js where the charset is badly found with regex in JS code.

benoit74 · 2024-07-01T13:08:57Z

Again on https://openai.com/_next/static/chunks/8492.86a146b3037613c7.js

benoit74 · 2024-07-01T13:12:47Z

Again with https://educacionenquimica.com.ar/index.php/edenlaq/article/download/76/135?inline=1 (unicode)

benoit74 · 2024-07-08T09:36:40Z

Again with https://cdn.cookielaw.org/consent/796524c5-24c2-4cdf-907c-573192ba6a9d/otSDKStub.js (this), bad regex in JS code

benoit74 · 2024-07-08T13:06:13Z

And

https://mi_shell.gitlab.io/models_online/ose_webgl/js/three.min.js (JS problem also)
https://wiki.opensourceecology.org/wiki/Main_Page (JS problem also)
https://vercel.com/_next/static/chunks/83266-67bf9ae568a2b318.js?dpl=dpl_2hmfJkPd9Zdg3JRu5ruKqVazZ59j (JS problem also)
https://tertullian.org/french/adversus_omnes_haereses.htm (windows-utf-8 in HTML document)
https://www.answering-islam.org/Thai/index.html (windows-874 in HTML document)

benoit74 · 2024-07-08T13:21:37Z

It looks like there is two problems in one error log:

the code trying to guess the charset is too permissive and grab a significant number of false positives in JS documents ; is there really a reason to run this regex on non-HTML documents? AFAIK, only HTML has a spec to declare charset in its header ; I suggest to detect charset in document header with regex only on HTML documents
there is a variety of websites which provide wrong encodings, or at least encodings unknown in Python (unicode, windows-utf-8, windows-874, iso-8559-1, iso-utf-8) ; add automated support for these encodings is going to be cumbersome, because we never really know what is correct to use. I suggest that we simply add yet another parameter to configure additional (Python already has many) encoding aliases (unicode=utf-8,iso-8559-1=iso-8859-1,windows-874=cp874)

rgaudin added the bug Something isn't working label Jun 24, 2024

rgaudin mentioned this issue Jun 24, 2024

Week 26 routine kiwix/operations#207

Closed

18 tasks

benoit74 self-assigned this Jun 27, 2024

benoit74 added this to the 2.1.0 milestone Jun 27, 2024

benoit74 mentioned this issue Jun 27, 2024

New ZIM request: Marxist Internet Archive openzim/zim-requests#311

Open

benoit74 mentioned this issue Jul 1, 2024

Week 27 routine kiwix/operations#212

Closed

18 tasks

benoit74 mentioned this issue Jul 8, 2024

Week 28 routine kiwix/operations#218

Closed

18 tasks

This was referenced Jul 8, 2024

Further enhance the situation regarding unknown encoding #347

Merged

Week 29 routine kiwix/operations#219

Closed

benoit74 mentioned this issue Jul 22, 2024

Week 30 routine kiwix/operations#221

Closed

18 tasks

rgaudin mentioned this issue Jul 29, 2024

Week 31 2024 routine kiwix/operations#223

Closed

19 tasks

benoit74 closed this as completed in #347 Aug 2, 2024

benoit74 mentioned this issue Aug 5, 2024

Week 32 2024 routine kiwix/operations#228

Closed

19 tasks

rgaudin mentioned this issue Aug 19, 2024

Week 34 2024 routine kiwix/operations#237

Closed

21 tasks

rgaudin mentioned this issue Sep 3, 2024

Week 36 2024 routine kiwix/operations#244

Closed

21 tasks

benoit74 mentioned this issue Sep 10, 2024

Week 37 2024 routine kiwix/operations#251

Closed

21 tasks

rgaudin mentioned this issue Sep 16, 2024

Week 38 2024 routine kiwix/operations#255

Closed

21 tasks

benoit74 mentioned this issue Sep 23, 2024

Week 39 2024 routine kiwix/operations#260

Closed

21 tasks

rgaudin mentioned this issue Oct 7, 2024

Week 41 2024 routine kiwix/operations#281

Closed

21 tasks

benoit74 mentioned this issue Oct 14, 2024

Week 42 2024 routine kiwix/operations#282

Closed

21 tasks

rgaudin mentioned this issue Oct 22, 2024

Week 43 2024 routine kiwix/operations#287

Closed

21 tasks

rgaudin mentioned this issue Nov 4, 2024

Week 45 2024 routine kiwix/operations#300

Closed

21 tasks

benoit74 mentioned this issue Nov 11, 2024

Week 46 2024 routine kiwix/operations#304

Closed

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LookupError: unknown encoding: unicode #331

LookupError: unknown encoding: unicode #331

rgaudin commented Jun 24, 2024

rgaudin commented Jun 24, 2024

benoit74 commented Jun 24, 2024

benoit74 commented Jun 27, 2024

benoit74 commented Jun 27, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 8, 2024

benoit74 commented Jul 8, 2024 •

edited

Loading

benoit74 commented Jul 8, 2024 •

edited

Loading

LookupError: unknown encoding: unicode #331

LookupError: unknown encoding: unicode #331

Comments

rgaudin commented Jun 24, 2024

rgaudin commented Jun 24, 2024

benoit74 commented Jun 24, 2024

benoit74 commented Jun 27, 2024

benoit74 commented Jun 27, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 1, 2024

benoit74 commented Jul 8, 2024

benoit74 commented Jul 8, 2024 • edited Loading

benoit74 commented Jul 8, 2024 • edited Loading

benoit74 commented Jul 8, 2024 •

edited

Loading

benoit74 commented Jul 8, 2024 •

edited

Loading