Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xed's detection is a bit better than cchardet's #103

Open
JCCyC opened this issue Sep 5, 2024 · 0 comments
Open

xed's detection is a bit better than cchardet's #103

JCCyC opened this issue Sep 5, 2024 · 0 comments

Comments

@JCCyC
Copy link

JCCyC commented Sep 5, 2024

OS/Arch

system='Linux', node='jclvdell', release='6.8.0-40-generic', version='#40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2', machine='x86_64'

Python version

3.10.12

cChardet version

2.1.7

What is the problem?

A file (attached) with the Euro sign is correctly understood as ISO-8859-15 by the xed editor, but cchardet sees it as ISO-8859-1

Expected behavior

Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou €313,84)

Actual behavior

Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou ¤313,84)

(Euro symbol appears as "¤")

Steps to reproduce the behavior

  1. Get this file: pagininha2.html.gz

  2. Do this:

$ gunzip pagininha2.html.gz
$ python
>>> import cchardet as chardet
>>> with open("pagininha2.html", "rb") as f:
...   msg = f.read()
...   result = chardet.detect(msg)
...   print(result)
... 
{'encoding': 'ISO-8859-1', 'confidence': 0.7640712261199951}
>>> 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant