Replies: 4 comments 7 replies
-
So, I think a good place to start with any review on the behaviour here would have to be evidence-led. There's a bunch of different possible charset decoding strategies and they all have different trade-offs. Perhaps a smart thing to do here would be to find a "top-1000" list from someplace, and start by determining:
|
Beta Was this translation helpful? Give feedback.
-
Hi, RebootI am going to assume this was not much to proceed with.
Yes, you are absolutely right. But something does not seem right with how you decided that the detection was going the change. to be evidence-led I may certainly say that your assumptions are most likely false. Those are dangerous assumptions. Looking at https://w3techs.com/technologies/overview/character_encoding may This statistic does not offer any ponderation, so one should not read it as "I have 97 % chance of hitting UTF-8 content on HTML content". (2021 Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer Neither httpx, chardet, or charset-normalizer are dedicated to HTML content. It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field. Initial thoughtsWhat I was saying, in the beginning, was very simple. Httpx decodes the content at least twice. That is bad, period. No matter what strategy httpx opt for, we all agree that it won't be without some tradeoff. What I would suggest today is to solo try out UTF-8 using a strict mode for errors handling. If it fails, either raise a warning, exception or just return bytes. In addition to guiding users on how they should handle this matter. Or reintroduce, optionally a detection if any engine is available. Regards, |
Beta Was this translation helpful? Give feedback.
-
Always the best way of doing things.
I did not implement it, indeed. I think that Chardet did implement it out of performance concerns.
The most content, the better. But one thing to keep in mind is that Another thing would be to use the public
I am open to ease the detection process in that case by providing a proper implementation. |
Beta Was this translation helpful? Give feedback.
-
Also need the ability to set the encoding and errors = 'replace/ignore' manually. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I have an idea of how to improve the text decoder default behavior without trying any detection by confidence.
While the assertion for "UTF-8 is prevalent in the WWW HTML content is True" it is not the case in non-HTML content. Even in the TOP 1000 websites, there are still servers that do not disclose the charset in headers.
Currently, httpx does:
utf_8
codec with strict err policyThere is one main thing that needs to be addressed as-is:
A small performance issue is going to happen over large payloads.
I propose to change the default behavior by:
Beta Was this translation helpful? Give feedback.
All reactions