-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Including charset in HTTP breaks non UTF-8 sites #2203
Comments
As of #2008, content type is already auto-detected (from the first 512 bytes through Looking at the importers list, it appears that charset auto-detect hasn't been used in most of major golang web frameworks, yet is used to format the diff content in gogs: https://github.com/gogits/gogs/blob/c199703e2aa746f10ff2b584513fe3af810b6c1a/models/git_diff.go. There should not be a major perf hit with checking the first 1kb of each content served in the gateway. |
I tried the |
ISO-8859 is equivalent to windows-1252 so it might be good. More precisely windows-1252 is superset of ISO-8859 |
any way we can not set the type and leave it up to browsers? or the golang lib? |
We could try detecting if there is charset declaration in then not setting charset in the HTTP header. This would solve this problem as browser will then use charset from . Heuristic would be very simple, and probably fast.
Any cons of this method? |
Hmm: note: I have tested #2230 and it does render the lua page properly. |
Also, @Kubuxu, |
Awesome, my response was to jbenet's question if we can leave it up to browser. |
The HTTP Api seems to arbitrarily include charsets in the headers, at times, and not, at other times. Should it be included all of the time, @Kubuxu? |
It shouldn't be included always as it overrides setting in HTML tags and also default HTML charset is different. |
I guess I am confused: how would a Header response from the API override HTML tags? Does the go HTTP API ever return HTML? |
Ahh, this issue is about charset in Gateway, sorry for confusing you. About API: it should be included almost always AFIAK, (json implicitly defines utf-8 but there are some edge cases in both directions, including it or not, but as it is included in some places is should probably be included all the time). |
Makes sense. Thanks! Sorry for redirecting the flow of this issue. |
2019 is almost over and the issue is still not resolved. This should be labeled as a bug since it actually "breaks" some web contents; for example, trying to mirror non-UTF-8-encoded Project Guterberg HTML books like this one: Utilitarianism (this is a just trivial example). This is the output I got by executing
|
The fix should be pretty simple. We need to modify Want to try tackling it? |
@Stebalien Oh, yes, I'd like to try. |
License: MIT Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
fix #2203: omit the charset attribute when Content-Type is text/html
License: MIT Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
License: MIT Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
License: MIT Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
License: MIT Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
Charset returned in HTTP header overrides charset in meta tag but is essential for text files and other content.
Encoding is broken on some sites like: https://ipfs.io/ipfs/QmYJxHZ5MeqKF5vbp8Wei71fNkBPdBSB3CRYSW75fA8yFE/ that use non UTF-8 encoding.
Solution would be to run simple charset meta tag detection. It could be heuristic for sake of speed.
Also good caching would be great.
The text was updated successfully, but these errors were encountered: