Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Including charset in HTTP breaks non UTF-8 sites #2203

Closed
Kubuxu opened this issue Jan 14, 2016 · 16 comments · Fixed by #6743
Closed

Including charset in HTTP breaks non UTF-8 sites #2203

Kubuxu opened this issue Jan 14, 2016 · 16 comments · Fixed by #6743
Labels
help wanted Seeking public contribution on this issue kind/bug A bug in existing code (including security flaws)

Comments

@Kubuxu
Copy link
Member

Kubuxu commented Jan 14, 2016

Charset returned in HTTP header overrides charset in meta tag but is essential for text files and other content.

Encoding is broken on some sites like: https://ipfs.io/ipfs/QmYJxHZ5MeqKF5vbp8Wei71fNkBPdBSB3CRYSW75fA8yFE/ that use non UTF-8 encoding.

Solution would be to run simple charset meta tag detection. It could be heuristic for sake of speed.

Also good caching would be great.

@Kubuxu Kubuxu changed the title Returing charset breaks non UTF-8 sites Including charset in HTTP breaks non UTF-8 sites Jan 14, 2016
@rht
Copy link
Contributor

rht commented Jan 21, 2016

As of #2008, content type is already auto-detected (from the first 512 bytes through http.DetectContentType).
To detect charset, however, requires 1. importing https://godoc.org/golang.org/x/net/html/charset, 2. rewriting http.ServeContent to use charset.DetermineEncoding.

Looking at the importers list, it appears that charset auto-detect hasn't been used in most of major golang web frameworks, yet is used to format the diff content in gogs: https://github.com/gogits/gogs/blob/c199703e2aa746f10ff2b584513fe3af810b6c1a/models/git_diff.go.

There should not be a major perf hit with checking the first 1kb of each content served in the gateway.

@rht rht added the help wanted Seeking public contribution on this issue label Jan 21, 2016
@rht
Copy link
Contributor

rht commented Jan 21, 2016

I tried the charset.DetermineEncoding, but it detected the charset in https://ipfs.io/ipfs/QmYJxHZ5MeqKF5vbp8Wei71fNkBPdBSB3CRYSW75fA8yFE/ as windows-125223.

@Kubuxu
Copy link
Member Author

Kubuxu commented Jan 21, 2016

ISO-8859 is equivalent to windows-1252 so it might be good.

More precisely windows-1252 is superset of ISO-8859

@jbenet
Copy link
Member

jbenet commented Jan 24, 2016

any way we can not set the type and leave it up to browsers? or the golang lib?

@Kubuxu
Copy link
Member Author

Kubuxu commented Jan 24, 2016

We could try detecting if there is charset declaration in then not setting charset in the HTTP header. This would solve this problem as browser will then use charset from .

Heuristic would be very simple, and probably fast.

  1. Take 1500 bytes.
  2. Truncate it to .
  3. Check if regex matches: /meta[^>]+charset=/i.
    Example result: http://regexr.com/3claa, http://regexr.com/3clad

Any cons of this method?

@rht
Copy link
Contributor

rht commented Jan 25, 2016

Hmm:
"Although this parameter is optional, we recommend that it always be present (in the Content-Type)"[1].

note: I have tested #2230 and it does render the lua page properly.

ref:
[1] https://www.w3.org/TR/html4/conform.html

@rht
Copy link
Contributor

rht commented Jan 25, 2016

Also, @Kubuxu, charset.DetermineEncoding does the heuristic which is based on http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding. It does look for the charset attribute as in https://github.com/golang/net/blob/master/html/charset/charset.go#L194.

@Kubuxu
Copy link
Member Author

Kubuxu commented Jan 25, 2016

Awesome, my response was to jbenet's question if we can leave it up to browser.

@RichardLitt
Copy link
Member

The HTTP Api seems to arbitrarily include charsets in the headers, at times, and not, at other times. Should it be included all of the time, @Kubuxu?

@Kubuxu
Copy link
Member Author

Kubuxu commented Mar 3, 2016

It shouldn't be included always as it overrides setting in HTML tags and also default HTML charset is different.

@RichardLitt
Copy link
Member

I guess I am confused: how would a Header response from the API override HTML tags? Does the go HTTP API ever return HTML?

@Kubuxu
Copy link
Member Author

Kubuxu commented Mar 3, 2016

Ahh, this issue is about charset in Gateway, sorry for confusing you. About API: it should be included almost always AFIAK, (json implicitly defines utf-8 but there are some edge cases in both directions, including it or not, but as it is included in some places is should probably be included all the time).

@RichardLitt
Copy link
Member

Makes sense. Thanks! Sorry for redirecting the flow of this issue.

@djalilhebal
Copy link
Contributor

2019 is almost over and the issue is still not resolved.

This should be labeled as a bug since it actually "breaks" some web contents; for example, trying to mirror non-UTF-8-encoded Project Guterberg HTML books like this one: Utilitarianism (this is a just trivial example).


This is the output I got by executing curl --head "https://gateway.ipfs.io/ipfs/QmXjU4HTejtdY5bouHmXny5NMyE4RfsnXNqizcYdVusr46/books/utilitarianism.html"

HTTP/2 200 
server: nginx
date: Fri, 01 Nov 2019 20:00:13 GMT
content-type: text/html; charset=utf-8
content-length: 186678
vary: Accept-Encoding
accept-ranges: bytes
access-control-allow-methods: GET
cache-control: public, max-age=29030400, immutable
etag: "QmSFFoikBNJZmN2muk4CbpAXtCRdJQjv2ZsrKArEP587tt"
last-modified: Thu, 01 Jan 1970 00:00:01 GMT
suborigin: ipfs000bciqixel7d5xij5x4pugxurohowwtxgex2embtxb747dyeucl4el2zjy
x-ipfs-gateway-host: gateway-bank1-mrs1
x-ipfs-path: /ipfs/QmXjU4HTejtdY5bouHmXny5NMyE4RfsnXNqizcYdVusr46/books/utilitarianism.html
access-control-allow-origin: *
access-control-allow-methods: GET, POST, OPTIONS
access-control-allow-headers: X-Requested-With, Range, Content-Range, X-Chunked-Output, X-Stream-Output
access-control-expose-headers: Content-Range, X-Chunked-Output, X-Stream-Output
x-ipfs-pop: gateway-bank1-mrs1
strict-transport-security: max-age=31536000; includeSubDomains; preload

@Stebalien Stebalien added the kind/bug A bug in existing code (including security flaws) label Nov 2, 2019
@Stebalien
Copy link
Member

The fix should be pretty simple. We need to modify serveFile in github.com/ipfs/go-ipfs/core/corehttp/gateway_handler.go to do the file type detection that ServeContent normally does, then strip the charset if the content type is text/html (or maybe just strip the charset).

Want to try tackling it?

@djalilhebal
Copy link
Contributor

@Stebalien Oh, yes, I'd like to try.

djalilhebal added a commit to djalilhebal/go-ipfs that referenced this issue Nov 2, 2019
License: MIT
Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
Stebalien added a commit that referenced this issue Dec 2, 2019
fix #2203: omit the charset attribute when Content-Type is text/html
pontiyaraja pushed a commit to pontiyaraja/go-ipfs that referenced this issue Dec 27, 2019
License: MIT
Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
License: MIT
Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
License: MIT
Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 8, 2020
License: MIT
Signed-off-by: Abdeldjalil Hebal <dreamski21@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Seeking public contribution on this issue kind/bug A bug in existing code (including security flaws)
Projects
None yet
6 participants