Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode chars doesn't work #193

Closed
eoli3n opened this issue May 10, 2020 · 7 comments
Closed

unicode chars doesn't work #193

eoli3n opened this issue May 10, 2020 · 7 comments

Comments

@eoli3n
Copy link

eoli3n commented May 10, 2020

At the bottom you can see that the same chars are working in my terminal.
img

@osa1
Copy link
Owner

osa1 commented May 10, 2020

Thanks for reporting.

I think this is an encoding issue. The original IRC protocols (RFC 1459 and 2812) are ASCII-based so encoding of unicode characters is not specified. I may be wrong, but I suspect current servers do not care about encoding of user messages, for example if a user sends unicode characters encoded in, say, UTF-16, servers are happy to send those to receivers in the same encoding (is there even a way for servers to figure out encoding of messages, in general?), but because there isn't a standard encoding specified, a client that expects messages to be encoded in UTF-8 (e.g. tiny) will decode it incorrectly. I think this is precisely what's happening here (except the sender's encoding may not be UTF-16 but something else).

It'd be good to know:

  1. The encoding used by the sender in this screenshot
  2. How other widely used IRC clients decode incoming messages

@eoli3n
Copy link
Author

eoli3n commented May 11, 2020

The encoding used by the sender in this screenshot

How to give you that ?

How other widely used IRC clients decode incoming messages

I don't have that problem with weechat.
Just tested irssi, no problem too.

@FreeFull
Copy link

irssi makes an effort to detect what encoding was used, and re-encode the message. Weechat probably does too. That sort of approach is based on heuristics so is imperfect, and I'm not sure how complicated it might be to implement.

@osa1
Copy link
Owner

osa1 commented May 11, 2020

@eoli3n how are you testing this in weechat and irssi? I'd like to use a similar setup to test tiny.

@osa1 osa1 added this to the 0.5.2 milestone May 11, 2020
@eoli3n
Copy link
Author

eoli3n commented May 11, 2020

Actually... on 2 hosts, without changing anything, it seems to work now... there's something i don't get here.

@osa1
Copy link
Owner

osa1 commented May 18, 2020

I briefly looked at hexchat and irssi for how they decode incoming messages.

  • Relevant code in irssi: from a quick look it seems like users can specify an encoding and UTF-8 is the default. There's also a "fallback" option which is CP1252. Not sure what happens if the incoming message is not encoded in any of these.

  • Relevant code in hexchat: User can specify an encoding (in server settings), UTF-8 is the default. Invalid sequences are replaced with "The Unicode replacement character" (0xFFFD).

Implementing (2) is trivial as we already have the function in std. I think we should just do that.

osa1 added a commit that referenced this issue May 18, 2020
IRC protocol is ASCII-based and encoding of non-ASCII (e.g. unicode)
characters is not specified. We expect UTF-8, but previously did not
handle other cases correctly and unsafely generated UTF-8 strings from
wire messages. This caused #194.

We now remove all unchecked indexing and conversion to UTF-8 and use
"lossy" conversion which generates a UTF-8 string even in the presence
of invalid UTF-8 sequences. For invalid sequences 'U+FFFD REPLACEMENT
CHARACTER' is generated.

Fixes #194

See also discussion in #193.
@osa1
Copy link
Owner

osa1 commented May 18, 2020

I'm closing this as we don't have a reproducer, and @eoli3n mentioned above that they can't reproduce this.

UTF-8-encoded characters always worked fine, for non-UTF-8 encodings, we previously did some unsafe stuff which I just fixed, in the worst case you should now see characters.

Note that we also assume that the terminal encoding is UTF-8 so if your terminal is not configured for that, changing that may fix the original problem you reported.

@eoli3n please re-open if you have this problem again.

@osa1 osa1 closed this as completed May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants