unicode chars doesn't work #193

eoli3n · 2020-05-10T20:13:54Z

At the bottom you can see that the same chars are working in my terminal.

osa1 · 2020-05-10T23:10:41Z

Thanks for reporting.

I think this is an encoding issue. The original IRC protocols (RFC 1459 and 2812) are ASCII-based so encoding of unicode characters is not specified. I may be wrong, but I suspect current servers do not care about encoding of user messages, for example if a user sends unicode characters encoded in, say, UTF-16, servers are happy to send those to receivers in the same encoding (is there even a way for servers to figure out encoding of messages, in general?), but because there isn't a standard encoding specified, a client that expects messages to be encoded in UTF-8 (e.g. tiny) will decode it incorrectly. I think this is precisely what's happening here (except the sender's encoding may not be UTF-16 but something else).

It'd be good to know:

The encoding used by the sender in this screenshot
How other widely used IRC clients decode incoming messages

eoli3n · 2020-05-11T07:22:32Z

The encoding used by the sender in this screenshot

How to give you that ?

How other widely used IRC clients decode incoming messages

I don't have that problem with weechat.
Just tested irssi, no problem too.

FreeFull · 2020-05-11T07:59:30Z

irssi makes an effort to detect what encoding was used, and re-encode the message. Weechat probably does too. That sort of approach is based on heuristics so is imperfect, and I'm not sure how complicated it might be to implement.

osa1 · 2020-05-11T10:44:10Z

@eoli3n how are you testing this in weechat and irssi? I'd like to use a similar setup to test tiny.

eoli3n · 2020-05-11T11:28:49Z

Actually... on 2 hosts, without changing anything, it seems to work now... there's something i don't get here.

osa1 · 2020-05-18T01:11:30Z

I briefly looked at hexchat and irssi for how they decode incoming messages.

Relevant code in irssi: from a quick look it seems like users can specify an encoding and UTF-8 is the default. There's also a "fallback" option which is CP1252. Not sure what happens if the incoming message is not encoded in any of these.
Relevant code in hexchat: User can specify an encoding (in server settings), UTF-8 is the default. Invalid sequences are replaced with "The Unicode replacement character" (0xFFFD).

Implementing (2) is trivial as we already have the function in std. I think we should just do that.

IRC protocol is ASCII-based and encoding of non-ASCII (e.g. unicode) characters is not specified. We expect UTF-8, but previously did not handle other cases correctly and unsafely generated UTF-8 strings from wire messages. This caused #194. We now remove all unchecked indexing and conversion to UTF-8 and use "lossy" conversion which generates a UTF-8 string even in the presence of invalid UTF-8 sequences. For invalid sequences 'U+FFFD REPLACEMENT CHARACTER' is generated. Fixes #194 See also discussion in #193.

osa1 · 2020-05-18T10:38:37Z

I'm closing this as we don't have a reproducer, and @eoli3n mentioned above that they can't reproduce this.

UTF-8-encoded characters always worked fine, for non-UTF-8 encodings, we previously did some unsafe stuff which I just fixed, in the worst case you should now see � characters.

Note that we also assume that the terminal encoding is UTF-8 so if your terminal is not configured for that, changing that may fix the original problem you reported.

@eoli3n please re-open if you have this problem again.

osa1 mentioned this issue May 10, 2020

Panic in char::is_whitespace #194

Closed

osa1 added bug question labels May 10, 2020

osa1 added this to the 0.5.2 milestone May 11, 2020

osa1 closed this as completed May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode chars doesn't work #193

unicode chars doesn't work #193

eoli3n commented May 10, 2020

osa1 commented May 10, 2020

eoli3n commented May 11, 2020 •

edited

Loading

FreeFull commented May 11, 2020

osa1 commented May 11, 2020

eoli3n commented May 11, 2020

osa1 commented May 18, 2020

osa1 commented May 18, 2020

unicode chars doesn't work #193

unicode chars doesn't work #193

Comments

eoli3n commented May 10, 2020

osa1 commented May 10, 2020

eoli3n commented May 11, 2020 • edited Loading

FreeFull commented May 11, 2020

osa1 commented May 11, 2020

eoli3n commented May 11, 2020

osa1 commented May 18, 2020

osa1 commented May 18, 2020

eoli3n commented May 11, 2020 •

edited

Loading