Allow Unicode in nicknames #259

ttepasse · 2016-05-03T16:00:55Z

RFC 1459 only allows ASCII letters, numerals and some special characters in Nicknames, leaving people from non-anglophone countries at a disadvantage. Using the wealth of human writing is possible in the body of messages, it should be possible in the nicknames too.

SadieCat · 2016-05-03T16:39:56Z

There are existing implementations of this (e.g. InspIRCd's m_nationalchars) but nothing standard. I believe that @DanielOaks was looking into trialling RFC 3454 in @mammon-ircd with a desire for standardising it though.

It isn't as simple as just allowing it though. Compatibility is a concern (there are clients which break when they get a CASEMAPPING which is not ascii or rfc1459) as well as masquerading with characters that look similar (e.g. character 97 "a" looks very similar to character 1072 "а").

clokep · 2016-05-03T16:51:02Z

There's also cases of servers improperly implementing rfc1459 vs. strict-rfc1459 (see inspircd/inspircd#1017).

Ideally, wouldn't we want this to match how it is done for channel names?

TingPing · 2016-05-03T17:57:09Z

For what its worth I made a test branch for hexchat supporting rfc3454 though no network implements it afaik to try it.

clokep · 2016-05-03T18:08:58Z

For what it's worth, we just experimented a bit on moznet and things like the zero-width space character is accepted as a valid room name...which shows up as an empty in whois:

(Additionally, there's also a channel which is just the prefix, #, which is a bit funky.)

dequis · 2016-05-03T19:07:27Z

Relevant reading:

UTR 36: Unicode Security Considerations

UTS 39: Unicode Security Mechanisms

Bitlbee has a 'utf8_nicks' setting, disabled by default and with a small warning about potential breakage in the help text. It doesn't perform any cleanup, deferring that to the IM server (XMPP for example cleans them with the nodeprep/resourceprep profiles of stringprep), but i'd really like to change this.

I haven't heard of clients with big issues when enabling this, just minor visual issues like miscalculating the width when displaying the nicks in a terminal.

MicroDroid · 2016-05-03T21:38:32Z

How are we going to maintain the backward compatibility?
I'd upvote this otherwise

grawity · 2016-05-03T21:52:40Z

In practice, it's probably already compatible because many clients don't care.

MicroDroid · 2016-05-03T22:19:10Z

Hmm, well then this should really be in IRCv3.2, it's awesome

DanielOaks · 2016-05-03T22:32:11Z

masquerading with characters that look similar (e.g. character 97 "a" looks very similar to character 1072 "а").

With rfc3454 casemapping I believe we use the nameprep profile to prevent issues like this. It would be good to read through documents like those in detail to make sure we do things right if we're standardising it though.

So long as you continue to disallow characters that break the protocol (i.e. commas, periods in client names, etc), and reject nicks/channel names that fail to casefold (i.e. strings that fail because they contain a character prohibited by the profile), I haven't seen too many issues with it.

kaniini · 2016-05-03T23:22:11Z

In charybdis, we plan to implement rfc7700 "casemapping", which is the same as rfc3454 nameprep except using IDN2008 rules, with specific requirements for "nicknames".

TingPing · 2016-05-04T00:26:14Z

How are we going to maintain the backward compatibility?

It is a joke but it is a solution, convert it to punycode (or similiar) for non unicode clients.

In practice, it's probably already compatible because many clients don't care.

Not sure what you mean by that, many clients respect the casemapping and rely upon its behavior.

DanielOaks · 2016-05-04T00:49:33Z

@kaniini That makes sense, once it's implemented/specced out give me a yell and I can see about switching my personal stuff over to use it as well.

kaniini · 2016-05-04T01:41:58Z

How are we going to maintain the backward compatibility?

There is no plan in charybdis for backwards compatibility. Deployments which switch from rfc1459 to rfc7700 casemapping will assume clients support UTF-8 properly. Networks will decide on their own when to make the switch, or whether to make it at all.

Mikaela · 2016-05-04T06:20:25Z

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

What if the client is configured to use not-UTF-8-charset?

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild? (ref: weechat/weechat#79)

DanielOaks · 2016-05-04T06:46:19Z

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

Presumably the same as it does with the latin alphabet.

What if the client is configured to use not-UTF-8-charset?

I think detecting a specifically UTF-8-based casemapping from the server should make the client default to using UTF-8, if they're not already. If the user decides not to use it, they may get corrupted characters, just like what happens today when two clients using utf8 and non-utf8 try to send weird characters to each other.

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild?

Then it will not show those characters because your client (or the system you're using it on) does not work properly. I don't think this is an issue for us to worry about, it's a bug that will get fixed by more distros over time, and I especially think will be fixed enough for us to not care about it by the time a unicode casemapping actually gets into proper usage.

attilamolnar · 2016-05-04T08:49:04Z

I fully support moving away from legacy rfc1459 towards rfc7700.

grawity · 2016-05-04T11:09:44Z

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

Not sure if possible, but ideally a<Tab> would also include nicks beginning with ą ã å あ etc., similar to how in some clients it already skips over any leading punctuation (a<Tab> → [Attila]).

What if the client is configured to use not-UTF-8-charset?

Clients which support CASEMAPPING=rfc7700 would always decode nicknames as UTF-8, regardless of the configured message encoding.

Existing clients would work the same way they already do when someone sends a UTF-8 message (i.e. some would detect UTF-8 anyway, others would mis-decode it as ISO-8859-42 or whatever such).

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild?

🤷

I guess it'd be less likely to happen if only "word" characters were accepted, similar to how Python etc. filter characters allowed in variable names.

DanielOaks · 2016-05-04T11:19:23Z

ideally a<Tab> would also include nicks beginning with ą ã å あ etc., similar to how in some clients it already skips over any leading punctuation (a<Tab> → [Attila]).

So long as the client takes the casefolding into account when evaluating tab-complete matches, should work without an issue I'd imagine.

MicroDroid · 2016-05-04T13:27:10Z

Maybe we can do some math in the IRC server to create an alias and send to the client? so the client uses the alias to complete the actual nick?

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa<Tab> → ąãå

Or, the math part might be left up to the clients, as the whole thing is really client side anyways.

clokep · 2016-05-04T13:29:44Z

I'd suggest it's just up to the clients to implement tab completion in a sane manner. UI interfaces shouldn't be speced in a protocol.

MicroDroid · 2016-05-04T13:30:51Z

Right.
So either way this problem is avoidable.

grawity · 2016-05-04T13:48:48Z

Hmm, how do people use tab-completion in the existing ISO-2022-JP networks?

Mikaela · 2016-05-04T15:10:21Z

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa → ąãå

Wouldn't this just mean that aaa / ąãå were the same nick and all variations of ąãå which the IRCd would interpret to aaa and get very confusing? This is why I gave 👎 to your comment.

RyanSquared · 2016-05-04T15:13:49Z

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa → ąãå
Wouldn't this just mean that aaa / ąãå were the same nick and all variations of ąãå which the IRCd would interpret to aaa and get very confusing? This is why I gave 👎 to your comment.

Another idea could be to treat it like capitalizations? ąãå == aaa == AAA but it's not translated to the same characters.

kaniini · 2016-05-04T15:38:15Z

Proposed client behaviour would be in a non-normative part of the spec at best, so it's not even worth bothering with. I suspect with the way this discussion is going, this will be an area where the IRCv3 process fails us and we just form a coalition of IRCd vendors to make it happen, and then IRCv3 maybe documents it after the point.

grawity · 2016-05-04T16:53:15Z

So business as usual, then?

DanielOaks · 2016-05-04T18:44:25Z

Pretty much what @kaniini says. It's not a huge issue to worry about.

sdaugherty · 2016-07-28T11:04:32Z

I'd still be concerned about breakage - even clients which support UTF-8 messages likely have made assumptions about nicknames, particularly any clients which support tab completion or which maintain a cached member list for channels for some purpose. I'd be afraid that this is likely to expose a lot of undefined behaviors around input sanitation of nicknames received from the server (or the lack thereof).

Some possible manifestations of incompatibility with UTF-8 nicknames

Commands applied to the wrong user
Broken tab completion
"null" users in internal caches
garbled characters

Some of these issues already exist today with channel names, and chat messages, but nicknames are more fundamental, as they are identifiers that the client absolutely has to deal with correctly - if a channel name breaks a client, the user can avoid that channel, a user can't necessarily choose to avoid all users with UTF-8 nicknames.

There's also a severe usability concern that needs to be addressed - a channel operator MUST be able to quickly and unambiguously specify nicknames for use in commands with only keyboard input, regardless of what language's characters might happen to be in those nicknames. Even if that client properly supports UTF-8 nicknames, if the use of such nicknames complicates the effective management of channels in the slightest, then user acceptance of internationalized nicknames will either be dead in the water as a feature users rebel against, or there will be demands for restrictive channel modes to prohibit all internationalized nicknames on a channel..

(Yes, I realize that in most cases, a user has access to a GUI, tab completion, or copy/paste, but there is no guarantee of this - there are environments where none of these will be a viable option. Tab completion, for example, often requires the user specify at least a partial match, or requires them to iterate through every nickname on the channel, copy/paste may not be available if the user is at an actual console session rather than running a terminal inside a GUI, GUI userlists aren't available in a terminal, and so on.)

kaniini · 2016-07-29T18:16:22Z

rfc7700, when properly implemented, handles all of those issues and more. have you read it?

sdaugherty · 2016-07-29T23:38:50Z

I have, and it is so extremely light on practical details about exactly how it would be implemented within the IRC protocol that it leaves more questions than answers.While IRC is mentioned as a possible application, aside from that mention, the rest of the RFC consists of a set of guidelines that can be generically applied to problems inherent with nickname internationalization. across a wide variety of existing and future protocols.

While the specifications set out in the RFC address a number of potential issues, the lack of any formal guidance of how to integrate them into the IRC protocol, combined with a lack of IRC specific recommendations effectively make it nothing more than a building block, and my concerns from a user standpoint above about IRC-specific implementation details remain at most partially addressed by RFC7700.

Of more concern, there are some security considerations that should be readily apparent to any long time user of IRC, which are not mentioned - specifically, the potential for disruption if the effective use of channel management and ignore functionality is obstructed or defeated by internationalized nicknames. This is especially important here because users might first have to learn how to deal with inputting i18n nicknames while under the pressure of on ongoing disruption or attack.

Any demonstration or reference implementations will have to be especially aware of these and other considerations, to avoid an implementation that is perceived as creating more problems that it solves.

realJoshByrnes · 2016-08-12T04:02:15Z

If you look at the IRCX Draft v04 (Microsoft, 1998) it provides a way to allow Unicode nicknames in IRC.

Client's that don't support Unicode (non-IRCX in the draft) see:

Non unicode nicknames as usual
Unicode nicknames as '^' followed by the hex representation of the nickname.

This has been supported in many clients / servers since the 1990s, why not use it?

TingPing · 2016-08-12T09:00:25Z

This has been supported in many clients / servers since the 1990s, why not use it?

Curious which ones?

Marqin · 2016-08-20T08:44:21Z

Just remember to ban for security reasons all Unicode confusable symbols (allow only one version of those chars).

dpyro · 2016-10-18T20:58:23Z

What about handling emojis? 👸🏻 may appear as either multiple characters or a single character while being visually different or identical to 👸depending on the system or application support. Additionally, many clients will use a shortcode such as :joy: to ease input of emoji. If a user wants to join #😂, they may use #:joy: which would be a different room entirely.

Both #😂 and #:joy: appear to be valid and distinct channel names on QuakeNet. 💩.la is a valid domain name and website link on my system (macOS/Safari).

DanielOaks · 2016-10-19T01:08:23Z

Shortcodes are handled explicitly by the client (i.e. if the client wants to convert them then cool), the protocol doesn't treat shortcodes any differently or give them any special conversion. At least in #272 right now, it allows emoji as a part of names so far as rfc7700 does, but servers are free to block whatever characters they want.

fantasai · 2017-03-29T03:28:35Z

@grawity NFKD + case folding is likely to help with the tab completion for accented characters. Decomposing characters will let you handle diacritics by either matching or skipping them, and compatibility decompositions will handle a lot of other stuff. (It's designed for search operations.) See http://unicode.org/reports/tr15/

But Unicode mapping tables won't handle things like a -> あ, since they don't have tables for romanization of non-Latin scripts. It's not really an easily standardizable thing... in many languages, there's multiple possibilities; e.g. Chinese has several formalized romanization schemes in common use, and Persian is romanized rather haphazardly by Persian-speakers.

jwheare · 2018-04-23T09:58:13Z

Worth considering whether just using a metadata key might resolve this sufficiently. e.g. display-name described here: #336

jwheare mentioned this issue Jul 28, 2016

nickname i18n using a variation of punycode ircv3/ircv3-ideas#2

Closed

DanielOaks mentioned this issue Sep 15, 2016

Add document for Unicode casemapping #272

Closed

Mikaela mentioned this issue Dec 5, 2016

Channel names do not support unicode characters freenode/ircd-seven#20

Open

jwheare added the protocol label Jan 7, 2017

slingamn mentioned this issue Sep 26, 2017

Allow specific emoji in nicks/channels ergochat/ergo#137

Closed

lorkki mentioned this issue Feb 4, 2019

Add fallback encoding for non-UTF-8 lines kiwiirc/irc-framework#142

Closed

Allow Unicode in nicknames #259

Allow Unicode in nicknames #259

Comments

ttepasse commented May 3, 2016

SadieCat commented May 3, 2016 • edited Loading

clokep commented May 3, 2016

TingPing commented May 3, 2016

clokep commented May 3, 2016

dequis commented May 3, 2016

MicroDroid commented May 3, 2016

grawity commented May 3, 2016

MicroDroid commented May 3, 2016

DanielOaks commented May 3, 2016

kaniini commented May 3, 2016

TingPing commented May 4, 2016 • edited Loading

DanielOaks commented May 4, 2016

kaniini commented May 4, 2016

Mikaela commented May 4, 2016

DanielOaks commented May 4, 2016

attilamolnar commented May 4, 2016

grawity commented May 4, 2016

DanielOaks commented May 4, 2016

MicroDroid commented May 4, 2016

clokep commented May 4, 2016

MicroDroid commented May 4, 2016

grawity commented May 4, 2016

Mikaela commented May 4, 2016

RyanSquared commented May 4, 2016

kaniini commented May 4, 2016

grawity commented May 4, 2016

DanielOaks commented May 4, 2016

sdaugherty commented Jul 28, 2016 • edited Loading

kaniini commented Jul 29, 2016 • edited Loading

sdaugherty commented Jul 29, 2016 • edited Loading

realJoshByrnes commented Aug 12, 2016

TingPing commented Aug 12, 2016

Marqin commented Aug 20, 2016 • edited Loading

dpyro commented Oct 18, 2016 • edited Loading

DanielOaks commented Oct 19, 2016 • edited Loading

fantasai commented Mar 29, 2017 • edited Loading

jwheare commented Apr 23, 2018

SadieCat commented May 3, 2016 •

edited

Loading

TingPing commented May 4, 2016 •

edited

Loading

sdaugherty commented Jul 28, 2016 •

edited

Loading

kaniini commented Jul 29, 2016 •

edited

Loading

sdaugherty commented Jul 29, 2016 •

edited

Loading

Marqin commented Aug 20, 2016 •

edited

Loading

dpyro commented Oct 18, 2016 •

edited

Loading

DanielOaks commented Oct 19, 2016 •

edited

Loading

fantasai commented Mar 29, 2017 •

edited

Loading