Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Unicode in nicknames #259

Open
ttepasse opened this issue May 3, 2016 · 37 comments
Open

Allow Unicode in nicknames #259

ttepasse opened this issue May 3, 2016 · 37 comments
Labels

Comments

@ttepasse
Copy link

ttepasse commented May 3, 2016

RFC 1459 only allows ASCII letters, numerals and some special characters in Nicknames, leaving people from non-anglophone countries at a disadvantage. Using the wealth of human writing is possible in the body of messages, it should be possible in the nicknames too.

@SadieCat
Copy link
Contributor

SadieCat commented May 3, 2016

There are existing implementations of this (e.g. InspIRCd's m_nationalchars) but nothing standard. I believe that @DanielOaks was looking into trialling RFC 3454 in @mammon-ircd with a desire for standardising it though.

It isn't as simple as just allowing it though. Compatibility is a concern (there are clients which break when they get a CASEMAPPING which is not ascii or rfc1459) as well as masquerading with characters that look similar (e.g. character 97 "a" looks very similar to character 1072 "а").

@clokep
Copy link
Contributor

clokep commented May 3, 2016

There's also cases of servers improperly implementing rfc1459 vs. strict-rfc1459 (see inspircd/inspircd#1017).

Ideally, wouldn't we want this to match how it is done for channel names?

@TingPing
Copy link
Contributor

TingPing commented May 3, 2016

For what its worth I made a test branch for hexchat supporting rfc3454 though no network implements it afaik to try it.

@clokep
Copy link
Contributor

clokep commented May 3, 2016

For what it's worth, we just experimented a bit on moznet and things like the zero-width space character is accepted as a valid room name...which shows up as an empty in whois:

(Additionally, there's also a channel which is just the prefix, #, which is a bit funky.)

screen shot 2016-05-03 at 2 06 50 pm

@dequis
Copy link
Contributor

dequis commented May 3, 2016

Relevant reading:

UTR 36: Unicode Security Considerations

UTS 39: Unicode Security Mechanisms

Bitlbee has a 'utf8_nicks' setting, disabled by default and with a small warning about potential breakage in the help text. It doesn't perform any cleanup, deferring that to the IM server (XMPP for example cleans them with the nodeprep/resourceprep profiles of stringprep), but i'd really like to change this.

I haven't heard of clients with big issues when enabling this, just minor visual issues like miscalculating the width when displaying the nicks in a terminal.

@MicroDroid
Copy link

How are we going to maintain the backward compatibility?
I'd upvote this otherwise

@grawity
Copy link
Contributor

grawity commented May 3, 2016

In practice, it's probably already compatible because many clients don't care.

@MicroDroid
Copy link

Hmm, well then this should really be in IRCv3.2, it's awesome

@DanielOaks
Copy link
Member

masquerading with characters that look similar (e.g. character 97 "a" looks very similar to character 1072 "а").

With rfc3454 casemapping I believe we use the nameprep profile to prevent issues like this. It would be good to read through documents like those in detail to make sure we do things right if we're standardising it though.

So long as you continue to disallow characters that break the protocol (i.e. commas, periods in client names, etc), and reject nicks/channel names that fail to casefold (i.e. strings that fail because they contain a character prohibited by the profile), I haven't seen too many issues with it.

@kaniini
Copy link
Contributor

kaniini commented May 3, 2016

In charybdis, we plan to implement rfc7700 "casemapping", which is the same as rfc3454 nameprep except using IDN2008 rules, with specific requirements for "nicknames".

@TingPing
Copy link
Contributor

TingPing commented May 4, 2016

How are we going to maintain the backward compatibility?

It is a joke but it is a solution, convert it to punycode (or similiar) for non unicode clients.

In practice, it's probably already compatible because many clients don't care.

Not sure what you mean by that, many clients respect the casemapping and rely upon its behavior.

@DanielOaks
Copy link
Member

@kaniini That makes sense, once it's implemented/specced out give me a yell and I can see about switching my personal stuff over to use it as well.

@kaniini
Copy link
Contributor

kaniini commented May 4, 2016

How are we going to maintain the backward compatibility?

There is no plan in charybdis for backwards compatibility. Deployments which switch from rfc1459 to rfc7700 casemapping will assume clients support UTF-8 properly. Networks will decide on their own when to make the switch, or whether to make it at all.

@Mikaela
Copy link
Contributor

Mikaela commented May 4, 2016

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

What if the client is configured to use not-UTF-8-charset?

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild? (ref: weechat/weechat#79)

@DanielOaks
Copy link
Member

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

Presumably the same as it does with the latin alphabet.

What if the client is configured to use not-UTF-8-charset?

I think detecting a specifically UTF-8-based casemapping from the server should make the client default to using UTF-8, if they're not already. If the user decides not to use it, they may get corrupted characters, just like what happens today when two clients using utf8 and non-utf8 try to send weird characters to each other.

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild?

Then it will not show those characters because your client (or the system you're using it on) does not work properly. I don't think this is an issue for us to worry about, it's a bug that will get fixed by more distros over time, and I especially think will be fixed enough for us to not care about it by the time a unicode casemapping actually gets into proper usage.

@attilamolnar
Copy link
Contributor

I fully support moving away from legacy rfc1459 towards rfc7700.

@grawity
Copy link
Contributor

grawity commented May 4, 2016

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

Not sure if possible, but ideally a<Tab> would also include nicks beginning with ą ã å あ etc., similar to how in some clients it already skips over any leading punctuation (a<Tab>[Attila]).

What if the client is configured to use not-UTF-8-charset?

Clients which support CASEMAPPING=rfc7700 would always decode nicknames as UTF-8, regardless of the configured message encoding.

Existing clients would work the same way they already do when someone sends a UTF-8 message (i.e. some would detect UTF-8 anyway, others would mis-decode it as ISO-8859-42 or whatever such).

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild?

🤷

I guess it'd be less likely to happen if only "word" characters were accepted, similar to how Python etc. filter characters allowed in variable names.

@DanielOaks
Copy link
Member

ideally a<Tab> would also include nicks beginning with ą ã å あ etc., similar to how in some clients it already skips over any leading punctuation (a<Tab>[Attila]).

So long as the client takes the casefolding into account when evaluating tab-complete matches, should work without an issue I'd imagine.

@MicroDroid
Copy link

Maybe we can do some math in the IRC server to create an alias and send to the client? so the client uses the alias to complete the actual nick?

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa<Tab>ąãå

Or, the math part might be left up to the clients, as the whole thing is really client side anyways.

@clokep
Copy link
Contributor

clokep commented May 4, 2016

I'd suggest it's just up to the clients to implement tab completion in a sane manner. UI interfaces shouldn't be speced in a protocol.

@MicroDroid
Copy link

Right.
So either way this problem is avoidable.

@grawity
Copy link
Contributor

grawity commented May 4, 2016

Hmm, how do people use tab-completion in the existing ISO-2022-JP networks?

@Mikaela
Copy link
Contributor

Mikaela commented May 4, 2016

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa → ąãå

Wouldn't this just mean that aaa / ąãå were the same nick and all variations of ąãå which the IRCd would interpret to aaa and get very confusing? This is why I gave 👎 to your comment.

@RyanSquared
Copy link
Contributor

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa → ąãå
Wouldn't this just mean that aaa / ąãå were the same nick and all variations of ąãå which the IRCd would interpret to aaa and get very confusing? This is why I gave 👎 to your comment.

Another idea could be to treat it like capitalizations? ąãå == aaa == AAA but it's not translated to the same characters.

@kaniini
Copy link
Contributor

kaniini commented May 4, 2016

Proposed client behaviour would be in a non-normative part of the spec at best, so it's not even worth bothering with. I suspect with the way this discussion is going, this will be an area where the IRCv3 process fails us and we just form a coalition of IRCd vendors to make it happen, and then IRCv3 maybe documents it after the point.

@grawity
Copy link
Contributor

grawity commented May 4, 2016

So business as usual, then?

@DanielOaks
Copy link
Member

Pretty much what @kaniini says. It's not a huge issue to worry about.

@sdaugherty
Copy link

sdaugherty commented Jul 28, 2016

I'd still be concerned about breakage - even clients which support UTF-8 messages likely have made assumptions about nicknames, particularly any clients which support tab completion or which maintain a cached member list for channels for some purpose. I'd be afraid that this is likely to expose a lot of undefined behaviors around input sanitation of nicknames received from the server (or the lack thereof).

Some possible manifestations of incompatibility with UTF-8 nicknames

  • Commands applied to the wrong user
  • Broken tab completion
  • "null" users in internal caches
  • garbled characters

Some of these issues already exist today with channel names, and chat messages, but nicknames are more fundamental, as they are identifiers that the client absolutely has to deal with correctly - if a channel name breaks a client, the user can avoid that channel, a user can't necessarily choose to avoid all users with UTF-8 nicknames.

There's also a severe usability concern that needs to be addressed - a channel operator MUST be able to quickly and unambiguously specify nicknames for use in commands with only keyboard input, regardless of what language's characters might happen to be in those nicknames. Even if that client properly supports UTF-8 nicknames, if the use of such nicknames complicates the effective management of channels in the slightest, then user acceptance of internationalized nicknames will either be dead in the water as a feature users rebel against, or there will be demands for restrictive channel modes to prohibit all internationalized nicknames on a channel..

(Yes, I realize that in most cases, a user has access to a GUI, tab completion, or copy/paste, but there is no guarantee of this - there are environments where none of these will be a viable option. Tab completion, for example, often requires the user specify at least a partial match, or requires them to iterate through every nickname on the channel, copy/paste may not be available if the user is at an actual console session rather than running a terminal inside a GUI, GUI userlists aren't available in a terminal, and so on.)

@kaniini
Copy link
Contributor

kaniini commented Jul 29, 2016

rfc7700, when properly implemented, handles all of those issues and more. have you read it?

@sdaugherty
Copy link

sdaugherty commented Jul 29, 2016

I have, and it is so extremely light on practical details about exactly how it would be implemented within the IRC protocol that it leaves more questions than answers.While IRC is mentioned as a possible application, aside from that mention, the rest of the RFC consists of a set of guidelines that can be generically applied to problems inherent with nickname internationalization. across a wide variety of existing and future protocols.

While the specifications set out in the RFC address a number of potential issues, the lack of any formal guidance of how to integrate them into the IRC protocol, combined with a lack of IRC specific recommendations effectively make it nothing more than a building block, and my concerns from a user standpoint above about IRC-specific implementation details remain at most partially addressed by RFC7700.

Of more concern, there are some security considerations that should be readily apparent to any long time user of IRC, which are not mentioned - specifically, the potential for disruption if the effective use of channel management and ignore functionality is obstructed or defeated by internationalized nicknames. This is especially important here because users might first have to learn how to deal with inputting i18n nicknames while under the pressure of on ongoing disruption or attack.

Any demonstration or reference implementations will have to be especially aware of these and other considerations, to avoid an implementation that is perceived as creating more problems that it solves.

@realJoshByrnes
Copy link
Contributor

If you look at the IRCX Draft v04 (Microsoft, 1998) it provides a way to allow Unicode nicknames in IRC.

Client's that don't support Unicode (non-IRCX in the draft) see:

  1. Non unicode nicknames as usual
  2. Unicode nicknames as '^' followed by the hex representation of the nickname.

This has been supported in many clients / servers since the 1990s, why not use it?

@TingPing
Copy link
Contributor

This has been supported in many clients / servers since the 1990s, why not use it?

Curious which ones?

@Marqin
Copy link

Marqin commented Aug 20, 2016

Just remember to ban for security reasons all Unicode confusable symbols (allow only one version of those chars).

@dpyro
Copy link

dpyro commented Oct 18, 2016

What about handling emojis? 👸🏻 may appear as either multiple characters or a single character while being visually different or identical to 👸depending on the system or application support. Additionally, many clients will use a shortcode such as :joy: to ease input of emoji. If a user wants to join #😂, they may use #:joy: which would be a different room entirely.

Both #😂 and #:joy: appear to be valid and distinct channel names on QuakeNet. 💩.la is a valid domain name and website link on my system (macOS/Safari).

@DanielOaks
Copy link
Member

DanielOaks commented Oct 19, 2016

Shortcodes are handled explicitly by the client (i.e. if the client wants to convert them then cool), the protocol doesn't treat shortcodes any differently or give them any special conversion. At least in #272 right now, it allows emoji as a part of names so far as rfc7700 does, but servers are free to block whatever characters they want.

@fantasai
Copy link

fantasai commented Mar 29, 2017

@grawity NFKD + case folding is likely to help with the tab completion for accented characters. Decomposing characters will let you handle diacritics by either matching or skipping them, and compatibility decompositions will handle a lot of other stuff. (It's designed for search operations.) See http://unicode.org/reports/tr15/

But Unicode mapping tables won't handle things like a -> あ, since they don't have tables for romanization of non-Latin scripts. It's not really an easily standardizable thing... in many languages, there's multiple possibilities; e.g. Chinese has several formalized romanization schemes in common use, and Persian is romanized rather haphazardly by Persian-speakers.

@jwheare
Copy link
Member

jwheare commented Apr 23, 2018

Worth considering whether just using a metadata key might resolve this sufficiently. e.g. display-name described here: #336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests