What to do about ZWJ emoji sequences #40071

Keno · 2021-03-16T23:59:06Z

Some emojis are composed of sequences of other emoji, combined by ZWJ. At the moment, ZWJ is disallowed in the parser, so these emoji, cannot be used, e.g.:

julia> 🏳️‍🌈
ERROR: syntax: invisible character \u200d near column 2
Stacktrace:
 [1] top-level scope
   @ none:1

This is because 🏳️‍🌈 is really 🏳️‍[ZWJ]🌈. We should decide what to do here, since use of these sequences is likely to expand in future Unicode versions. One option is of course to just do nothing and continue to disallow these. Another option may be to just normalize out the ZWJ in emoji sequences and treat that equivalently to the constituent emoji next to each other (since that's what they look like if ZWJ sequences are not supported by the font renderer.

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2021-03-17T20:39:35Z

Just allowing ZWJ as a (non-initial) identifier character is not a bad option. Sequences of emoji with and without the ZWJ are distinct, so to the extent we allow emoji it seems we should simply allow it.

Keno · 2021-03-17T20:43:26Z

What about ZWJ outside of emoji sequences? Unicode doesn't specify what emoji sequences are valid to join, but we could look at the character class to disallow it if what's being joined is not an emoji.

JeffBezanson · 2021-03-17T20:47:26Z

Sounds good to me.

mgkuhn · 2021-03-20T21:56:28Z

See also Unicode Standard Annex #31 “Unicode Identifier and Pattern Syntax” Section 2.3:

Implementations that allow emoji characters in identifiers should also normally allow emoji sequences. These are defined in ED-17, emoji sequence in UTS51. In particular, that means allowing ZWJ characters, emoji presentation selector (U+FE0F), and TAG characters, but only in the particular defined contexts described in UTS51.

PallHaraldsson · 2021-10-21T14:53:42Z

We should decide what to do here

Maybe nothing and status quo just ok? Are there any emojis that we need to support (e.g. some ZWJ sequences for math?)? Because you'll be opening a large can of worms:

"This Emoji ZWJ Sequence has not been Recommended For General Interchange (RGI) by Unicode. Expect limited cross-platform support." for e.g. https://emojipedia.org/family-woman-woman-boy-girl/

Copying and pasting it into the REPL works, but incorrectly, as four heads, not as a square image. Otherwise I have nothing against gender neutral, or "food baby": https://blog.emojipedia.org/why-is-there-a-pregnant-man-emoji/

Maybe just close this, we already have the best Unicode support?

stevengj · 2023-10-23T21:22:40Z

Allowing ZWJ/U+200D as a non-initial character would be the simplest option, but if you allow it between arbitrary characters then it does allow a whole new type of obfuscated code, e.g. "a\u200dweird\u200didentifier" displays as a‍weird‍identifier (looking like aweirdidentifier) in some software (and as a weird identifier in some other terminals).

And if you only allow it in emoji sequences it makes the parser more complicated, though not insurmountably so. (This is what Unicode Annex 31 recommends for languages adopting the emoji profile. Not that we hew particularly close to Annex 31 in any case.)

One compromise option would be to allow ZWJ as any non-initial character, but to normalize it away in non-emoji sequences (or normalize it away entirely) — that way the complexity is pushed to the symbol normalization, out of the parser.

stevengj · 2023-10-24T13:03:09Z

I just had a brainwave — what we really want here is to not break identifiers within graphemes. So, an improved rule for identifiers would be:

identifier consists of a sequence of graphemes, each of which may begin with one of the currently allowed characters (i.e. same allowed characters at the start of the identifier, and same allowed characters afterwards).

This way we only need one more bit of state during parsing, for the grapheme-break state in utf8proc, and it will handle all of the emoji rules etcetera for us, and won't allow ZWJ in non-emoji sequences.

Edit: actually, because of grapheme rule GB9, this will allow ZWJ in non-emoji sequences at the end of the identifier. So, we would need one more rule: identifiers cannot end with ZWJ.

Keno added 😃🍕 and other emoji unicode Related to unicode characters and encodings labels Mar 17, 2021

JeffBezanson added the parser Language parsing and surface syntax label Mar 17, 2021

giordano mentioned this issue Sep 30, 2022

Flags don't look right in the repl #46982

Open

mbauman mentioned this issue Aug 31, 2023

Exclude emojis - to e.g. be gender-neutral #51130

Closed

giordano mentioned this issue Oct 20, 2023

support Unicode 15.1 via utf8proc 2.9.0 #51799

Merged

stevengj mentioned this issue Oct 24, 2023

handle ZWJ and emoji sequences, don't break identifiers within graphemes JuliaLang/JuliaSyntax.jl#372

Merged

stevengj closed this as completed in JuliaLang/JuliaSyntax.jl#372 Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to do about ZWJ emoji sequences #40071

What to do about ZWJ emoji sequences #40071

Keno commented Mar 16, 2021

JeffBezanson commented Mar 17, 2021

Keno commented Mar 17, 2021

JeffBezanson commented Mar 17, 2021

mgkuhn commented Mar 20, 2021 •

edited

Loading

PallHaraldsson commented Oct 21, 2021

stevengj commented Oct 23, 2023 •

edited

Loading

stevengj commented Oct 24, 2023 •

edited

Loading

What to do about ZWJ emoji sequences #40071

What to do about ZWJ emoji sequences #40071

Comments

Keno commented Mar 16, 2021

JeffBezanson commented Mar 17, 2021

Keno commented Mar 17, 2021

JeffBezanson commented Mar 17, 2021

mgkuhn commented Mar 20, 2021 • edited Loading

PallHaraldsson commented Oct 21, 2021

stevengj commented Oct 23, 2023 • edited Loading

stevengj commented Oct 24, 2023 • edited Loading

mgkuhn commented Mar 20, 2021 •

edited

Loading

stevengj commented Oct 23, 2023 •

edited

Loading

stevengj commented Oct 24, 2023 •

edited

Loading