Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do about ZWJ emoji sequences #40071

Closed
Keno opened this issue Mar 16, 2021 · 7 comments · Fixed by JuliaLang/JuliaSyntax.jl#372
Closed

What to do about ZWJ emoji sequences #40071

Keno opened this issue Mar 16, 2021 · 7 comments · Fixed by JuliaLang/JuliaSyntax.jl#372
Labels
😃🍕 and other emoji parser Language parsing and surface syntax unicode Related to unicode characters and encodings

Comments

@Keno
Copy link
Member

Keno commented Mar 16, 2021

Some emojis are composed of sequences of other emoji, combined by ZWJ. At the moment, ZWJ is disallowed in the parser, so these emoji, cannot be used, e.g.:

julia> 🏳️‍🌈
ERROR: syntax: invisible character \u200d near column 2
Stacktrace:
 [1] top-level scope
   @ none:1

This is because 🏳️‍🌈 is really 🏳️‍[ZWJ]🌈. We should decide what to do here, since use of these sequences is likely to expand in future Unicode versions. One option is of course to just do nothing and continue to disallow these. Another option may be to just normalize out the ZWJ in emoji sequences and treat that equivalently to the constituent emoji next to each other (since that's what they look like if ZWJ sequences are not supported by the font renderer.

@Keno Keno added 😃🍕 and other emoji unicode Related to unicode characters and encodings labels Mar 17, 2021
@JeffBezanson JeffBezanson added the parser Language parsing and surface syntax label Mar 17, 2021
@JeffBezanson
Copy link
Member

Just allowing ZWJ as a (non-initial) identifier character is not a bad option. Sequences of emoji with and without the ZWJ are distinct, so to the extent we allow emoji it seems we should simply allow it.

@Keno
Copy link
Member Author

Keno commented Mar 17, 2021

What about ZWJ outside of emoji sequences? Unicode doesn't specify what emoji sequences are valid to join, but we could look at the character class to disallow it if what's being joined is not an emoji.

@JeffBezanson
Copy link
Member

Sounds good to me.

@mgkuhn
Copy link
Contributor

mgkuhn commented Mar 20, 2021

See also Unicode Standard Annex #31 “Unicode Identifier and Pattern Syntax” Section 2.3:

Implementations that allow emoji characters in identifiers should also normally allow emoji sequences. These are defined in ED-17, emoji sequence in UTS51. In particular, that means allowing ZWJ characters, emoji presentation selector (U+FE0F), and TAG characters, but only in the particular defined contexts described in UTS51.

@PallHaraldsson
Copy link
Contributor

We should decide what to do here

Maybe nothing and status quo just ok? Are there any emojis that we need to support (e.g. some ZWJ sequences for math?)? Because you'll be opening a large can of worms:

"This Emoji ZWJ Sequence has not been Recommended For General Interchange (RGI) by Unicode. Expect limited cross-platform support." for e.g. https://emojipedia.org/family-woman-woman-boy-girl/

Copying and pasting it into the REPL works, but incorrectly, as four heads, not as a square image. Otherwise I have nothing against gender neutral, or "food baby": https://blog.emojipedia.org/why-is-there-a-pregnant-man-emoji/

Maybe just close this, we already have the best Unicode support?

@stevengj
Copy link
Member

stevengj commented Oct 23, 2023

Allowing ZWJ/U+200D as a non-initial character would be the simplest option, but if you allow it between arbitrary characters then it does allow a whole new type of obfuscated code, e.g. "a\u200dweird\u200didentifier" displays as a‍weird‍identifier (looking like aweirdidentifier) in some software (and as a weird identifier in some other terminals).

And if you only allow it in emoji sequences it makes the parser more complicated, though not insurmountably so. (This is what Unicode Annex 31 recommends for languages adopting the emoji profile. Not that we hew particularly close to Annex 31 in any case.)

One compromise option would be to allow ZWJ as any non-initial character, but to normalize it away in non-emoji sequences (or normalize it away entirely) — that way the complexity is pushed to the symbol normalization, out of the parser.

@stevengj
Copy link
Member

stevengj commented Oct 24, 2023

I just had a brainwave — what we really want here is to not break identifiers within graphemes. So, an improved rule for identifiers would be:

  • identifier consists of a sequence of graphemes, each of which may begin with one of the currently allowed characters (i.e. same allowed characters at the start of the identifier, and same allowed characters afterwards).

This way we only need one more bit of state during parsing, for the grapheme-break state in utf8proc, and it will handle all of the emoji rules etcetera for us, and won't allow ZWJ in non-emoji sequences.

Edit: actually, because of grapheme rule GB9, this will allow ZWJ in non-emoji sequences at the end of the identifier. So, we would need one more rule: identifiers cannot end with ZWJ.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
😃🍕 and other emoji parser Language parsing and surface syntax unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants