Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude emojis - to e.g. be gender-neutral #51130

Closed
PallHaraldsson opened this issue Aug 31, 2023 · 5 comments
Closed

Exclude emojis - to e.g. be gender-neutral #51130

PallHaraldsson opened this issue Aug 31, 2023 · 5 comments

Comments

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Aug 31, 2023

I discovered for C23 https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2836.pdf
C Identifier Syntax using Unicode Standard
Annex 31

Prior art: C++, Rust

The allowed Unicode code points in identifiers include many that are unassigned or unnecessary, and
others that are actually counter-productive.

What will this proposal change?
All emoji become excluded, instead of just some
Emoji with code points less than FFFF, such as ✌, and ❤ are currently excluded from identifiers.

[I couldn't copy-paste correctly, thus neither test those emojis in Julia, do we allow all or just some emojis in Julia?]

bool �� = false; // Female Construction Worker
// ({Construction Worker}{ZWJ}{Female Sign})
but this is valid:
bool �� = true; // (Male) Construction Worker

Other oddities of disallowed vs allowed: [different clocks example I can neither copy paste.]

Zero Width Joiner and Zero Width Non-Joiner become excluded.

The available set of identifiers changes over time

Why now
One driving factor for addressing this now is that GCC has fixed their long standing bug 67224 "UTF-8
support for identifier names in GCC". Clang has always supported all the allowed code points in
source code. MSVC in its usual configuration defaults to code page 1252, but can be told to accept
UTF-8 source.

It's possible their reasons don't apply to Julia, i.e. C and C++ are standardized and they want the same identifiers to work across compilers, no more or less. For Julia we can allow a superset of what they allow, but should look into if they have a good (other) reason to exclude something.

For us restricting is a breaking change, or at least technically. I think if we want to do that, then we want to do as soon as possible, at least on master, and then 1.10 before release in case it will become LTS. It would be strange to restrict later, and the LTS allowing more.

We could always expand again in a later (even point) version and (conservatively) on 1.10/LTS.

Since it has very little value, I propose blocking at least all human-looking emojis (then also smileys?). It's a can of worms, do we then want to support skin-tone too. Not that I have anything against that for emojis, just for identifiers that could be easily confusable in the REPL. And if we do disallow skin-tones already then we could be accused of race-bias, for white (yellow actually?), against e.g. black.

If in doubt, just block all emojis with a hammer? Since it can be undone later, and discussing (each) could be time-consuming.

@vtjnash
Copy link
Member

vtjnash commented Aug 31, 2023

Julia already explicitly permits most emoji, except gender modifiers, which have been disallowed since at least v1.0 (though I don't recall the rationale)

julia> 👷= false
false

julia> 👷‍♀️=false
ERROR: ParseError:
# Error @ REPL[5]:1:2
👷‍♀️=false
# └ ── invisible character '\u200d'
Stacktrace:
 [1] top-level scope
   @ none:1

@PallHaraldsson
Copy link
Contributor Author

By "explicitly permits most emoji, except gender modifiers", you mean a) (at least) these:

julia> REPL.REPLCompletions.emoji_symbols
Dict{String, String} with 1185 entries:
  "\\:ghost:"                => "👻"
  "\\:briefs:"               => "🩲"
  "\\:metro:"                => "🚇"
  "\\:children_crossing:"    => "🚸"
[..]
  "\\:clock9:"               => "🕘"
[..]
  "\\:person_frowning:"      => "🙍"
[..]

[I didn't check all of them one by one.]

and b) you mean you want it to stay that way possibly? I'm not proposing dropping anything from that list. They can all be there for purposes of e.g. typing something into strings. But for the purposes of identifiers, then likely none are useful, e.g. person_frowning problematic, though we could allow some (maybe all, or most) from that list, e.g. children_crossing:" => "🚸" doesn't actually seems problematic (still I very much doubt useful). clock9 seems to be one of the problematic, at least for C. E.g. if people were to say "type in the clock emoji" vs "alarm clock"...

As a start with could exclude all emojis not on that list (and possibly more if we know would be an an issue).

I'm not criticizing this (longer) list in any context:

julia> REPL.REPLCompletions.latex_symbols
Dict{String, String} with 2509 entries:

@mbauman
Copy link
Member

mbauman commented Aug 31, 2023

Being able to tab-complete a given symbol is wholly orthogonal to its use as an identifier. There are many tab-completable characters that are not allowed identifiers, and even more so vice-versa. What's allowed as an identifier is largely defined at broad strokes by Unicode character categories, but with some refinements explicitly allowing or disallowing groups of characters:

return (cat == UTF8PROC_CATEGORY_LU || cat == UTF8PROC_CATEGORY_LL ||
cat == UTF8PROC_CATEGORY_LT || cat == UTF8PROC_CATEGORY_LM ||
cat == UTF8PROC_CATEGORY_LO || cat == UTF8PROC_CATEGORY_NL ||
cat == UTF8PROC_CATEGORY_SC || // allow currency symbols
// other symbols, but not arrows or replacement characters
(cat == UTF8PROC_CATEGORY_SO && !(wc >= 0x2190 && wc <= 0x21FF) &&
wc != 0xfffc && wc != 0xfffd &&

Emoji are largely So (symbol, other). The ZWJ used for compound emoji is Cf (Other, format), which also includes things like the BIDI mark, which we definitely wouldn't want to include.

@Keno
Copy link
Member

Keno commented Aug 31, 2023

I'd rather go the opposite way and include all emojis as valid identifiers. We're not in the business of policing people's expression ;). If some organizations want to disallow certain classes of emoji, they're welcome to do that as a lint check.

@mbauman
Copy link
Member

mbauman commented Aug 31, 2023

Removing emoji as identifiers would be quite breaking. The actionable part here would be what we do with zero-width joiners, which is a duplicate of #40071.

@mbauman mbauman closed this as not planned Won't fix, can't repro, duplicate, stale Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants