Warn on confusable non-ascii identifiers (UTS 39, C2) #11582

mrluc · 2022-01-18T06:22:37Z

This PR adds warnings for confusable identifiers, corresponding to clause C2 of UTS39 (previous PR handled C1); the reference implementation, and/or the docs/tests accompanying this PR, describe what that means a bit more.

The major change from the reference implementation is tracking whether a whole file has any non-ascii tokens, and only running the whole-file unicode lints in that case.

We expect to add another lint, covering Clause 3 of UTS39 (warning on 'mixed-script confusable characters', regardless of whether pairs of confusable identifiers exist), in a future PR.

(This was extracted from the reference implementation of UTS39, which has a higher-level overview + some examples).

lib/elixir/src/elixir_tokenizer.erl

josevalim · 2022-01-18T14:31:52Z

lib/elixir/src/elixir_tokenizer.erl

+
+maybe_warn_on_unicode_security(Tokens, File, #elixir_tokenizer{ascii_identifiers_only=false, identifier_tokenizer=IdentifierTokenizer, warnings=Warnings} = Scope) ->
+  UnicodeWarnings =
+    case erlang:function_exported(IdentifierTokenizer, unicode_lint_warnings, 1) of


We don't need the function exported, we can always just call it and it will never be called by the bare tokenizer since it only handles ascii anyway. :)

In fact, I would make it call something like 'String.Confusable':lint(...) and decouple from the tokenizer. :)

it will never be called by the bare tokenizer since it only handles ascii anyway. :)

Oh yeah, can avoid that check now -- before I was calling it at end of every file, heh. Re: naming, and splitting it out of tokenizer, makes sense to me; I have a comment below asking about the right way to do that. Re: String as top namespace, actually the script resolution IS something that I'd love to have available in String eventually, even if right now it'll be an undocumented impl. detail like Tokenizer.tokenize. Rust's unicode-scripts provides that for all strings, and now some stuff like username validation, I want my validation rule to be 'single-script usernames only', following uts39's definition of ss.

lib/elixir/src/elixir.hrl

josevalim · 2022-01-18T14:37:17Z

lib/elixir/unicode/tokenizer.ex

@@ -1,5 +1,34 @@
+type_path = Path.join(__DIR__, "IdentifierType.txt")


Let's move the changes to this module to a separate file called "confusables.ex" or similar. We probably don't need to keep them in the tokenizer!

I remember I didn't know how to do it with the bootstrap process and my first try didn't work, 🤦 so I punted -- until now I guess!

However these lines specifically are the pieces from the last PR, just hoisted up (the Restricted idents / uncommon codepoints) so I want to make sure I understand the suggestion; it's used at compile-time in Tokenizer, but also in the mixed-script confusables, cutting the list of potential confusables down from 6k-8k to a few hundred.

There are a couple of ways I can see it being split out, which of these makes more sense? (or neither, heh)

split it all, including the stuff from last PR, into maybe unicode/security.ex, add it ahead of tokenizer to bootstrap, and tokenizer.ex could get eg. restricted codepoints via String.UnicodeSecurity.restricted_codepoints(). (There's a line I need to cut to do that, but looking at it again, it needs to be cut anyway :P arrows going wrong way, depending on tokenizer in confusables when it should just use the list of restricted codepoints).

split out just the changes from this file into confusables.ex, add it ahead of tokenizer to bootstrap, and it's OK to eg. re-read IdentifierType.txt if we need to there.

It would be option 2 but, you are right that handling bootstraping corner cases is annoying, so let's merge this in a single file as is right now and I can take care of splitting it up later on. :)

josevalim · 2022-01-18T14:55:44Z

lib/elixir/pages/unicode-security.md

+
+* Some characters are in use in so many writing systems that they have been classified by Unicode as 'Common' or 'Inherited'; these include things like numbers, underscores, etc; Elixir will not warn about mixing of ALL-script characters, like `幻ㄒㄧㄤ1 = :foo; 幻ㄒㄧㄤ2 = :bar`.
+
+However, there are some script combinations with no overlap in characters, like {Cyrillic} and {Latin} -- in Unicode terms, the 'resolved script set' would be empty. So if that kind of script mixing occurs in an identifier, and the only Cyrillic characters in the file are those confusable with characters in other languages, it will emit a warning to that effect. (If, however, the file also contains non-confusable Cyrillic characters in source code, then the programmer can visually detect that another script is being used, and no warning is issued).


Hrm... I am thinking about this... if we only emit warnings when all usages are confusable, doesn't it mean we already get the confusable warning? If so, what is the advantage of this additional rule?

Maybe we should focus on single-script, mixed-script, and whole-script confusables for this PR, and see what C3 brings us further down the road?

I am actually keen on tightening this more than the rule above. For example, I would warn if any mixed script outside of ASCII is used. But then I would rather merge C2 first and then have this discussion.

Sorry for the back and forth on this, but those layers are being uncovered as I go deeper in the reviews (and as I re-read UTS 39. this was the third time :D).

Yeah I wondered that at first too -- they often overlap, but not always; this handles cases where there is no other token in the file whose skeleton clashes; in that case, this would still be suspicious. (Example, may be a bit contrived): a package/dependency upstream has had an additional, visually identical function added to it; or where this is the means used to define something in another file, and it's used in the file being considered here.

In that case, usage of is_admin() may only happen once, so it's not actually confusable with anything in the file, but if it includes only confusable characters from Cyrillic mixed in there -- that's a fishy-enough situation to warn on (and is the example they show in Trojan Source, the sayHello() fn from a dependency, where the H is from another script). (I actually renamed the lint in this PR -- MixedScriptUsingOnlyConfusableCharacters -- to try to make that a bit clearer. But as with a lot of things in this standard, it can only get so clear :( because for instance 'mixed script' intuitively meant something quite different to me before actually implementing the standard, or rather now I have a couple of things in mind that someone could mean when they say that).

If I got it right: C2 warns for existing confusables. C3 warns for potential confusables. Is that it?

@mrluc if we go with a more restrictive approach, as described here:

Highly Restrictive The string qualifies as Single Script, or The string is covered by any of the following sets of scripts, according to the definition in Section 5.1: Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan Latin + Han + Bopomofo; or equivalently: Latn + Hanb Latin + Han + Hangul; or equivalently: Latn + Kore

Would that simplify the implementation or make it more complex?

Yeah, I am on the same page now, thanks! Now my question is: why not ship a more strict version of the warning? Such as only allowing Latin with Japn, Hanb, and Kore?

Want to clarify -- do you mean one of the standard's definitions of restriction levels, like its definition of 'Highly Restrictive', or just filtering it down to only those languages?

I don't think the standard recommends only allowing those, notice that it mentions "single script", which requires full scriptset resolution to determine:

Highly Restrictive:
The string qualifies as Single Script, or
The string is covered by any of the following sets of scripts, according to the definition in Section 5.1:
Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan
Latin + Han + Bopomofo; or equivalently: Latn + Hanb
Latin + Han + Hangul; or equivalently: Latn + Kore

Regarding whether or not we could do something more restrictive -- YES, that's a lang design question for sure.

For instance, I initially thought "single script is good enough for identifiers"; it ignores Common and all mathy chars are Common; it resolves multiple-script asian languages, etc.

The counter-example there would be variables written by people in other scripts, who are now forced to come up with a transliteration of 'twitter' or 'api' in variables like "mycompany_twitter" and "mycompany_api", where 'mycompany' is in another script.

However, how much simpler does it make things?

It could remove statefulness if we move to being more restrictive on idents vs. checking the uses in the file. That would remove, basically, all of the fns check_confusable_mixed_script and warning_per_problematic_script, plus a little more, and it could also in theory be done at tokenization time.

But I think almost everything we do with mixed-script would still require scriptset resolution, which is the bulk of the C3 additions.

@josevalim github UI lag got me on your question here; I responded to it above -- "Single Script" is the key there, due to how uts39 defines single script requiring script resolution. So it would simplify it somewhat, removing a couple of functions, (EDIT: and allow it to be stateless, so only Confusables would require pass over all tokens), but the effects on the size of the implementation wouldn't be drastic since resolution is the bulk of it.

@mrluc I see. Let's go with C2 for now only. Then, after merging, I can split it into a separate module, which should leave us in a better position to accept C3. WDYT?

Sounds good to me!

After talking over the C3 stuff, C2 seems blessedly simple! 😆 Since it's relatively simpler it may as well be the only thing in the PR. I'll probably be able to do that + incorporate the feedback tomorrow.

Thanks, and again, sorry for the false positive on accepting both C2 and C3 at once. :)

Co-authored-by: José Valim <jose.valim@gmail.com>

mrluc · 2022-01-20T23:04:55Z

@josevalim I took a pass at cutting it down to just C2 capabilities for now:

applied a couple of suggested changes in .erl/.hrl
split UTS39 C2 stuff out into its own file
removed UTS39 C3 support for now per discussion of how much is implied by C3 (like 'resolved script sets'), how much of the ref. impl is dedicated to it, etc, though I named things/adjusted docs in a way that lends itself to adding C3 next.

lib/elixir/src/elixir_tokenizer.erl

lib/elixir/unicode/unicode.ex

lib/elixir/src/elixir_tokenizer.erl

josevalim · 2022-01-21T08:37:28Z

💚 💙 💜 💛 ❤️

josevalim · 2022-01-21T08:42:45Z

Oh, quick question: should we also include atoms in your confusable lookups? A confusable atom can also be used to hide conditionals and so forth.

mrluc · 2022-01-21T15:58:45Z

D'oh, sorry for missing deletion of all of those C3 .txt/comments. 🤦

Re: atoms,

Practically, :foo and foo shouldn't be considered confusable with each other, because one has :, so 2 lookups would need to be maintained, one just for atom literals in source code.
a 'cop-out' would be to say that in a language where Atoms are treated in a visually distinctive way, like Elixir, that they're no different from other data literals -- we don't warn that a string literal is confusable, for instance. However, as you point out, atoms do seem different from Charlists and Strings, and we're not talking about generated atoms, just the ones that are used meaningfully in code.

I think C3, if added in a form similar to the ref impl, would be a natural place to give atoms a measure of protection, without needing to add an atom confusability lookup to C2 check. That check scans tokens and builds up all unique chars -- currently in identifier/alias only, but could extend it for atoms too -- and that would let us catch atoms with mixed-script confusables.

It's an interesting question to what extent languages with pattern matching should warn on what we might consider "literals", heh ... I'm not sure I have a strong opinion on how far it should go, but the above feels like one layer/measure of coverage.

josevalim · 2022-01-21T17:30:08Z

a 'cop-out' would be to say that in a language where Atoms are treated in a visually distinctive way, like Elixir, that they're no different from other data literals -- we don't warn that a string literal is confusable, for instance.

That was my thinking, especially because of this:

Foo.admin()
apply(Foo, :admin, [])

My biggest concern with atoms is code like this:

if user.type == :regular do
  ...
else
  # do admin stuff
end

And someone then sneaks a :regular confusable elsewhere.

It is one of the reasons why I think for C3 we should go with a more restrictive approach.

mrluc · 2022-01-21T17:59:45Z

(My example was wrong -- I was thinking of :foo and foo being found confusable, which isn't what we're talking about here).

I do think C2 confusable protection should include atoms 👍. Would you like me to open a new PR for that tweak, and add a line to the test? Super-simple change.

Re: drawing the line somewhere -- in theory there are these variations, too:

if user.type == :regular do ...
if user.type == 'regular' do ...
if user.type == "regular" do ...

But, given that Atoms are used heavily for pattern matching in control-flow, I'd agree that, wherever the line is, Atoms are probably inside it.

It is one of the reasons why I think for C3 we should go with a more restrictive approach.

That makes sense to me generally; it does feel like Elixir has the opportunity to make things a lot easier for itself, while still letting people use their language. In the limit: maybe even enforcing UTS39's definition of "single script" for unicode idents. Would have to consider how to handle a bunch of examples -- for instance, the case of the ..._twitter, ..._api variable names using partial english/latin terms/acronyms -- but can leave that for future PR.

josevalim · 2022-01-21T18:11:17Z

I do think C2 confusable protection should include atoms 👍. Would you like me to open a new PR for that tweak, and add a line to the test? Super-simple change.

I have already pushed it. :)

Re: drawing the line somewhere -- in theory there are these variations, too:

I agree. At the same time, anything that is "quoted" kinda means "here be dragons". For example, we don't currently check for quoted atoms.

Would have to consider how to handle a bunch of examples -- for instance, the case of the ..._twitter, ..._api variable names using partial english/latin terms/acronyms -- but can leave that for future PR.

Yeah, this can come later. I was thinking about such rules but the report is pretty clear that turkish and cyrillic should never be mixed with ascii, probably it is just too easy to sneak something in. At the same time, you probably have a very reasonable way of writing _api in cyrillic, so it is fine to go cyrillic all the way, but it could be useful for some of the asian languages. That's why I landed on "Highly restrictive".

mrluc · 2022-02-02T00:39:36Z

@josevalim it's been a while -- so I spiked doing C3 protections via 'Highly Restrictive' identifiers, looks promising.

I have a couple of clarifying questions, and then below I have 2 examples of tradeoffs, where it's better / worse than the ref impl.

Warning or error? Error was easier for my initial impl based on an additional per-token check when ascii_letters? = false in String.Tokenizer.validate. I can imagine either, since mixing scripts in idents should be uber-rare.
Let's say there's a single-char identifier, Cyrillic 'a', used like a = 1, in a file with otherwise only Latin identifiers (Latin being the default script for ascii). Should we consider that concerning and warn/err on it?

I would say 'no, not concerning' if we like Highly Restrictive identifier rules; they allow it, because the identifier is Single-Script (Cyrillic). It IS an identifier comprised only of characters that are mixed-script confusables, yes -- but (a) scripts are not being mixed within the identifier itself, (b) confusability protections already prevent Cyrillic and Latin versions of the same single-char identifier existing in a file, (c) the example used in the 'Trojan Source' report the CVE was issued for is that of an identifier that mixes scripts within the same identifier.

But to be clear, these are 2 different approaches to addressing UTS39 C3 / Trojan Source's 2nd CVE; each has strengths and weaknesses:

treat 'mixed script confusable' characters as dangerous/suspicious (rust / ref impl's approach) until a non-confusable character is seen; downside: requires all-tokens check, and further, a single innocent character whitelists all script mixing!
disallow all 'script mixing' in identifiers (Highly Restrictive identifiers), and use confusables detection that's mixed-script-capable, and thus prevent all 'mixed-script confusable identifiers'.

Comparisons below:

Where is the C3 check from the reference impl 'better' than C3 via Highly Restrictive identifiers?

Rust, and the reference implementation in Elixir, try to catch the case of 'the only use of non-ascii/Latin script X in this project is comprised only of confusable characters'.

Since confusable detection prevents one class of problem that can arise from that, the remaining issue is with identifiers imported into scope from a dependency, perhaps even autocompleted instead of typed -- the Rust & reference implementation approach would catch an import from a compromised upstream

# fns using Cyrillic lowercase letters, let's say
import BadUpstream, only: [math: 1, top: 1, key: 1, batch: 1]

Highly Restrictive wouldn't catch that, because no 'script mixing' would be happening in the identifier, which could be made up of combos of these letters for instance {Grk, Cyrl, Latn}:

Where are Highly Restrictive idents 'better' than reference impl's/Rust's C3 check?

This one is interesting, as it's even more the case in Rust than the Elixir ref impl, due to crate vs. file scope for this.

If you allow a SINGLE non-confusable Cyrillic character anywhere in a whole crate, then all mixings of Cyrillic and Latin are now completely allowed:

  // this first variable, by itself, would produce warnings, like 
  // 'oh no! the project uses only confusable chars from Cyrillic!'
  //  п is confusable with Latin n
  let п = 3.14;    
  let л = 3.14;
  // ^^ ahhhh -- if you let this one 'not confusable' char in, any mixed-script {Latin, Cyrillic}
  //    identifier won't elicit warnings in the rest of the crate, and following compiles cleanly:
  let аdmin = 1;

Compare that behavior with the Trojan Source recco to warn or error on "throw errors or warnings for ... identifiers with mixed-script confusable characters".

Hasten to add, Rust obviously does great with UTS39 -- this just shows that even in the few languages to have implemented UTS39 protections, there are tradeoffs! UTS39 acknowledges that, even once you have mixed-script and confusable detection capabilities, nuance is still necessary when determining what to consider as suspicious, or there will be a ton of false positives.

Let me know if any of those tradeoffs change what we'd ideally like to have in Elixir, so I can include it in the C3 PR.

I could see Highly Restrictive identifiers being good enough given the C2 protections already in place.

We could also have both -- the Rust-style check from the reference impl (in case of file w/unicode in idents only), and Highly Restrictive identifiers which as you noted make good sense -- which would be pretty robust!

josevalim · 2022-02-02T07:23:02Z

Hi @mrluc,!thank you for another great write up!

I could see Highly Restrictive identifiers being good enough given the C2 protections already in place.

Yes, this is my understanding too, so that’s the route I would go. I also think we could embed C3 into the existing tokenizer, but that’s something I would do as a forth step (I will investigate it once the Highly Restrictive Reference Implementation is in, you don’t have to worry!).

mrluc added 3 commits January 17, 2022 23:47

add uts39 confusable checks, run in files w/non-ascii identifiers

fec46d9

fix renaming

8ac15ed

update files in unicode.ex comment

8d59a74

josevalim reviewed Jan 18, 2022

View reviewed changes

lib/elixir/src/elixir_tokenizer.erl Outdated Show resolved Hide resolved

josevalim reviewed Jan 18, 2022

View reviewed changes

lib/elixir/src/elixir.hrl Outdated Show resolved Hide resolved

josevalim reviewed Jan 18, 2022

View reviewed changes

mrluc and others added 3 commits January 18, 2022 10:16

simpler track_ascii in elixir_tokenizer.erl

b4c6e8b

Co-authored-by: José Valim <jose.valim@gmail.com>

break out confusables to security.ex

3cb21c0

format

5c7c48d

mrluc changed the title ~~Warn on identifiers that use unicode in potentially confusing/unsafe ways (UTS 39, C2-C3)~~ Warn on confusable non-ascii identifiers (UTS 39, C2) Jan 20, 2022

unicode-security.md, update note on uts39 C3

cf0c2ab

josevalim reviewed Jan 21, 2022

View reviewed changes

lib/elixir/src/elixir_tokenizer.erl Outdated Show resolved Hide resolved

josevalim reviewed Jan 21, 2022

View reviewed changes

lib/elixir/src/elixir_tokenizer.erl Outdated Show resolved Hide resolved

josevalim reviewed Jan 21, 2022

View reviewed changes

lib/elixir/src/elixir_tokenizer.erl Outdated Show resolved Hide resolved

josevalim reviewed Jan 21, 2022

View reviewed changes

lib/elixir/src/elixir_tokenizer.erl Outdated Show resolved Hide resolved

Apply suggestions from code review

1a4c38a

josevalim reviewed Jan 21, 2022

View reviewed changes

lib/elixir/unicode/unicode.ex Outdated Show resolved Hide resolved

josevalim added 4 commits January 21, 2022 09:26

Delete PropertyValueAliases.txt

39257f9

Delete Scripts.txt

d724903

Delete ScriptExtensions.txt

3bcc60c

Update lib/elixir/unicode/unicode.ex

a1ada5d

josevalim reviewed Jan 21, 2022

View reviewed changes

lib/elixir/src/elixir_tokenizer.erl Outdated Show resolved Hide resolved

Update lib/elixir/src/elixir_tokenizer.erl

f7dfd55

josevalim merged commit a468926 into elixir-lang:main Jan 21, 2022

mrluc mentioned this pull request Feb 10, 2022

Don't allow unsafe script mixing in identifiers (UTS 39, C3) #11620

Closed

		@@ -1,5 +1,34 @@
		type_path = Path.join(__DIR__, "IdentifierType.txt")


		* Some characters are in use in so many writing systems that they have been classified by Unicode as 'Common' or 'Inherited'; these include things like numbers, underscores, etc; Elixir will not warn about mixing of ALL-script characters, like `幻ㄒㄧㄤ1 = :foo; 幻ㄒㄧㄤ2 = :bar`.

		However, there are some script combinations with no overlap in characters, like {Cyrillic} and {Latin} -- in Unicode terms, the 'resolved script set' would be empty. So if that kind of script mixing occurs in an identifier, and the only Cyrillic characters in the file are those confusable with characters in other languages, it will emit a warning to that effect. (If, however, the file also contains non-confusable Cyrillic characters in source code, then the programmer can visually detect that another script is being used, and no warning is issued).

Warn on confusable non-ascii identifiers (UTS 39, C2) #11582

Warn on confusable non-ascii identifiers (UTS 39, C2) #11582

Uh oh!

Conversation

mrluc commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrluc Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrluc Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrluc commented Jan 20, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

josevalim commented Jan 21, 2022

Uh oh!

josevalim commented Jan 21, 2022

Uh oh!

mrluc commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jan 21, 2022

Uh oh!

mrluc commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Jan 21, 2022

Uh oh!

mrluc commented Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Where is the C3 check from the reference impl 'better' than C3 via Highly Restrictive identifiers?

Where are Highly Restrictive idents 'better' than reference impl's/Rust's C3 check?

Uh oh!

josevalim commented Feb 2, 2022

Uh oh!

Uh oh!

mrluc commented Jan 18, 2022 •

edited

Loading

mrluc Jan 18, 2022 •

edited

Loading

mrluc Jan 18, 2022 •

edited

Loading

mrluc commented Jan 21, 2022 •

edited

Loading

mrluc commented Jan 21, 2022 •

edited

Loading

mrluc commented Feb 2, 2022 •

edited

Loading