Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on confusable non-ascii identifiers (UTS 39, C2) #11582

Merged
merged 13 commits into from
Jan 21, 2022

Conversation

mrluc
Copy link
Contributor

@mrluc mrluc commented Jan 18, 2022

This PR adds warnings for confusable identifiers, corresponding to clause C2 of UTS39 (previous PR handled C1); the reference implementation, and/or the docs/tests accompanying this PR, describe what that means a bit more.

The major change from the reference implementation is tracking whether a whole file has any non-ascii tokens, and only running the whole-file unicode lints in that case.

We expect to add another lint, covering Clause 3 of UTS39 (warning on 'mixed-script confusable characters', regardless of whether pairs of confusable identifiers exist), in a future PR.

(This was extracted from the reference implementation of UTS39, which has a higher-level overview + some examples).


maybe_warn_on_unicode_security(Tokens, File, #elixir_tokenizer{ascii_identifiers_only=false, identifier_tokenizer=IdentifierTokenizer, warnings=Warnings} = Scope) ->
UnicodeWarnings =
case erlang:function_exported(IdentifierTokenizer, unicode_lint_warnings, 1) of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the function exported, we can always just call it and it will never be called by the bare tokenizer since it only handles ascii anyway. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I would make it call something like 'String.Confusable':lint(...) and decouple from the tokenizer. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will never be called by the bare tokenizer since it only handles ascii anyway. :)

Oh yeah, can avoid that check now -- before I was calling it at end of every file, heh. Re: naming, and splitting it out of tokenizer, makes sense to me; I have a comment below asking about the right way to do that. Re: String as top namespace, actually the script resolution IS something that I'd love to have available in String eventually, even if right now it'll be an undocumented impl. detail like Tokenizer.tokenize. Rust's unicode-scripts provides that for all strings, and now some stuff like username validation, I want my validation rule to be 'single-script usernames only', following uts39's definition of ss.

@@ -1,5 +1,34 @@
type_path = Path.join(__DIR__, "IdentifierType.txt")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the changes to this module to a separate file called "confusables.ex" or similar. We probably don't need to keep them in the tokenizer!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember I didn't know how to do it with the bootstrap process and my first try didn't work, 🤦 so I punted -- until now I guess!

However these lines specifically are the pieces from the last PR, just hoisted up (the Restricted idents / uncommon codepoints) so I want to make sure I understand the suggestion; it's used at compile-time in Tokenizer, but also in the mixed-script confusables, cutting the list of potential confusables down from 6k-8k to a few hundred.

There are a couple of ways I can see it being split out, which of these makes more sense? (or neither, heh)

  1. split it all, including the stuff from last PR, into maybe unicode/security.ex, add it ahead of tokenizer to bootstrap, and tokenizer.ex could get eg. restricted codepoints via String.UnicodeSecurity.restricted_codepoints(). (There's a line I need to cut to do that, but looking at it again, it needs to be cut anyway :P arrows going wrong way, depending on tokenizer in confusables when it should just use the list of restricted codepoints).

  2. split out just the changes from this file into confusables.ex, add it ahead of tokenizer to bootstrap, and it's OK to eg. re-read IdentifierType.txt if we need to there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be option 2 but, you are right that handling bootstraping corner cases is annoying, so let's merge this in a single file as is right now and I can take care of splitting it up later on. :)


* Some characters are in use in so many writing systems that they have been classified by Unicode as 'Common' or 'Inherited'; these include things like numbers, underscores, etc; Elixir will not warn about mixing of ALL-script characters, like `幻ㄒㄧㄤ1 = :foo; 幻ㄒㄧㄤ2 = :bar`.

However, there are some script combinations with no overlap in characters, like {Cyrillic} and {Latin} -- in Unicode terms, the 'resolved script set' would be empty. So if that kind of script mixing occurs in an identifier, and the only Cyrillic characters in the file are those confusable with characters in other languages, it will emit a warning to that effect. (If, however, the file also contains non-confusable Cyrillic characters in source code, then the programmer can visually detect that another script is being used, and no warning is issued).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm... I am thinking about this... if we only emit warnings when all usages are confusable, doesn't it mean we already get the confusable warning? If so, what is the advantage of this additional rule?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should focus on single-script, mixed-script, and whole-script confusables for this PR, and see what C3 brings us further down the road?

I am actually keen on tightening this more than the rule above. For example, I would warn if any mixed script outside of ASCII is used. But then I would rather merge C2 first and then have this discussion.

Sorry for the back and forth on this, but those layers are being uncovered as I go deeper in the reviews (and as I re-read UTS 39. this was the third time :D).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wondered that at first too -- they often overlap, but not always; this handles cases where there is no other token in the file whose skeleton clashes; in that case, this would still be suspicious. (Example, may be a bit contrived): a package/dependency upstream has had an additional, visually identical function added to it; or where this is the means used to define something in another file, and it's used in the file being considered here.

In that case, usage of is_admin() may only happen once, so it's not actually confusable with anything in the file, but if it includes only confusable characters from Cyrillic mixed in there -- that's a fishy-enough situation to warn on (and is the example they show in Trojan Source, the sayHello() fn from a dependency, where the H is from another script). (I actually renamed the lint in this PR -- MixedScriptUsingOnlyConfusableCharacters -- to try to make that a bit clearer. But as with a lot of things in this standard, it can only get so clear :( because for instance 'mixed script' intuitively meant something quite different to me before actually implementing the standard, or rather now I have a couple of things in mind that someone could mean when they say that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got it right: C2 warns for existing confusables. C3 warns for potential confusables. Is that it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrluc if we go with a more restrictive approach, as described here:

Highly Restrictive

    The string qualifies as Single Script, or
    The string is covered by any of the following sets of scripts, according to the definition in Section 5.1:
        Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan
        Latin + Han + Bopomofo; or equivalently: Latn + Hanb
        Latin + Han + Hangul; or equivalently: Latn + Kore

Would that simplify the implementation or make it more complex?

Copy link
Contributor Author

@mrluc mrluc Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am on the same page now, thanks! Now my question is: why not ship a more strict version of the warning? Such as only allowing Latin with Japn, Hanb, and Kore?

Want to clarify -- do you mean one of the standard's definitions of restriction levels, like its definition of 'Highly Restrictive', or just filtering it down to only those languages?

I don't think the standard recommends only allowing those, notice that it mentions "single script", which requires full scriptset resolution to determine:

Highly Restrictive:
The string qualifies as Single Script, or
The string is covered by any of the following sets of scripts, according to the definition in Section 5.1:
Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan
Latin + Han + Bopomofo; or equivalently: Latn + Hanb
Latin + Han + Hangul; or equivalently: Latn + Kore

Regarding whether or not we could do something more restrictive -- YES, that's a lang design question for sure.

  • For instance, I initially thought "single script is good enough for identifiers"; it ignores Common and all mathy chars are Common; it resolves multiple-script asian languages, etc.

  • The counter-example there would be variables written by people in other scripts, who are now forced to come up with a transliteration of 'twitter' or 'api' in variables like "mycompany_twitter" and "mycompany_api", where 'mycompany' is in another script.

However, how much simpler does it make things?

  • It could remove statefulness if we move to being more restrictive on idents vs. checking the uses in the file. That would remove, basically, all of the fns check_confusable_mixed_script and warning_per_problematic_script, plus a little more, and it could also in theory be done at tokenization time.

  • But I think almost everything we do with mixed-script would still require scriptset resolution, which is the bulk of the C3 additions.

Copy link
Contributor Author

@mrluc mrluc Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim github UI lag got me on your question here; I responded to it above -- "Single Script" is the key there, due to how uts39 defines single script requiring script resolution. So it would simplify it somewhat, removing a couple of functions, (EDIT: and allow it to be stateless, so only Confusables would require pass over all tokens), but the effects on the size of the implementation wouldn't be drastic since resolution is the bulk of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrluc I see. Let's go with C2 for now only. Then, after merging, I can split it into a separate module, which should leave us in a better position to accept C3. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

After talking over the C3 stuff, C2 seems blessedly simple! 😆 Since it's relatively simpler it may as well be the only thing in the PR. I'll probably be able to do that + incorporate the feedback tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and again, sorry for the false positive on accepting both C2 and C3 at once. :)

@mrluc mrluc changed the title Warn on identifiers that use unicode in potentially confusing/unsafe ways (UTS 39, C2-C3) Warn on confusable non-ascii identifiers (UTS 39, C2) Jan 20, 2022
@mrluc
Copy link
Contributor Author

mrluc commented Jan 20, 2022

@josevalim I took a pass at cutting it down to just C2 capabilities for now:

  • applied a couple of suggested changes in .erl/.hrl
  • split UTS39 C2 stuff out into its own file
  • removed UTS39 C3 support for now per discussion of how much is implied by C3 (like 'resolved script sets'), how much of the ref. impl is dedicated to it, etc, though I named things/adjusted docs in a way that lends itself to adding C3 next.

@josevalim josevalim merged commit a468926 into elixir-lang:main Jan 21, 2022
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@josevalim
Copy link
Member

Oh, quick question: should we also include atoms in your confusable lookups? A confusable atom can also be used to hide conditionals and so forth.

@mrluc
Copy link
Contributor Author

mrluc commented Jan 21, 2022

D'oh, sorry for missing deletion of all of those C3 .txt/comments. 🤦

Re: atoms,

  • Practically, :foo and foo shouldn't be considered confusable with each other, because one has :, so 2 lookups would need to be maintained, one just for atom literals in source code.

  • a 'cop-out' would be to say that in a language where Atoms are treated in a visually distinctive way, like Elixir, that they're no different from other data literals -- we don't warn that a string literal is confusable, for instance. However, as you point out, atoms do seem different from Charlists and Strings, and we're not talking about generated atoms, just the ones that are used meaningfully in code.

I think C3, if added in a form similar to the ref impl, would be a natural place to give atoms a measure of protection, without needing to add an atom confusability lookup to C2 check. That check scans tokens and builds up all unique chars -- currently in identifier/alias only, but could extend it for atoms too -- and that would let us catch atoms with mixed-script confusables.

It's an interesting question to what extent languages with pattern matching should warn on what we might consider "literals", heh ... I'm not sure I have a strong opinion on how far it should go, but the above feels like one layer/measure of coverage.

@josevalim
Copy link
Member

a 'cop-out' would be to say that in a language where Atoms are treated in a visually distinctive way, like Elixir, that they're no different from other data literals -- we don't warn that a string literal is confusable, for instance.

That was my thinking, especially because of this:

Foo.admin()
apply(Foo, :admin, [])

My biggest concern with atoms is code like this:

if user.type == :regular do
  ...
else
  # do admin stuff
end

And someone then sneaks a :regular confusable elsewhere.

It is one of the reasons why I think for C3 we should go with a more restrictive approach.

@mrluc
Copy link
Contributor Author

mrluc commented Jan 21, 2022

(My example was wrong -- I was thinking of :foo and foo being found confusable, which isn't what we're talking about here).

I do think C2 confusable protection should include atoms 👍. Would you like me to open a new PR for that tweak, and add a line to the test? Super-simple change.

Re: drawing the line somewhere -- in theory there are these variations, too:

if user.type == :regular do ...
if user.type == 'regular' do ...
if user.type == "regular" do ...

But, given that Atoms are used heavily for pattern matching in control-flow, I'd agree that, wherever the line is, Atoms are probably inside it.

It is one of the reasons why I think for C3 we should go with a more restrictive approach.

That makes sense to me generally; it does feel like Elixir has the opportunity to make things a lot easier for itself, while still letting people use their language. In the limit: maybe even enforcing UTS39's definition of "single script" for unicode idents. Would have to consider how to handle a bunch of examples -- for instance, the case of the ..._twitter, ..._api variable names using partial english/latin terms/acronyms -- but can leave that for future PR.

@josevalim
Copy link
Member

I do think C2 confusable protection should include atoms 👍. Would you like me to open a new PR for that tweak, and add a line to the test? Super-simple change.

I have already pushed it. :)

Re: drawing the line somewhere -- in theory there are these variations, too:

I agree. At the same time, anything that is "quoted" kinda means "here be dragons". For example, we don't currently check for quoted atoms.

Would have to consider how to handle a bunch of examples -- for instance, the case of the ..._twitter, ..._api variable names using partial english/latin terms/acronyms -- but can leave that for future PR.

Yeah, this can come later. I was thinking about such rules but the report is pretty clear that turkish and cyrillic should never be mixed with ascii, probably it is just too easy to sneak something in. At the same time, you probably have a very reasonable way of writing _api in cyrillic, so it is fine to go cyrillic all the way, but it could be useful for some of the asian languages. That's why I landed on "Highly restrictive".

@mrluc
Copy link
Contributor Author

mrluc commented Feb 2, 2022

@josevalim it's been a while -- so I spiked doing C3 protections via 'Highly Restrictive' identifiers, looks promising.

I have a couple of clarifying questions, and then below I have 2 examples of tradeoffs, where it's better / worse than the ref impl.

  1. Warning or error? Error was easier for my initial impl based on an additional per-token check when ascii_letters? = false in String.Tokenizer.validate. I can imagine either, since mixing scripts in idents should be uber-rare.

  2. Let's say there's a single-char identifier, Cyrillic 'a', used like a = 1, in a file with otherwise only Latin identifiers (Latin being the default script for ascii). Should we consider that concerning and warn/err on it?

    I would say 'no, not concerning' if we like Highly Restrictive identifier rules; they allow it, because the identifier is Single-Script (Cyrillic). It IS an identifier comprised only of characters that are mixed-script confusables, yes -- but (a) scripts are not being mixed within the identifier itself, (b) confusability protections already prevent Cyrillic and Latin versions of the same single-char identifier existing in a file, (c) the example used in the 'Trojan Source' report the CVE was issued for is that of an identifier that mixes scripts within the same identifier.

But to be clear, these are 2 different approaches to addressing UTS39 C3 / Trojan Source's 2nd CVE; each has strengths and weaknesses:

  • treat 'mixed script confusable' characters as dangerous/suspicious (rust / ref impl's approach) until a non-confusable character is seen; downside: requires all-tokens check, and further, a single innocent character whitelists all script mixing!
  • disallow all 'script mixing' in identifiers (Highly Restrictive identifiers), and use confusables detection that's mixed-script-capable, and thus prevent all 'mixed-script confusable identifiers'.

Comparisons below:

Where is the C3 check from the reference impl 'better' than C3 via Highly Restrictive identifiers?

Rust, and the reference implementation in Elixir, try to catch the case of 'the only use of non-ascii/Latin script X in this project is comprised only of confusable characters'.

Since confusable detection prevents one class of problem that can arise from that, the remaining issue is with identifiers imported into scope from a dependency, perhaps even autocompleted instead of typed -- the Rust & reference implementation approach would catch an import from a compromised upstream

# fns using Cyrillic lowercase letters, let's say
import BadUpstream, only: [math: 1, top: 1, key: 1, batch: 1]

Highly Restrictive wouldn't catch that, because no 'script mixing' would be happening in the identifier, which could be made up of combos of these letters for instance {Grk, Cyrl, Latn}:

image

Where are Highly Restrictive idents 'better' than reference impl's/Rust's C3 check?

This one is interesting, as it's even more the case in Rust than the Elixir ref impl, due to crate vs. file scope for this.

If you allow a SINGLE non-confusable Cyrillic character anywhere in a whole crate, then all mixings of Cyrillic and Latin are now completely allowed:

  // this first variable, by itself, would produce warnings, like 
  // 'oh no! the project uses only confusable chars from Cyrillic!'
  //  п is confusable with Latin n
  let п = 3.14;    
  let л = 3.14;
  // ^^ ahhhh -- if you let this one 'not confusable' char in, any mixed-script {Latin, Cyrillic}
  //    identifier won't elicit warnings in the rest of the crate, and following compiles cleanly:
  let аdmin = 1;

Compare that behavior with the Trojan Source recco to warn or error on "throw errors or warnings for ... identifiers with mixed-script confusable characters".

Hasten to add, Rust obviously does great with UTS39 -- this just shows that even in the few languages to have implemented UTS39 protections, there are tradeoffs! UTS39 acknowledges that, even once you have mixed-script and confusable detection capabilities, nuance is still necessary when determining what to consider as suspicious, or there will be a ton of false positives.

Let me know if any of those tradeoffs change what we'd ideally like to have in Elixir, so I can include it in the C3 PR.

I could see Highly Restrictive identifiers being good enough given the C2 protections already in place.

We could also have both -- the Rust-style check from the reference impl (in case of file w/unicode in idents only), and Highly Restrictive identifiers which as you noted make good sense -- which would be pretty robust!

@josevalim
Copy link
Member

Hi @mrluc,!thank you for another great write up!

I could see Highly Restrictive identifiers being good enough given the C2 protections already in place.

Yes, this is my understanding too, so that’s the route I would go. I also think we could embed C3 into the existing tokenizer, but that’s something I would do as a forth step (I will investigate it once the Highly Restrictive Reference Implementation is in, you don’t have to worry!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants