-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn on confusable non-ascii identifiers (UTS 39, C2) #11582
Conversation
lib/elixir/src/elixir_tokenizer.erl
Outdated
|
||
maybe_warn_on_unicode_security(Tokens, File, #elixir_tokenizer{ascii_identifiers_only=false, identifier_tokenizer=IdentifierTokenizer, warnings=Warnings} = Scope) -> | ||
UnicodeWarnings = | ||
case erlang:function_exported(IdentifierTokenizer, unicode_lint_warnings, 1) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need the function exported, we can always just call it and it will never be called by the bare tokenizer since it only handles ascii anyway. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I would make it call something like 'String.Confusable':lint(...)
and decouple from the tokenizer. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will never be called by the bare tokenizer since it only handles ascii anyway. :)
Oh yeah, can avoid that check now -- before I was calling it at end of every file, heh. Re: naming, and splitting it out of tokenizer, makes sense to me; I have a comment below asking about the right way to do that. Re: String as top namespace, actually the script resolution IS something that I'd love to have available in String eventually, even if right now it'll be an undocumented impl. detail like Tokenizer.tokenize. Rust's unicode-scripts provides that for all strings, and now some stuff like username validation, I want my validation rule to be 'single-script usernames only', following uts39's definition of ss.
lib/elixir/unicode/tokenizer.ex
Outdated
@@ -1,5 +1,34 @@ | |||
type_path = Path.join(__DIR__, "IdentifierType.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the changes to this module to a separate file called "confusables.ex" or similar. We probably don't need to keep them in the tokenizer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember I didn't know how to do it with the bootstrap process and my first try didn't work, 🤦 so I punted -- until now I guess!
However these lines specifically are the pieces from the last PR, just hoisted up (the Restricted idents / uncommon codepoints) so I want to make sure I understand the suggestion; it's used at compile-time in Tokenizer, but also in the mixed-script confusables, cutting the list of potential confusables down from 6k-8k to a few hundred.
There are a couple of ways I can see it being split out, which of these makes more sense? (or neither, heh)
-
split it all, including the stuff from last PR, into maybe unicode/security.ex, add it ahead of tokenizer to bootstrap, and tokenizer.ex could get eg. restricted codepoints via
String.UnicodeSecurity.restricted_codepoints()
. (There's a line I need to cut to do that, but looking at it again, it needs to be cut anyway :P arrows going wrong way, depending on tokenizer in confusables when it should just use the list of restricted codepoints). -
split out just the changes from this file into confusables.ex, add it ahead of tokenizer to bootstrap, and it's OK to eg. re-read IdentifierType.txt if we need to there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be option 2 but, you are right that handling bootstraping corner cases is annoying, so let's merge this in a single file as is right now and I can take care of splitting it up later on. :)
lib/elixir/pages/unicode-security.md
Outdated
|
||
* Some characters are in use in so many writing systems that they have been classified by Unicode as 'Common' or 'Inherited'; these include things like numbers, underscores, etc; Elixir will not warn about mixing of ALL-script characters, like `幻ㄒㄧㄤ1 = :foo; 幻ㄒㄧㄤ2 = :bar`. | ||
|
||
However, there are some script combinations with no overlap in characters, like {Cyrillic} and {Latin} -- in Unicode terms, the 'resolved script set' would be empty. So if that kind of script mixing occurs in an identifier, and the only Cyrillic characters in the file are those confusable with characters in other languages, it will emit a warning to that effect. (If, however, the file also contains non-confusable Cyrillic characters in source code, then the programmer can visually detect that another script is being used, and no warning is issued). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm... I am thinking about this... if we only emit warnings when all usages are confusable, doesn't it mean we already get the confusable warning? If so, what is the advantage of this additional rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should focus on single-script, mixed-script, and whole-script confusables for this PR, and see what C3 brings us further down the road?
I am actually keen on tightening this more than the rule above. For example, I would warn if any mixed script outside of ASCII is used. But then I would rather merge C2 first and then have this discussion.
Sorry for the back and forth on this, but those layers are being uncovered as I go deeper in the reviews (and as I re-read UTS 39. this was the third time :D).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wondered that at first too -- they often overlap, but not always; this handles cases where there is no other token in the file whose skeleton clashes; in that case, this would still be suspicious. (Example, may be a bit contrived): a package/dependency upstream has had an additional, visually identical function added to it; or where this is the means used to define something in another file, and it's used in the file being considered here.
In that case, usage of is_admin()
may only happen once, so it's not actually confusable with anything in the file, but if it includes only confusable characters from Cyrillic mixed in there -- that's a fishy-enough situation to warn on (and is the example they show in Trojan Source, the sayHello() fn from a dependency, where the H is from another script). (I actually renamed the lint in this PR -- MixedScriptUsingOnlyConfusableCharacters
-- to try to make that a bit clearer. But as with a lot of things in this standard, it can only get so clear :( because for instance 'mixed script' intuitively meant something quite different to me before actually implementing the standard, or rather now I have a couple of things in mind that someone could mean when they say that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I got it right: C2 warns for existing confusables. C3 warns for potential confusables. Is that it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrluc if we go with a more restrictive approach, as described here:
Highly Restrictive
The string qualifies as Single Script, or
The string is covered by any of the following sets of scripts, according to the definition in Section 5.1:
Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan
Latin + Han + Bopomofo; or equivalently: Latn + Hanb
Latin + Han + Hangul; or equivalently: Latn + Kore
Would that simplify the implementation or make it more complex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am on the same page now, thanks! Now my question is: why not ship a more strict version of the warning? Such as only allowing Latin with Japn, Hanb, and Kore?
Want to clarify -- do you mean one of the standard's definitions of restriction levels, like its definition of 'Highly Restrictive', or just filtering it down to only those languages?
I don't think the standard recommends only allowing those, notice that it mentions "single script", which requires full scriptset resolution to determine:
Highly Restrictive:
The string qualifies as Single Script, or
The string is covered by any of the following sets of scripts, according to the definition in Section 5.1:
Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan
Latin + Han + Bopomofo; or equivalently: Latn + Hanb
Latin + Han + Hangul; or equivalently: Latn + Kore
Regarding whether or not we could do something more restrictive -- YES, that's a lang design question for sure.
-
For instance, I initially thought "single script is good enough for identifiers"; it ignores Common and all mathy chars are Common; it resolves multiple-script asian languages, etc.
-
The counter-example there would be variables written by people in other scripts, who are now forced to come up with a transliteration of 'twitter' or 'api' in variables like "mycompany_twitter" and "mycompany_api", where 'mycompany' is in another script.
However, how much simpler does it make things?
-
It could remove statefulness if we move to being more restrictive on idents vs. checking the uses in the file. That would remove, basically, all of the fns
check_confusable_mixed_script
andwarning_per_problematic_script
, plus a little more, and it could also in theory be done at tokenization time. -
But I think almost everything we do with mixed-script would still require scriptset resolution, which is the bulk of the C3 additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josevalim github UI lag got me on your question here; I responded to it above -- "Single Script" is the key there, due to how uts39 defines single script requiring script resolution. So it would simplify it somewhat, removing a couple of functions, (EDIT: and allow it to be stateless, so only Confusables would require pass over all tokens), but the effects on the size of the implementation wouldn't be drastic since resolution is the bulk of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrluc I see. Let's go with C2 for now only. Then, after merging, I can split it into a separate module, which should leave us in a better position to accept C3. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
After talking over the C3 stuff, C2 seems blessedly simple! 😆 Since it's relatively simpler it may as well be the only thing in the PR. I'll probably be able to do that + incorporate the feedback tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, and again, sorry for the false positive on accepting both C2 and C3 at once. :)
Co-authored-by: José Valim <jose.valim@gmail.com>
@josevalim I took a pass at cutting it down to just C2 capabilities for now:
|
💚 💙 💜 💛 ❤️ |
Oh, quick question: should we also include atoms in your confusable lookups? A confusable atom can also be used to hide conditionals and so forth. |
D'oh, sorry for missing deletion of all of those C3 .txt/comments. 🤦 Re: atoms,
I think C3, if added in a form similar to the ref impl, would be a natural place to give atoms a measure of protection, without needing to add an atom confusability lookup to C2 check. That check scans tokens and builds up all unique chars -- currently in identifier/alias only, but could extend it for atoms too -- and that would let us catch atoms with mixed-script confusables. It's an interesting question to what extent languages with pattern matching should warn on what we might consider "literals", heh ... I'm not sure I have a strong opinion on how far it should go, but the above feels like one layer/measure of coverage. |
That was my thinking, especially because of this: Foo.admin()
apply(Foo, :admin, []) My biggest concern with atoms is code like this: if user.type == :regular do
...
else
# do admin stuff
end And someone then sneaks a It is one of the reasons why I think for C3 we should go with a more restrictive approach. |
(My example was wrong -- I was thinking of I do think C2 confusable protection should include atoms 👍. Would you like me to open a new PR for that tweak, and add a line to the test? Super-simple change. Re: drawing the line somewhere -- in theory there are these variations, too: if user.type == :regular do ...
if user.type == 'regular' do ...
if user.type == "regular" do ... But, given that Atoms are used heavily for pattern matching in control-flow, I'd agree that, wherever the line is, Atoms are probably inside it.
That makes sense to me generally; it does feel like Elixir has the opportunity to make things a lot easier for itself, while still letting people use their language. In the limit: maybe even enforcing UTS39's definition of "single script" for unicode idents. Would have to consider how to handle a bunch of examples -- for instance, the case of the |
I have already pushed it. :)
I agree. At the same time, anything that is "quoted" kinda means "here be dragons". For example, we don't currently check for quoted atoms.
Yeah, this can come later. I was thinking about such rules but the report is pretty clear that turkish and cyrillic should never be mixed with ascii, probably it is just too easy to sneak something in. At the same time, you probably have a very reasonable way of writing |
@josevalim it's been a while -- so I spiked doing C3 protections via 'Highly Restrictive' identifiers, looks promising. I have a couple of clarifying questions, and then below I have 2 examples of tradeoffs, where it's better / worse than the ref impl.
But to be clear, these are 2 different approaches to addressing UTS39 C3 / Trojan Source's 2nd CVE; each has strengths and weaknesses:
Comparisons below: Where is the C3 check from the reference impl 'better' than C3 via Highly Restrictive identifiers?Rust, and the reference implementation in Elixir, try to catch the case of 'the only use of non-ascii/Latin script X in this project is comprised only of confusable characters'. Since confusable detection prevents one class of problem that can arise from that, the remaining issue is with identifiers imported into scope from a dependency, perhaps even autocompleted instead of typed -- the Rust & reference implementation approach would catch an import from a compromised upstream # fns using Cyrillic lowercase letters, let's say
import BadUpstream, only: [math: 1, top: 1, key: 1, batch: 1] Highly Restrictive wouldn't catch that, because no 'script mixing' would be happening in the identifier, which could be made up of combos of these letters for instance {Grk, Cyrl, Latn}: Where are Highly Restrictive idents 'better' than reference impl's/Rust's C3 check?This one is interesting, as it's even more the case in Rust than the Elixir ref impl, due to crate vs. file scope for this. If you allow a SINGLE non-confusable Cyrillic character anywhere in a whole crate, then all mixings of Cyrillic and Latin are now completely allowed: // this first variable, by itself, would produce warnings, like
// 'oh no! the project uses only confusable chars from Cyrillic!'
// п is confusable with Latin n
let п = 3.14;
let л = 3.14;
// ^^ ahhhh -- if you let this one 'not confusable' char in, any mixed-script {Latin, Cyrillic}
// identifier won't elicit warnings in the rest of the crate, and following compiles cleanly:
let аdmin = 1; Compare that behavior with the Trojan Source recco to warn or error on "throw errors or warnings for ... identifiers with mixed-script confusable characters". Hasten to add, Rust obviously does great with UTS39 -- this just shows that even in the few languages to have implemented UTS39 protections, there are tradeoffs! UTS39 acknowledges that, even once you have mixed-script and confusable detection capabilities, nuance is still necessary when determining what to consider as suspicious, or there will be a ton of false positives. Let me know if any of those tradeoffs change what we'd ideally like to have in Elixir, so I can include it in the C3 PR. I could see Highly Restrictive identifiers being good enough given the C2 protections already in place. We could also have both -- the Rust-style check from the reference impl (in case of file w/unicode in idents only), and Highly Restrictive identifiers which as you noted make good sense -- which would be pretty robust! |
Hi @mrluc,!thank you for another great write up!
Yes, this is my understanding too, so that’s the route I would go. I also think we could embed C3 into the existing tokenizer, but that’s something I would do as a forth step (I will investigate it once the Highly Restrictive Reference Implementation is in, you don’t have to worry!). |
This PR adds warnings for confusable identifiers, corresponding to clause C2 of UTS39 (previous PR handled C1); the reference implementation, and/or the docs/tests accompanying this PR, describe what that means a bit more.
The major change from the reference implementation is tracking whether a whole file has any non-ascii tokens, and only running the whole-file unicode lints in that case.
We expect to add another lint, covering Clause 3 of UTS39 (warning on 'mixed-script confusable characters', regardless of whether pairs of confusable identifiers exist), in a future PR.
(This was extracted from the reference implementation of UTS39, which has a higher-level overview + some examples).