Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft of warning on confusable identifiers #11429

Closed
wants to merge 1 commit into from

Conversation

mrluc
Copy link
Contributor

@mrluc mrluc commented Dec 1, 2021

Protects against confusable identifiers in our own code by default; adds confusables.txt and tries to take the approach suggested in UTS39, of condensing identifiers down to their prototypical form, a character at a time via the confusables.txt 'confusable char -> prototypical char' mapping, and warning on clashes.

Example: given elixir code like this:

# myfile.exs
аdmin = true
should_be_admin = false
admin = should_be_admin

IO.puts "Admin status: #{admin}. <-- False, like we expect."

if аdmin do
  IO.puts "But actually, #JustAdminThings"
end

This PR adds warnings:

$ bin/elixir myfile.exs
warning: confusable: 'admin' on L3 looks like 'аdmin' up on L1
  myfile.exs:3:1

Admin status: false. <-- False, like we expect.
But actually, #JustAdminThings

Additions: mostly, this adds confusables.txt and a new module in tokenizer.ex; then some glue lines in elixir_tokenizer.erl and some (probably insufficient) tests.

Referenced:

Limitations:

  • Though I looked at the way Rust handles this, this PR only supports confusable detection for now, not all of the mitigations that Rust does. The unicode-security crate and the non_ascii_idents lint that uses it actually support ALL of the mitigations mentioned in UTS39 -- so this doesn't have: MixedScript, UncommonCodepoints, GeneralSecurityProfile (yet).
  • I spiked something kinda fun, but ultimately pulled it -- a way to protect against lookalikes of the default kernel functions, not just identifiers in our own code, via pre-seeding the 'skeletons' lookup with names from eg. Kernel.__info__ or Kernel.SpecialForms.__info__. But ultimately, since import ..., except: ... is a thing, it didn't make that much sense to me; it was fun to see 'if', 'unless', 'fn' etc all protected from lookalikes, but protecting against that, and not the legit takeover of the name, didn't feel that necessary. 😆 it feels like assuming anything other than the internal consistency of names in one file is a losing game in a flexible language. Plus, other layers can catch those lookalikes -- in Rust, 'uncommon codepoints' seem to warn of all of the lookalikes I come up with for 'if', 'while', etc.

Next steps on this PR:

  • Discussion, wanted to see if it makes sense to continue adding UTS39 protections
  • Crude benchmarking -- timing make clean test on main vs this branch?
  • Better benchmarking, so far looks like average of around 1% slowdown -- IMO any measurable slowdown is unnecessary, can probably fix that.
  • Write more tests ... this PR only adds 2 tests. (Though since it's to do with tokenization, the codebase compiling at all to run the test is a good number of cases sort of). Also, one of the tests I added isn't through the public API :( since I wanted to assert on the warning raised, one test calls :elixir_tokenizer.tokenize(...) and is pretty intimate with its output.
  • Draft -> R4R if CI + perf seem okay

Next steps on unicode security:

  • Seems fun to add more of these protections, if I did that I'd currently lean towards adding them to the same module, String.UnicodeSecurity in the tokenizer.ex file.
  • MixedScript and UncommonCodepoints would be next in general.
  • Specifically the next one I'd be interested in adding is 'uncommon codepoints'; I got interested in UTS39 through recreating in .ex a PoC I saw in another language, using an uncommon codepoint, a hangul filler, as a visually invisible variable. Confusable detection doesn't catch it because its prototype is another hangul filler, not a space.

* doesn't have: MixedScript, UncommonCodepoints, GeneralSecurityProfile (yet)
@mrluc
Copy link
Contributor Author

mrluc commented Dec 1, 2021

Re: perf

$ time make clean test (tests for all apps) on this branch, on a (/me checks cli) udoo volt v8 gives us:

real	3m58.639s
user	6m16.748s
sys	0m59.551s

real	3m55.786s
user	6m9.571s
sys	0m58.634s

And on the main branch:

real	3m55.956s
user	6m14.368s
sys	0m59.557s

real	3m56.490s
user	6m11.985s
sys	0m59.472s

Seems sane so far; of course that may not be a relevant benchmark if some groups of tests are responsible for most of that time, etc. But I'd bet there's some tooling to catch perf regressions in this repo's scripts/workflows too.

@mrluc mrluc marked this pull request as ready for review December 1, 2021 09:28
@josevalim
Copy link
Member

Thanks @mrluc for looking into this!

Using make clean test is not a good comparison because the overall time will be taken by the time running the tests, which is not largely affect by this. you want to measure make compile. Something like: make compile && touch lib/elixir/lib/kernel.ex && time make stdlib.

But even then, I would say the best time is to clone a large project, like credo, and then do:

$ mix compile
$ mix compile --force

My initial feedback is that this approach will most likely be expensive and the fact that it is looking for conflicts means the complexity is not linear. I would prefer to go the Rust route and warn for mixed scripts instead, which is linear.

@mrluc
Copy link
Contributor Author

mrluc commented Dec 1, 2021

I would say the best time is to clone a large project, like credo, and then do ...

@josevalim sounds good, I'll look at impacts on compilation specifically for large projects and post results here. If I'm missing something that will uncover it. 👍

Re: complexity and following Rust's approach, I've been following along with it (the non_ascii_idents.rs lint + the unicode-security crate). However, only the confusable bits of that for now (which is just one piece).

I'm not 100% sure that confusable detection is on by default or not in Rust, so I should probably have checked that, lol -- I will do that! 🤦

But yeah, trying to do confusables like how they do, they build up a 'symbol gallery' (which they added to lang to support unicode security) and then they do one pass iterating over it in the lint, building up the lookup of skeletons, and finding collisions in the lookup as they go.

But as I say, I can just see how it actually performs, vs. how I hand-wavily think it should behave, by doing what you mention. 😄 I've never contributed to elixir-lang so I obvs don't have a great intuition for that sort of thing. Will do that this afternoon I think.

@josevalim
Copy link
Member

Yeah, I think the best scenario for this particular version is to have a separate library with mix task, maybe mix confusables, that traverses the lib directory, calling Code.string_to_quoted, and building the lookup that you mentioned. The benefit is that it could also lookup across files?

But for Elixir itself, if we want to always run by default, I would go with the simpler mixed script check.

@mrluc
Copy link
Contributor Author

mrluc commented Dec 1, 2021

But for Elixir itself, if we want to always run by default, I would go with the simpler mixed script check.

Cool -- I'll also look at implementing just that; I've got the Rust impl to look at 👍

@mrluc
Copy link
Contributor Author

mrluc commented Dec 2, 2021

Did some perf checks in the vein of:

... the best time is to clone a large project, like credo, and then do:

$ mix compile
$ mix compile --force

Over 5 runs with each of 3 large repos, in the base v1.13 version and the modified version from this PR, compilation times range from -0.38% to 1.9% more with this PR.

Repos: credo, ecto_sql, ex_venture:

before after
Repo credo ex_venture ecto_sql credo ex_venture ecto_sql
Run 1 6.707 17.988 2.599 6.895 18.136 2.599
Run 2 6.751 17.973 2.736 6.928 18.107 2.613
Run 3 6.839 17.874 2.634 6.909 18.027 2.642
Run 4 6.883 18.107 2.552 6.931 18.032 2.606
Run 5 6.778 18.032 2.583 6.948 17.87 2.594
Average 6.7916 17.9948 2.6208 6.9222 18.0344 2.6108

My hacky-but-probably-ok-for-this test harness is basically this script; it has some sanity-checks like adding a 'canary' module to each repo that triggers the changed behavior, so we know we're using the right versions.

My initial feedback is that this approach will most likely be expensive and the fact that it is looking for conflicts means the complexity is not linear.

IMO right now it has a constant, but measurable, perf impact -- I'd shoot for it to have zero measurable impact. It being basically linear/constant makes sense to me, because currently it just iterates 1x over the built-up list of tokens for each file, and the number of entries in the map is limited to the number of unique identifiers. And there's low-hanging fruit for optimization (moving to only iterating over 'unique tokens' for checks, vs. 'every token' as right now).

I would prefer to go the Rust route ...

Matching Rust's default behavior for this makes sense to me. For context, all 3 of the lints in non_ascii_idents are on by default in Rust -- might be stable now.

Obvs this PR isn't there yet, so I might close it to work on it more. But the confusables detection (confusable_idents) is actually pretty close if you squint past it currently being per-file instead of per-application/global. Uncommon endpoints and mixedscript confusables are the others.

it could also lookup across files?

Yes, I'd plan on adding that -- what rust did is add a lookup, the 'symbol gallery', that all symbols in a 'parse session' get added to (which per rust book is per-crate). That involves less work on "correct implementation of the UTS39 logic/protections", and more implementation-specific ideas/questions (would such a gallery be best as a server+ets table w/lifecycle bound up with the compilation of one application, and how to do so...).

But I figured an elixir draft of implementing the 3 protections that rust has added could be useful one way or another.

And if not useful, maybe fun 😆

@josevalim
Copy link
Member

Actually, I just realized one thing: when we tokenize tokens, we already return if they are ascii only. We only need to run those checks if at least one token is not ascii only, which is going to be false in the majority of the cases. So I think we are fine with going this route and there should be some obvious optimizations.

The only downside is that, if we want to make it part of the compiler, then it is definitely per file.

@mrluc
Copy link
Contributor Author

mrluc commented Dec 3, 2021

[...] So I think we are fine with going this route and there should be some obvious optimizations.

❤️ super cool. I'll probably take this PR down then, and do a pass on getting the remaining UTS39 stuff in place, with parity w/Rust's impl in mind.

I've got an intuition about how to move beyond single-file w/o messing up parallel compilation, but will save that for the next 'checkpoint' PR, which hopefully will be closer to usable.

@josevalim
Copy link
Member

Given those are multiple checks, I would rather implement each of them individually rather than at once. The simplest starting point is the restricted characters, which we already consider many of: https://hexdocs.pm/elixir/1.12/unicode-syntax.html

We already require NFC, no zero join white space, and so on. But there are likely further restrictions and allowances to consider. I would also allow technical characters, especially math ones, but it will require careful revision.

@mrluc
Copy link
Contributor Author

mrluc commented Dec 3, 2021

Given those are multiple checks, I would rather implement each of them individually rather than at once.

Roger that -- sorry, I didn't mean that I'd batch them into the same PR, just that I wanted to get through the remaining missing pieces of 'reference implementation of UTS39 for elixir' (and since Rust sticks very closely to the standard, it's helpful for me to look at). Makes sense to do individual PRs for the added checks; even if we don't like some of them, it could be useful as a reference for an alternative implementation of similar protections.

The simplest starting point is the restricted characters [...]

I would also allow technical characters, especially math ones, but it will require careful revision.

Definitely easier; it's per-char, where Confusables and Mixed-Script confusables following UTS39's algo assume some kind of a 'symbol gallery' is accumulated, and that computation + lookup-collision-or-insert will be done for every new name against it (per crate in rs, per-file in this PR).

Re: allowing mathy chars, the relevant section of UTS39:

An implementation following the General Security Profile does not permit any characters in \p{Identifier_Status=Restricted}, unless it documents the additional characters that it does allow. Such documentation can specify characters via properties, such as \p{Identifier_Status=Technical}, or by explicit lists, or by combinations of these. Implementations may also specify that fewer characters are allowed than implied by \p{Identifier_Status=Restricted}; for example, they can restrict characters to only those permitted by [IDNA2008].

This section is in the context of 2 files UTS39 provides (which Rust sticks to by default), from this directory:

  • The IdentifierStatus.txt is basically a 'whitelist', if you want to think of it as 'outsourcing this decision to UTS39 defaults'
  • The file IdentifierStatus has the categorization of 'reasons why in/not in whitelist', like 'Technical' or 'Uncommon Codepoint'.

So maybe Elixir decides it doesn't want the exact same decisions as Rust, ie doesn't want to use only UTS39's data for its definition of 'what's Restricted or not' -- for instance, I couldn't sneak a single greek or mathy character through the rs playground just now without #![allow(uncommon_codepoints)], a full-crate check.

But we can have 'UTS 39 compliance with $LANG characteristics' as long as the deviations are documented, so if I do this one next I'll want to implement that in a way that (a) documentation is automatic and (b) it's easy to add categories, or ad-hoc chars, if you want to tweak it.

(Aside, these single-char checks are how I got here -- I recreated a JS PoC I'd seen using the invisible 'Hangul filler' char, in Elixir, then asked 'well how does one prevent it?', and from those learnings I saw that 'uncommon codepoints' in UTS39 catches it in rust. Go didn't catch it when I checked, actually. So I started on the UTS39 rabbit hole when I saw that it was the way to go, and this PR is the first result of that; started with 'confusables' bc they are well-known, and complexity-wise are midway between 'uncommon codepoints' and 'mixed-script confusables' in terms of pain).

@josevalim
Copy link
Member

I think the first PR should fully implement UTS 39 with no exceptions, we can discuss allowances later.

The reason we may want allowances is because we don't have parsing directives and we don't plan on adding them, so users won't have a fallback to enable those if they want to.

@mrluc
Copy link
Contributor Author

mrluc commented Dec 3, 2021

I think the first PR should fully implement UTS 39 with no exceptions, we can discuss allowances later...

Super cool. I should probably get out of our collective hair until I have that 'reference impl' then. 👍 Thanks for your time and all you do for the community. 😄

@mrluc mrluc closed this Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants