-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
draft of warning on confusable identifiers #11429
Conversation
* doesn't have: MixedScript, UncommonCodepoints, GeneralSecurityProfile (yet)
Re: perf
And on the main branch:
Seems sane so far; of course that may not be a relevant benchmark if some groups of tests are responsible for most of that time, etc. But I'd bet there's some tooling to catch perf regressions in this repo's scripts/workflows too. |
Thanks @mrluc for looking into this! Using But even then, I would say the best time is to clone a large project, like credo, and then do: $ mix compile
$ mix compile --force My initial feedback is that this approach will most likely be expensive and the fact that it is looking for conflicts means the complexity is not linear. I would prefer to go the Rust route and warn for mixed scripts instead, which is linear. |
@josevalim sounds good, I'll look at impacts on compilation specifically for large projects and post results here. If I'm missing something that will uncover it. 👍 Re: complexity and following Rust's approach, I've been following along with it (the I'm not 100% sure that confusable detection is on by default or not in Rust, so I should probably have checked that, lol -- I will do that! 🤦 But yeah, trying to do confusables like how they do, they build up a 'symbol gallery' (which they added to lang to support unicode security) and then they do one pass iterating over it in the lint, building up the lookup of skeletons, and finding collisions in the lookup as they go. But as I say, I can just see how it actually performs, vs. how I hand-wavily think it should behave, by doing what you mention. 😄 I've never contributed to elixir-lang so I obvs don't have a great intuition for that sort of thing. Will do that this afternoon I think. |
Yeah, I think the best scenario for this particular version is to have a separate library with mix task, maybe But for Elixir itself, if we want to always run by default, I would go with the simpler mixed script check. |
Cool -- I'll also look at implementing just that; I've got the Rust impl to look at 👍 |
Did some perf checks in the vein of:
Over 5 runs with each of 3 large repos, in the base v1.13 version and the modified version from this PR, compilation times range from -0.38% to 1.9% more with this PR. Repos:
My hacky-but-probably-ok-for-this test harness is basically this script; it has some sanity-checks like adding a 'canary' module to each repo that triggers the changed behavior, so we know we're using the right versions.
IMO right now it has a constant, but measurable, perf impact -- I'd shoot for it to have zero measurable impact. It being basically linear/constant makes sense to me, because currently it just iterates 1x over the built-up list of tokens for each file, and the number of entries in the map is limited to the number of unique identifiers. And there's low-hanging fruit for optimization (moving to only iterating over 'unique tokens' for checks, vs. 'every token' as right now).
Matching Rust's default behavior for this makes sense to me. For context, all 3 of the lints in non_ascii_idents are on by default in Rust -- might be stable now. Obvs this PR isn't there yet, so I might close it to work on it more. But the confusables detection (
Yes, I'd plan on adding that -- what rust did is add a lookup, the 'symbol gallery', that all symbols in a 'parse session' get added to (which per rust book is per-crate). That involves less work on "correct implementation of the UTS39 logic/protections", and more implementation-specific ideas/questions (would such a gallery be best as a server+ets table w/lifecycle bound up with the compilation of one application, and how to do so...). But I figured an elixir draft of implementing the 3 protections that rust has added could be useful one way or another. And if not useful, maybe fun 😆 |
Actually, I just realized one thing: when we tokenize tokens, we already return if they are ascii only. We only need to run those checks if at least one token is not ascii only, which is going to be false in the majority of the cases. So I think we are fine with going this route and there should be some obvious optimizations. The only downside is that, if we want to make it part of the compiler, then it is definitely per file. |
❤️ super cool. I'll probably take this PR down then, and do a pass on getting the remaining UTS39 stuff in place, with parity w/Rust's impl in mind. I've got an intuition about how to move beyond single-file w/o messing up parallel compilation, but will save that for the next 'checkpoint' PR, which hopefully will be closer to usable. |
Given those are multiple checks, I would rather implement each of them individually rather than at once. The simplest starting point is the restricted characters, which we already consider many of: https://hexdocs.pm/elixir/1.12/unicode-syntax.html We already require NFC, no zero join white space, and so on. But there are likely further restrictions and allowances to consider. I would also allow technical characters, especially math ones, but it will require careful revision. |
Roger that -- sorry, I didn't mean that I'd batch them into the same PR, just that I wanted to get through the remaining missing pieces of 'reference implementation of UTS39 for elixir' (and since Rust sticks very closely to the standard, it's helpful for me to look at). Makes sense to do individual PRs for the added checks; even if we don't like some of them, it could be useful as a reference for an alternative implementation of similar protections.
Definitely easier; it's per-char, where Confusables and Mixed-Script confusables following UTS39's algo assume some kind of a 'symbol gallery' is accumulated, and that computation + lookup-collision-or-insert will be done for every new name against it (per crate in rs, per-file in this PR). Re: allowing mathy chars, the relevant section of UTS39:
This section is in the context of 2 files UTS39 provides (which Rust sticks to by default), from this directory:
So maybe Elixir decides it doesn't want the exact same decisions as Rust, ie doesn't want to use only UTS39's data for its definition of 'what's Restricted or not' -- for instance, I couldn't sneak a single greek or mathy character through the rs playground just now without But we can have 'UTS 39 compliance with $LANG characteristics' as long as the deviations are documented, so if I do this one next I'll want to implement that in a way that (a) documentation is automatic and (b) it's easy to add categories, or ad-hoc chars, if you want to tweak it. (Aside, these single-char checks are how I got here -- I recreated a JS PoC I'd seen using the invisible 'Hangul filler' char, in Elixir, then asked 'well how does one prevent it?', and from those learnings I saw that 'uncommon codepoints' in UTS39 catches it in rust. Go didn't catch it when I checked, actually. So I started on the UTS39 rabbit hole when I saw that it was the way to go, and this PR is the first result of that; started with 'confusables' bc they are well-known, and complexity-wise are midway between 'uncommon codepoints' and 'mixed-script confusables' in terms of pain). |
I think the first PR should fully implement UTS 39 with no exceptions, we can discuss allowances later. The reason we may want allowances is because we don't have parsing directives and we don't plan on adding them, so users won't have a fallback to enable those if they want to. |
Super cool. I should probably get out of our collective hair until I have that 'reference impl' then. 👍 Thanks for your time and all you do for the community. 😄 |
Protects against confusable identifiers in our own code by default; adds
confusables.txt
and tries to take the approach suggested in UTS39, of condensing identifiers down to their prototypical form, a character at a time via the confusables.txt 'confusable char -> prototypical char' mapping, and warning on clashes.Example: given elixir code like this:
This PR adds warnings:
Additions: mostly, this adds
confusables.txt
and a new module intokenizer.ex
; then some glue lines inelixir_tokenizer.erl
and some (probably insufficient) tests.Referenced:
Limitations:
unicode-security
crate and thenon_ascii_idents
lint that uses it actually support ALL of the mitigations mentioned in UTS39 -- so this doesn't have: MixedScript, UncommonCodepoints, GeneralSecurityProfile (yet).Kernel.__info__
orKernel.SpecialForms.__info__
. But ultimately, sinceimport ..., except: ...
is a thing, it didn't make that much sense to me; it was fun to see 'if', 'unless', 'fn' etc all protected from lookalikes, but protecting against that, and not the legit takeover of the name, didn't feel that necessary. 😆 it feels like assuming anything other than the internal consistency of names in one file is a losing game in a flexible language. Plus, other layers can catch those lookalikes -- in Rust, 'uncommon codepoints' seem to warn of all of the lookalikes I come up with for 'if', 'while', etc.Next steps on this PR:
make clean test
on main vs this branch?:elixir_tokenizer.tokenize(...)
and is pretty intimate with its output.Next steps on unicode security: