-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Guard against confusables incl. hangul space #11410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* seen in article of same name
* hangul space (0x3164) has a PoC; others are the subset of rust's confusables that are currently valid as elixir identifiers, minus the few cjk and hebrew confusables that occurred unescaped in eg. string_test.exs
Thank you @mrluc. This looks a good direction. The place that you changed is related to strings. I assume we still want to allow those in strings, after all hangul space has correct uses in strings, and we want to restrict those only on variables an atoms, correct? If so, the correct is to handle those files here: https://github.com/elixir-lang/elixir/blob/main/lib/elixir/unicode/tokenizer.ex Here would be the steps for adding this:
|
Upon further inspection, it seems this list is incomplete. For example,
To be clear then, both |
@josevalim thanks for the pointers 🙏 may need to split addressing this PoC out from the concept of confusables (or add it as an exception/additional filter on top of more comprehensive confusables-handling PR, since Rust-style confusable handling may not catch it...) Re: the proof-of-concept -- it seems like "a bug in unicode" and also "a bug in unicode's confusables.txt" -- because it visually looks like it should be categorized as whitespace, and as confusable with space, and it's not. For that, adding just a guard against Re: looking at what Rust's done:
I took a closer look at what Rust's done and I see that the file I had looked at was a bit of a red herring); I have a better idea now of what's involved -- accumulate a lookup of homoglyph 'skeletons' and warn on collisions. This looks like the relevant Rust tracking issue -- a place where they discuss/link to the work of computing the 'skeleton' lookup for homoglyphs. I like the distinction in your description -- the lookup is for tokens, not just for valid elixir identifiers, so assuming space is a token, then if |
Ah, cool, Rust warns of 'uncommon codepoints' by default (playground) So whatever they do is probably 👍 . If someone wants to grab earlier feel free, otherwise I can probably put some scheduled "shop time" for this next Thurs/Fri. Edit: 'what rust does' seems to be 2-layered which seems cool
Anyway I'd probably want to look at the impl. for both of those. |
Yes, the uncommon is a separate part which we should discuss separately (and confusables are more important IMO). |
EDIT: closed because solution that this PR had tested was overly-simplistic; see last 3 comments or so.
I took a pass at protecting against 'confusables', using the subset of Rust's list that are valid as .ex identifiers, and adding one character that Rust's list doesn't catch (hangul space). This would be to complement the recently-added protections against bidirectional characters.
This was just to get the ball rolling -- currently it raises, when maybe it should warn? Other considerations are noted at the end of the PR desc.
I got here kinda by accident; I was talking with someone about that 'hangul space' character enabling a neat supply-chain attack in javascript -- and I confirmed that it works in Elixir too, look for the invisible variable below in 1.13.0-rc.1:
Commented proof-of-concept gist showing how it could be exploited in .ex is here. Security-wise it feels okay to discuss in the open; the article I mentioned has done the rounds, 'trojan codes' has raised awareness generally, vendors like github are adding detections/warnings, the bidirectional chars PR was handled in the open, the PoC relies on a contrived scenario, etc.
I saw a comment on the 'bidirectional' PR that pointed at the Rust confusable protections; I processed that, filtered it down from several hundred to only the 48 that are valid identifiers in elixir, and added the hangul character from my PoC (google 'invisible backdoor javascript' for the original article I read on this) which the Rust list didn't include.
Notes: