Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't allow restricted characters in identifiers (UTS 39, C1) #11580

Conversation

mrluc
Copy link
Contributor

@mrluc mrluc commented Jan 17, 2022

(This was originally extracted from the reference implementation of UTS39, which has a much higher-level overview).

This PR prevents Elixir identifiers from including \p{Identifier_Status=Restricted} codepoints; implemented using the unicode 'IdentifierTypes' that correspond to the 'Restricted' status, so that if Elixir wanted to allow some additional 'IdentifierTypes' like eg. 'Technical' characters it could do so, while still preventing 'Uncommon_Use' or 'Obsolete' codepoints (if documented, still conformant w/uts39). This impl conforms with UTS 39, clause 1 ('general security profile for identifiers'); added a new page to docs, with subheading for Clauses C1-C5 from the standard, similar to how unicode-syntax.md has R1-R6 for the Requirements in that annex of the standard, as I think the plan would be to add the 3 primary protections from the reference impl (which Rust also uses).

Example of change, w/an invisible Hangul Filler character which was previously valid in identifiers:

iex> ㅤ = 1
 ... unexpected token: "ㅤ" \(column 1, code point U\+3164\)

@mrluc
Copy link
Contributor Author

mrluc commented Jan 17, 2022

Since the direction taken is direct integration w/Tokenization, I started without any run-time component; the next PR would be MixedScriptConfusable protections, which will need to resolve scripts at runtime for non-Ascii tokens, but we'll still try to minimize the runtime bits.

I'll circle back to that later in the week as I get some 'shop time' 😄.

id_continue = continue -- patterns
type_path = Path.join(__DIR__, "IdentifierType.txt")

restricted =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

37k items in here -- we could flip it around, keep the smaller list, and then subtract via

(id_upper -- patterns) -- (id_upper -- allowed)

But it didn't feel like it made any difference ie when reloading in iex; obvs 37k item list in memory has some impact though depending on people w/use cases that involve 'compile elixir myself, on low-spec machine'.

@josevalim
Copy link
Member

Looks perfect. We just need to add an entry here and we are good: https://github.com/elixir-lang/elixir/blob/main/lib/elixir/unicode/unicode.ex#L7

@mrluc
Copy link
Contributor Author

mrluc commented Jan 17, 2022

@josevalim 👍 done, thank you. MixedScript will be much more fun!

@josevalim
Copy link
Member

josevalim commented Jan 17, 2022

@mrluc just so you know, I dropped another comment in the reference implementation. I think we should only do the remaining checks at the end of the file if any non-ASCII identifier is found, so it will be much closer to your reference implementation. :) We should continue the discussion there but I wanted to be sure you are not sent on a path that won't be merged!

@josevalim josevalim merged commit 4a5fcbe into elixir-lang:main Jan 17, 2022
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@josevalim
Copy link
Member

@eksperimental, would you be so kind to update the Unicode scripts? If you prefer to wait, that's fine too, as more files will likely be added within the next weeks. :)

@eksperimental
Copy link
Contributor

@eksperimental, would you be so kind to update the Unicode scripts? If you prefer to wait, that's fine too, as more files will likely be added within the next weeks. :)

Sure! I will be updating as the PRs are getting merged,
Please ping me in the following PRs, so the load is less.

I will report back to this thread once the changes are implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants