Skip to content

Reference implementation of Unicode Security (UTS39) #11569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions lib/elixir/pages/unicode-security.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Unicode Security

(See [Unicode Syntax](unicode-security.html) for more information on Unicode usage in Elixir).

Elixir will warn on confusing or suspicious uses of Unicode in identifiers since Elixir v1.15, as defined in the [Unicode Technical Standard #39](https://unicode.org/reports/tr39/) on Security.

The focus of this document is to describe how Elixir implements the conformance clauses from that standard, referred to as C1, C2, and so on. All quotes are from the spec unless otherwise noted.

## C1. General Security Profile for Identifiers

Elixir will issue 'uncommon codepoint' warnings on identifiers with codepoints in `\p{Identifier_Status=Restricted}`

> An implementation following the General Security Profile does not permit any characters in \p{Identifier_Status=Restricted}, ...

For instance, the 'HANGUL FILLER' (`ㅤ`) character, which is often invisible, is an uncommon codepoint and will trigger this warning.

Elixir also allows some Restricted codepoints by explicit lists.

> ... unless it documents the additional characters that it does allow. Such documentation can specify characters via properties, such as \p{Identifier_Status=Technical}, or by explicit lists, or by combinations of these.

Currently there is a whitelist of mathematical characters, which may be expanded over time, which can be used in identifier names:

λ∆Ø∂δω∇Φϕσμπκxαθ

The C1 check is implemented via the `String.Unicode.Security.UncommonCodepoints` module. The name 'uncommon codepoints' reflects the reason for this check given in the spec, which is more meaninful than 'general profile':

> The Restricted characters are characters not in common use


## C2. Confusable detection

Elixir will warn on identifiers that look the same, but aren't. As an example, in `а = a = 1`, the two 'a' characters are Cyrillic and Latin, and could be confused for each other. Confusable identifiers can lead to hard-to-catch bugs (say, in copy-pasted code) and can represent an attack vector, so we will warn about identifiers within a single file that could be confused with each other.

The means used for confusable detection are those of Section 4, Confusable Detection, with one noted modification

> Alternatively, it shall declare that it uses a modification, and provide a precise list of character mappings that are added to or removed from the provided ones.

Elixir will not warn on confusability for identifiers made up exclusively of characters in a-z, A-Z, 0-9, and _. This is because ASCII identifiers have existed for so long that the programming community has had their own means of dealing with confusability between identifiers like `l,1` or `O,0`.

## C3. Mixed script detection

Elixir will warn on identifiers that contain a mix of characters from different scripts, but only when those scripts are not normally used together in a writing system, and even then, only when usage of that script is comprised solely of confusable characters.

* Some languages naturally use multiple scripts. For instance, the Japanese writing system may use multiple scripts, like Hiragana, Katakana, and Han -- so an identifier in Elixir could be comprised of characters from all of those scripts (as well as Common characters, like _ and 0-9; see below).

* Some letters may be used in multiple writing systems; for instance, a codepoint could appear in scripts used in the Japanese, Korean, and Chinese writing systems.

* Some characters are in use in so many writing systems that they have been classified by Unicode as 'Common' or 'Inherited'; these include things like numbers, underscores, etc; Elixir will not warn about mixing of ALL-script characters, like `幻ㄒㄧㄤ1 = :foo; 幻ㄒㄧㄤ2 = :bar`.

However, there is no writing system that mixes Cyrillic and Latin characters, and so if that occurs in an identifier in a file, Elixir will examine the file more closely.

* If the only Cyrillic characters in the file are those confusable with characters in other languages, it will emit a warning to that effect.

* If, however, the file contains non-confusable Cyrillic characters as well, then a warning will not be emitted.


## C4, C5 (inapplicable)

'C4 - Restriction Level detection' conformance is inapplicable to identifiers and is NOT claimed; it applies to classifying the level of safety of a given arbitrary string into one of 5 restriction levels.

'C5 - Mixed number detection' conformance is not claimed. However, Mixed Script and Confusable detections provide a level of safety regarding most confusing or other uses of mixed-script numbers. For instance, the example in section 5.3 of BENGALI DIGIT FOUR (৪) with DIGIT EIGHT (8); this example: `utf৪ = true` would produce a mixed-script warning.
2 changes: 2 additions & 0 deletions lib/elixir/pages/unicode-syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ Quoted identifiers, such as strings (`"olá"`) and charlists (`'olá'`), support

Elixir also supports Unicode in identifiers since Elixir v1.5, as defined in the [Unicode Annex #31](https://unicode.org/reports/tr31/). The focus of this document is to describe how Elixir implements the requirements outlined in the Unicode Annex. These requirements are referred to as R1, R6 and so on.

Elixir will warn on confusing or suspicious uses of Unicode in identifiers since Elixir v1.15, as defined in the [Unicode Technical Standard #39](https://unicode.org/reports/tr39/) on Security. The [Unicode Security](unicode-security.html) document describes how Elixir implements the requirements of that Standard.

To check the Unicode version of your current Elixir installation, run `String.Unicode.version()`.

## R1. Default Identifiers
Expand Down
20 changes: 14 additions & 6 deletions lib/elixir/src/elixir_tokenizer.erl
Original file line number Diff line number Diff line change
Expand Up @@ -135,13 +135,13 @@ tokenize(String, Line, Opts) ->
tokenize(String, Line, 1, Opts).

tokenize([], Line, Column, #elixir_tokenizer{cursor_completion=Cursor} = Scope, Tokens) when Cursor /= false ->
#elixir_tokenizer{terminators=Terminators, warnings=Warnings} = Scope,

#elixir_tokenizer{file=File, identifier_tokenizer=IdentifierTokenizer, terminators=Terminators, warnings=Warnings} = Scope,
{CursorColumn, CursorTerminators, CursorTokens} =
add_cursor(Line, Column, Cursor, Terminators, Tokens),

AccTokens = cursor_complete(Line, CursorColumn, CursorTerminators, CursorTokens),
{ok, Line, Column, Warnings, AccTokens};
UnicodeWarnings = unicode_lint_warnings(IdentifierTokenizer, Tokens, File),
{ok, Line, Column, Warnings ++ UnicodeWarnings, AccTokens};

tokenize([], EndLine, Column, #elixir_tokenizer{terminators=[{Start, StartLine, _} | _]} = Scope, Tokens) ->
End = terminator(Start),
Expand All @@ -150,8 +150,10 @@ tokenize([], EndLine, Column, #elixir_tokenizer{terminators=[{Start, StartLine,
Formatted = io_lib:format(Message, [End, Start, StartLine]),
error({EndLine, Column, [Formatted, Hint], []}, [], Scope, Tokens);

tokenize([], Line, Column, #elixir_tokenizer{warnings=Warnings}, Tokens) ->
{ok, Line, Column, Warnings, lists:reverse(Tokens)};
tokenize([], Line, Column, #elixir_tokenizer{file=File, identifier_tokenizer=IdentifierTokenizer, warnings=Warnings}, TokensReversed) ->
Tokens = lists:reverse(TokensReversed),
UnicodeWarnings = unicode_lint_warnings(IdentifierTokenizer, Tokens, File),
{ok, Line, Column, Warnings ++ UnicodeWarnings, Tokens};

% VC merge conflict

Expand Down Expand Up @@ -1666,4 +1668,10 @@ prune_tokens([], [], Terminators) ->

drop_including([{Token, _} | Tokens], Token) -> Tokens;
drop_including([_ | Tokens], Token) -> drop_including(Tokens, Token);
drop_including([], _Token) -> [].
drop_including([], _Token) -> [].

unicode_lint_warnings(IdentifierTokenizer, Tokens, File) ->
case erlang:function_exported(IdentifierTokenizer, unicode_lint_warnings, 1) of
true -> IdentifierTokenizer:unicode_lint_warnings(Tokens, File);
false -> []
end.
62 changes: 62 additions & 0 deletions lib/elixir/test/elixir/kernel/warning_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,68 @@ defmodule Kernel.WarningTest do
assert {:error, _} = Code.string_to_quoted(~s[:"foobar" do])
end

describe "unicode identifier security" do
test "uncommon codepoints" do
assert capture_err(fn -> Code.eval_string("ㅤx = x = 2") end) =~
"identifier 'ㅤx' has uncommon codepoint \\u3164"

assert capture_err(fn -> Code.eval_string("defmodule M do def ∆ᵥ(_), do: 1 end") end) == ""
assert capture_err(fn -> Code.eval_string("∆_whitelisted = 1") end) == ""
end

test "warns on confusables" do
assert capture_err(fn -> Code.eval_string("аdmin=1; admin=1") end) =~
"confusable identifier: 'admin' looks like 'аdmin' on line 1"

assert capture_err(fn -> Code.eval_string("力=1; カ=1") end) =~
"confusable identifier: 'カ' looks like '力' on line 1"

# by convention, doesn't warn on ascii-only confusables
assert capture_err(fn -> Code.eval_string("x0 = xO = 1") end) == ""
assert capture_err(fn -> Code.eval_string("l1 = ll = 1") end) == ""
end

test "warnings on mixed scripts" do
output = capture_err(fn -> Code.eval_string("cirсlе = 1") end)

warning = ~S"""
The only uses of Cyrillic in this file are mixed-script confusables, like 'cirсlе' on line 1:
\u0063 'c' {Latin}
\u0069 'i' {Latin}
\u0072 'r' {Latin}
\u0441 'с' {Cyrillic} <- mixed-script confusable
\u006C 'l' {Latin}
\u0435 'е' {Cyrillic} <- mixed-script confusable
Resolved script set (intersection): {∅}
"""

assert output =~ warning
end

test "does not warn on valid uses of multiple scripts" do
# writing systems with multiple scripts, and with Common chars like '_'
assert capture_err(fn -> Code.eval_string("幻ㄒㄧㄤ = 1") end) == ""
assert capture_err(fn -> Code.eval_string("幻ㄒㄧㄤ1 = 1") end) == ""
assert capture_err(fn -> Code.eval_string("__सवव_1? = 1") end) == ""

# mixed scripts, but verified, by using non-confusable characters too
assert capture_err(fn -> Code.eval_string("夏の幻ㄒㄧㄤ = 2") end) == ""
assert capture_err(fn -> Code.eval_string("_सवव_twitter_api = 1") end) == ""
assert capture_err(fn -> Code.eval_string("слово_api = 1") end) == ""

# tokens from the whole file are considered in this check, and
# any use of a non-confusable character verifies that script.
assert capture_err(fn -> Code.eval_string("рос_api = 1") end) =~ "mixed-script"

assert capture_err(fn ->
Code.eval_string("""
рос_api = 1 # mixed-script with confusable Cyrillic
слово = 1 # verifies Cyrillic in the file
""")
end) == ""
end
end

test "operators formed by many of the same character followed by that character" do
output =
capture_err(fn ->
Expand Down
Loading