elixir-lang · mrluc · Jan 9, 2022 · Jan 14, 2022 · Jan 14, 2022
diff --git a/lib/elixir/pages/unicode-security.md b/lib/elixir/pages/unicode-security.md
@@ -0,0 +1,61 @@
+# Unicode Security
+
+(See [Unicode Syntax](unicode-security.html) for more information on Unicode usage in Elixir).
+
+Elixir will warn on confusing or suspicious uses of Unicode in identifiers since Elixir v1.15, as defined in the [Unicode Technical Standard #39](https://unicode.org/reports/tr39/) on Security.
+
+The focus of this document is to describe how Elixir implements the conformance clauses from that standard, referred to as C1, C2, and so on. All quotes are from the spec unless otherwise noted.
+
+## C1. General Security Profile for Identifiers
+
+Elixir will issue 'uncommon codepoint' warnings on identifiers with codepoints in `\p{Identifier_Status=Restricted}`
+
+> An implementation following the General Security Profile does not permit any characters in \p{Identifier_Status=Restricted}, ...
+
+For instance, the 'HANGUL FILLER' (`ㅤ`) character, which is often invisible, is an uncommon codepoint and will trigger this warning.
+
+Elixir also allows some Restricted codepoints by explicit lists.
+
+> ... unless it documents the additional characters that it does allow. Such documentation can specify characters via properties, such as \p{Identifier_Status=Technical}, or by explicit lists, or by combinations of these.
+
+Currently there is a whitelist of mathematical characters, which may be expanded over time, which can be used in identifier names:
+
+    λ∆Ø∂δω∇Φϕσμπκxαθ
+
+The C1 check is implemented via the `String.Unicode.Security.UncommonCodepoints` module. The name 'uncommon codepoints' reflects the reason for this check given in the spec, which is more meaninful than 'general profile':
+
+> The Restricted characters are characters not in common use
+
+
+## C2. Confusable detection
+
+Elixir will warn on identifiers that look the same, but aren't. As an example, in `а = a = 1`, the two 'a' characters are Cyrillic and Latin, and could be confused for each other. Confusable identifiers can lead to hard-to-catch bugs (say, in copy-pasted code) and can represent an attack vector, so we will warn about identifiers within a single file that could be confused with each other.
+
+The means used for confusable detection are those of Section 4, Confusable Detection, with one noted modification
+
+> Alternatively, it shall declare that it uses a modification, and provide a precise list of character mappings that are added to or removed from the provided ones.
+
+Elixir will not warn on confusability for identifiers made up exclusively of characters in a-z, A-Z, 0-9, and _. This is because ASCII identifiers have existed for so long that the programming community has had their own means of dealing with confusability between identifiers like `l,1` or `O,0`.
+
+## C3. Mixed script detection
+
+Elixir will warn on identifiers that contain a mix of characters from different scripts, but only when those scripts are not normally used together in a writing system, and even then, only when usage of that script is comprised solely of confusable characters.
+
+* Some languages naturally use multiple scripts. For instance, the Japanese writing system may use multiple scripts, like Hiragana, Katakana, and Han -- so an identifier in Elixir could be comprised of characters from all of those scripts (as well as Common characters, like _ and 0-9; see below).
+
+* Some letters may be used in multiple writing systems; for instance, a codepoint could appear in scripts used in the Japanese, Korean, and Chinese writing systems.
+
+* Some characters are in use in so many writing systems that they have been classified by Unicode as 'Common' or 'Inherited'; these include things like numbers, underscores, etc; Elixir will not warn about mixing of ALL-script characters, like `幻ㄒㄧㄤ1 = :foo; 幻ㄒㄧㄤ2 = :bar`.
+
+However, there is no writing system that mixes Cyrillic and Latin characters, and so if that occurs in an identifier in a file, Elixir will examine the file more closely.
+
+* If the only Cyrillic characters in the file are those confusable with characters in other languages, it will emit a warning to that effect.
+
+* If, however, the file contains non-confusable Cyrillic characters as well, then a warning will not be emitted.
+
+
+## C4, C5 (inapplicable)
+
+'C4 - Restriction Level detection' conformance is inapplicable to identifiers and is NOT claimed; it applies to classifying the level of safety of a given arbitrary string into one of 5 restriction levels.
+
+'C5 - Mixed number detection' conformance is not claimed. However, Mixed Script and Confusable detections provide a level of safety regarding most confusing or other uses of mixed-script numbers. For instance, the example in section 5.3 of BENGALI DIGIT FOUR (৪) with DIGIT EIGHT (8); this example: `utf৪ = true` would produce a mixed-script warning.
diff --git a/lib/elixir/pages/unicode-syntax.md b/lib/elixir/pages/unicode-syntax.md
@@ -6,6 +6,8 @@ Quoted identifiers, such as strings (`"olá"`) and charlists (`'olá'`), support
 
 Elixir also supports Unicode in identifiers since Elixir v1.5, as defined in the [Unicode Annex #31](https://unicode.org/reports/tr31/). The focus of this document is to describe how Elixir implements the requirements outlined in the Unicode Annex. These requirements are referred to as R1, R6 and so on.
 
+Elixir will warn on confusing or suspicious uses of Unicode in identifiers since Elixir v1.15, as defined in the [Unicode Technical Standard #39](https://unicode.org/reports/tr39/) on Security. The [Unicode Security](unicode-security.html) document describes how Elixir implements the requirements of that Standard.
+
 To check the Unicode version of your current Elixir installation, run `String.Unicode.version()`.
 
 ## R1. Default Identifiers

diff --git a/lib/elixir/src/elixir_tokenizer.erl b/lib/elixir/src/elixir_tokenizer.erl
@@ -135,13 +135,13 @@ tokenize(String, Line, Opts) ->
   tokenize(String, Line, 1, Opts).
 
 tokenize([], Line, Column, #elixir_tokenizer{cursor_completion=Cursor} = Scope, Tokens) when Cursor /= false ->
-  #elixir_tokenizer{terminators=Terminators, warnings=Warnings} = Scope,
-
+  #elixir_tokenizer{file=File, identifier_tokenizer=IdentifierTokenizer, terminators=Terminators, warnings=Warnings} = Scope,
   {CursorColumn, CursorTerminators, CursorTokens} =
     add_cursor(Line, Column, Cursor, Terminators, Tokens),
 
   AccTokens = cursor_complete(Line, CursorColumn, CursorTerminators, CursorTokens),
-  {ok, Line, Column, Warnings, AccTokens};
+  UnicodeWarnings = unicode_lint_warnings(IdentifierTokenizer, Tokens, File),
+  {ok, Line, Column, Warnings ++ UnicodeWarnings, AccTokens};
 
 tokenize([], EndLine, Column, #elixir_tokenizer{terminators=[{Start, StartLine, _} | _]} = Scope, Tokens) ->
   End = terminator(Start),
@@ -150,8 +150,10 @@ tokenize([], EndLine, Column, #elixir_tokenizer{terminators=[{Start, StartLine,
   Formatted = io_lib:format(Message, [End, Start, StartLine]),
   error({EndLine, Column, [Formatted, Hint], []}, [], Scope, Tokens);
 
-tokenize([], Line, Column, #elixir_tokenizer{warnings=Warnings}, Tokens) ->
-  {ok, Line, Column, Warnings, lists:reverse(Tokens)};
+tokenize([], Line, Column, #elixir_tokenizer{file=File, identifier_tokenizer=IdentifierTokenizer, warnings=Warnings}, TokensReversed) ->
+  Tokens = lists:reverse(TokensReversed),
+  UnicodeWarnings = unicode_lint_warnings(IdentifierTokenizer, Tokens, File),
+  {ok, Line, Column, Warnings ++ UnicodeWarnings, Tokens};
 
 % VC merge conflict
 
@@ -1666,4 +1668,10 @@ prune_tokens([], [], Terminators) ->
 
 drop_including([{Token, _} | Tokens], Token) -> Tokens;
 drop_including([_ | Tokens], Token) -> drop_including(Tokens, Token);
-drop_including([], _Token) -> [].
+drop_including([], _Token) -> [].
+
+unicode_lint_warnings(IdentifierTokenizer, Tokens, File) ->
+  case erlang:function_exported(IdentifierTokenizer, unicode_lint_warnings, 1) of
+    true -> IdentifierTokenizer:unicode_lint_warnings(Tokens, File);
+    false -> []
+  end.
diff --git a/lib/elixir/test/elixir/kernel/warning_test.exs b/lib/elixir/test/elixir/kernel/warning_test.exs
@@ -26,6 +26,68 @@ defmodule Kernel.WarningTest do
     assert {:error, _} = Code.string_to_quoted(~s[:"foobar" do])
   end
 
+  describe "unicode identifier security" do
+    test "uncommon codepoints" do
+      assert capture_err(fn -> Code.eval_string("ㅤx = x = 2") end) =~
+               "identifier 'ㅤx' has uncommon codepoint \\u3164"
+
+      assert capture_err(fn -> Code.eval_string("defmodule M do def ∆ᵥ(_), do: 1 end") end) == ""
+      assert capture_err(fn -> Code.eval_string("∆_whitelisted = 1") end) == ""
+    end
+
+    test "warns on confusables" do
+      assert capture_err(fn -> Code.eval_string("аdmin=1; admin=1") end) =~
+               "confusable identifier: 'admin' looks like 'аdmin' on line 1"
+
+      assert capture_err(fn -> Code.eval_string("力=1; カ=1") end) =~
+               "confusable identifier: 'カ' looks like '力' on line 1"
+
+      # by convention, doesn't warn on ascii-only confusables
+      assert capture_err(fn -> Code.eval_string("x0 = xO = 1") end) == ""
+      assert capture_err(fn -> Code.eval_string("l1 = ll = 1") end) == ""
+    end
+
+    test "warnings on mixed scripts" do
+      output = capture_err(fn -> Code.eval_string("cirсlе = 1") end)
+
+      warning = ~S"""
+      The only uses of Cyrillic in this file are mixed-script confusables, like 'cirсlе' on line 1:
+       \u0063 'c' {Latin}
+       \u0069 'i' {Latin}
+       \u0072 'r' {Latin}
+       \u0441 'с' {Cyrillic} <- mixed-script confusable
+       \u006C 'l' {Latin}
+       \u0435 'е' {Cyrillic} <- mixed-script confusable
+       Resolved script set (intersection): {∅}
+      """
+
+      assert output =~ warning
+    end
+
+    test "does not warn on valid uses of multiple scripts" do
+      # writing systems with multiple scripts, and with Common chars like '_'
+      assert capture_err(fn -> Code.eval_string("幻ㄒㄧㄤ = 1") end) == ""
+      assert capture_err(fn -> Code.eval_string("幻ㄒㄧㄤ1 = 1") end) == ""
+      assert capture_err(fn -> Code.eval_string("__सवव_1? = 1") end) == ""
+
+      # mixed scripts, but verified, by using non-confusable characters too
+      assert capture_err(fn -> Code.eval_string("夏の幻ㄒㄧㄤ = 2") end) == ""
+      assert capture_err(fn -> Code.eval_string("_सवव_twitter_api = 1") end) == ""
+      assert capture_err(fn -> Code.eval_string("слово_api = 1") end) == ""
+
+      # tokens from the whole file are considered in this check, and
+      # any use of a non-confusable character verifies that script.
+      assert capture_err(fn -> Code.eval_string("рос_api = 1") end) =~ "mixed-script"
+
+      assert capture_err(fn ->
+               Code.eval_string("""
+               рос_api = 1 # mixed-script with confusable Cyrillic
+               слово = 1   # verifies Cyrillic in the file
+               """)
+             end) == ""
+    end
+  end
+
   test "operators formed by many of the same character followed by that character" do
     output =
       capture_err(fn ->