diff --git a/spec/formatting.md b/spec/formatting.md index c14145121..f04897565 100644 --- a/spec/formatting.md +++ b/spec/formatting.md @@ -502,7 +502,7 @@ Next, using `res`, resolve the preferential order for all message keys: 1. Let `key` be the `var` key at position `i`. 1. If `key` is not the catch-all key `'*'`: 1. Assert that `key` is a _literal_. - 1. Let `ks` be the resolved value of `key`. + 1. Let `ks` be the resolved value of `key` in Unicode Normalization Form C. 1. Append `ks` as the last element of the list `keys`. 1. Let `rv` be the resolved value at index `i` of `res`. 1. Let `matches` be the result of calling the method MatchSelectorKeys(`rv`, `keys`) @@ -516,6 +516,9 @@ The returned list MAY be empty. The most-preferred key is first, with each successive key appearing in order by decreasing preference. +The resolved value of each _key_ MUST be in Unicode Normalization Form C ("NFC"), +even if the _literal_ for the _key_ is not. + If calling MatchSelectorKeys encounters any error, a _Bad Selector_ error is emitted and an empty list is returned. diff --git a/spec/syntax.md b/spec/syntax.md index ea55af8a0..24ea52318 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -444,6 +444,12 @@ A _key_ can be either a _literal_ value or the "catch-all" key `*`. The **_catch-all key_** is a special key, represented by `*`, that matches all values for a given _selector_. +The value of each _key_ MUST be treated as if it were in +[Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC"). +Two _keys_ are considered equal if they are canonically equivalent strings, +that is, if they consist of the same sequence of Unicode code points after +Unicode Normalization Form C has been applied to both. + ## Expressions An **_expression_** is a part of a _message_ that will be determined @@ -690,6 +696,20 @@ except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF. All code points are preserved. +> [!IMPORTANT] +> Most text, including that produced by common keyboards and input methods, +> is already encoded in the canonical form known as +> [Unicode Normalization Form C](https://unicode.org/reports/tr15) ("NFC"). +> A few languages, legacy character encoding conversions, or operating environments +> can result in _literal_ values that are not in this form. +> Some uses of _literals_ in MessageFormat, +> notably as the value of _keys_, +> apply NFC to the _literal_ value during processing or comparison. +> While there is no requirement that the _literal_ value actually be entered +> in a normalized form, +> users are cautioned to employ the same character sequences +> for equivalent values and, whenever possible, ensure _literals_ are in NFC. + A **_quoted literal_** begins and ends with U+005E VERTICAL BAR `|`. The characters `\` and `|` within a _quoted literal_ MUST be escaped as `\\` and `\|`. @@ -714,21 +734,6 @@ number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / " ### Names and Identifiers -An **_identifier_** is a character sequence that -identifies a _function_, _markup_, or _option_. -Each _identifier_ consists of a _name_ optionally preceeded by -a _namespace_. -When present, the _namespace_ is separated from the _name_ by a -U+003A COLON `:`. -Built-in _functions_ and their _options_ do not have a _namespace_ identifier. - -The _namespace_ `u` (U+0075 LATIN SMALL LETTER U) -is reserved for future standardization. - -_Function_ _identifiers_ are prefixed with `:`. -_Markup_ _identifiers_ are prefixed with `#` or `/`. -_Option_ _identifiers_ have no prefix. - A **_name_** is a character sequence used in an _identifier_ or as the name for a _variable_ or the value of an _unquoted literal_. @@ -740,6 +745,20 @@ when matching _name_ or _identifier_ strings or _unquoted literal_ values. _Variable_ _names_ are prefixed with `$`. +Two _names_ are considered equal if they are canonically equivalent strings, +that is, if they consist of the same sequence of Unicode code points after +[Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC") +has been applied to both. + +> [!NOTE] +> Implementations are not required to normalize all _names_. +> Comparisons of _name_ values only need be done "as-if" normalization +> has occured. +> Since most text in the wild is already in NFC +> and since checking for NFC is fast and efficient, +> implementations can often substitute checking for actually applying normalization +> to _name_ values. + Valid content for _names_ is based on Namespaces in XML 1.0's [NCName](https://www.w3.org/TR/xml-names/#NT-NCName). This is different from XML's [Name](https://www.w3.org/TR/xml/#NT-Name) @@ -751,6 +770,21 @@ Otherwise, the set of characters allowed in a _name_ is large. > Such variables cannot be referenced in a _message_, > but are not otherwise errors. +An **_identifier_** is a character sequence that +identifies a _function_, _markup_, or _option_. +Each _identifier_ consists of a _name_ optionally preceeded by +a _namespace_. +When present, the _namespace_ is separated from the _name_ by a +U+003A COLON `:`. +Built-in _functions_ and their _options_ do not have a _namespace_ identifier. + +The _namespace_ `u` (U+0075 LATIN SMALL LETTER U) +is reserved for future standardization. + +_Function_ _identifiers_ are prefixed with `:`. +_Markup_ _identifiers_ are prefixed with `#` or `/`. +_Option_ _identifiers_ have no prefix. + Examples: > A variable: >```