|
| 1 | +# MF2.0 compromise syntax |
| 2 | + |
| 3 | +# Intro |
| 4 | + |
| 5 | +This syntax builds on the one from https://github.com/unicode-org/message-format-wg/pull/230 |
| 6 | +but modified to address |
| 7 | +[@markusicu’s comments there](https://github.com/unicode-org/message-format-wg/pull/230#issuecomment-1116903103). |
| 8 | + |
| 9 | +# Basic syntax |
| 10 | + |
| 11 | +Messages need to delineate between literal text, placeholders, and other “code”. |
| 12 | +We should start in “code mode” and always enclose “patterns” (text+placeholders) in curly braces. |
| 13 | +``` |
| 14 | +{Hello world!} |
| 15 | +{Hello {$name}!} |
| 16 | +``` |
| 17 | + |
| 18 | +This is unusual for formatting syntaxes, but useful. |
| 19 | +We anyway need to support selecting from among multiple patterns, |
| 20 | +and delimiting the patterns makes it unambiguous |
| 21 | +what white space is part of the pattern vs. serves as delimiters of “code” tokens. |
| 22 | +For consistency, we should always enclose a pattern, |
| 23 | +even if the message consists only of that pattern. |
| 24 | +That also helps with embedding messages in various resource file formats, |
| 25 | +because they can freely trim surrounding white space without |
| 26 | +requiring escapes when a message pattern wants to start or end with spaces. |
| 27 | + |
| 28 | +By contrast, consider the experience with the existing ICU MessageFormat syntax |
| 29 | +which does start in “text mode”. |
| 30 | +ICU MessageFormat has pioneered the selection among multiple patterns based on run-time arguments. |
| 31 | +It represents selection using complex placeholders, |
| 32 | +which has the side effect of allowing literal text and other placeholders |
| 33 | +before and after the top-level selection placeholder. |
| 34 | +However, for reliable translations, |
| 35 | +there should be no translatable contents before or after the selection placeholder; |
| 36 | +instead, each selectable pattern should form one complete “translation unit”. |
| 37 | +Because the existing ICU MessageFormat starts in “text mode”, |
| 38 | +even though it looks like there is no extraneous text, |
| 39 | +spurious white space creeps in from developers’ line breaking of long message strings. |
| 40 | +The remedy is to always use syntax to indicate the start of translatable contents. |
| 41 | + |
| 42 | +We use curly braces to delimit patterns because |
| 43 | +`{}` are the paired ASCII punctuation characters least commonly used in normal text. |
| 44 | +For the same reason, we also use them for embedding placeholders in patterns. |
| 45 | + |
| 46 | +Literal text can use any characters except for curly braces, |
| 47 | +and except for the backslash, which we use as usual for escaping. |
| 48 | +That is, the only special characters inside a pattern are `{}\`. |
| 49 | +The only allowed escape sequences are `\{`, `\}`, and `\\`. |
| 50 | +It is an error if `\` is followed by any other character. |
| 51 | + |
| 52 | +The message syntax does not use `'` or `"`, |
| 53 | +so that it is easy to hard-code message strings in programming language source code. |
| 54 | + |
| 55 | +# Placeholders |
| 56 | + |
| 57 | +Formatting a message replaces placeholders with values based on run-time arguments or special functions. |
| 58 | +We also allow for value literals specified inside the placeholder, |
| 59 | +instead of using an argument name; |
| 60 | +and we also allow for invoking functions without using argument names or value literals. |
| 61 | +``` |
| 62 | +{$name} |
| 63 | +{$count :number} |
| 64 | +{$fraction :number style=percent minFractions=2} |
| 65 | +{<25> :number} |
| 66 | +{:specialFunction optionKey=optionValue key2=<value with spaces>} |
| 67 | +``` |
| 68 | + |
| 69 | +An argument name is a `$` immediately followed by an identifier. |
| 70 | +A message formatting function will typically accept a Map of argument keys to values |
| 71 | +where the keys match argument name identifiers in the patterns of the message. |
| 72 | + |
| 73 | +TODO: For the definition of identifiers we should consult with the Unicode Source Code Working Group. |
| 74 | + |
| 75 | +If the placeholder specifies only an argument name, |
| 76 | +then the formatting function is inferred from the run-time type of the argument value. |
| 77 | +For example, a string value would simply be inserted, |
| 78 | +and a numeric type could be formatted using some kind of default number formatter. |
| 79 | +- TODO: In the registry, specify the default formatters for a small set of value types. |
| 80 | + |
| 81 | +The function is specified via a `:` immediately followed by an identifier. |
| 82 | +If an argument name or a value literal is given, |
| 83 | +then the function is usually a formatter for its expected input types. |
| 84 | +- TODO: There still seems to be discussion about the function prefix character. |
| 85 | + It could be some other ASCII punctuation, for example `@`. |
| 86 | +- TODO: Functions must be listed in a registry. |
| 87 | +- TODO: Functions that accept value literals must specify their syntax. |
| 88 | +- TODO: Reserve a naming convention for private use functions (not in the standard registry). Examples: |
| 89 | + - Starts with `_` |
| 90 | + - Starts with `x` |
| 91 | + - Contains interior dots – e.g., com.google.fancyNumber |
| 92 | + |
| 93 | +When a function is specified, it can be optionally followed by options which are key-value pairs, |
| 94 | +with `=` (and no white space) between the key identifier and the value. |
| 95 | +The option value can contain any character other than curly braces and white space, |
| 96 | +unless delimited like literal values. |
| 97 | +- TODO: Each registered function must define the available options and their value syntax. |
| 98 | +- TODO: If we allow white space in option values, then we need optional delimiters for such values. Probably the same delimiters as for literal values. |
| 99 | + |
| 100 | +Options are not allowed when no function is specified. |
| 101 | + |
| 102 | +Value literals are important for developers to control the output. |
| 103 | +For example, certain strings may need to be inlined as literals so that |
| 104 | +they are not changed during translation. |
| 105 | +Numeric constants need to be formatted differently depending on the target language |
| 106 | +(e.g., which digits and separators, and the grouping style). |
| 107 | +Date constants need to be formatted according to the target language’s calendar system. |
| 108 | + |
| 109 | +If only a value literal is given, without specifying a function, |
| 110 | +then its string value is used verbatim and it is read-only for translators. |
| 111 | +- TODO: Value literals need to be delimited (they may contain spaces), |
| 112 | + and the starting delimiter needs to be distinct from the prefixes for |
| 113 | + argument names and functions. |
| 114 | + Reasonable choices include `<>`, `()`, `[]`, or `||`. |
| 115 | + Consider that the same delimiters should also be usable (not visually confusing) |
| 116 | + when used in a list of selection values (see below); that probably excludes `||` and `[]`. |
| 117 | +- TODO: Define escaping inside constant values. |
| 118 | + Probably the pattern escapes plus escapes for the constant delimiters. |
| 119 | + |
| 120 | +A placeholder must not be an empty pair of `{}` braces. |
| 121 | + |
| 122 | +Any character that does not fit defined syntax is an error. |
| 123 | +This leaves room for future extensions. |
| 124 | +For example, a placeholder must start with `{` immediately followed by |
| 125 | +the prefix character for an argument name, literal value, or function; |
| 126 | +and after the function name there must be only white-space-separated options which |
| 127 | +start with identifier-start characters. |
| 128 | + |
| 129 | +# Syntactic white space |
| 130 | + |
| 131 | +We use white space inside placeholders and in “code mode” (outside patterns) as token separators. |
| 132 | +White space is a sequence of one or more of the characters TAB, LF, CR, SP, and maybe some more. |
| 133 | +- TODO: For the definition of white space we should consult with the Unicode Source Code Working Group. |
| 134 | +- TODO: Decide whether to use Unicode Pattern_White_Space or otherwise allow RLM and LRM characters. |
| 135 | + |
| 136 | +White space can also be useful for line breaking long messages, indentation, and alignment. |
| 137 | +However, we should not allow white space everywhere possible, |
| 138 | +because that just leads to confusing variations in style, |
| 139 | +and the creation of formatting tools to enforce certain styles. |
| 140 | +For example, there is no reason to allow white space between a name or function prefix and its identifier, |
| 141 | +around the `=` of an option, after the `{` of a placeholder, or before the `}` of a placeholder. |
| 142 | + |
| 143 | +# Pattern selection |
| 144 | + |
| 145 | +Messages need the ability to choose among variants of a pattern based on certain argument values. |
| 146 | +Common examples include selecting the right plural form, and variants for different person genders. |
| 147 | + |
| 148 | +There should be a single level of selection (not nested like in ICU MessageFormat). |
| 149 | +It needs to support multiple selectors. |
| 150 | + |
| 151 | +In this syntax, a list of N selectors is followed by a list of pairs where |
| 152 | +the first element of each pair is a list of N value literals and |
| 153 | +the second element of each pair is a pattern. |
| 154 | +A `_` is a wildcard value that always matches. |
| 155 | +The last variant must have a list of all wildcard values. |
| 156 | +``` |
| 157 | +[{$count :plural offset=1 grouping=always} {$gender}] |
| 158 | +[1 female] {{$name} added you to her circles.} |
| 159 | +[1 male] {{$name} added you to his circles.} |
| 160 | +[1 _] {{$name} added you to their circles.} |
| 161 | +[_ _] {{$name} added you and {#count} others to their circles.} |
| 162 | +``` |
| 163 | + |
| 164 | +Lists are enclosed in square brackets, reminiscent of Python lists. |
| 165 | +The opening `[` also distinguishes the selection syntax from a simple pattern. |
| 166 | + |
| 167 | +TODO: Decide whether to enclose each value literal in |
| 168 | +the same pair of delimiters as literals in placeholder (for consistency), |
| 169 | +or whether to make that optional. |
| 170 | +(The `[]` value list syntax already indicates that value literals are enclosed.) |
| 171 | +Some literals may require it if they contain spaces. |
| 172 | +The `_` should probably never be enclosed in literal delimiters. |
| 173 | + |
| 174 | +Selector syntax follows placeholder syntax, |
| 175 | +except that a function must be specified. |
| 176 | +For the purpose of selection, there are three types of functions: |
| 177 | +1. Select-and-format functions combine the two functionalities, |
| 178 | + and the selection is informed by the formatting. |
| 179 | + For example, selectors for plural variants |
| 180 | + (different selectors for cardinal-number vs. ordinal-number variants) |
| 181 | + have to take into account how the number is formatted. |
| 182 | +2. Format-only functions can be used as selectors via |
| 183 | + simple string matching of their output with the variant values. |
| 184 | +3. Select-only functions select among variant values, but they cannot be used in pattern placeholders. |
| 185 | + |
| 186 | +There is a simple format-only function that can be used for simple string matching. |
| 187 | +TODO: Decide on a name for this format-only function. Consider `:string`. |
| 188 | + |
| 189 | +Inside a selection-variant pattern, |
| 190 | +there is a special placeholder syntax for inserting the formatting result of a select-and-format function. |
| 191 | +This placeholder only specifies the selector’s argument name with a distinct prefix. |
| 192 | +It must not specify a function. |
| 193 | +In the example above, the `{#count}` value is the input $count minus the offset, |
| 194 | +like the `#` in an ICU PluralFormat, which is the input to the plural rules evaluation. |
| 195 | +This is not allowed for argument names used in select-only functions. |
| 196 | +- TODO: Bike-shedding on the prefix character, shown as `#` here. |
| 197 | + |
| 198 | +Inside selected patterns, |
| 199 | +the selector argument variables must not be used with the normal `$` placeholder syntax – |
| 200 | +for example, the patterns in the preceding example must not use `{$count}`. |
| 201 | +Allowing that would be doubly confusing: |
| 202 | +- It would not be clear which value is inserted. |
| 203 | + In the example, the plural offset is subtracted from the input value, |
| 204 | + and the formatted version of that is what is used for |
| 205 | + evaluating the plural rules and inserting into the pattern. |
| 206 | +- It would not be clear what formatting is applied. |
| 207 | + The formatting function and options specified in the selector must be used, |
| 208 | + but `{$count}` would look like the default formatter might be used. |
| 209 | + Allowing a function-and-options specification here would be even worse. |
| 210 | +- (If a developer does need a pattern with both the selector-modified and also the original value, |
| 211 | + then they can pass the value twice into the message formatting function, |
| 212 | + under different argument names.) |
| 213 | + |
| 214 | +# Named expressions |
| 215 | + |
| 216 | +When a message contains many variants, it is tedious, verbose, and error-prone to |
| 217 | +repeat complicated placeholders in many of those variants. |
| 218 | +We allow the definition of named expressions before the selection. |
| 219 | +The patterns could then use those names. |
| 220 | +``` |
| 221 | +$relDate={$date :relativeDateTime fields=Mdjm} |
| 222 | +[{$count :plural offset=1} {$gender}] |
| 223 | +[1 female] {{$name} added you to her circles {$relDate}.} |
| 224 | +[1 male] {{$name} added you to his circles {$relDate}.} |
| 225 | +[1 _] {{$name} added you to their circles {$relDate}.} |
| 226 | +[_ _] {{$name} added you and {#count} others to their circles {$relDate}.} |
| 227 | +``` |
| 228 | + |
| 229 | +When a named expression is used in a pattern placeholder, then no function must be specified. |
| 230 | +The formatting is determined by the given expression. |
| 231 | +- TODO: Decide whether to use a different prefix for |
| 232 | + a pattern placeholder that refers to a named expression. |
| 233 | + Using `$` looks familiar, but |
| 234 | + a distinct prefix would signal that this is not a normal placeholder, |
| 235 | + and it would allow for a syntax definition (in the BNF) limited to |
| 236 | + only the named-expression insertion. |
| 237 | + |
| 238 | +The expression name must not be the same as that for any placeholder argument. |
0 commit comments