Skip to content

Commit 0475f8f

Browse files
committed
MF2.0 compromise syntax
1 parent fe595d5 commit 0475f8f

File tree

1 file changed

+238
-0
lines changed

1 file changed

+238
-0
lines changed

spec/compromise-syntax.md

Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
# MF2.0 compromise syntax
2+
3+
# Intro
4+
5+
This syntax builds on the one from https://github.com/unicode-org/message-format-wg/pull/230
6+
but modified to address
7+
[@markusicu’s comments there](https://github.com/unicode-org/message-format-wg/pull/230#issuecomment-1116903103).
8+
9+
# Basic syntax
10+
11+
Messages need to delineate between literal text, placeholders, and other “code”.
12+
We should start in “code mode” and always enclose “patterns” (text+placeholders) in curly braces.
13+
```
14+
{Hello world!}
15+
{Hello {$name}!}
16+
```
17+
18+
This is unusual for formatting syntaxes, but useful.
19+
We anyway need to support selecting from among multiple patterns,
20+
and delimiting the patterns makes it unambiguous
21+
what white space is part of the pattern vs. serves as delimiters of “code” tokens.
22+
For consistency, we should always enclose a pattern,
23+
even if the message consists only of that pattern.
24+
That also helps with embedding messages in various resource file formats,
25+
because they can freely trim surrounding white space without
26+
requiring escapes when a message pattern wants to start or end with spaces.
27+
28+
By contrast, consider the experience with the existing ICU MessageFormat syntax
29+
which does start in “text mode”.
30+
ICU MessageFormat has pioneered the selection among multiple patterns based on run-time arguments.
31+
It represents selection using complex placeholders,
32+
which has the side effect of allowing literal text and other placeholders
33+
before and after the top-level selection placeholder.
34+
However, for reliable translations,
35+
there should be no translatable contents before or after the selection placeholder;
36+
instead, each selectable pattern should form one complete “translation unit”.
37+
Because the existing ICU MessageFormat starts in “text mode”,
38+
even though it looks like there is no extraneous text,
39+
spurious white space creeps in from developers’ line breaking of long message strings.
40+
The remedy is to always use syntax to indicate the start of translatable contents.
41+
42+
We use curly braces to delimit patterns because
43+
`{}` are the paired ASCII punctuation characters least commonly used in normal text.
44+
For the same reason, we also use them for embedding placeholders in patterns.
45+
46+
Literal text can use any characters except for curly braces,
47+
and except for the backslash, which we use as usual for escaping.
48+
That is, the only special characters inside a pattern are `{}\`.
49+
The only allowed escape sequences are `\{`, `\}`, and `\\`.
50+
It is an error if `\` is followed by any other character.
51+
52+
The message syntax does not use `'` or `"`,
53+
so that it is easy to hard-code message strings in programming language source code.
54+
55+
# Placeholders
56+
57+
Formatting a message replaces placeholders with values based on run-time arguments or special functions.
58+
We also allow for value literals specified inside the placeholder,
59+
instead of using an argument name;
60+
and we also allow for invoking functions without using argument names or value literals.
61+
```
62+
{$name}
63+
{$count :number}
64+
{$fraction :number style=percent minFractions=2}
65+
{<25> :number}
66+
{:specialFunction optionKey=optionValue key2=<value with spaces>}
67+
```
68+
69+
An argument name is a `$` immediately followed by an identifier.
70+
A message formatting function will typically accept a Map of argument keys to values
71+
where the keys match argument name identifiers in the patterns of the message.
72+
73+
TODO: For the definition of identifiers we should consult with the Unicode Source Code Working Group.
74+
75+
If the placeholder specifies only an argument name,
76+
then the formatting function is inferred from the run-time type of the argument value.
77+
For example, a string value would simply be inserted,
78+
and a numeric type could be formatted using some kind of default number formatter.
79+
- TODO: In the registry, specify the default formatters for a small set of value types.
80+
81+
The function is specified via a `:` immediately followed by an identifier.
82+
If an argument name or a value literal is given,
83+
then the function is usually a formatter for its expected input types.
84+
- TODO: There still seems to be discussion about the function prefix character.
85+
It could be some other ASCII punctuation, for example `@`.
86+
- TODO: Functions must be listed in a registry.
87+
- TODO: Functions that accept value literals must specify their syntax.
88+
- TODO: Reserve a naming convention for private use functions (not in the standard registry). Examples:
89+
- Starts with `_`
90+
- Starts with `x`
91+
- Contains interior dots – e.g., com.google.fancyNumber
92+
93+
When a function is specified, it can be optionally followed by options which are key-value pairs,
94+
with `=` (and no white space) between the key identifier and the value.
95+
The option value can contain any character other than curly braces and white space,
96+
unless delimited like literal values.
97+
- TODO: Each registered function must define the available options and their value syntax.
98+
- TODO: If we allow white space in option values, then we need optional delimiters for such values. Probably the same delimiters as for literal values.
99+
100+
Options are not allowed when no function is specified.
101+
102+
Value literals are important for developers to control the output.
103+
For example, certain strings may need to be inlined as literals so that
104+
they are not changed during translation.
105+
Numeric constants need to be formatted differently depending on the target language
106+
(e.g., which digits and separators, and the grouping style).
107+
Date constants need to be formatted according to the target language’s calendar system.
108+
109+
If only a value literal is given, without specifying a function,
110+
then its string value is used verbatim and it is read-only for translators.
111+
- TODO: Value literals need to be delimited (they may contain spaces),
112+
and the starting delimiter needs to be distinct from the prefixes for
113+
argument names and functions.
114+
Reasonable choices include `<>`, `()`, `[]`, or `||`.
115+
Consider that the same delimiters should also be usable (not visually confusing)
116+
when used in a list of selection values (see below); that probably excludes `||` and `[]`.
117+
- TODO: Define escaping inside constant values.
118+
Probably the pattern escapes plus escapes for the constant delimiters.
119+
120+
A placeholder must not be an empty pair of `{}` braces.
121+
122+
Any character that does not fit defined syntax is an error.
123+
This leaves room for future extensions.
124+
For example, a placeholder must start with `{` immediately followed by
125+
the prefix character for an argument name, literal value, or function;
126+
and after the function name there must be only white-space-separated options which
127+
start with identifier-start characters.
128+
129+
# Syntactic white space
130+
131+
We use white space inside placeholders and in “code mode” (outside patterns) as token separators.
132+
White space is a sequence of one or more of the characters TAB, LF, CR, SP, and maybe some more.
133+
- TODO: For the definition of white space we should consult with the Unicode Source Code Working Group.
134+
- TODO: Decide whether to use Unicode Pattern_White_Space or otherwise allow RLM and LRM characters.
135+
136+
White space can also be useful for line breaking long messages, indentation, and alignment.
137+
However, we should not allow white space everywhere possible,
138+
because that just leads to confusing variations in style,
139+
and the creation of formatting tools to enforce certain styles.
140+
For example, there is no reason to allow white space between a name or function prefix and its identifier,
141+
around the `=` of an option, after the `{` of a placeholder, or before the `}` of a placeholder.
142+
143+
# Pattern selection
144+
145+
Messages need the ability to choose among variants of a pattern based on certain argument values.
146+
Common examples include selecting the right plural form, and variants for different person genders.
147+
148+
There should be a single level of selection (not nested like in ICU MessageFormat).
149+
It needs to support multiple selectors.
150+
151+
In this syntax, a list of N selectors is followed by a list of pairs where
152+
the first element of each pair is a list of N value literals and
153+
the second element of each pair is a pattern.
154+
A `_` is a wildcard value that always matches.
155+
The last variant must have a list of all wildcard values.
156+
```
157+
[{$count :plural offset=1 grouping=always} {$gender}]
158+
[1 female] {{$name} added you to her circles.}
159+
[1 male] {{$name} added you to his circles.}
160+
[1 _] {{$name} added you to their circles.}
161+
[_ _] {{$name} added you and {#count} others to their circles.}
162+
```
163+
164+
Lists are enclosed in square brackets, reminiscent of Python lists.
165+
The opening `[` also distinguishes the selection syntax from a simple pattern.
166+
167+
TODO: Decide whether to enclose each value literal in
168+
the same pair of delimiters as literals in placeholder (for consistency),
169+
or whether to make that optional.
170+
(The `[]` value list syntax already indicates that value literals are enclosed.)
171+
Some literals may require it if they contain spaces.
172+
The `_` should probably never be enclosed in literal delimiters.
173+
174+
Selector syntax follows placeholder syntax,
175+
except that a function must be specified.
176+
For the purpose of selection, there are three types of functions:
177+
1. Select-and-format functions combine the two functionalities,
178+
and the selection is informed by the formatting.
179+
For example, selectors for plural variants
180+
(different selectors for cardinal-number vs. ordinal-number variants)
181+
have to take into account how the number is formatted.
182+
2. Format-only functions can be used as selectors via
183+
simple string matching of their output with the variant values.
184+
3. Select-only functions select among variant values, but they cannot be used in pattern placeholders.
185+
186+
There is a simple format-only function that can be used for simple string matching.
187+
TODO: Decide on a name for this format-only function. Consider `:string`.
188+
189+
Inside a selection-variant pattern,
190+
there is a special placeholder syntax for inserting the formatting result of a select-and-format function.
191+
This placeholder only specifies the selector’s argument name with a distinct prefix.
192+
It must not specify a function.
193+
In the example above, the `{#count}` value is the input $count minus the offset,
194+
like the `#` in an ICU PluralFormat, which is the input to the plural rules evaluation.
195+
This is not allowed for argument names used in select-only functions.
196+
- TODO: Bike-shedding on the prefix character, shown as `#` here.
197+
198+
Inside selected patterns,
199+
the selector argument variables must not be used with the normal `$` placeholder syntax –
200+
for example, the patterns in the preceding example must not use `{$count}`.
201+
Allowing that would be doubly confusing:
202+
- It would not be clear which value is inserted.
203+
In the example, the plural offset is subtracted from the input value,
204+
and the formatted version of that is what is used for
205+
evaluating the plural rules and inserting into the pattern.
206+
- It would not be clear what formatting is applied.
207+
The formatting function and options specified in the selector must be used,
208+
but `{$count}` would look like the default formatter might be used.
209+
Allowing a function-and-options specification here would be even worse.
210+
- (If a developer does need a pattern with both the selector-modified and also the original value,
211+
then they can pass the value twice into the message formatting function,
212+
under different argument names.)
213+
214+
# Named expressions
215+
216+
When a message contains many variants, it is tedious, verbose, and error-prone to
217+
repeat complicated placeholders in many of those variants.
218+
We allow the definition of named expressions before the selection.
219+
The patterns could then use those names.
220+
```
221+
$relDate={$date :relativeDateTime fields=Mdjm}
222+
[{$count :plural offset=1} {$gender}]
223+
[1 female] {{$name} added you to her circles {$relDate}.}
224+
[1 male] {{$name} added you to his circles {$relDate}.}
225+
[1 _] {{$name} added you to their circles {$relDate}.}
226+
[_ _] {{$name} added you and {#count} others to their circles {$relDate}.}
227+
```
228+
229+
When a named expression is used in a pattern placeholder, then no function must be specified.
230+
The formatting is determined by the given expression.
231+
- TODO: Decide whether to use a different prefix for
232+
a pattern placeholder that refers to a named expression.
233+
Using `$` looks familiar, but
234+
a distinct prefix would signal that this is not a normal placeholder,
235+
and it would allow for a syntax definition (in the BNF) limited to
236+
only the named-expression insertion.
237+
238+
The expression name must not be the same as that for any placeholder argument.

0 commit comments

Comments
 (0)