Skip to content
22 changes: 18 additions & 4 deletions spec/appendices.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,31 @@ host environments, their serializations and resource formats,
that might be sufficient to prevent most problems.
However, MessageFormat itself does not supply such a restriction.

MessageFormat _messages_ permit nearly all Unicode code points,
with the exception of surrogates,
MessageFormat _messages_ permit nearly all Unicode code points
to appear in _literals_, including the text portions of a _pattern_.
This means that it can be possible for a _message_ to contain invisible characters
(such as bidirectional controls,
ASCII control characters in the range U+0000 to U+001F,
(such as bidirectional controls, ASCII control characters in the range U+0000 to U+001F,
or characters that might be interpreted as escapes or syntax in the host format)
that abnormally affect the display of the _message_
when viewed as source code, or in resource formats or translation tools,
but do not generate errors from MessageFormat parsers or processing APIs.

> [!IMPORTANT]
> _Text_ and _quoted literals_ allow unpaired surrogate code points
> (`U+D800` to `U+DFFF`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/why is this important to note as a security consideration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously suggested the same thing:

I... don't think this is a security consideration? Maybe move this note to the section on message, just after the note that begins:

This syntax is designed to be embeddable into many different programming languages...

I think that location is a good one because we put a whole wodge of notes there that are "read once and forget" about the content of messages.

> This is for compatibility with formats or data structures
> that use the UTF-16 encoding
> and do not check for unpaired surrogates.
> (Strings in Java or JavaScript are examples of this.)
> These code points SHOULD NOT be used in a _message_.
> Unpaired surrogate code points are likely an indication of mistakes
> or errors in the creation, serialization, or processing of the _message_.
> Many processes will convert them to
> � U+FFFD REPLACEMENT CHARACTER
> during processing or display.
> Implementations not based on UTF-16 might not be able to represent
> a _message_ containing such code points.

Bidirectional text containing right-to-left characters (such as used for Arabic or Hebrew)
also poses a potential source of confusion for users.
Since MessageFormat 2.0's syntax makes use of
Expand Down
3 changes: 1 addition & 2 deletions spec/message.abnf
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,7 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
/ %x41-5B ; omit \ (%x5C)
/ %x5D-7A ; omit { | } (%x7B-7D)
/ %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000)
/ %x3001-D7FF ; omit surrogates
/ %xE000-10FFFF
/ %x3001-10FFFF ; allowing surrogates is intentional

; Character escapes
escaped-char = backslash ( backslash / "{" / "|" / "}" )
Expand Down
21 changes: 14 additions & 7 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ The syntax specification takes into account the following design restrictions:
control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters
(U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.
U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and
quoted literals only (U+D800 through U+DFFF), and other potentially confusing content.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specifier about where surrogates are valid is unnecessary esp. in this context, where the other "potentially confusing content" is for the most part only allowed in the same.

Suggested change
U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and
quoted literals only (U+D800 through U+DFFF), and other potentially confusing content.
U+100000 through U+10FFFD), unassigned code points,
unpaired surrogates (U+D800 through U+DFFF), and other potentially confusing content.


## Messages and their Syntax

Expand Down Expand Up @@ -274,8 +275,11 @@ A _quoted pattern_ MAY be empty.
### Text

**_<dfn>text</dfn>_** is the translateable content of a _pattern_.
Any Unicode code point is allowed, except for U+0000 NULL
and the surrogate code points U+D800 through U+DFFF inclusive.
Any Unicode code point is allowed, except for U+0000 NULL.
> [!NOTE]
> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
> are allowed for compatibility with UTF-16 based implementations
> that do not check for this encoding error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note would be better placed a bit further down in this section, and it needs to be separated from the surrounding content by an empty line. This would also be a more appropriate place for the concerns that are currently hidden away in the Security Considerations appendix.

The characters U+005C REVERSE SOLIDUS `\`,
U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}`
MUST be escaped as `\\`, `\{`, and `\}` respectively.
Expand All @@ -301,8 +305,7 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
/ %x41-5B ; omit \ (%x5C)
/ %x5D-7A ; omit { | } (%x7B-7D)
/ %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000)
/ %x3001-D7FF ; omit surrogates
/ %xE000-10FFFF
/ %x3001-10FFFF ; allowing surrogates is intentional
```

When a _pattern_ is quoted by embedding the _pattern_ in curly brackets, the
Expand Down Expand Up @@ -691,8 +694,7 @@ A _literal_ can appear
as a _key_ value,
as the _operand_ of a _literal-expression_,
or in the value of an _option_.
A _literal_ MAY include any Unicode code point
except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.
A _literal_ MAY include any Unicode code point except for U+0000 NULL.

All code points are preserved.

Expand All @@ -714,6 +716,11 @@ A **_<dfn>quoted literal</dfn>_** begins and ends with U+005E VERTICAL BAR `|`.
The characters `\` and `|` within a _quoted literal_ MUST be
escaped as `\\` and `\|`.

> [!NOTE]
> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
> are allowed in quoted literals for compatibility with UTF-16 based
> implementations that do not check for this encoding error.

An **_<dfn>unquoted literal</dfn>_** is a _literal_ that does not require the `|`
quotes around it to be distinct from the rest of the _message_ syntax.
An _unquoted literal_ MAY be used when the content of the _literal_
Expand Down