Skip to content

Commit 5e5c3cd

Browse files
committed
RFC: Allow full unicode range
This spec text implements #687 (full context and details there) and also introduces a new escape sequence. Three distinct changes: 1. Change SourceCharacter to allow points above 0xFFFF, now to 0x10FFFF. 2. Allow surrogate pairs within StringValue. This handles illegal pairs with a parse error. 3. Introduce new syntax for full range code point EscapedUnicode. This syntax (`\u{1F37A}`) has been adopted by many other languages and I propose GraphQL adopt it as well. (As a bonus, this removes the last instance of a regex in the lexer grammar!)
1 parent 00b88f0 commit 5e5c3cd

File tree

2 files changed

+43
-7
lines changed

2 files changed

+43
-7
lines changed

spec/Appendix B -- Grammar Summary.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ SourceCharacter ::
77
- "U+0009"
88
- "U+000A"
99
- "U+000D"
10-
- "U+0020–U+FFFF"
10+
- "U+0020–U+10FFFF"
1111

1212
## Ignored Tokens
1313

@@ -113,7 +113,16 @@ StringCharacter ::
113113
- `\u` EscapedUnicode
114114
- `\` EscapedCharacter
115115

116-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
116+
EscapedUnicode ::
117+
118+
- HexDigit HexDigit HexDigit HexDigit
119+
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
120+
121+
HexDigit :: one of
122+
123+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
124+
- `A` `B` `C` `D` `E` `F`
125+
- `a` `b` `c` `d` `e` `f`
117126

118127
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
119128

spec/Section 2 -- Language.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ SourceCharacter ::
5050
- "U+0009"
5151
- "U+000A"
5252
- "U+000D"
53-
- "U+0020–U+FFFF"
53+
- "U+0020–U+10FFFF"
5454

5555
GraphQL documents are expressed as a sequence of
5656
[Unicode](https://unicode.org/standard/standard.html) code points (informally
@@ -812,7 +812,16 @@ StringCharacter ::
812812
- `\u` EscapedUnicode
813813
- `\` EscapedCharacter
814814

815-
EscapedUnicode :: /[0-9A-Fa-f]{4}/
815+
EscapedUnicode ::
816+
817+
- HexDigit HexDigit HexDigit HexDigit
818+
- `{` HexDigit+ `}` "but only if <= 0x10FFFF"
819+
820+
HexDigit :: one of
821+
822+
- `0` `1` `2` `3` `4` `5` `6` `7` `8` `9`
823+
- `A` `B` `C` `D` `E` `F`
824+
- `a` `b` `c` `d` `e` `f`
816825

817826
EscapedCharacter :: one of `"` `\` `/` `b` `f` `n` `r` `t`
818827

@@ -897,16 +906,34 @@ StringValue :: `""`
897906

898907
StringValue :: `"` StringCharacter+ `"`
899908

900-
- Return the sequence of all {StringCharacter} code points.
909+
- Let {string} be the sequence of all {StringCharacter} code points.
910+
- For each {codePoint} at {index} in {string}:
911+
- If {codePoint} is >= 0xD800 and <= 0xDBFF (a
912+
[_High Surrogate_](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)):
913+
- Let {lowPoint} be the code point at {index} + {1} in {string}.
914+
- Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a
915+
[_Low Surrogate_](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
916+
- Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} -
917+
0xDC00) + 0x10000.
918+
- Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
919+
- Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a
920+
[_Low Surrogate_](https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs)).
921+
- Return {string}.
922+
923+
Note: {StringValue} should avoid encoding code points as surrogate pairs. While
924+
services must interpret them accordingly, a braced escape (for example
925+
`"\u{1F4A9}"`) is a clearer way to encode code points outside of the
926+
[Basic Multilingual Plane](https://unicodebook.readthedocs.io/unicode.html#bmp).
901927

902928
StringCharacter :: SourceCharacter but not `"` or `\` or LineTerminator
903929

904930
- Return the code point {SourceCharacter}.
905931

906932
StringCharacter :: `\u` EscapedUnicode
907933

908-
- Let {value} be the 16-bit hexadecimal value represented by the sequence of
909-
hexadecimal digits within {EscapedUnicode}.
934+
- Let {value} be the 21-bit hexadecimal value represented by the sequence of
935+
{HexDigit} within {EscapedUnicode}.
936+
- Assert {value} <= 0x10FFFF.
910937
- Return the code point {value}.
911938

912939
StringCharacter :: `\` EscapedCharacter

0 commit comments

Comments
 (0)