@@ -45,32 +45,22 @@ match, however some lookahead restrictions include additional constraints.
4545
4646## Source Text
4747
48- SourceCharacter ::
48+ SourceCharacter :: "Any Unicode scalar value"
4949
50- - "U+0009"
51- - "U+000A"
52- - "U+000D"
53- - "U+0020–U+10FFFF"
50+ GraphQL documents are interpreted from a source text, which is a sequence of
51+ {SourceCharacter}, each {SourceCharacter} being a _ Unicode scalar value _ which
52+ may be any Unicode code point from U+0000 to U+D7FF or U+E000 to U+10FFFF
53+ (informally referred to as _ "characters" _ through most of this specification).
5454
55- GraphQL documents are expressed as a sequence of
56- [ Unicode] ( https://unicode.org/standard/standard.html ) code points (informally
57- referred to as _ "characters"_ through most of this specification). However, with
58- few exceptions, most of GraphQL is expressed only in the original non-control
59- ASCII range so as to be as widely compatible with as many existing tools,
60- languages, and serialization formats as possible and avoid display issues in
61- text editors and source control.
55+ A GraphQL document may be expressed only in the ASCII range to be as widely
56+ compatible with as many existing tools, languages, and serialization formats as
57+ possible and avoid display issues in text editors and source control. Non-ASCII
58+ Unicode scalar values may appear within {StringValue} and {Comment}.
6259
63- Note: Non-ASCII Unicode characters may appear freely within {StringValue} and
64- {Comment} portions of GraphQL.
65-
66- ### Unicode
67-
68- UnicodeBOM :: "Byte Order Mark (U+FEFF)"
69-
70- The "Byte Order Mark" is a special Unicode character which may appear at the
71- beginning of a file containing Unicode which programs may use to determine the
72- fact that the text stream is Unicode, what endianness the text stream is in, and
73- which of several Unicode encodings to interpret.
60+ Note: An implementation which uses _ UTF-16_ to represent GraphQL documents in
61+ memory (for example, JavaScript or Java) may encounter a _ surrogate pair_ . This
62+ encodes a _ supplementary code point_ and is a single valid source character,
63+ however an unpaired _ surrogate code point_ is not a valid source character.
7464
7565### White Space
7666
@@ -175,6 +165,17 @@ significant way, for example a {StringValue} may contain white space characters.
175165No {Ignored} may appear _ within_ a {Token}, for example no white space
176166characters are permitted between the characters defining a {FloatValue}.
177167
168+ ** Byte order mark**
169+
170+ UnicodeBOM :: "Byte Order Mark (U+FEFF)"
171+
172+ The _ Byte Order Mark_ is a special Unicode code point which may appear at the
173+ beginning of a file which programs may use to determine the fact that the text
174+ stream is Unicode, and what specific encoding has been used.
175+
176+ As files are often concatenated, a _ Byte Order Mark_ may appear anywhere within
177+ a GraphQL document and is {Ignored}.
178+
178179### Punctuators
179180
180181Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
@@ -814,8 +815,8 @@ StringCharacter ::
814815
815816EscapedUnicode ::
816817
818+ - ` { ` HexDigit+ ` } `
817819- HexDigit HexDigit HexDigit HexDigit
818- - ` { ` HexDigit+ ` } ` "but only if <= 0x10FFFF"
819820
820821HexDigit :: one of
821822
@@ -830,19 +831,58 @@ BlockStringCharacter ::
830831- SourceCharacter but not ` """ ` or ` \""" `
831832- ` \""" `
832833
833- Strings are sequences of characters wrapped in quotation marks (U+0022). (ex.
834- {` "Hello World" ` }). White space and other otherwise-ignored characters are
835- significant within a string value.
834+ {StringValue} is a sequence of characters wrapped in quotation marks (U+0022).
835+ (ex. {` "Hello World" ` }). White space and other characters ignored in other parts
836+ of a GraphQL document are significant within a string value.
837+
838+ A {StringValue} is evaluated to a Unicode text value, a sequence of Unicode
839+ scalar values, by interpreting all escape sequences using the static semantics
840+ defined below.
836841
837842The empty string {` "" ` } must not be followed by another {` " ` } otherwise it would
838843be interpreted as the beginning of a block string. As an example, the source
839844{` """""" ` } can only be interpreted as a single empty block string and not three
840845empty strings.
841846
842- Non-ASCII Unicode characters are allowed within single-quoted strings. Since
843- {SourceCharacter} must not contain some ASCII control characters, escape
844- sequences must be used to represent these characters. The {` \ ` }, {` " ` }
845- characters also must be escaped. All other escape sequences are optional.
847+ ** Escape Sequences**
848+
849+ In a single-quoted {StringValue}, any Unicode scalar value may be expressed
850+ using an escape sequence. GraphQL strings allow both C-style escape sequences
851+ (for example ` \n ` ) and two forms of Unicode escape sequences: one with a
852+ fixed-width of 4 hexadecimal digits (for example ` \u000A ` ) and one with a
853+ variable-width most useful for representing a _ supplementary character_ such as
854+ an Emoji (for example ` \u{1F4A9} ` ).
855+
856+ The hexadecimal number encoded by a Unicode escape sequence must describe a
857+ Unicode scalar value, otherwise parsing should stop with an early error. For
858+ example both sources ` "\uDEAD" ` and ` "\u{110000}" ` should not be considered
859+ valid {StringValue}.
860+
861+ Escape sequences are only meaningful within a single-quoted string. Within a
862+ block string, they are simply that sequence of characters (for example
863+ ` """\n""" ` represents the Unicode text [ U+005C, U+006E] ). Within a comment an
864+ escape sequence is not a significant sequence of characters. They may not appear
865+ elsewhere in a GraphQL document.
866+
867+ Since {StringCharacter} must not contain some characters, escape sequences must
868+ be used to represent these characters. All other escape sequences are optional
869+ and unescaped non-ASCII Unicode characters are allowed within strings. If using
870+ GraphQL within a system which only supports ASCII, then escape sequences may be
871+ used to represent all Unicode characters outside of the ASCII range.
872+
873+ For legacy reasons, a _ supplementary character_ may be escaped by two
874+ fixed-width unicode escape sequences forming a _ surrogate pair_ . For example the
875+ input ` "\uD83D\uDCA9" ` is a valid {StringValue} which represents the same
876+ Unicode text as ` "\u{1F4A9}" ` . While this legacy form is allowed, it should be
877+ avoided as a variable-width unicode escape sequence is a clearer way to encode
878+ such code points.
879+
880+ When producing a {StringValue}, implementations should use escape sequences to
881+ represent non-printable control characters (U+0000 to U+001F and U+007F to
882+ U+009F). Other escape sequences are not necessary, however an implementation may
883+ use escape sequences to represent any other range of code points. If an
884+ implementation chooses to escape a _ supplementary character_ , it should not use
885+ a fixed-width surrogate pair unicode escape sequence.
846886
847887** Block Strings**
848888
@@ -898,44 +938,57 @@ Note: If non-printable ASCII characters are needed in a string value, a standard
898938quoted string with appropriate escape sequences must be used instead of a block
899939string.
900940
901- ** Semantics**
941+ ** Static Semantics**
942+
943+ A {StringValue} describes a Unicode text value, a sequence of * Unicode scalar
944+ value* s. These semantics describe how to apply the {StringValue} grammar to a
945+ source text to evaluate a Unicode text. Errors encountered during this
946+ evaluation are considered a failure to apply the {StringValue} grammar to a
947+ source and result in a parsing error.
902948
903949StringValue :: ` "" `
904950
905951- Return an empty sequence.
906952
907953StringValue :: ` " ` StringCharacter+ ` " `
908954
909- - Let {string} be the sequence of all {StringCharacter} code points.
910- - For each {codePoint} at {index} in {string}:
911- - If {codePoint} is >= 0xD800 and <= 0xDBFF (a
912- [ _ High Surrogate_ ] ( https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs ) ):
913- - Let {lowPoint} be the code point at {index} + {1} in {string}.
914- - Assert {lowPoint} is >= 0xDC00 and <= 0xDFFF (a
915- [ _ Low Surrogate_ ] ( https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs ) ).
916- - Let {decodedPoint} = ({codePoint} - 0xD800) × 0x400 + ({lowPoint} -
917- 0xDC00) + 0x10000.
918- - Within {string}, replace {codePoint} and {lowPoint} with {decodedPoint}.
919- - Otherwise, assert {codePoint} is not >= 0xDC00 and <= 0xDFFF (a
920- [ _ Low Surrogate_ ] ( https://unicodebook.readthedocs.io/unicode_encodings.html#utf-16-surrogate-pairs ) ).
921- - Return {string}.
922-
923- Note: {StringValue} should avoid encoding code points as surrogate pairs. While
924- services must interpret them accordingly, a braced escape (for example
925- ` "\u{1F4A9}" ` ) is a clearer way to encode code points outside of the
926- [ Basic Multilingual Plane] ( https://unicodebook.readthedocs.io/unicode.html#bmp ) .
955+ - Return the concatenated sequence of _ Unicode scalar value_ by evaluating all
956+ {StringCharacter}.
927957
928958StringCharacter :: SourceCharacter but not ` " ` or ` \ ` or LineTerminator
929959
930- - Return the code point {SourceCharacter}.
960+ - Return the _ Unicode scalar value _ {SourceCharacter}.
931961
932962StringCharacter :: ` \u ` EscapedUnicode
933963
934- - Let {value} be the 21-bit hexadecimal value represented by the sequence of
935- {HexDigit} within {EscapedUnicode}.
936- - Assert {value} <= 0x10FFFF.
964+ - Let {value} be the hexadecimal value represented by the sequence of {HexDigit}
965+ within {EscapedUnicode}.
966+ - Assert {value} is a within the _ Unicode scalar value_ range (>= 0x0000 and <=
967+ 0xD7FF or >= 0xE000 and <= 0x10FFFF).
937968- Return the code point {value}.
938969
970+ StringCharacter :: ` \u ` HexDigit HexDigit HexDigit HexDigit ` \u ` HexDigit
971+ HexDigit HexDigit HexDigit
972+
973+ - Let {leadingValue} be the hexadecimal value represented by the first sequence
974+ of {HexDigit}.
975+ - Let {trailingValue} be the hexadecimal value represented by the second
976+ sequence of {HexDigit}.
977+ - If {leadingValue} is >= 0xD800 and <= 0xDBFF (a _ Leading Surrogate_ ):
978+ - Assert {trailingValue} is >= 0xDC00 and <= 0xDFFF (a _ Trailing Surrogate_ ).
979+ - Return ({leadingValue} - 0xD800) × 0x400 + ({trailingValue} - 0xDC00) +
980+ 0x10000.
981+ - Otherwise:
982+ - Assert {leadingValue} is within the _ Unicode scalar value_ range.
983+ - Assert {trailingValue} is within the _ Unicode scalar value_ range.
984+ - Return the sequence of the code point {leadingValue} followed by the code
985+ point {trailingValue}.
986+
987+ Note: If both escape sequences encode a _ Unicode scalar value_ , then this
988+ semantic is identical to applying the prior semantic on each fixed-width escape
989+ sequence. A variable-width escape sequence must only encode a _ Unicode scalar
990+ value_ .
991+
939992StringCharacter :: ` \ ` EscapedCharacter
940993
941994- Return the code point represented by {EscapedCharacter} according to the table
@@ -954,13 +1007,13 @@ StringCharacter :: `\` EscapedCharacter
9541007
9551008StringValue :: ` """ ` BlockStringCharacter\* ` """ `
9561009
957- - Let {rawValue} be the Unicode character sequence of all {BlockStringCharacter}
958- Unicode character values (which may be an empty sequence).
1010+ - Let {rawValue} be the concatenated sequence of _ Unicode scalar value _ by
1011+ evaluating all {BlockStringCharacter} (which may be an empty sequence).
9591012- Return the result of {BlockStringValue(rawValue)}.
9601013
9611014BlockStringCharacter :: SourceCharacter but not ` """ ` or ` \""" `
9621015
963- - Return the character value of {SourceCharacter}.
1016+ - Return the _ Unicode scalar value _ {SourceCharacter}.
9641017
9651018BlockStringCharacter :: ` \""" `
9661019
0 commit comments