@@ -45,32 +45,22 @@ match, however some lookahead restrictions include additional constraints.
4545
4646## Source Text
4747
48- SourceCharacter ::
48+ SourceCharacter :: "Any Unicode scalar value"
4949
50- - "U+0009"
51- - "U+000A"
52- - "U+000D"
53- - "U+0020–U+FFFF"
50+ GraphQL documents are interpreted from a source text, which is a sequence of
51+ {SourceCharacter}, each {SourceCharacter} being a _ Unicode scalar value _ which
52+ may be any Unicode code point from U+0000 to U+D7FF or U+E000 to U+10FFFF
53+ (informally referred to as _ "characters" _ through most of this specification).
5454
55- GraphQL documents are expressed as a sequence of
56- [ Unicode] ( https://unicode.org/standard/standard.html ) code points (informally
57- referred to as _ "characters"_ through most of this specification). However, with
58- few exceptions, most of GraphQL is expressed only in the original non-control
59- ASCII range so as to be as widely compatible with as many existing tools,
60- languages, and serialization formats as possible and avoid display issues in
61- text editors and source control.
55+ A GraphQL document may be expressed only in the ASCII range to be as widely
56+ compatible with as many existing tools, languages, and serialization formats as
57+ possible and avoid display issues in text editors and source control. Non-ASCII
58+ Unicode scalar values may appear within {StringValue} and {Comment}.
6259
63- Note: Non-ASCII Unicode characters may appear freely within {StringValue} and
64- {Comment} portions of GraphQL.
65-
66- ### Unicode
67-
68- UnicodeBOM :: "Byte Order Mark (U+FEFF)"
69-
70- The "Byte Order Mark" is a special Unicode character which may appear at the
71- beginning of a file containing Unicode which programs may use to determine the
72- fact that the text stream is Unicode, what endianness the text stream is in, and
73- which of several Unicode encodings to interpret.
60+ Note: An implementation which uses _ UTF-16_ to represent GraphQL documents in
61+ memory (for example, JavaScript or Java) may encounter a _ surrogate pair_ . This
62+ encodes one _ supplementary code point_ and is a single valid source character,
63+ however an unpaired _ surrogate code point_ is not a valid source character.
7464
7565### White Space
7666
@@ -115,10 +105,9 @@ CommentChar :: SourceCharacter but not LineTerminator
115105GraphQL source documents may contain single-line comments, starting with the
116106{` # ` } marker.
117107
118- A comment can contain any Unicode code point in {SourceCharacter} except
119- {LineTerminator} so a comment always consists of all code points starting with
120- the {` # ` } character up to but not including the {LineTerminator} (or end of the
121- source).
108+ A comment may contain any {SourceCharacter} except {LineTerminator} so a comment
109+ always consists of all {SourceCharacter} starting with the {` # ` } character up to
110+ but not including the {LineTerminator} (or end of the source).
122111
123112Comments are {Ignored} like white space and may appear after any token, or
124113before a {LineTerminator}, and have no significance to the semantic meaning of a
@@ -175,6 +164,16 @@ significant way, for example a {StringValue} may contain white space characters.
175164No {Ignored} may appear _ within_ a {Token}, for example no white space
176165characters are permitted between the characters defining a {FloatValue}.
177166
167+ ** Byte order mark**
168+
169+ UnicodeBOM :: "Byte Order Mark (U+FEFF)"
170+
171+ The _ Byte Order Mark_ is a special Unicode code point which may appear at the
172+ beginning of a file which programs may use to determine the fact that the text
173+ stream is Unicode, and what specific encoding has been used. As files are often
174+ concatenated, a _ Byte Order Mark_ may appear before or after any lexical token
175+ and is {Ignored}.
176+
178177### Punctuators
179178
180179Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }
@@ -812,7 +811,16 @@ StringCharacter ::
812811- ` \u ` EscapedUnicode
813812- ` \ ` EscapedCharacter
814813
815- EscapedUnicode :: /[ 0-9A-Fa-f] {4}/
814+ EscapedUnicode ::
815+
816+ - ` { ` HexDigit+ ` } `
817+ - HexDigit HexDigit HexDigit HexDigit
818+
819+ HexDigit :: one of
820+
821+ - ` 0 ` ` 1 ` ` 2 ` ` 3 ` ` 4 ` ` 5 ` ` 6 ` ` 7 ` ` 8 ` ` 9 `
822+ - ` A ` ` B ` ` C ` ` D ` ` E ` ` F `
823+ - ` a ` ` b ` ` c ` ` d ` ` e ` ` f `
816824
817825EscapedCharacter :: one of ` " ` ` \ ` ` / ` ` b ` ` f ` ` n ` ` r ` ` t `
818826
@@ -821,19 +829,57 @@ BlockStringCharacter ::
821829- SourceCharacter but not ` """ ` or ` \""" `
822830- ` \""" `
823831
824- Strings are sequences of characters wrapped in quotation marks (U+0022). (ex.
825- {` "Hello World" ` }). White space and other otherwise-ignored characters are
826- significant within a string value.
832+ A {StringValue} is evaluated to a _ Unicode text_ value, a sequence of _ Unicode
833+ scalar value_ , by interpreting all escape sequences using the static semantics
834+ defined below. White space and other characters ignored between lexical tokens
835+ are significant within a string value.
827836
828837The empty string {` "" ` } must not be followed by another {` " ` } otherwise it would
829838be interpreted as the beginning of a block string. As an example, the source
830839{` """""" ` } can only be interpreted as a single empty block string and not three
831840empty strings.
832841
833- Non-ASCII Unicode characters are allowed within single-quoted strings. Since
834- {SourceCharacter} must not contain some ASCII control characters, escape
835- sequences must be used to represent these characters. The {` \ ` }, {` " ` }
836- characters also must be escaped. All other escape sequences are optional.
842+ ** Escape Sequences**
843+
844+ In a single-quoted {StringValue}, any _ Unicode scalar value_ may be expressed
845+ using an escape sequence. GraphQL strings allow both C-style escape sequences
846+ (for example ` \n ` ) and two forms of Unicode escape sequences: one with a
847+ fixed-width of 4 hexadecimal digits (for example ` \u000A ` ) and one with a
848+ variable-width most useful for representing a _ supplementary character_ such as
849+ an Emoji (for example ` \u{1F4A9} ` ).
850+
851+ The hexadecimal number encoded by a Unicode escape sequence must describe a
852+ _ Unicode scalar value_ , otherwise must result in a parse error. For example both
853+ sources ` "\uDEAD" ` and ` "\u{110000}" ` should not be considered valid
854+ {StringValue}.
855+
856+ Escape sequences are only meaningful within a single-quoted string. Within a
857+ block string, they are simply that sequence of characters (for example
858+ ` """\n""" ` represents the _ Unicode text_ [ U+005C, U+006E] ). Within a comment an
859+ escape sequence is not a significant sequence of characters. They may not appear
860+ elsewhere in a GraphQL document.
861+
862+ Since {StringCharacter} must not contain some code points directly (for example,
863+ a {LineTerminator}), escape sequences must be used to represent them. All other
864+ escape sequences are optional and unescaped non-ASCII Unicode characters are
865+ allowed within strings. If using GraphQL within a system which only supports
866+ ASCII, then escape sequences may be used to represent all Unicode characters
867+ outside of the ASCII range.
868+
869+ For legacy reasons, a _ supplementary character_ may be escaped by two
870+ fixed-width unicode escape sequences forming a _ surrogate pair_ . For example the
871+ input ` "\uD83D\uDCA9" ` is a valid {StringValue} which represents the same
872+ _ Unicode text_ as ` "\u{1F4A9}" ` . While this legacy form is allowed, it should be
873+ avoided as a variable-width unicode escape sequence is a clearer way to encode
874+ such code points.
875+
876+ When producing a {StringValue}, implementations should use escape sequences to
877+ represent non-printable control characters (U+0000 to U+001F and U+007F to
878+ U+009F). Other escape sequences are not necessary, however an implementation may
879+ use escape sequences to represent any other range of code points (for example,
880+ when producing ASCII-only output). If an implementation chooses to escape a
881+ _ supplementary character_ , it should only use a variable-width unicode escape
882+ sequence.
837883
838884** Block Strings**
839885
@@ -889,51 +935,84 @@ Note: If non-printable ASCII characters are needed in a string value, a standard
889935quoted string with appropriate escape sequences must be used instead of a block
890936string.
891937
892- ** Semantics**
938+ ** Static Semantics**
939+
940+ :: A {StringValue} describes a _ Unicode text_ value, which is a sequence of
941+ _ Unicode scalar value_ .
942+
943+ These semantics describe how to apply the {StringValue} grammar to a source text
944+ to evaluate a _ Unicode text_ . Errors encountered during this evaluation are
945+ considered a failure to apply the {StringValue} grammar to a source and must
946+ result in a parsing error.
893947
894948StringValue :: ` "" `
895949
896950- Return an empty sequence.
897951
898952StringValue :: ` " ` StringCharacter+ ` " `
899953
900- - Return the sequence of all {StringCharacter} code points.
954+ - Return the _ Unicode text_ by concatenating the evaluation of all
955+ {StringCharacter}.
901956
902957StringCharacter :: SourceCharacter but not ` " ` or ` \ ` or LineTerminator
903958
904- - Return the code point {SourceCharacter}.
959+ - Return the _ Unicode scalar value _ {SourceCharacter}.
905960
906961StringCharacter :: ` \u ` EscapedUnicode
907962
908- - Let {value} be the 16-bit hexadecimal value represented by the sequence of
909- hexadecimal digits within {EscapedUnicode}.
910- - Return the code point {value}.
963+ - Let {value} be the hexadecimal value represented by the sequence of {HexDigit}
964+ within {EscapedUnicode}.
965+ - Assert {value} is a within the _ Unicode scalar value_ range (>= 0x0000 and <=
966+ 0xD7FF or >= 0xE000 and <= 0x10FFFF).
967+ - Return the _ Unicode scalar value_ {value}.
968+
969+ StringCharacter :: ` \u ` HexDigit HexDigit HexDigit HexDigit ` \u ` HexDigit
970+ HexDigit HexDigit HexDigit
971+
972+ - Let {leadingValue} be the hexadecimal value represented by the first sequence
973+ of {HexDigit}.
974+ - Let {trailingValue} be the hexadecimal value represented by the second
975+ sequence of {HexDigit}.
976+ - If {leadingValue} is >= 0xD800 and <= 0xDBFF (a _ Leading Surrogate_ ):
977+ - Assert {trailingValue} is >= 0xDC00 and <= 0xDFFF (a _ Trailing Surrogate_ ).
978+ - Return ({leadingValue} - 0xD800) × 0x400 + ({trailingValue} - 0xDC00) +
979+ 0x10000.
980+ - Otherwise:
981+ - Assert {leadingValue} is within the _ Unicode scalar value_ range.
982+ - Assert {trailingValue} is within the _ Unicode scalar value_ range.
983+ - Return the sequence of the _ Unicode scalar value_ {leadingValue} followed by
984+ the _ Unicode scalar value_ {trailingValue}.
985+
986+ Note: If both escape sequences encode a _ Unicode scalar value_ , then this
987+ semantic is identical to applying the prior semantic on each fixed-width escape
988+ sequence. A variable-width escape sequence must only encode a _ Unicode scalar
989+ value_ .
911990
912991StringCharacter :: ` \ ` EscapedCharacter
913992
914- - Return the code point represented by {EscapedCharacter} according to the table
915- below.
993+ - Return the _ Unicode scalar value _ represented by {EscapedCharacter} according
994+ to the table below.
916995
917- | Escaped Character | Code Point | Character Name |
918- | ----------------- | ---------- | ---------------------------- |
919- | {` " ` } | U+0022 | double quote |
920- | {` \ ` } | U+005C | reverse solidus (back slash) |
921- | {` / ` } | U+002F | solidus (forward slash) |
922- | {` b ` } | U+0008 | backspace |
923- | {` f ` } | U+000C | form feed |
924- | {` n ` } | U+000A | line feed (new line) |
925- | {` r ` } | U+000D | carriage return |
926- | {` t ` } | U+0009 | horizontal tab |
996+ | Escaped Character | Scalar Value | Character Name |
997+ | ----------------- | ------------ | ---------------------------- |
998+ | {` " ` } | U+0022 | double quote |
999+ | {` \ ` } | U+005C | reverse solidus (back slash) |
1000+ | {` / ` } | U+002F | solidus (forward slash) |
1001+ | {` b ` } | U+0008 | backspace |
1002+ | {` f ` } | U+000C | form feed |
1003+ | {` n ` } | U+000A | line feed (new line) |
1004+ | {` r ` } | U+000D | carriage return |
1005+ | {` t ` } | U+0009 | horizontal tab |
9271006
9281007StringValue :: ` """ ` BlockStringCharacter\* ` """ `
9291008
930- - Let {rawValue} be the Unicode character sequence of all {BlockStringCharacter}
931- Unicode character values (which may be an empty sequence).
1009+ - Let {rawValue} be the _ Unicode text _ by concatenating the evaluation of all
1010+ {BlockStringCharacter} (which may be an empty sequence).
9321011- Return the result of {BlockStringValue(rawValue)}.
9331012
9341013BlockStringCharacter :: SourceCharacter but not ` """ ` or ` \""" `
9351014
936- - Return the character value of {SourceCharacter}.
1015+ - Return the _ Unicode scalar value _ {SourceCharacter}.
9371016
9381017BlockStringCharacter :: ` \""" `
9391018
0 commit comments