Skip to content

[RFC] Clarify and restrict unicode support #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 25, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions spec/Appendix A -- Notation Conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,13 +166,16 @@ Example_param :
This specification describes the semantic value of many grammar productions in
the form of a list of algorithmic steps.

For example, this describes how a parser should interpret a Unicode escape
sequence which appears in a string literal:
For example, this describes how a parser should interpret a string literal:

EscapedUnicode :: u /[0-9A-Fa-f]{4}/
StringValue :: `""`

* Let {codePoint} be the number represented by the four-digit hexadecimal sequence.
* The string value is the Unicode character represented by {codePoint}.
* Return an empty Unicode character sequence.

StringValue :: `"` StringCharacter+ `"`

* Return the Unicode character sequence of all {StringCharacter}
Unicode character values.


## Algorithms
Expand Down
20 changes: 9 additions & 11 deletions spec/Appendix B -- Grammar Summary.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,29 @@
# B. Appendix: Grammar Summary

SourceCharacter :: "Any Unicode code point"
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/


## Ignored Tokens

Ignored ::
- UnicodeBOM
- WhiteSpace
- LineTerminator
- Comment
- Comma

UnicodeBOM :: "Byte Order Mark (U+FEFF)"

WhiteSpace ::
- "Horizontal Tab (U+0009)"
- "Vertical Tab (U+000B)"
- "Form Feed (U+000C)"
- "Space (U+0020)"
- "No-break Space (U+00A0)"

LineTerminator ::
- "New Line (U+000A)"
- "Carriage Return (U+000D)"
- "Line Separator (U+2028)"
- "Paragraph Separator (U+2029)"
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
- "Carriage Return (U+000D)" "New Line (U+000A)"

Comment ::
- `#` CommentChar*
Comment :: `#` CommentChar*

CommentChar :: SourceCharacter but not LineTerminator

Expand Down Expand Up @@ -76,10 +74,10 @@ StringValue ::

StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \ EscapedUnicode
- \u EscapedUnicode
- \ EscapedCharacter

EscapedUnicode :: u /[0-9A-Fa-f]{4}/
EscapedUnicode :: /[0-9A-Fa-f]{4}/

EscapedCharacter :: one of `"` \ `/` b f n r t

Expand Down
75 changes: 61 additions & 14 deletions spec/Section 2 -- Language.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,45 +13,61 @@ double-colon `::`).

## Source Text

SourceCharacter :: "Any Unicode character"
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/

GraphQL documents are expressed as a sequence of
[Unicode](http://unicode.org/standard/standard.html) characters. However, with
few exceptions, most of GraphQL is expressed only in the original ASCII range
so as to be as widely compatible with as many existing tools, languages, and
serialization formats as possible. Other than within comments, Non-ASCII Unicode
characters are only found within {StringValue}.
few exceptions, most of GraphQL is expressed only in the original non-control
ASCII range so as to be as widely compatible with as many existing tools,
languages, and serialization formats as possible and avoid display issues in
text editors and source control.


### Unicode

UnicodeBOM :: "Byte Order Mark (U+FEFF)"

Non-ASCII Unicode characters may freely appear within {StringValue} and
{Comment} portions of GraphQL.

The "Byte Order Mark" is a special Unicode character which
may appear at the beginning of a file containing Unicode which programs may use
to determine the fact that the text stream is Unicode, what endianness the text
stream is in, and which of several Unicode encodings to interpret.


### White Space

WhiteSpace ::
- "Horizontal Tab (U+0009)"
- "Vertical Tab (U+000B)"
- "Form Feed (U+000C)"
- "Space (U+0020)"
- "No-break Space (U+00A0)"

White space is used to improve legibility of source text and act as separation
between tokens, and any amount of white space may appear before or after any
token. White space between tokens is not significant to the semantic meaning of
a GraphQL query document, however white space characters may appear within a
{String} or {Comment} token.

Note: GraphQL intentionally does not consider Unicode "Zs" category characters
as white-space, avoiding misinterpretation by text editors and source
control tools.

### Line Terminators

LineTerminator ::
- "New Line (U+000A)"
- "Carriage Return (U+000D)"
- "Line Separator (U+2028)"
- "Paragraph Separator (U+2029)"
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
- "Carriage Return (U+000D)" "New Line (U+000A)"

Like white space, line terminators are used to improve the legibility of source
text, any amount may appear before or after any other token and have no
significance to the semantic meaning of a GraphQL query document. Line
terminators are not found within any other token.

Note: Any error reporting which provide the line number in the source of the
offending syntax should use the preceding amount of {LineTerminator} to produce
the line number.


### Comments

Expand Down Expand Up @@ -101,9 +117,11 @@ defined here in a lexical grammar by patterns of source Unicode characters.
Tokens are later used as terminal symbols in a GraphQL query document syntactic
grammars.


### Ignored Tokens

Ignored ::
- UnicodeBOM
- WhiteSpace
- LineTerminator
- Comment
Expand Down Expand Up @@ -639,17 +657,46 @@ StringValue ::

StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \ EscapedUnicode
- \u EscapedUnicode
- \ EscapedCharacter

EscapedUnicode :: u /[0-9A-Fa-f]{4}/
EscapedUnicode :: /[0-9A-Fa-f]{4}/

EscapedCharacter :: one of `"` \ `/` b f n r t

Strings are lists of characters wrapped in double-quotes `"`. (ex.
Strings are sequences of characters wrapped in double-quotes (`"`). (ex.
`"Hello World"`). White space and other otherwise-ignored characters are
significant within a string value.

Note: Unicode characters are allowed within String value literals, however
GraphQL source must not contain some ASCII control characters so escape
sequences must be used to represent these characters.

**Semantics**

StringValue :: `""`

* Return an empty Unicode character sequence.

StringValue :: `"` StringCharacter+ `"`

* Return the Unicode character sequence of all {StringCharacter}
Unicode character values.

StringCharacter :: SourceCharacter but not `"` or \ or LineTerminator

* Return the character value of {SourceCharacter}.

StringCharacter :: \u EscapedUnicode

* Return the character value represented by the UTF16 hexidecimal
identifier {EscapedUnicode}.

StringCharacter :: \ EscapedCharacter

* Return the character value of {EscapedCharacter}.


#### Enum Value

EnumValue : Name but not `true`, `false` or `null`
Expand Down