-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak Some Unicode-Related Text #1103
base: draft-v8
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,7 @@ Conceptually speaking, a program is compiled using three steps: | |
1. Lexical analysis, which translates a stream of Unicode input characters into a stream of tokens. | ||
1. Syntactic analysis, which translates the stream of tokens into executable code. | ||
|
||
Conforming implementations shall accept Unicode compilation units encoded with the UTF-8 encoding form (as defined by the Unicode standard), and transform them into a sequence of Unicode characters. Implementations can choose to accept and transform additional character encoding schemes (such as UTF-16, UTF-32, or non-Unicode character mappings). | ||
Apart from accepting UTF-8 encoded input (as required by [§5](conformance.md#5-conformance), a conforming implementation can choose to accept and transform additional character encoding schemes (such as UTF-16, UTF-32, or non-Unicode character mappings). | ||
|
||
> *Note*: The handling of the Unicode NULL character (U+0000) is implementation-specific. It is strongly recommended that developers avoid using this character in their source code, for the sake of both portability and readability. When the character is required within a character or string literal, the escape sequences `\0` or `\u0000` may be used instead. *end note* | ||
<!-- markdownlint-disable MD028 --> | ||
|
@@ -351,7 +351,7 @@ token | |
|
||
### 6.4.2 Unicode character escape sequences | ||
|
||
A Unicode escape sequence represents a Unicode code point. Unicode escape sequences are processed in identifiers ([§6.4.3](lexical-structure.md#643-identifiers)), character literals ([§6.4.5.5](lexical-structure.md#6455-character-literals)), regular string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)), and interpolated regular string expressions ([§12.8.3](expressions.md#1283-interpolated-string-expressions)). A Unicode escape sequence is not processed in any other location (for example, to form an operator, punctuator, or keyword). | ||
A Unicode character escape sequence represents a Unicode code point. Unicode escape sequences are processed in identifiers ([§6.4.3](lexical-structure.md#643-identifiers)), character literals ([§6.4.5.5](lexical-structure.md#6455-character-literals)), regular string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)), and interpolated regular string expressions ([§12.8.3](expressions.md#1283-interpolated-string-expressions)). A Unicode escape sequence is not processed in any other location (for example, to form an operator, punctuator, or keyword). | ||
|
||
```ANTLR | ||
fragment Unicode_Escape_Sequence | ||
|
@@ -361,7 +361,7 @@ fragment Unicode_Escape_Sequence | |
; | ||
``` | ||
|
||
A Unicode character escape sequence represents the single Unicode code point formed by the hexadecimal number following the “\u” or “\U” characters. Since C# uses a 16-bit encoding of Unicode code points in character and string values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two Unicode surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported. | ||
A *Unicode_Escape_Sequence* represents the Unicode code point whose value is the hexadecimal number following the “\u” or “\U” characters. Since C# uses UTF-16 encoding in `char` and `string` values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two UTF-16 surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the discussion over §8.2.5 I suspect this para will require some careful rewriting if it is agreed implementation may use different (and multiple) storage models for strings as long as they conform to the API presenting them as UTF-16. |
||
|
||
Multiple translations are not performed. For instance, the string literal `"\u005Cu005C"` is equivalent to `"\u005C"` rather than `"\"`. | ||
|
||
|
@@ -805,7 +805,7 @@ The value of a real literal of type `float` or `double` is determined by using t | |
|
||
#### 6.4.5.5 Character literals | ||
|
||
A character literal represents a single character, and consists of a character in quotes, as in `'a'`. | ||
A character literal represents a single character as a UTF-16 code unit, and consists of a character or *Unicode_Escape_Sequence* in quotes, as in `'a'`, `'\u0061'`, or `'\U00000061'`. | ||
|
||
```ANTLR | ||
Character_Literal | ||
|
@@ -850,7 +850,7 @@ fragment Hexadecimal_Escape_Sequence | |
> | ||
> *end note* | ||
|
||
A hexadecimal escape sequence represents a single Unicode UTF-16 code unit, with the value formed by the hexadecimal number following “`\x`”. | ||
A hexadecimal escape sequence represents a UTF-16 code unit, with the value formed by the hexadecimal number following “`\x`”. | ||
|
||
If the value represented by a character literal is greater than `U+FFFF`, a compile-time error occurs. | ||
|
||
|
@@ -876,7 +876,7 @@ The type of a *Character_Literal* is `char`. | |
|
||
#### 6.4.5.6 String literals | ||
|
||
C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences. | ||
C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences. Both forms use UTF-16 encoding. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did we miss this when considering interpolated string literals? It feels odd to not mention interpolated string literals anywhere within this section. (Probably not something to fix in this PR, but potentially worth a new issue. See what you think. Feel free to create one and assign it to me if you agree.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jskeet – Since when did C# have interpolated string literals? ;-) That said we should make sure that the text around interpolated string expressions is correct Unicode-wise relative to this PR. (The rules used in the definition of interpolated string expressions refer to the same There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm... I've always regarded interpolated strings as a form of string literal. Looks like I was wrong - 12.8.3 explicitly says:
|
||
|
||
A verbatim string literal consists of an `@` character followed by a double-quote character, zero or more characters, and a closing double-quote character. | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,7 @@ A conforming implementation is required to document its choice of behavior in ea | |
1. The maximum value allowed for `Decimal_Digit+` in `PP_Line_Indicator` ([§6.5.8](lexical-structure.md#658-line-directives)). | ||
1. The interpretation of the *input_characters* in the *pp_pragma-text* of a #pragma directive ([§6.5.9](lexical-structure.md#659-pragma-directives)). | ||
1. The values of any application parameters passed to `Main` by the host environment prior to application startup ([§7.1](basic-concepts.md#71-application-startup)). | ||
1. The endianness of UTF-16 code units in a UTF-16-encoded string literal or an instance of the class `string` ([§8.2.5](types.md#825-the-string-type)). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How would this be detected anyway? I don't know offhand whether unsafe code can fix a |
||
1. The precise structure of the expression tree, as well as the exact process for creating it, when an anonymous function is converted to an expression-tree ([§10.7.3](conversions.md#1073-evaluation-of-lambda-expression-conversions-to-expression-tree-types)). | ||
1. The value returned when a stack allocation of size zero is made ([§12.8.21](expressions.md#12821-stack-allocation)). | ||
1. Whether a `System.ArithmeticException` (or a subclass thereof) is thrown or the overflow goes unreported with the resulting value being that of the left operand, when in an `unchecked` context and the left operand of an integer division is the maximum negative `int` or `long` value and the right operand is `–1` ([§12.10.3](expressions.md#12103-division-operator)). | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -107,7 +107,7 @@ The `dynamic` type is further described in [§8.7](types.md#87-the-dynamic-type) | |
|
||
### 8.2.5 The string type | ||
|
||
The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent Unicode character strings. | ||
The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units, whose endianness is implementation-defined. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the representation (including endianness) of UTF-16 code units in strings were not the same as in However, I'm not sure the standard should require implementations to store all strings in UTF-16 all the time. It could allow implementations to use a tighter encoding such as ISO-8859-1 for some strings, provided that the implementation can expand the string to UTF-16 when a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In other words, I'd slightly prefer if the standard said that a string represents a sequence of UTF-16 code units (which do not have to be valid UTF-16) but did not prescribe any in-memory format or endianness for strings. AFAICT, the endianness of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KalleOlaviNiemitalo Regarding "Instances of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would you say that that the ordering of the UTF-16 code units is implementation-defined? What kind of freedom would implementations have there? For example, in the string "cat", the UTF-16 code units must be (U+0063, U+0061, U+0074). Do you think some implementation would prefer storing those in the opposite order in memory, i.e. { 0x0074, 0x0061, 0x0063 }? As Or, did you mean an implementation might swap the code units in each surrogate pair? I mean, store There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @KalleOlaviNiemitalo Re endianness, here's what my notes say, which is very likely where I got the idea to add the text I did: https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction#utf-8-and-utf-32 says, the following: In .NET, the UTF-16 code units of a string are stored in contiguous memory as a sequence of 16-bit integers (char instances). The bits of individual code units are laid out according to the endianness of the current architecture. On a little-endian architecture, the string consisting of the UTF-16 code points [ D801 DCCC ] would be laid out in memory as the bytes [ 0x01, 0xD8, 0xCC, 0xDC ]. On a big-endian architecture that same string would be laid out in memory as the bytes [ 0xD8, 0x01, 0xDC, 0xCC ]. Computer systems that communicate with each other must agree on the representation of data crossing the wire. Most network protocols use UTF-8 as a standard when transmitting text, partly to avoid issues that might result from a big-endian machine communicating with a little-endian machine. The string consisting of the UTF-8 code points [ F0 90 93 8C ] will always be represented as the bytes [ 0xF0, 0x90, 0x93, 0x8C ] regardless of endianness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I agree with @KalleOlaviNiemitalo here, as per my previous comment... it's possible that requirements of interop mean we need this, but we should discuss it further in a meeting rather than making an assumption. It's possible that we could change the "implementation-defined" line to constrain it to "when the underlying memory of a string is exposed for interop reasons" or similar. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don’t think endianness should be mentioned here. The Standard says little about interop in general and mentioned endianness once non-normatively (§23.5.1 example). I would expect any implementation concerned with interop would document the endian order used and be consistent. I also agree with @KalleOlaviNiemitalo that the Standard should not prescribe the in-memory structure of strings – that is down to the implementation – just that the type present them, as per the API, as a sequence of UTF-16 code units. AFAIK this Standard, the CLR one, and the MS .NET documentation also does not (rightly or wrongly) go into the efficiency/performance of the API in general so we need not say this presentation shall be efficient but hope that it is! |
||
|
||
Values of the `string` type can be written as string literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)). | ||
|
||
|
@@ -311,7 +311,7 @@ C# supports nine integral types: `sbyte`, `byte`, `short`, `ushort`, `int`, `uin | |
- The `uint` type represents unsigned 32-bit integers with values from `0` to `4294967295`, inclusive. | ||
- The `long` type represents signed 64-bit integers with values from `-9223372036854775808` to `9223372036854775807`, inclusive. | ||
- The `ulong` type represents unsigned 64-bit integers with values from `0` to `18446744073709551615`, inclusive. | ||
- The `char` type represents unsigned 16-bit integers with values from `0` to `65535`, inclusive. The set of possible values for the `char` type corresponds to the Unicode character set. | ||
- The `char` type represents unsigned 16-bit integers with values from `0` to `65535`, inclusive, as a UTF-16 code unit. | ||
> *Note*: Although `char` has the same representation as `ushort`, not all operations permitted on one type are permitted on the other. *end note* | ||
|
||
All signed integral types are represented using two’s complement format. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think this is correct. The interpolated string expression syntax in a code file leads to a
System.IFormattable
,System.FormattableString
orstring
instance at runtime, and thatstring
instance is presented as UTF-16The syntax itself only exists in the code file and is therefore in whatever encoding the code file is in – UTF-8, ASCII, EBCDIC…
The same apply to string and character literals – they themselves can be in any encoding supported by the implementation for code files, while the values they produce at runtime are presented as UTF-16.
Offhand I’ve no alternative wording suggestion and so would just not make any change.