Tweak Some Unicode-Related Text #1103

RexJaeschke · 2024-05-04T15:58:08Z

Recently, I started writing a formal spec for the V11 feature, UTF-8 string literals. However, as I was reading the Draft V8 Ecma C# spec, it became clear to me that it didn’t say enough and/or wasn’t as clear as it could be w.r.t the (presumed) current intention regarding Unicode support. So, before I go back to writing that V11 feature spec, I created this PR, which contains a set of proposed improvements to the current Ecma spec. Basically, I want a better base on which to add the V11 (and possibly later) extensions. This PR includes the following kinds of edits:

Use grammar rule name instead of the descriptive English equivalent.
Use string instead of "string," as we'll have different kinds of string once UTF-8 support is added.
Use char instead of "character" where that makes it more precise.
Remove duplication of normative text (w.r.t conformance).
Use terms consistently.

I don't expect this to be controversial. The only new normative text has to do with explicitly saying that type string uses UTF-16 encoding.

@jskeet I added you as a reviewer, as I know you have written about some Unicode issues.
@KalleOlaviNiemitalo If you have expertise in this area, I'd appreciate your feedback.

KalleOlaviNiemitalo · 2024-05-04T16:18:19Z

standard/types.md

@@ -107,7 +107,7 @@ The `dynamic` type is further described in [§8.7](types.md#87-the-dynamic-type)

 ### 8.2.5 The string type

-The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent Unicode character strings.
+The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units, whose endianness is implementation-defined.


If the representation (including endianness) of UTF-16 code units in strings were not the same as in char, then memory allocation would be necessary in the conversion from string to ReadOnlySpan<char>, and in fixed (char* p = str).

However, I'm not sure the standard should require implementations to store all strings in UTF-16 all the time. It could allow implementations to use a tighter encoding such as ISO-8859-1 for some strings, provided that the implementation can expand the string to UTF-16 when a ReadOnlySpan<char> or char* is needed. (A variable-length encoding such as UTF-8 could be too costly to implement, because of the string[int] indexer.)

In other words, I'd slightly prefer if the standard said that a string represents a sequence of UTF-16 code units (which do not have to be valid UTF-16) but did not prescribe any in-memory format or endianness for strings.

AFAICT, the endianness of int is currently unspecified rather than implementation-defined. I don't see why the endianness of char (and ushort, which has the same representation) would be more important for implementations to document.

@KalleOlaviNiemitalo Regarding "Instances of the string class represent a sequence of UTF-16 code units, whose endianness is implementation-defined." note the comma. This is intended to say that the ordering of the UTF-16 code units in the (singular) string instance is implementation-defined, rather than the bits in any char in that string (which is how I would read that sentence without the comma. I say that because I'm confused by your mention of endianness with scalar types.

Why would you say that that the ordering of the UTF-16 code units is implementation-defined? What kind of freedom would implementations have there?

For example, in the string "cat", the UTF-16 code units must be (U+0063, U+0061, U+0074). Do you think some implementation would prefer storing those in the opposite order in memory, i.e. { 0x0074, 0x0061, 0x0063 }? As "cat"[2] must still be 't', I expect that would only be harder to implement.

Or, did you mean an implementation might swap the code units in each surrogate pair? I mean, store "👁eye" (U+1F441, U+0065, U+0079, U+0065) as { 0xDC41, 0xD83D, 0x0065, 0x0079, 0x0065 } rather than { 0xD83D, 0xDC41, 0x0065, 0x0079, 0x0065 }. I expect this would make sorting slower.

@KalleOlaviNiemitalo Re endianness, here's what my notes say, which is very likely where I got the idea to add the text I did:

https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction#utf-8-and-utf-32 says, the following:

In .NET, the UTF-16 code units of a string are stored in contiguous memory as a sequence of 16-bit integers (char instances). The bits of individual code units are laid out according to the endianness of the current architecture.

On a little-endian architecture, the string consisting of the UTF-16 code points [ D801 DCCC ] would be laid out in memory as the bytes [ 0x01, 0xD8, 0xCC, 0xDC ]. On a big-endian architecture that same string would be laid out in memory as the bytes [ 0xD8, 0x01, 0xDC, 0xCC ].

Computer systems that communicate with each other must agree on the representation of data crossing the wire. Most network protocols use UTF-8 as a standard when transmitting text, partly to avoid issues that might result from a big-endian machine communicating with a little-endian machine. The string consisting of the UTF-8 code points [ F0 90 93 8C ] will always be represented as the bytes [ 0xF0, 0x90, 0x93, 0x8C ] regardless of endianness.

I think I agree with @KalleOlaviNiemitalo here, as per my previous comment... it's possible that requirements of interop mean we need this, but we should discuss it further in a meeting rather than making an assumption. It's possible that we could change the "implementation-defined" line to constrain it to "when the underlying memory of a string is exposed for interop reasons" or similar.

I don’t think endianness should be mentioned here. The Standard says little about interop in general and mentioned endianness once non-normatively (§23.5.1 example). I would expect any implementation concerned with interop would document the endian order used and be consistent.

I also agree with @KalleOlaviNiemitalo that the Standard should not prescribe the in-memory structure of strings – that is down to the implementation – just that the type present them, as per the API, as a sequence of UTF-16 code units.

AFAIK this Standard, the CLR one, and the MS .NET documentation also does not (rightly or wrongly) go into the efficiency/performance of the API in general so we need not say this presentation shall be efficient but hope that it is!

jskeet · 2024-09-12T09:09:25Z

standard/lexical-structure.md

@@ -876,7 +876,7 @@ The type of a *Character_Literal* is `char`.

 #### 6.4.5.6 String literals

-C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences.
+C# supports two forms of string literals: ***regular string literals*** and ***verbatim string literals***. A regular string literal consists of zero or more characters enclosed in double quotes, as in `"hello"`, and can include both simple escape sequences (such as `\t` for the tab character), and hexadecimal and Unicode escape sequences. Both forms use UTF-16 encoding.


Did we miss this when considering interpolated string literals? It feels odd to not mention interpolated string literals anywhere within this section. (Probably not something to fix in this PR, but potentially worth a new issue. See what you think. Feel free to create one and assign it to me if you agree.)

@jskeet – Since when did C# have interpolated string literals? ;-)

That said we should make sure that the text around interpolated string expressions is correct Unicode-wise relative to this PR.

(The rules used in the definition of interpolated string expressions refer to the same Simple_Escape_Sequence, Hexadecimal_Escape_Sequence and Unicode_Escape_Sequence rules used in the definitions of string and character literals. However the clause for interpolated string expressions (§12.8.3) makes no reference to the escape sequences other than using these three rules in the grammar section.)

Hmm... I've always regarded interpolated strings as a form of string literal. Looks like I was wrong - 12.8.3 explicitly says:

Interpolated string expressions have two forms; regular (interpolated_regular_string_expression) and verbatim (interpolated_verbatim_string_expression); which are lexically similar to, but differ semantically from, the two forms of string literals (§6.4.5.6).

jskeet · 2024-09-12T09:11:25Z

standard/portability-issues.md

@@ -28,6 +28,7 @@ A conforming implementation is required to document its choice of behavior in ea
 1. The maximum value allowed for `Decimal_Digit+` in `PP_Line_Indicator` ([§6.5.8](lexical-structure.md#658-line-directives)).
 1. The interpretation of the *input_characters* in the *pp_pragma-text* of a #pragma directive ([§6.5.9](lexical-structure.md#659-pragma-directives)).
 1. The values of any application parameters passed to `Main` by the host environment prior to application startup ([§7.1](basic-concepts.md#71-application-startup)).
+1. The endianness of UTF-16 code units in a UTF-16-encoded string literal or an instance of the class `string` ([§8.2.5](types.md#825-the-string-type)).


How would this be detected anyway? I don't know offhand whether unsafe code can fix a string into a byte*, which seems the most obvious way of detecting it. If it's undetectable within any specified means, perhaps we don't need to include this?

jskeet · 2024-09-12T09:13:44Z

standard/types.md

@@ -107,7 +107,7 @@ The `dynamic` type is further described in [§8.7](types.md#87-the-dynamic-type)

 ### 8.2.5 The string type

-The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent Unicode character strings.
+The `string` type is a sealed class type that inherits directly from `object`. Instances of the `string` class represent a sequence of UTF-16 code units, whose endianness is implementation-defined.


I think I agree with @KalleOlaviNiemitalo here, as per my previous comment... it's possible that requirements of interop mean we need this, but we should discuss it further in a meeting rather than making an assumption. It's possible that we could change the "implementation-defined" line to constrain it to "when the underlying memory of a string is exposed for interop reasons" or similar.

Nigel-Ecma · 2024-09-13T06:31:12Z

standard/expressions.md

@@ -1334,7 +1334,7 @@ An *interpolated_string_expression* consists of `$`, `$@`, or `@$`, immediately

 Interpolated string expressions have two forms; regular (*interpolated_regular_string_expression*)
 and verbatim (*interpolated_verbatim_string_expression*); which are lexically similar to, but differ semantically from, the two forms of string
-literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)).
+literals ([§6.4.5.6](lexical-structure.md#6456-string-literals)). Both forms use UTF-16 encoding.


I don’t think this is correct. The interpolated string expression syntax in a code file leads to a System.IFormattable, System.FormattableString or string instance at runtime, and that string instance is presented as UTF-16

The syntax itself only exists in the code file and is therefore in whatever encoding the code file is in – UTF-8, ASCII, EBCDIC…

The same apply to string and character literals – they themselves can be in any encoding supported by the implementation for code files, while the values they produce at runtime are presented as UTF-16.

Offhand I’ve no alternative wording suggestion and so would just not make any change.

Nigel-Ecma · 2024-09-13T06:39:45Z

standard/lexical-structure.md

@@ -361,7 +361,7 @@ fragment Unicode_Escape_Sequence
 ;
 ```

-A Unicode character escape sequence represents the single Unicode code point formed by the hexadecimal number following the “\u” or “\U” characters. Since C# uses a 16-bit encoding of Unicode code points in character and string values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two Unicode surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported.
+A *Unicode_Escape_Sequence* represents the Unicode code point whose value is the hexadecimal number following the “\u” or “\U” characters. Since C# uses UTF-16 encoding in `char` and `string` values, a Unicode code point in the range `U+10000` to `U+10FFFF` is represented using two UTF-16 surrogate code units. Unicode code points above `U+FFFF` are not permitted in character literals. Unicode code points above `U+10FFFF` are invalid and are not supported.


Given the discussion over §8.2.5 I suspect this para will require some careful rewriting if it is agreed implementation may use different (and multiple) storage models for strings as long as they conform to the API presenting them as UTF-16.

RexJaeschke added 4 commits May 4, 2024 11:28

tweak Unicode-related text

1ada208

tweak Unicode-related text

31be46d

tweak Unicode-related text

39725dd

tweak Unicode-related text

7f4aa4b

RexJaeschke added the type: clarity While not technically incorrect, the Standard is potentially confusing label May 4, 2024

RexJaeschke requested a review from jskeet May 4, 2024 15:58

RexJaeschke self-assigned this May 4, 2024

KalleOlaviNiemitalo reviewed May 4, 2024

View reviewed changes

TweakUnicodeStuff

c2674d6

RexJaeschke closed this Jun 24, 2024

RexJaeschke deleted the TweakUnicodeStuff branch June 24, 2024 20:29

BillWagner restored the TweakUnicodeStuff branch June 24, 2024 20:36

BillWagner reopened this Jun 24, 2024

RexJaeschke added the meeting: proposal There is an informal proposal in the issue, worth discussing in a meeting label Jun 24, 2024

jskeet reviewed Sep 12, 2024

View reviewed changes

Nigel-Ecma reviewed Sep 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak Some Unicode-Related Text #1103

Tweak Some Unicode-Related Text #1103

RexJaeschke commented May 4, 2024

KalleOlaviNiemitalo May 4, 2024 •

edited

Loading

KalleOlaviNiemitalo May 4, 2024

RexJaeschke May 24, 2024

KalleOlaviNiemitalo Jun 24, 2024

RexJaeschke Aug 12, 2024

jskeet Sep 12, 2024

Nigel-Ecma Sep 13, 2024

jskeet Sep 12, 2024

Nigel-Ecma Sep 13, 2024

jskeet Sep 13, 2024

jskeet Sep 12, 2024

jskeet Sep 12, 2024

Nigel-Ecma Sep 13, 2024

Nigel-Ecma Sep 13, 2024

Tweak Some Unicode-Related Text #1103

Are you sure you want to change the base?

Tweak Some Unicode-Related Text #1103

Conversation

RexJaeschke commented May 4, 2024

KalleOlaviNiemitalo May 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KalleOlaviNiemitalo May 4, 2024 •

edited

Loading