Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Scenarios and Design Philosophy - UTF-8 string support #2368

Closed
GrabYourPitchforks opened this issue Jun 19, 2018 · 9 comments
Closed

Scenarios and Design Philosophy - UTF-8 string support #2368

GrabYourPitchforks opened this issue Jun 19, 2018 · 9 comments
Assignees
Labels
area-System.Text.Utf8String design OpenBeforeArchiving These issues were open before the repo was archived. For re-open them, file them in the new repo
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

(This is a living document. Expect edits.)

Scenarios

High-performance networking stacks

(This includes HttpClient, ASP.NET's Kestrel, and similar APIs.)

High-performance networking stacks want to be able to read incoming data directly into a buffer and perform text processing operations over it. This buffer is most likely going to be a byte[] or similar, and the operations they'll want to perform run the gamut from text tokenization (looking for HTTP headers) to full parsing (e.g., interpreting a sequence of bytes as a human-readable date).

Importantly, these stacks don't want to incur the cost of transcoding or of allocating lots of little string instances just to call methods like String.Split or Int32.Parse. We need to provide an API set that has near-parity (where sensical) with the existing String APIs but which can operate over arbitrary spans of UTF-8 data.

Finally, because these buffers are generally going to be represented as byte[], we need to ensure that callers can perform UTF-8 operations over instances of this type without falling back to unsafe code or bouncing off the Unsafe or MemoryMarshal classes. If the developer needs to wrap the byte[] in a different type before calling the UTF-8 APIs, the wrapping logic should be (a) constant-time, (b) non-allocating, and (c) a single line of code at most.

Interop with Go / Swift / Python

We want developers to be able to port their applications from other frameworks and have them run on .NET. Currently this may require developers to keep transcoding concerns top-of-mind due to the difference in how they ingest data from other languages (sometimes UTF-8) and how .NET's String type expects that data to be represented (UTF-16).

For example, consider a scenario where a Go developer persists a string (perhaps representing a JSON payload) to a file, then the developer later wants to consume that file from .NET code. .NET's File.ReadAllText and File.WriteAllText APIs use UTF-8 by default, so the transcoding step is hidden from the developer and they just have a pleasant development experience with these APIs returning String.

Unfortunately such goodness does not extend to other concepts in the framework. One example of a problem area is p/invoking with a library that expects string data in UTF-8 format. Using such a library is certainly achievable, but we force developers to have subject matter expertise both in Unicode transcoding concepts and in interop concepts in order to fulfill this scenario.

Another example is the existence of some frequently-used concepts that other languages have (consider Go's rune or Swift's Character) that we hide behind complex APIs. If we made these APIs more approachable, developers would have greater confidence migrating this kind of code to our platform.

Cheap slicing

(This is similar to the high-performance networking stack scenario but deserves its own callout due to the prevalence of such code.)

It is extraordinarily common for developers to call String's Substring, Split, and Trim methods in order to get substrings back from the original string. Our data show that the majority of applications call at least of the aforementioned APIs. These APIs are particularly prevalent in parsing code paths.

More compact memory representation

Evidence shows that most of the data present in String instances is ASCII. This is due to a number of factors, including the prevalence of Latin-based text on the web. Even in applications that cater to Chinese audiences and other speakers whose languages don't use Latin-based characters, we find that things like OS identifiers are predominantly English. Since UTF-8 ASCII text takes half the memory size of UTF-16 ASCII text, it stands to reason that changing the internal representation of this data can provide significant savings. (Our internal tests over first-party code have born out this theory.)

Reducing the total allocation cost of strings has other beneficial side effects. Oracle's own experiments showed that when they added the compact string feature to Java 9, the reduced memory footprint allowed them to reduce the frequency of garbage collection events, and the time spent in each event decreased (reference, PDF link).

We're not ready to change the internal representation of our String type, but the Utf8String feature does allow developers to use a new API that follows familiar behaviors and patterns while also reducing overall load on the managed runtime.

Serialization / deserialization

Some serializers (JSON, XML) are defined to produce textual data instead of binary data. However, in some situations we know the output of these serializers is going to be written to the wire, and it would improve our runtime performance if we're able to write to the expected wire format (generally UTF-8) directly rather than go through a serialize-then-transcode process.

APIs like StreamWriter also fall under this general umbrella. The most common encoding to provide to the StreamWriter constructor is UTF-8 (this is also the default encoding), which means that any input passed to any of the writer's Write APIs needs first to be converted to a String, then converted to binary, then copied to the output buffer. This is the case even for string literals passed to the APIs! We should instead ensure that the string data is already in the correct encoding format before it's passed to the APIs, and at that point we have a simple memcpy operation.

Other candidates for optimizations here include ASP.NET's Razor web pages, which operates analogously to StreamWriter. Large chunks of the data written to these pages are literal strings, and having first-class support for a UTF-8 string type would help them avoid unnecessary runtime transcoding.

Interop with code and services

We've been approached by first-party and third-party clients who are consuming SDKs which expect all text data to be in UTF-8 format. That is, they're removing their dependencies on consuming wchar_t*, instead relying on utf8char_t* (aliased as char*). Giving developers the option to use Utf8String instead of String in their code minimizes the overhead of p/invoke and data exchange between the managed and unmanaged worlds.

ML.NET

The ML.NET team currently has the following scenario.

  • Data exists in a Python buffer as UTF-8 encoded binary data.
  • In C++, this UTF-8 binary data is converted to UTF-16 and stored in a std::wstring. (memcpy + transcode)
  • In C#, the wchar_t* is fetched and copied into a managed String instance. (memcpy)
  • A custom struct DvText is used to cheaply slice and parse the string. This is used for tokenization and general primitive type parsing.
  • Eventually, C# p/invokes back into C++, communicating a series of pointers which are translated back into std::wstring instances. (memcpy of substrings)
  • The C++ bridge converts all of these std::wstring instances into a single binary payload in a format expected by Python. (memcpy + transcode)

We can eliminate the transcoding step and provide a type which supports cheap slicing, which should cut down significantly on the overhead the team is seeing in their scenarios. Additionally, we have the opportunity to eliminate the initial memcpy step entirely if that also proves to be a bottleneck.

Design philosophy

Usage, usability, and behaviors

We find ourselves with two conflicting goals. The first goal is performance above all else: fill a buffer with inbound network data, reinterpret_cast it as UTF-8 data, and operate on it. Network protocol stacks are the big consumer here. This can be achieved by providing UTF-8 manipulation methods which operate directly on spans, which has the added benefit of allowing the consumer to remain in full control of all memory allocations.

The second goal is to provide a friendly, usable API surface on an object which represents UTF-8 data. This helps with application migration, data exchange, and using UTF-8 SDKs. Most (but not all) comparable languages represent strings as immutable objects with their own dedicated backing memory. For example, in the Go language, the string([]byte) -> string API converts a byte array to a string, but it does so by making a copy of the underlying data. The returned object has an independent lifetime from the original input array. (In Swift and C++, strings are copy-by-value.)

This implies that for usability's sake we should have a Utf8String type which mimics the behavior developers have come to expect from String: it's an immutable object which holds on to its own copy of the data, and you're able to retrieve the underlying read-only span from the object. This provides something of a universal exchange type, as APIs which need to hydrate a standalone instance of data can return an instance of this type, and developers can always go from Utf8String to span if they need access to the more powerful span-only APIs.

Since we have an immutable reference type, we can also make certain optimizations like ensuring a null terminator (important for p/invoke scenarios) or repurposing flags in the object header to improve the performance of string inspection or manipulation operations.

Ideally at some point in the future we can have full globalization support for UTF-8 sequences, including culture-aware sorting and case conversion routines. This will likely require a sizeable change to the globalization APIs, so it's possible that such a feature would be several versions out. We should at minimum support limited globalization-related operations on UTF-8 sequences, including Ordinal and OrdinalIgnoreCase comparisons, ToUpperInvariant and friends, and allowing the invariant culture to be passed to ToUtf8String APIs.

Performance

Utf8String should have similar complexity characteristics as String: constant time indexing, linear time allocation and searching, etc. For marshaling, we may wish to consider similar optimizations as exist on String, e.g., stack-copying small objects rather than pinning the object in the managed heap. It is not a goal to provide constant time indexing of scalar values or graphemes within either a String or a Utf8String.

While Utf8String is useful for representing incoming UTF-8 data without the need for transcoding, it does still incur the cost of an allocation per instance. As part of this work we may want to consider making StringSlice or Utf8StringSlice first-class types in the framework. One could imagine these types as being thin wrappers (perhaps aliases?) for ReadOnlyMemory<char> and ReadOnlyMemory<Char8> along with most (but not all) of the instance methods on String and Utf8String.

Security and validation

UTF-8 processing has traditionally been a source of security vulnerabilities for applications and frameworks. There are subtleties with data processing that commonly lead to buffer overflows or exceptions in unexpected places.

.NET applications have historically not been subject to these same vulnerabilities because of our internal representations of strings as UTF-16. It's generally difficult for ill-formed UTF-16 sequences to make their way into the system because client-submitted data on the wire is normally in UTF-8 format, and the conversion process from UTF-8 to UTF-16 will naturally replace invalid sequences with a replacement character. When vulnerabilities have been found the culprit has generally been serializers like JSON readers which blindly splat "\uXXXX" sequences into a String rather than go through a proper encode / decode routine.

UTF-8 is much more prone to misuse due to the fact that remote client input is already expected to be in UTF-8 format. Since there's no need for transcoding, there's a greater temptation to reinterpret cast the provided data directly into a UTF-8 container without running through a verifier. This behavior generally leads to problems like those mentioned in the earlier CVE link. Therefore, as much as possible, we should strive to ensure that Utf8String instances represent well-formed UTF-8 data, where well-formed is defined in The Unicode Standard, Chapter 3, Table 3-7.

Any Utf8String factory (where "factory" is anything that returns a Utf8String, including constructors) should perform validation on its inputs, replacing ill-defined sequences with the replacement character U+FFFD. The validation logic should be compatible with the Utf8Encoding class used by full framework.

There are a handful of exceptions to this rule. Some callers may know that the input data is already well-formed, perhaps because it has been loaded from a trusted source (like a resource string) or because it has already been validated. There must be "no-validate" equivalents of the factories to allow the caller to avoid the performance hit.

In a nutshell, though we do not require Utf8String instances to be well-formed, our APIs should encourage this as much as possible. APIs which operate on Utf8String instances must be prepared to handle ill-formed input and should behave predictably. For example, the enumerator over Utf8String.AsScalars() must have well-defined behavior for all possible inputs.

One interesting consideration for validation by default is Substring and related APIs. While it's true that this could theoretically be used to split a Utf8String in the middle of a multibyte sequence, in practice developers tend to use this API in a safe fashion. Consider the following two examples.

Utf8String str;
if (str.StartsWith("Foo")) { str = str.Substring(3); }

Utf8String str;
int idx = str.IndexOf("Foo");
if (idx >= 0) { str = str.Substring(idx); }

In both cases, the string is split at a proper scalar boundary due to the fact that the target string is well-formed. And since the target string is almost always a literal (or itself a Utf8String, which we encourage to be well-formed), the split string will likewise be well-formed. Since this represents the typical use case of Substring, we can optimistically avoid validation on this and related calls. (Though if we wanted to perform validation, that's easy enough to do very cheaply.)

Validation and inspection

We should expose APIs that allow developers to gather useful information about UTF-8 sequences (not just Utf8String instances), including validation, transcoding, and enumeration of these sequences. There are three kinds of enumeration that are useful for both UTF-16 strings and UTF-8 strings.

  • Enumeration by code unit (Char8 or char) - Provides access to the raw bit data of the string.
  • Enumeration by scalar (UnicodeScalar) - Provides access to the decoded data of the string. Can be used for transcoding purposes or to make ordinal comparisons between strings of different representations.
  • Enumeration by text element - Provides access to the displayed graphemes of the string. Can be used to extract individual "linguistic characters" from the string, including allowing manipulation such as character deletion or string reversal.

The APIs we provide should be powerful and low-level enough for developers to build their own higher-level APIs on top, adding value where those developers see fit. As a concrete example, we needn't provide an API which says "the next scalar in the input string is CYRILLIC SMALL LETTER IOTIFIED A". But we should have an API which allows the developer to see that the next scalar in the input string is U+A657, allowing the developer to build their own higher-level API which then maps U+A657 to "CYRILLIC SMALL LETTER IOTIFIED A" (see code chart PDF).

Open question: should the framework provide a text element / grapheme enumerator? Or does it perhaps fall into the "separate component provides this facility using our lower-level APIs as implementation details" category?

Code units, byte, Char8, and UnicodeScalar

First, some quick definitions:

  • Code unit: The fundamental data type of a string. The code unit for UTF-16 text is a 16-bit integral type (char, distinct from ushort and short). The code unit for UTF-8 text is an 8-bit integral type (tentatively Char8, distinct from byte and sbyte).

  • Code point: Any value in the Unicode codespace (U+0000..U+10FFFF). Not all code points have representations in UTF text; for example, the code points U+D800..U+DFFF are reserved exclusively for UTF-16 text and make sense only when combined to form a scalar value.

  • Scalar value: Any value in the range [U+0000..U+D7FF], inclusive; or [U+E000..U+10FFFF], inclusive. In other words, the set of all code points minus the set of all UTF-16 surrogate code points. All scalar values have unique representations in UTF-8, UTF-16, and UTF-32 text. Well-formed UTF-8, UTF-16, and UTF-32 text is defined as a sequence of scalar values which have been properly encoded (into one or more code units per scalar value) per the UTF being targeted.

  • Grapheme: A single display character which may be composed of one or more scalar values. For example, the "woman firefighter" emoji is a single grapheme which consists of the three-scalar sequence [ U+1F469 (woman), U+200D (zero width joiner), U+1F692 (fire engine) ]. A more layman's way to think of a grapheme is in the context of a text editor: if the user hits the backspace key, what symbols would they reasonably expect to disappear from the screen? (We don't consider in-box grapheme support in this proposal.)

High-performance scenarios require that we add APIs which operate on UTF-8 text in the form of spans. The simplest way to do this is to add UTF-8 extension methods on ReadOnlySpan<byte>, but this comes with two big problems. First, containers of byte are primarily used as an exchange for binary blob data. Any extension methods on ReadOnlySpan<byte> will show both for spans that contain UTF-8 text and for spans that contain binary data. Additionally, managed type systems tend to draw a strong distinction between objects which contain different types of data (and developers generally depend on compile time type checks to catch mistakes), and we don't want to subvert the type system in this manner.

Our solution to this is to introduce an integral Char8 type to represent the code unit of UTF-8 textual data. Just as ReadOnlySpan<char> (represents a UTF-16 string) is distinct from ReadOnlySpan<ushort> (represents integral data), ReadOnlySpan<Char8> (represents a UTF-8 string) is distinct from ReadOnlySpan<byte> (represents binary data). Any span-based extension methods we create will take ReadOnlySpan<Char8> as the this parameter.

To support this concept, we'll also need to add AsUtf8(this ReadOnlySpan<byte>) : ReadOnlySpan<Char8> and AsBytes(this ReadOnlySpan<Char8>) : ReadOnlySpan<byte> extension methods. This does technically allow subverting the type system in the sense that it allows reinterpreting textual data as binary data and vice versa, but it has the advantage that the call site makes very clear that the developer is changing the representation of the data between text and binary.

Finally, we introduce a UnicodeScalar type that can contain any valid Unicode scalar value. Instances of this type can be converted to any representation (UTF-8 / UTF-16 / UTF-32). Overloads of Utf8String.IndexOf and Utf8String.Contains take instances of UnicodeScalar in place of Char8 because a single Char8 in isolation can really only represent an ASCII character. In addition, we'll provide APIs for enumerating UnicodeScalar instances from a UTF-8 sequence both forward and reverse. (We can extend this support to enumerating scalars from UTF-16 sequences in the future if demand exists.)

@svick
Copy link
Contributor

svick commented Jun 19, 2018

Since this represents the typical use case of Substring, we can optimistically avoid validation on this and related calls. (Though if we wanted to perform validation, that's easy enough to do very cheaply.)

What is the reason to avoid validation if it's very cheap?

we'll provide APIs for enumerating UnicodeScalar instances from a UTF-8 sequence [in] reverse

What is the use case for that? I've heard "reverse a string" being used as an interview question, but when is it actually useful?

@JesperTreetop
Copy link

JesperTreetop commented Jun 19, 2018

What is the use case for that? I've heard "reverse a string" being used as an interview question, but when is it actually useful?

Reversing a string is not the only use for reverse enumeration.

If you're parsing a text file or text-based format where you know you only need data at the end. tail is a good example; grabbing the last row of a CSV file (being very careful about the particulars of course) is another. In general, you could "skip back" a bit and go forward from there, but in UTF-8 in particular, you have to finagle the details to find out if you're in the middle of a surrogate or not. The code to be able to go backwards in this sequence should be written once, be tuned and tested error-free and live precisely in this type, if you ask me.

That said, in this case, it would probably be more useful to application code to enumerate the graphemes instead of the scalars... but that could be said about nearly all of the exposed APIs, and doing that starts with being able to enumerate the scalars.

@cdorst
Copy link

cdorst commented Jun 19, 2018

To add data-persistence to the above scenarios: it would be useful if EF Core used the distinction between System.Utf8String & System.String when mapping objects (& generating migrations) to varchar (Utf8String) & nvarchar (String) SQL data types.

Ex: class Foo { [StringLength(50)] public Utf8String Bar { get; set; } } => SQL table Foo w/ varchar(50) Bar column; (the value-add is to not have to explicitly define the [Column(TypeName = "varchar(50)")] metadata on the entity type or in the DbContext for Utf8String properties)

@JesperTreetop
Copy link

@cdorst Ignoring that these methods would need to exist in the framework or in packages first for EF Core to have an opinion about them, varchar fields without an n is specifically relative to the collation and contain bytes specified by the code page and resulting encoding. There's no guarantee that the encoding is UTF-8 compatible or even that it's a single or a variable number of bytes per character since some code pages specify double byte character sets - this could work okay with databases defined from the ground up by EF Core's migrations which could do the thing you intend, but mismatch with fields that already have collations set, or that use the default collation. This definitely does not apply safely universally.

@cdorst
Copy link

cdorst commented Jun 19, 2018

Thank you for the reply @JesperTreetop! That makes sense about being mindful of the db’s collation/encoding & not expecting it to be universally safe.

@GrabYourPitchforks
Copy link
Member Author

@JesperTreetop @cdorst The typical example I give for reverse enumeration is a Trim API. In UTF-16 this is easy enough because all whitespace characters fit cleanly into a 16-bit char, so you can just use a simple forward indexer and reverse indexer. In UTF-8 some whitespace characters are represented as a multi-byte sequence, so you can't use a simple indexer; enumeration is easier for these scenarios.

@scalablecory
Copy link

In addition, we'll provide APIs for enumerating UnicodeScalar instances from a UTF-8 sequence both forward and reverse. (We can extend this support to enumerating scalars from UTF-16 sequences in the future if demand exists.)

If it speeds up developing this spec, I think this part should be left out.

First reason being: 99% of apps are only copying and concatenating strings, and don't care about any text processing.

Second reason being: of the 1% of other apps, not many of them are doing it correctly and the ones that are doing it correctly are all duplicating code. It's probably a larger topic to discover which APIs could be made to help here such as iterating/indexing by code point and grapheme cluster.

@jaykrell
Copy link

jaykrell commented Dec 4, 2018

Reversing a string is useful in converting integer to string.
Though it is borderline.
You either make a pass over the integer to compute the length, allocate the buffer, iterate over the integer again, placing characters correctly.
Or you make a pass over the integer, placing characters backwards, and then reverse the string.

Reversing a string is useful for plain text search maybe.
Suffix search is prefix search over reversed data?
Or maybe there is a more direct approach.

Reverse is sometimes useful for file path canonicalization.
Instead of having .. backup, you have .. increase a counter and skip forward -- having reversed the string to start, and when done.
I guess the other way is iteration from the end, copy to the end, and then one extra copy to the front when done. Two passes instead of three.

Notice that reverse can often by encoding-ignorant, if the special characters being searched for are single byte. Also the integer to string case assuming 0-9 can just be byte-wise.
Likewise, iterate from end can be encoding-ignorant depending on what is being gone.

For some reason the reversing algorithms come to mind first, but the non-reversing algorirthms are slightly more efficient.

@roji
Copy link
Member

roji commented Dec 11, 2018

Seeing the great work being done here (i.e. https://github.com/dotnet/corefx/issues/30503), I'm hoping that a no-copy, zero-allocation alternative/complement to UTF8String is still being considered. As above:

We find ourselves with two conflicting goals. The first goal is performance above all else: fill a buffer with inbound network data, reinterpret_cast it as UTF-8 data, and operate on it. Network protocol stacks are the big consumer here. This can be achieved by providing UTF-8 manipulation methods which operate directly on spans, which has the added benefit of allowing the consumer to remain in full control of all memory allocations.

and

While Utf8String is useful for representing incoming UTF-8 data without the need for transcoding, it does still incur the cost of an allocation per instance. As part of this work we may want to consider making StringSlice or Utf8StringSlice first-class types in the framework. One could imagine these types as being thin wrappers (perhaps aliases?) for ReadOnlyMemory and ReadOnlyMemory along with most (but not all) of the instance methods on String and Utf8String.

Indeed, some sort of way to work directly on a Span<byte> as UTF8 data seems to be highly desirable, with .NET's current perf goals and approach. Have I missed a conversation around this elsewhere or has one not yet been started?

On a related note, if we do go down this route we'd have at least three string-like types:

  • String
  • UTF8String
  • Utf8StringSlice (or similar)

It seems like it would be good to have some uniform way which would allow performing string operations while hiding the particular underlying implementation (a bit like how spans can be constructed over byte[], byte*...).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.Utf8String design OpenBeforeArchiving These issues were open before the repo was archived. For re-open them, file them in the new repo
Projects
None yet
Development

No branches or pull requests

10 participants