Skip to content
This repository has been archived by the owner on Aug 2, 2023. It is now read-only.

Utf8String design proposal #2350

Closed
GrabYourPitchforks opened this issue Jun 6, 2018 · 188 comments
Closed

Utf8String design proposal #2350

GrabYourPitchforks opened this issue Jun 6, 2018 · 188 comments
Labels
area-System.Text.Utf8String Design Review OpenBeforeArchiving These issues were open before the repo was archived. For re-open them, file them in the new repo
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Jun 6, 2018

Utf8String design discussion - last edited 14-Sep-19

Utf8String design overview

Audience and scenarios

Utf8String and related concepts are meant for modern internet-facing applications that need to speak "the language of the web" (or i/o in general, really). Currently applications spend some amount of time transcoding into formats that aren't particularly useful, which wastes CPU cycles and memory.

A naive way to accomplish this would be to represent UTF-8 data as byte[] / Span<byte>, but this leads to a usability pit of failure. Developers would then become dependent on situational awareness and code hygiene to be able to know whether a particular byte[] instance is meant to represent binary data or UTF-8 textual data, leading to situations where it's very easy to write code like byte[] imageData = ...; imageData.ToUpperInvariant();. This defeats the purpose of using a typed language.

We want to expose enough functionality to make the Utf8String type usable and desirable by our developer audience, but it's not intended to serve as a full drop-in replacement for its sibling type string. For example, we might add Utf8String-related overloads to existing APIs in the System.IO namespace, but we wouldn't add an overload Assembly.LoadFrom(Utf8String assemblyName).

In addition to networking and i/o scenarios, it's expected that there will be an audience who will want to use Utf8String for interop scenarios, especially when interoperating with components written in Rust or Go. Both of these languages use UTF-8 as their native string representation, and providing a type which can be used as a data exchange type for that audience will make their scenarios a bit easier.

Finally, we should afford power developers the opportunity to improve their throughput and memory utilization by limiting data copying where feasible. This doesn't imply that we must be allocation-free or zero-copy for every scenario. But it does imply that we should investigate common operations and consider alternative ways of performing these tasks as long as it doesn't compromise the usability of the mainline scenarios.

It's important to call out that Utf8String is not intended to be a replacement for string. The standard UTF-16 string will remain the core primitive type used throughout the .NET ecosystem and will enjoy the largest supported API surface area. We expect that developers who use Utf8String in their code bases will do so deliberately, either because they're working in one of the aforementioned scenarios or because they find other aspects of Utf8String (such as its API surface or behavior guarantees) desirable.

Design decisions and type API

To make internal Utf8String implementation details easier, and to allow consumers to better reason about the type's behavior, the Utf8String type maintains the following invariants:

  • Instances are immutable. Once data is copied to the Utf8String instance, it is unchanging for the lifetime of the instance. All members on Utf8String are thread-safe.

  • Instances are heap-allocated. This is a standard reference type, like string and object.

  • The backing data is guaranteed well-formed UTF-8. It can be round-tripped through string (or any other Unicode-compatible encoding) and back without any loss of fidelity. It can be passed verbatim to any other component whose contract requires that it operate only on well-formed UTF-8 data.

  • The backing data is null-terminated. If the Utf8String instance is pinned, the resulting byte* can be passed to any API which takes a LPCUTF8STR parameter. (Like string, Utf8String instances can contain embedded nulls.)

These invariants help shape the proposed API and usage examples as described throughout this document.

[Serializable]
public sealed class Utf8String : IComparable<Utf8String>, IEquatable<Utf8String>, ISerializable
{
    public static readonly Utf8String Empty; // matches String.Empty

    /*
     * CTORS AND FACTORIES
     *
     * These ctors all have "throw on invalid data" behavior since it's intended that data should
     * be faithfully retained and should be round-trippable back to its original encoding.
     */

    public Utf8String(byte[]? value, int startIndex, int length);
    public Utf8String(char[]? value, int startIndex, int length);
    public Utf8String(ReadOnlySpan<byte> value);
    public Utf8String(ReadOnlySpan<char> value);
    public Utf8String(string value) { }

    // These ctors expect null-terminated UTF-8 or UTF-16 input.
    // They'll compute strlen / wcslen on the caller's behalf.

    public unsafe Utf8String(byte* value);
    public unsafe Utf8String(char* value);

    public static Utf8String Create<TState>(int length, TState state, SpanAction<byte, TState> action);

    // "Try" factories are non-throwing equivalents of the above methods. They use a try pattern instead
    // of throwing if invalid input is detected.

    public static bool TryCreateFrom(ReadOnlySpan<byte> buffer, out Utf8String? value);
    public static bool TryCreateFrom(ReadOnlySpan<char> buffer, out Utf8String? value);

    // "Loose" factories also perform validation, but if an invalid sequence is detected they'll
    // silently fix it up by performing U+FFFD substitution in the returned Utf8String instance
    // instead of throwing.

    public static Utf8String CreateFromLoose(ReadOnlySpan<byte> buffer);
    public static Utf8String CreateFromLoose(ReadOnlySpan<char> buffer);
    public static Utf8String CreateLoose<TState>(int length, TState state, SpanAction<byte, TState> action);

    // "Unsafe" factories skip validation entirely. It's up to the caller to uphold the invariant
    // that Utf8String instances only ever contain well-formed UTF-8 data.

    [RequiresUnsafe]
    public static Utf8String UnsafeCreateWithoutValidation(ReadOnlySpan<byte> utf8Contents);
    [RequiresUnsafe]
    public static Utf8String UnsafeCreateWithoutValidation<TState>(int length, TState state, SpanAction<byte, TState> action);

    /*
     * ENUMERATION
     *
     * Since there's no this[int] indexer on Utf8String, these properties allow enumeration
     * of the contents as UTF-8 code units (Bytes), as UTF-16 code units (Chars), or as
     * Unicode scalar values (Runes). The enumerable struct types are defined at the bottom
     * of this type.
     */

    public ByteEnumerable Bytes { get; }
    public CharEnumerable Chars { get; }
    public RuneEnumerable Runes { get; }

    // Also allow iterating over extended grapheme clusters (not yet ready).
    // public GraphemeClusterEnumerable GraphemeClusters { get; }

    /*
     * COMPARISON
     *
     * All comparisons are Ordinal unless the API takes a parameter such
     * as a StringComparison or CultureInfo.
     */

    // The "AreEquivalent" APIs compare UTF-8 data against UTF-16 data for equivalence, where
    // equivalence is defined as "the texts would transcode as each other".
    // (Shouldn't these methods really be on a separate type?)

    public static bool AreEquivalent(Utf8String? utf8Text, string? utf16Text);
    public static bool AreEquivalent(Utf8Span utf8Text, ReadOnlySpan<char> utf16Text);
    public static bool AreEquivalent(ReadOnlySpan<byte> utf8Text, ReadOnlySpan<char> utf16Text);
    
    public int CompareTo(Utf8String? other);
    public int CompareTo(Utf8String? other, StringComparison comparisonType);

    public override bool Equals(object? obj); // 'obj' must be Utf8String, not string
    public static bool Equals(Utf8String? left, Utf8String? right);
    public static bool Equals(Utf8String? left, Utf8String? right, StringComparison comparisonType);
    public bool Equals(Utf8String? value);
    public bool Equals(Utf8String? value, StringComparison comparisonType);

    public static bool operator !=(Utf8String? left, Utf8String? right);
    public static bool operator ==(Utf8String? left, Utf8String? right);

    /*
     * SEARCHING
     *
     * Like comparisons, all searches are Ordinal unless the API takes a
     * parameter dictating otherwise.
     */
    
    public bool Contains(char value);
    public bool Contains(char value, StringComparison comparisonType);
    public bool Contains(Rune value);
    public bool Contains(Rune value, StringComparison comparisonType);
    public bool Contains(Utf8String value);
    public bool Contains(Utf8String value, StringComparison comparisonType);

    public bool EndsWith(char value);
    public bool EndsWith(char value, StringComparison comparisonType);
    public bool EndsWith(Rune value);
    public bool EndsWith(Rune value, StringComparison comparisonType);
    public bool EndsWith(Utf8String value);
    public bool EndsWith(Utf8String value, StringComparison comparisonType);

    public bool StartsWith(char value);
    public bool StartsWith(char value, StringComparison comparisonType);
    public bool StartsWith(Rune value);
    public bool StartsWith(Rune value, StringComparison comparisonType);
    public bool StartsWith(Utf8String value);
    public bool StartsWith(Utf8String value, StringComparison comparisonType);

    // TryFind is the equivalent of IndexOf. It returns a Range instead of an integer
    // index because there's no this[int] indexer on the Utf8String type, and encouraging
    // developers to slice by integer indices will almost certainly lead to bugs.
    // More on this later.

    public bool TryFind(char value, out Range range);
    public bool TryFind(char value, StringComparison comparisonType, out Range range);
    public bool TryFind(Rune value, out Range range);
    public bool TryFind(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFind(Utf8String value, out Range range);
    public bool TryFind(Utf8String value, StringComparison comparisonType, out Range range);

    public bool TryFindLast(char value, out Range range);
    public bool TryFindLast(char value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Rune value, out Range range);
    public bool TryFindLast(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Utf8String value, out Range range);
    public bool TryFindLast(Utf8String value, StringComparison comparisonType, out Range range);

    /*
     * SLICING
     *
     * All slicing operations uphold the "well-formed data" invariant and
     * validate that creating the new substring instance will not split a
     * multi-byte UTF-8 subsequence. This check is O(1).
     */

    public Utf8String this[Range range] { get; }

    public (Utf8String Before, Utf8String? After) SplitOn(char separator);
    public (Utf8String Before, Utf8String? After) SplitOn(char separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOn(Rune separator);
    public (Utf8String Before, Utf8String? After) SplitOn(Rune separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOn(Utf8String separator);
    public (Utf8String Before, Utf8String? After) SplitOn(Utf8String separator, StringComparison comparisonType);

    public (Utf8String Before, Utf8String? After) SplitOnLast(char separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(char separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Rune separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Rune separator, StringComparison comparisonType);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Utf8String separator);
    public (Utf8String Before, Utf8String? After) SplitOnLast(Utf8String separator, StringComparison comparisonType);

    /*
     * INSPECTION & MANIPULATION
     */

    // some number of overloads to help avoid allocation in the common case
    public static Utf8String Concat<T>(params IEnumerable<T> values);
    public static Utf8String Concat<T0, T1>(T0 value0, T1 value1);
    public static Utf8String Concat<T0, T1, T2>(T0 value0, T1 value1, T2 value2);

    public bool IsAscii();

    public bool IsNormalized(NormalizationForm normalizationForm = NormalizationForm.FormC);

    public static Utf8String Join<T>(char separator, params IEnumerable<T> values);
    public static Utf8String Join<T>(Rune separator, params IEnumerable<T> values);
    public static Utf8String Join<T>(Utf8String? separator, params IEnumerable<T> values);

    public Utf8String Normalize(NormalizationForm normalizationForm = NormalizationForm.FormC);

    // Do we also need Insert, Remove, etc.?

    public Utf8String Replace(char oldChar, char newChar); // Ordinal
    public Utf8String Replace(char oldChar, char newChar, StringComparison comparison);
    public Utf8String Replace(char oldChar, char newChar, bool ignoreCase, CultureInfo culture);
    public Utf8String Replace(Rune oldRune, Rune newRune); // Ordinal
    public Utf8String Replace(Rune oldRune, Rune newRune, StringComparison comparison);
    public Utf8String Replace(Rune oldRune, Rune newRune, bool ignoreCase, CultureInfo culture);
    public Utf8String Replace(Utf8String oldText, Utf8String newText); // Ordinal
    public Utf8String Replace(Utf8String oldText, Utf8String newText, StringComparison comparison);
    public Utf8String Replace(Utf8String oldText, Utf8String newText, bool ignoreCase, CultureInfo culture);

    public Utf8String ToLower(CultureInfo culture);
    public Utf8String ToLowerInvariant();

    public Utf8String ToUpper(CultureInfo culture);
    public Utf8String ToUpperInvariant();

    // The Trim* APIs only trim whitespace for now. When we figure out how to trim
    // additional data we can add the appropriate overloads.

    public Utf8String Trim();
    public Utf8String TrimStart();
    public Utf8String TrimEnd();

    /*
     * PROJECTING
     */

    public ReadOnlySpan<byte> AsBytes(); // perhaps an extension method instead?
    public static explicit operator ReadOnlySpan<byte>(Utf8String? value);
    public static implicit operator Utf8Span(Utf8String? value);

    /*
     * MISCELLANEOUS
     */
    
    public override int GetHashCode(); // Ordinal
    public int GetHashCode(StringComparison comparisonType);

    // Used for pinning and passing to p/invoke. If the input Utf8String
    // instance is empty, returns a reference to the null terminator.

    [EditorBrowsable(EditorBrowsableState.Never)]
    public ref readonly byte GetPinnableReference();

    public static bool IsNullOrEmpty(Utf8String? value);
    public static bool IsNullOrWhiteSpace(Utf8String? value);

    public override string ToString(); // transcode to UTF-16

    /*
     * SERIALIZATION
     * (Throws an exception on deserialization if data is invalid.)
     */
    
    // Could also use an IObjectReference if we didn't want to implement the deserialization ctor.
    private Utf8String(SerializationInfo info, StreamingContext context);
    void ISerializable.GetObjectData(SerializationInfo info, StreamingContext context);

    /*
     * HELPER NESTED STRUCTS
     */

    public readonly struct ByteEnumerable : IEnumerable<byte> { /* ... */ }
    public readonly struct CharEnumerable : IEnumerable<char> { /* ... */ }
    public readonly struct RuneEnumerable : IEnumerable<Rune> { /* ... */ }
}

public static class MemoryExtensions
{
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value);
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value, int offset);
    public static ReadOnlyMemory<byte> AsMemory(Utf8String value, int offset, int count);
}

Non-allocating types

While Utf8String is an allocating, heap-based, null-terminated type; there are scenarios where a developer may want to represent a segment (or "slice") of UTF-8 data from an existing buffer without incurring an allocation.

The Utf8Segment (alternative name: Utf8Memory) and Utf8Span types can be used for this purpose. They represent a view into UTF-8 data, with the following guarantees:

  • They are immutable views into immutable data.
  • They are guaranteed well-formed UTF-8 data. (Tearing will be covered shortly.)

These types have Utf8String-like methods hanging off of them as instance methods where appropriate. Additionally, they can be projected as ROM<byte> and ROS<byte> for developers who want to deal with the data at the raw binary level or who want to call existing extension methods on the ROM and ROS types.

Since Utf8Segment and Utf8Span are standalone types distinct from ROM and ROS, they can have behaviors that developers have come to expect from string-like types. For example, Utf8Segment (unlike ROM<char> or ROM<byte>) can be used as a key in a dictionary without jumping through hoops:

Dictionary<Utf8Segment, int> dict = ...;

Utf8String theString = u"hello world";
Utf8Segment segment = theString.AsMemory(0, 5); // u"hello"

if (dict.TryGetValue(segment, out int value))
{
    Console.WriteLine(value);
}

Utf8Span instances can be compared against each other:

Utf8Span data1 = ...;
Utf8Span data2 = ...;

int hashCode = data1.GetHashCode(); // Marvin32 hash

if (data1 == data2) { /* ordinal comparison of contents */ }

An alternative design that was considered was to introduce a type Char8 that would represent an 8-bit code unit - it would serve as the elemental type of Utf8String and its slices. However, ReadOnlyMemory<Char8> and ReadOnlySpan<Char8> were a bit unweildy for a few reasons.

First, there was confusion as to what ROS<Char8> actually meant when the developer could use ROS<byte> for everything. Was ROS<Char8> actually providing guarantees that ROS<byte> couldn't? (No.) When would I ever want to use a lone Char8 by itself rather than as part of a larger sequence? (You probably wouldn't.)

Second, it introduced a complication that if you had a ROM<Char8>, it couldn't be converted to a ROM<byte>. This impacted the ability to perform text manipulation and then act on the data in a binary fashion, such as sending it across the network.

Creating segment types

Segment types can be created safely from Utf8String backing objects. As mentioned earlier, we enforce that data in the UTF-8 segment types is well-formed. This implies that an instance of a segment type cannot represent data that has been sliced in the middle of a multibyte boundary. Calls to slicing APIs will throw an exception if the caller tries to slice the data in such a manner.

The Utf8Segment type introduces additional complexity in that it could be torn in a multi-threaded application, and that tearing may invalidate the well-formedness assumption by causing the torn segment to begin or end in the middle of a multi-byte UTF-8 subsequence. To resolve this issue, any instance method on Utf8Segment (including its projection to ROM<byte>) must first validate that the instance has not been torn. If the instance has been torn, an exception is thrown. This check is O(1) algorithmic complexity.

It is possible that the developer will want to create a Utf8Segment or Utf8Span instance from an existing buffer (such as a pooled buffer). There are zero-cost APIs to allow this to be done; however, they are unsafe because they easily allow the developer to violate invariants held by these types.

If the developer wishes to call the unsafe factories, they must maintain the following three invariants hold.

  1. The provided buffer (ROM<byte> or ROS<byte>) remains "alive" and immutable for the duration of the Utf8Segment or Utf8Span's existence. Whichever component receives a Utf8Segment or Utf8Span - however the instance has been created - must never observe that the underlying contents change or that dereferencing the contents might result in an AV or other undefined behavior.

  2. The provided buffer contains only well-formed UTF-8 data, and the boundaries of the buffer do not split a multibyte UTF-8 sequence.

  3. For Utf8Segment in particular, the caller must not create a Utf8Segment instance wrapped around a ROM<byte> in circumstances where the component which receives the newly created Utf8Segment might tear it. The reason for this is that the "check that the Utf8Segment instance was not torn across a multi-byte subsequence" protection is only reliable when the Utf8Segment instance is backed by a Utf8String. The Utf8Segment type makes a best effort to offer protection for other backing buffers, but this protection is not ironclad in those scenarios. This could lead to a violation of invariant (2) immediately above.

The type design here - including the constraints placed on segment types and the elimination of the Char8 type - also draws inspiration from the Go, Swift, and Rust communities.

public readonly ref struct Utf8Span
{
    public Utf8Span(Utf8String? value);

    // This "Unsafe" ctor wraps a Utf8Span around an arbitrary span. It is non-copying.
    // The caller must uphold Utf8Span's invariants: that it's immutable and well-formed
    // for the lifetime that any component might be consuming the Utf8Span instance.
    // Consumers (and Utf8Span's own internal APIs) rely on this invariant, and
    // violating it could lead to undefined behavior at runtime.

    [RequiresUnsafe]
    public static Utf8Span UnsafeCreateWithoutValidation(ReadOnlySpan<byte> buffer);

    // The equality operators and GetHashCode() operate on the underlying buffers.
    // Two Utf8Span instances containing the same data will return equal and have
    // the same hash code, even if they're referencing different memory addresses.

    [EditorBrowsable(EditorBrowsableState.Never)]
    [Obsolete("Equals(object) on Utf8Span will always throw an exception. Use Equals(Utf8Span) or == instead.")]
    public override bool Equals(object? obj);
    public bool Equals(Utf8Span other);
    public bool Equals(Utf8Span other, StringComparison comparison);
    public static bool Equals(Utf8Span left, Utf8Span right);
    public static bool Equals(Utf8Span left, Utf8Span right, StringComparison comparison);
    public override int GetHashCode();
    public int GetHashCode(StringComparison comparison);
    public static bool operator !=(Utf8Span left, Utf8Span right);
    public static bool operator ==(Utf8Span left, Utf8Span right);

    // Unlike Utf8String.GetPinnableReference, Utf8Span.GetPinnableReference returns
    // null if the span is zero-length. This is because we're not guaranteed that the
    // backing data has a null terminator at the end, so we don't know whether it's
    // safe to dereference the element just past the end of the span.

    public ReadOnlySpan<byte> Bytes { get; }
    public bool IsEmpty { get; }
    [EditorBrowsable(EditorBrowsableState.Never)]
    public ref readonly byte GetPinnableReference();

    // For the most part, Utf8Span's remaining APIs mirror APIs already on Utf8String.
    // There are some exceptions: methods like ToUpperInvariant have a non-allocating
    // equivalent that allows the caller to specify the buffer which should
    // contain the result of the operation. Like Utf8String, all APIs are assumed
    // Ordinal unless the API takes a parameter which provides otherwise.

    public static Utf8Span Empty { get; }

    public ReadOnlySpan<byte> Bytes { get; } // returns ROS<byte>, not custom enumerable
    public CharEnumerable Chars { get; }
    public RuneEnumerable Runes { get; }

    // Also allow iterating over extended grapheme clusters (not yet ready).
    // public GraphemeClusterEnumerable GraphemeClusters { get; }

    public int CompareTo(Utf8Span other);
    public int CompareTo(Utf8Span other, StringComparison comparison);

    public bool Contains(char value);
    public bool Contains(char value, StringComparison comparison);
    public bool Contains(Rune value);
    public bool Contains(Rune value, StringComparison comparison);
    public bool Contains(Utf8Span value);
    public bool Contains(Utf8Span value, StringComparison comparison);

    public bool EndsWith(char value);
    public bool EndsWith(char value, StringComparison comparison);
    public bool EndsWith(Rune value);
    public bool EndsWith(Rune value, StringComparison comparison);
    public bool EndsWith(Utf8Span value);
    public bool EndsWith(Utf8Span value, StringComparison comparison);

    public bool IsAscii();

    public bool IsEmptyOrWhiteSpace();

    public bool IsNormalized(NormalizationForm normalizationForm = NormalizationForm.FormC);

    public Utf8String Normalize(NormalizationForm normalizationForm = NormalizationForm.FormC);
    public int Normalize(Span<byte> destination, NormalizationForm normalizationForm = NormalizationForm.FormC);

    public Utf8Span this[Range range] { get; }

    public SplitResult SplitOn(char separator);
    public SplitResult SplitOn(char separator, StringComparison comparisonType);
    public SplitResult SplitOn(Rune separator);
    public SplitResult SplitOn(Rune separator, StringComparison comparisonType);
    public SplitResult SplitOn(Utf8String separator);
    public SplitResult SplitOn(Utf8String separator, StringComparison comparisonType);

    public SplitResult SplitOnLast(char separator);
    public SplitResult SplitOnLast(char separator, StringComparison comparisonType);
    public SplitResult SplitOnLast(Rune separator);
    public SplitResult SplitOnLast(Rune separator, StringComparison comparisonType);
    public SplitResult SplitOnLast(Utf8String separator);
    public SplitResult SplitOnLast(Utf8String separator, StringComparison comparisonType);

    public bool StartsWith(char value);
    public bool StartsWith(char value, System.StringComparison comparison);
    public bool StartsWith(Rune value);
    public bool StartsWith(Rune value, StringComparison comparison);
    public bool StartsWith(Utf8Span value);
    public bool StartsWith(Utf8Span value, StringComparison comparison);

    public int ToChars(Span<char> destination);

    public Utf8String ToLower(CultureInfo culture);
    public int ToLower(Span<byte> destination, CultureInfo culture);

    public Utf8String ToLowerInvariant();
    public int ToLowerInvariant(Span<byte> destination);

    public override string ToString();

    public Utf8String ToUpper(CultureInfo culture);
    public int ToUpper(Span<byte> destination, CultureInfo culture);

    public Utf8String ToUpperInvariant();
    public int ToUpperInvariant(Span<byte> destination);

    public Utf8String ToUtf8String();

    // Should we also have Trim* overloads that return a range instead
    // of the span directly? Does this actually enable any new scenarios?

    public Utf8Span Trim();
    public Utf8Span TrimStart();
    public Utf8Span TrimEnd();

    public bool TryFind(char value, out Range range);
    public bool TryFind(char value, StringComparison comparisonType, out Range range);
    public bool TryFind(Rune value, out Range range);
    public bool TryFind(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFind(Utf8Span value, out Range range);
    public bool TryFind(Utf8Span value, StringComparison comparisonType, out Range range);

    public bool TryFindLast(char value, out Range range);
    public bool TryFindLast(char value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Rune value, out Range range);
    public bool TryFindLast(Rune value, StringComparison comparisonType, out Range range);
    public bool TryFindLast(Utf8Span value, out Range range);
    public bool TryFindLast(Utf8Span value, StringComparison comparisonType, out Range range);

    /*
     * HELPER NESTED STRUCTS
     */

    public readonly ref struct CharEnumerable { /* pattern match for 'foreach' */ }
    public readonly ref struct RuneEnumerable { /* pattern match for 'foreach' */ }

    public readonly ref struct SplitResult
    {
        private SplitResult();

        [EditorBrowsable(EditorBrowsable.Never)]
        public void Deconstruct(out Utf8Span before, out Utf8Span after);
    }
}

public readonly struct Utf8Segment : IComparable<Utf8Segment>, IEquatable<Utf8Segment>
{
    private readonly ReadOnlyMemory<byte> _data;

    public Utf8Span Span { get; }

    // Not all span-based APIs are present. APIs on Utf8Span that would
    // return a new Utf8Span (such as Trim) should be present here, but
    // other APIs that return bool / int (like Contains, StartsWith)
    // should only be present on the Span type to discourage heavy use
    // of APIs hanging directly off of this type.

    public override bool Equals(object? other); // ok to call
    public bool Equals(Utf8Segment other); // defaults to Ordinal
    public bool Equals(Utf8Segment other, StringComparison comparison);

    public override int GetHashCode(); // Ordinal
    public int GetHashCode(StringComparison comparison);

    // Caller is responsible for ensuring:
    // - Input buffer contains well-formed UTF-8 data.
    // - Input buffer is immutable and accessible for the lifetime of this Utf8Segment instance.
    public static Utf8Segment UnsafeCreateWithoutValidation(ReadOnlyMemory<byte> data);
}

Supporting types

Like StringComparer, there's also a Utf8StringComparer which can be passed into the Dictionary<,> and HashSet<> constructors. This Utf8StringComparer also implements IEqualityComparer<Utf8Segment>, which allows using Utf8Segment instances directly as the keys inside dictionaries and other collection types.

The Dictionary<,> class is also being enlightened to understand that these types have both non-randomized and randomized hash code calculation routines. This allows dictionaries instantiated with TKey = Utf8String or TKey = Utf8Segment to enjoy the same performance optimizations as dictionaries instantiated with TKey = string.

Finally, the Utf8StringComparer type has convenience methods to compare Utf8Span instances against one another. This will make it easier to compare texts using specific cultures, even if that specific culture is not the current thread's active culture.

public abstract class Utf8StringComparer : IComparer<Utf8Segment>, IComparer<Utf8String?>, IEqualityComparer<Utf8Segment>, IEqualityComparer<Utf8String?>
{
    private Utf8StringComparer(); // all implementations are internal

    public static Utf8StringComparer CurrentCulture { get; }
    public static Utf8StringComparer CurrentCultureIgnoreCase { get; }
    public static Utf8StringComparer InvariantCulture { get; }
    public static Utf8StringComparer InvariantCultureIgnoreCase { get; }
    public static Utf8StringComparer Ordinal { get; }
    public static Utf8StringComparer OrdinalIgnoreCase { get; }

    public static Utf8StringComparer Create(CultureInfo culture, bool ignoreCase);
    public static Utf8StringComparer Create(CultureInfo culture, CompareOptions options);
    public static Utf8StringComparer FromComparison(StringComparison comparisonType);

    public abstract int Compare(Utf8Segment x, Utf8Segment y);
    public abstract int Compare(Utf8String? x, Utf8String? y);
    public abstract int Compare(Utf8Span x, Utf8Span y);
    public abstract bool Equals(Utf8Segment x, Utf8Segment y);
    public abstract bool Equals(Utf8String? x, Utf8String? y);
    public abstract bool Equals(Utf8Span x, Utf8Span y);
    public abstract int GetHashCode(Utf8Segment obj);
    public abstract int GetHashCode(Utf8String obj);
    public abstract int GetHashCode(Utf8Span obj);
}

Manipulating UTF-8 data

CoreFX and Azure scenarios

  • What exchange types do we use when passing around UTF-8 data into and out of Framework APIs?

  • How do we generate UTF-8 data in a low-allocation manner?

  • How do we apply a series of transformations to UTF-8 data in a low-allocation manner?

    • Leave everything as Span<byte>, use a special Utf8StringBuilder type, or something else?

    • Do we need to support UTF-8 string interpolation?

    • If we have builders, who is ultimately responsible for lifetime management?

    • Perhaps should look at ValueStringBuilder for inspiration.

    • A MutableUtf8Buffer type would be promising, but we'd need to be able to generate Utf8Span slices from it, and if the buffer is being modified continually the spans could end up holding invalid data. Example below:

      MutableUtf8Buffer buffer = GetBuffer();
      Utf8Span theSpan = buffer[0..1];
      
      buffer.InsertAt(0, utf8("💣")); // U+1F483 ([ F0 9F 92 A3 ])
      
      // 'theSpan' now contains only the first byte ([ F0 ]).
      // Trying to use it could corrupt the application.
      //
      // Any such mutable UTF-8 type would necessarily be unsafe. This
      // also matches Rust's semantics: direct byte manipulation can only
      // take place within an unsafe context.
      // See:
      // * https://doc.rust-lang.org/std/string/struct.String.html#method.as_mut_vec
      // * https://doc.rust-lang.org/std/primitive.str.html#method.as_bytes_mut
  • Some folks will want to perform operations in-place.

Sample operations on arbitrary buffers

(Devs may want to perform these operations on arbitrary byte buffers, even if those buffers aren't guaranteed to contain valid UTF-8 data.)

  • Validate that buffer contains well-formed UTF-8 data.

  • Convert ASCII data to upper / lower in-place, leaving all non-ASCII data untouched.

  • Split on byte patterns. (Probably shouldn't split on runes or UTF-8 string data, since we can't guarantee data is well-formed UTF-8.)

These operations could be on the newly-introduced System.Text.Unicode.Utf8 static class. They would take ROS<byte> and Span<byte> as input parameters because they can operate on arbitrary byte buffers. Their runtime performance would be subpar compared to similar methods on Utf8String, Utf8Span, or other types where we can guarantee that no invalid data will be seen, as the APIs which operate on raw byte buffers would need to be defensive and would probably operate over the input in an iterative fashion rather than in bulk. One potential behavior could be skipping over invalid data and leaving it unchanged as part of the operation.

Sample Utf8StringBuilder implementation for private use

internal ref struct Utf8StringBuilder
{
    public void Append<T>(T value) where T : IUtf8Formattable;
    public void Append<T>(T value, string format, CultureInfo culture) where T : IUtf8Formattable;

    public void Append(Utf8String value);
    public void Append(Utf8Segment value);
    public void Append(Utf8Span value);

    // Some other Append methods, resize methods, etc.
    // Methods to query the length.

    public Utf8String ToUtf8String();

    public void Dispose(); // when done with the instance
}

// Would be implemented by numeric types (int, etc.),
// DateTime, String, Utf8String, Guid, other primitives,
// Uri, and anything else we might want to throw into
// interpolated data.
internal interface IUtf8Formattable
{
    void Append(ref Utf8StringBuilder builder);
    void Append(ref Utf8StringBuilder builder, string format, CultureInfo culture);
}

Code samples and metadata representation

The C# compiler could detect support for UTF-8 strings by looking for the existence of the System.Utf8String type and the appropriate helper APIs on RuntimeHelpers as called out in the samples below. If these APIs don't exist, then the target framework does not support the concept of UTF-8 strings.

Literals

Literal UTF-8 strings would appear as regular strings in source code, but would be prefixed by a u as demonstrated below. The u prefix would denote that the return type of this literal string expression should be Utf8String instead of string.

Utf8String myUtf8String = u"A literal string!";
// Normal ldstr to literal UTF-16 string in PE string table, followed by
// call to helper method which translates this to a UTF-8 string literal.
// The end result of these calls is that a Utf8String instance sits atop
// the stack.

ldstr "A literal string!"
call class System.Utf8String System.Runtime.CompilerServices.RuntimeHelpers.InitializeUtf8StringLiteral(string)

The u prefix would also be combinable with the @ prefix and the $ prefix (more on this below).

Additionally, literal UTF-8 strings must be well-formed Unicode strings.

// Below line would be a compile-time error since it contains ill-formed Unicode data.
Utf8String myUtf8String = u"A malformed \ud800 literal string!";

Three alternative designs were considered. One was to use RVA statics (through ldsflda) instead of literal UTF-16 strings (through ldstr) before calling a "load from RVA" method on RuntimeHelpers. The overhead of using RVA statics is somewhat greater than the overhead of using the normal UTF-16 string table, so the normal UTF-16 string literal table should still be the more optimized case for small-ish strings, which we believe to be the common case.

Another alternative considered was to introduce a new opcode ldstr.utf8, which would act as a UTF-8 equivalent to the normal ldstr opcode. This would be a breaking change to the .NET tooling ecosystem, and the ultimate decision was that there would be too much pain to the ecosystem to justify the benefit.

The third alternative considered was to smuggle UTF-8 data in through a normal UTF-16 string in the string table, then call a RuntimeHelpers method to reinterpret the contents. This would result in a "garbled" string for anybody looking at the raw IL. While that in itself isn't terrible, there is the possibility that smuggling UTF-8 data in this manner could result in a literal string which has ill-formed UTF-16 data. Not all .NET tooling is resilient to this. For example, xunit's test runner produces failures if it sees attributes initialized from literal strings containing ill-formed UTF-16 data. There is a risk that other tooling would behave similarly, potentially modifying the DLL in such a manner that errors only manifest themselves at runtime. This could result in difficult-to-diagnose bugs.

We may wish to reconsider this decision in the future. For example, if we see that it is common for developers to use large UTF-8 literal strings, maybe we'd want to dynamically switch to using RVA statics for such strings. This would lower the resulting DLL size. However, this would add extra complexity to the compilation process, so we'd want to tread lightly here.

Constant handling

class MyClass
{
    public const Utf8String MyConst = u"A const string!";
}
// Literal field initialized to literal UTF-16 value. The runtime doesn't care about
// this (modulo FieldInfo.GetRawConstantValue, which perhaps we could fix up), so
// only the C# compiler would need to know that this is a UTF-8 constant and that
// references to it should get the same (ldstr, call) treatment as stated above.

.field public static literal class System.Utf8String MyConst = "A const string!";

String concatenation

There would be APIs on Utf8String which mirror the string.Concat APIs. The compiler should special-case the + operator to call the appropriate overload n-ary overload of Concat.

Utf8String a = ...;
Utf8String b = ...;

Utf8String c = a + u", " + b; // calls Utf8String.Concat(...)

Since we expect use of Utf8String to be "deliberate" when compared to string (see the beginning of this document), we should consider that a developer who is using UTF-8 wants to stay in UTF-8 during concatenation operations. This means that if there's a line which involves the concatenation of both a Utf8String and a string, the final type post-concatenation should be Utf8String.

Utf8String a = ...;
string b = ...;

Utf8String concatFoo = a + b;
string concatBar = (object)a + b; // compiler can't statically determine that any argument is Utf8String

This is still open for discussion, as the behavior may be surprising to people. Another alternative is to produce a build warning if somebody tries to mix-and-match UTF-8 strings and UTF-16 strings in a single concatenation expression.

If string interpolation is added in the future, this shouldn't result in ambiguity. The $ interpolation operator will be applied to a literal Utf8String or a literal string, and that would dictate the overall return type of the operation.

Equality comparisons

There are standard == and != operators defined on the Utf8String class.

public static bool operator ==(Utf8String a, Utf8String b);
public static bool operator !=(Utf8String a, Utf8String b);

The C# compiler should special-case when either side of an equality expression is known to be a literal null object, and if so the compiler should emit a referential check against the null object instead of calling the operator method. This matches the if (myString == null) behavior that the string type enjoys today.

Additionally, equality / inequality comparisons between Utf8String and string should produce compiler warnings, as they will never succeed.

Utf8String a = ...;
string b = ...;

// Below line should produce a warning since it will end up being the equivalent
// of Object.ReferenceEquals, which will only succeed if both arguments are null.
// This probably wasn't what the developer intended to check.

if (a == b) { /* ... */ }

I attempted to define operator ==(Utf8String a, string b) so that I could slap [Obsolete] on it and generate the appropriate warning, but this had the side effect of disallowing the user to write the code if (myUtf8String == null) since the compiler couldn't figure out which overload of operator == to call. This was also one of the reasons I had opened dotnet/csharplang#2340.

Marshaling behaviors

Like the string type, the Utf8String type shall be marshalable across p/invoke boundaries. The corresponding unmanaged type shall be LPCUTF8 (equivalent to a BYTE* pointing to null-terminated UTF-8 data) unless a different unmanaged type is specified in the p/invoke signature.

If a different [MarshalAs] representation is specified, the stub routine creates a temporary copy in the desired representation, performs the p/invoke, then destroys the temporary copy or allows the GC to reclaim the temporary copy.

class NativeMethods
{
    [DllImport]
    public static extern int MyPInvokeMethod(
        [In] Utf8String marshaledAsLPCUTF8,
        [In, MarshalAs(UnmanagedType.LPUTF8Str)] Utf8String alsoMarshaledAsLPCUTF8,
        [In, MarshalAs(UnmanagedType.LPWStr)] Utf8String marshaledAsLPCWSTR,
        [In, MarshalAs(UnmanagedType.BStr)] Utf8String marshaledAsBSTR);
}

If a Utf8String must be marshaled from native-to-managed (e.g., a reverse p/invoke takes place on a delegate which has a Utf8String parameter), the stub routine is responsible for fixing up invalid UTF-8 data before creating the Utf8String instance (or it may let the Utf8String constructor perform the fixup automatically).

Unmanaged routines must not modify the contents of any Utf8String instance marshaled across the p/invoke boundary. Utf8String instances are assumed to be immutable once created, and violating this assumption could cause undefined behaviors within the runtime.

There is no default marshaling behavior for Utf8Segment or Utf8Span since they are not guaranteed to be null-terminated. If in the future the runtime allows marshaling {ReadOnly}Span<T> across a p/invoke boundary (presumably as a non-null-terminated array equivalent), library authors may fetch the underlying ReadOnlySpan<byte> from the Utf8Segment or Utf8Span instance and directly marshal that span across the p/invoke boundary.

Automatic coercion of UTF-16 literals to UTF-8 literals

If possible, it would be nice if UTF-16 literals (not arbitrary string instances) could be automatically coerced to UTF-8 literals (via the ldstr / call routines mentioned earlier). This coercion would only be considered if attempting to leave the data as a string would have caused a compilation error. This could help eliminate some errors resulting from developers forgetting to put the u prefix in front of the string literal, and it could make the code cleaner. Some examples follow.

// String literal being assigned to a member / local of type Utf8String.
public const Utf8String MyConst = "A literal!";

public void Foo(string s);
public void Foo(Utf8String s);

public void FooCaller()
{
    // Calls Foo(string) since it's an exact match.
    Foo("A literal!");
}

public void Bar(object o);
public void Bar(Utf8String s);

public void BarCaller()
{
    // Calls Bar(object), passing in the string literal,
    // since it's the closest match.
    Bar("A literal!");
}

public void Baz(int i);
public void Baz(Utf8String s);

public void BazCaller1()
{
    // Calls Baz(Utf8String), passing in the UTF-8 literal,
    // since there's no closer match.
    Baz("A literal!");
}

public void BazCaller2(string someInput)
{
    // Compiler error. The input isn't a literal, so no auto-coercion
    // takes place. Dev should call Baz(new Utf8String(someInput)).
    Baz(someInput);
}

public void Quux<T>(ReadOnlySpan<T> value);
public void Quux(Utf8String s);

public void QuuxCaller()
{
    // Calls Quux<char>(ReadOnlySpan<char>), passing in the string literal,
    // since string satisfies the constraints.
    Quux("A literal!");
}

public void Glomp(Utf8Span value);

public void GlompCaller()
{
    // Calls Glomp(Utf8Span), passing in the UTF-8 literal, since there's
    // no closer match and Utf8String can be implicitly cast to Utf8Span.
    Glomp("A literal!");
}

UTF-8 String interpolation

The string interpolation feature is undergoing significant churn (see dotnet/csharplang#2302). I envision that when a final design is chosen, there would be a UTF-8 counterpart for symmetry. The internal IUtf8Formattable interface as proposed above is being designed partly with this feature in mind in order to allow single-allocation Utf8String interpolation.

ustring contextual language keyword

For simplicity, we may want to consider a contextual language keyword which corresponds to the System.Utf8String type. The exact name is still up for debate, as is whether we'd want it at all, but we could consider something like the below.

Utf8String a = u"Some UTF-8 string.";

// 'ustring' and 'System.Utf8String' are aliases, as shown below.

ustring b = a;
Utf8String c = b;

The name ustring is intended to invoke "Unicode string". Another leading candidate was utf8. We may wish not to ship with this keyword support in v1 of the Utf8String feature. If we opt not to do so we should be mindful of how we might be able to add it in the future without introducing breaking changes.

An alternative design to use a u suffix instead of a u prefix. I'm mostly impartial to this, but there is a nice symmetry to having the characters u, $, and @ all available as prefixes on literal strings.

We could also drop the u prefix entirely and rely solely on type targeting:

ustring a = "Literal string type-targeted to UTF-8.";
object b = (ustring)"Another literal string type-targeted to UTF-8.";

This has implications for string interpolation, as it wouldn't be possible to prepend both the (ustring) coercion hint and the $ interpolation operator simultaneously.

Switching and pattern matching

If a value whose type is statically known to be Utf8String is passed to a switch statement, the corresponding case statements should allow the use of literal Utf8String values.

Utf8String value = ...;

switch (value)
{
    case u"Some literal": /* ... */
    case u"Some other literal": /* ... */
    case "Yet another literal": /* target typing also works */
}

Since pattern matching operates on input values of arbitrary types, I'm pessimistic that pattern matching will be able to take advantage of target typing. This may instead require that developers specify the u prefix on Utf8String literals if they wish such values to participate in pattern matching.

A brief interlude on indexers and IndexOf

Utf8String and related types do not expose an elemental indexer (this[int]) or a typical IndexOf method because they're trying to rid the developer of the notion that bytewise indices into UTF-8 buffers can be treated equivalently as charwise indices into UTF-16 buffers. Consider the naïve implementation of a typical "string split" routine as presented below.

void SplitString(string source, string target, StringComparison comparisonType, out string beforeTarget, out string afterTarget)
{
    // Locates 'target' within 'source', splits on it, then populates the two out parameters.
    // ** NOTE ** This code has a bug, as will be explained in detail below.

    int index = source.IndexOf(target, comparisonType);
    if (index < 0) { throw new Exception("Target string not found!"); }

    beforeTarget = source.Substring(0, index);
    afterTarget = source.Substring(index + target.Length, source.Length - index - target.Length);
}

One subtlety of the above code is that when culture-sensitive or case-insensitive comparers are used (such as OrdinalIgnoreCase in the above example), the target string doesn't have to be an exact char-for-char match of a sequence present in the source string. For example, consider the UTF-16 string "GREEN" ([ 0047 0052 0045 0045 004E ]). Performing an OrdinalIgnoreCase search for the substring "e" ([ 0065 ]) will result in a match, as 'e' (U+0065) and 'E' (U+0045) compare as equal under an OrdinalIgnoreCase comparer.

As another example, consider the UTF-16 string "preſs" ([ 0070 0072 0065 017F 0073 ]), whose fourth character is the Latin long s 'ſ' (U+017F). Performing an OrdinalIgnoreCase search for the substring "S" ([ 0053 ]) will result in a match, as 'ſ' (U+017F) and 'S' (U+0053) compare as equal under an OrdinalIgnoreCase comparer.

There are also scenarios where the length of the match within the search string might not be equal to the length of the target string. Consider the UTF-16 string "encyclopædia" ([ 0065 006E 0063 0079 0063 006C 006F 0070 00E6 0064 0069 0061 ]), whose ninth character is the ligature 'æ' (U+00E6). Performing an InvariantCultureIgnoreCase search for the substring "ae" ([ 0061 0065 ]) will result in a match at index 8, as "æ" ([ 00E6 ]) and "ae" ([ 0061 0065 ]) compare as equal under an InvariantCultureIgnoreCase comparer.

This result is interesting and should give us pause. Since "æ".Length == 1 and "ae".Length == 2, the arithmetic at the end of the method will actually result in the wrong substrings being returned to the caller.

beforeTarget = source.Substring(0, 8 /* index */); // = "encyclop"
afterTarget = source.Substring(
    10 /* index + target.Length */,
    2 /* source.Length - index - target.Length */); // = "ia" (expected "dia"!)

Due to the nature of UTF-16 (used by string), when performing an Ordinal or an OrdinalIgnoreCase comparison, the length of the matched substring within the source will always have a char count equal to target.Length. The length mismatch as demonstrated by "encyclopædia" above can only happen with a culture-sensitive comparer or any of the InvariantCulture comparers.

However, in UTF-8, these same guarantees do not hold. Under UTF-8, only when performing an Ordinal comparison is there a guarantee that the length of the matched substring within the source will have a byte count equal to the target. All other comparers - including OrdinalIgnoreCase - have the behavior that the byte length of the matched substring can change (either shrink or grow) when compared to the byte length of the target string.

As an example of this, consider the string "preſs" from earlier, but this time in its UTF-8 representation ([ 70 72 65 C5 BF 73 ]). Performing an OrdinalIgnoreCase for the target UTF-8 string "S" ([ 53 ]) will match on the ([ C5 BF ]) portion of the source string. (This is the UTF-8 representation of the letter 'ſ'.) To properly split the source string along this search target, the caller need to know not only where the match was, but also how long the match was within the original source string.

This fundamental problem is why Utf8String and related types don't expose a standard IndexOf function or a standard this[int] indexer. It's still possible to index directly into the underlying byte buffer by using an API which projects the data as a ROS<byte>. But for splitting operations, these types instead offer a simpler API that performs the split on the caller's behalf, handling the length adjustments appropriately. For callers who want the equivalent of IndexOf, the types instead provide TryFind APIs that return a Range instead of a typical integral index value. This Range represents the matching substring within the original source string, and new C# language features make it easy to take this result and use it to create slices of the original source input string.

This also addresses feedback that was given in a previous prototype: users weren't sure how to interpret the result of the IndexOf method. (Is it a byte count? Is it a char count? Is it something else?) Similarly, there was confusion as to what parameters should be passed to a this[int] indexer or a Substring(int, int) method. By having the APIs promote use of Range and related C# language features, this confusion should subside. Power developers can inspect the Range instance directly to extract raw byte offsets if needed, but most devs shouldn't need to query such information.

API usage samples

Scenario: Split an incoming string of the form "LastName, FirstName" into individual FirstName and LastName components.

// Using Utf8String input and producing Utf8String instances
void SplitSample(ustring input)
{
    // Method 1: Use the SplitOn API to find the ',' char, then trim manually.

    (ustring lastName, ustring firstName) = input.Split(',');
    if (firstName is null) { /* ERROR: no ',' detected in input */ }

    lastName = lastName.Trim();
    firstName = firstName.Trim();

    // Method 2: Use the SplitOn API to find the ", " target string, assuming no trim needed.

    (ustring lastName, ustring firstName) = input.Split(u", ");
    if (firstName is null) { /* ERROR: no ", " detected in input */ }
}

// Using Utf8Span input and producing Utf8Span instances
void SplitSample(Utf8Span input)
{
    // Method 1: Use the SplitOn API to find the ',' char, then trim manually.

    (Utf8Span lastName, Utf8Span firstName) = input.Split(',');
    lastName = lastName.Trim();
    firstName = firstName.Trim();
    if (firstName.IsEmpty) { /* ERROR: trailing ',', or no ',' detected in input */ }

    // Method 2: Use the SplitOn API to find the ", " target string, assuming no trim needed.

    (Utf8Span lastName, Utf8Span firstName) = input.Split(", ");
    if (firstName.IsEmpty) { /* ERROR: trailing ", ", or no ", " detected in input */ }
}

Additionally, the SplitResult struct returned by Utf8Span.Split implements both a standard IEnumerable<T> pattern and the C# deconstruct pattern, which allows it to be used separately from enumeration for simple cases where only a small handful of values are returned.

Utf8Span str = ...;

// The result of Utf8Span.Split can be used in an enumerator

foreach (Utf8Span substr in str.Split(','))
{
    /* operate on substr */
}

// Or it can be used in tuple deconstruction
// (See docs for description of behavior for each arity.)

(Utf8Span before, Utf8Span after) = str.Split(',');
(Utf8Span part1, Utf8Span part2, Utf8Span part3, ...) = str.Split(',');

Scenario: Split a comma-delimited input into substrings, then perform an operation with each substring.

// Using Utf8String input and producing Utf8String instances
// The Utf8Span code would look  identical (sub. 'Utf8Span' for 'ustring')

void SplitSample(ustring input)
{
    while (input.Length > 0)
    {
        // 'TryFind' is the 'IndexOf' equivalent. It returns a Range instead
        // of an integer index because there's no this[int] indexer on Utf8String.

        if (!input.TryFind(',', out Range matchedRange))
        {
            // The remainder of the input string is empty, but no comma
            // was found in the remaining portion. Process the remainder
            // of the input string, then finish.

            ProcessValue(input);
            break;
        }

        // We found a comma! Substring and process.
        // The 'matchedRange' local contains the range for the ',' that we found.

        ProcessValue(input[..matchedRange.Start]); // fetch segment to the left of the comma, then process it
        input = input[matchedRange.End..]; // set 'input' to the remainder of the input string and loop
    }

    // Could also have an IEnumerable<ustring>-returning version if we wanted, I suppose.
}

Miscellaneous topics and open questions

What about comparing UTF-16 and UTF-8 data?

Currently there is a set of APIs Utf8String.AreEquivalent which will decode sequences of UTF-16 and UTF-8 data and compare them for ordinal equality. The general code pattern is below.

ustring a = ...;
string b = ...;

// The below line fails to compile because there's no operator==(Utf8String, string) defined.

bool result = (a == b);

// The below line is probably what the developer intended to write.

bool result = ustring.AreEquivalent(a, b);

// The below line should compile since literal strings can be type targeted to Utf8String.

bool result = (a == "Hello!");

Do we want to add an operator==(Utf8String, string) overload which would allow easy == comparison of UTF-8 and UTF-16 data? There are three main downsides to this which caused me to vote no, but I'm open to reconsideration.

  1. The compiler would need to special-case if (myUtf8String == null), which would now be ambiguous between the two overloads. (If the compiler is already special-casing null checks, this is a non-issue.)

  2. The performance of UTF-16 to UTF-8 comparison is much worse than the performance of UTF-16 to UTF-16 (or UTF-8 to UTF-8) comparison. When the representation is the same on both sides, certain shortcuts can be implemented to avoid the O(n) comparison, and even the O(n) comparison itself can be implemented as a simple memcmp operation. When the representations are heterogeneous, the opportunity for taking shortcuts is much more restricted, and the O(n) comparison itself has a higher constant factor. Developers might not expect such a performance characteristic from an equality operator.

  3. Comparing a Utf8String against a literal string would no longer go through the fast path, as target typing would cause the compiler to emit a call to operator==(Utf8String, string) instead of operator==(Utf8String, Utf8String). The comparison itself would then have the lower performance described by bullet (2) above.

One potential upside to having such a comparison is that it would prevent developers from using the antipattern if (myUtf8String.ToString() == someString), which would result in unnecessary allocations. If we are concerned about this antipattern one way to address it would be through a Code Analyzer.

What if somebody passes invalid data to the "skip validation" factories?

When calling the "unsafe" APIs, callers are fully responsible for ensuring that the invariants are maintained. Our debug builds could double-check some of these invariants (such as the initial Utf8String creation consisting only of well-formed data). We could also consider allowing applications to opt-in to these checks at runtime by enabling an MDA or other diagnostic facility. But as a guiding principle, when "unsafe" APIs are called the Framework should trust the developer and should have as little overhead as possible.

Consider consolidating the unsafe factory methods under a single unsafe type.

This would prevent pollution of the type's normal API surface and could help write tools which audit use of a single "unsafe" type.

Some of the methods may need to be extension methods instead of normal static factories. (Example: Unsafe slicing routines, should we choose to expose them.)

Potential APIs to enlighten

System namespace

Include Utf8String / Utf8Span overloads on Console.WriteLine. Additionally, perhaps introduce an API Console.ReadLineUtf8.

System.Data.* namepace

Include generalized support for serializing Utf8String properties as a primitive with appropriate mapping to nchar or nvarchar.

System.Diagnostics.* namespace

Enlighten EventSource so that a caller can write Utf8String / Utf8Span instances cheaply. Additionally, some types like ActivitySpanId already have ROS<byte> ctors; overloads can be introduced here.

System.Globalization.* namespace

The CompareInfo type has many members which operate on string instances. These should be spanified foremost, and Utf8String / Utf8Span overloads should be added. Good candidates are Compare, GetHashCode, IndexOf, IsPrefix, and IsSuffix.

The TextInfo type has members which should be treated similarly. ToLower and ToUpper are good candidates. Can we get away without enlightening ToTitleCase?

System.IO.* namespace

BinaryReader and BinaryWriter should have overloads which operate on Utf8String and Utf8Span. These overloads could potentially be cheaper than the normal string / ROS<char> based overloads, since the reader / writer instances may in fact be backed by UTF-8 under the covers. If this is the case then writing is simple projection, and reading is validation (faster than transcoding).

File: WriteAllLines, WriteAllText, AppendAllText, etc. are good candidates for overloads to be added. On the read side, there's ReadAllTextUtf8 and ReadAllLinesUtf8.

TextReader.ReadLine and TextWriter.Write are also good candidates to overload. This follows the same general premise as BinaryReader and BinaryWriter as mentioned above.

Should we also enlighten SerialPort or GPIO APIs? I'm not sure if UTF-8 is a bottleneck here.

System.Net.Http.* namespace

Introduce Utf8StringContent, which automatically sets the charset header. This type already exists in the System.Utf8String.Experimental package.

System.Text.* namespace

UTF8Encoding: Overload candidates are GetChars, GetString, and GetCharCount (of Utf8String or Utf8Span). These would be able to skip validation after transcoding as long as the developer hasn't subclassed the type.

Rune: Add ToUtf8String API. Add IsDefined API to query the OS's NLS tables (could help with databases and other components that need to adhere to strict case / comparison processing standards).

TextEncoder: Add Encode(Utf8String): Utf8String and FindFirstIndexToEncode(Utf8Span): Index. This is useful for HTML-escaping, JSON-escaping, and related operations.

Utf8JsonReader: Add read APIs (GetUtf8String) and overloads to both the ctor and ValueTextEquals.

JsonEncodedText: Add an EncodedUtf8String property.

Regex is a bit of a special case because there has been discussion about redoing the regex stack all-up. If we did proceed with redoing the stack, then it would make sense to add first-class support for UTF-8 here.

@benaadams
Copy link
Member

Even though byte / charu8 is the underlying elemental type of Utf8String, none of the APIs outside of the constructor actually take those types as input. The input parameter types to IndexOf and similar APIs is UnicodeScalar, which represents an arbitrary Unicode scalar value and can be 1 - 4 code units wide when transcoded to UTF-8.

Does that mean

var ss = s.Substring(s.IndexOf(','));

Would be a double traversal? i.e. any use of IndexOf would lead to a double traversal for its return value to be meaningful?

@GrabYourPitchforks
Copy link
Member Author

Yes, I know this is dated from the future! :)
It's our agenda and review doc for the in-person meeting before it goes to wider community review. Not everything is captured here, especially things related to runtime interaction.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Jun 6, 2018

@benaadams No, it's a single traversal, just like if s were typed as System.String in your example. The IndexOf is O(n) up to the first found ',' character (using a vectorized search if available), and the Substring is O(n) from the indexed position to the end of the string. So the total number of bytes observed is index /* IndexOf */ + (Length - index) /* memcpy */ = Length = single traversal.

@benaadams
Copy link
Member

But if IndexOf is returning the number of UnicodeScalars which can be 1-4 bytes; passing that int return value into Substring doesn't it then have to rescan from the start of the Utf8String to find that start position? i.e. IndexOf isn't returning (int scalarPosition, int byteOffset)

@GrabYourPitchforks
Copy link
Member Author

APIs that operate on indices (like IndexOf, Substring, etc.) go by code unit count, not scalar count.

(I get that it might be confusing since enumeration of Utf8String instances goes by scalar, not by code unit, so now we have a disparity on the type. That's why I'd proposed as an open question that maybe we kill the enumerator entirely and just have Bytes and Scalars properties, which removes the disparity.)

@stephentoub
Copy link
Member

stephentoub commented Jun 6, 2018

Thanks, Levi. Some questions/comments:

  1. Should be straightforward and O(1) to create a Utf8String instance from an existing String / ReadOnlySpan or from a ReadOnlySpan coming in from the wire.

I don't understand how this is possible. With Utf8String as a reference type, getting the data into it will necessitate a memcpy at a minimum, which is not O(1).

Must allow querying total length (in code units) as O(1) operation.

I would expect a requirement would also be being able to query the total length in bytes in O(1) (which is also possible with string).

The five requirements below are drawn from String

This is already making some trade-offs. If I've read the data off the wire, I already have it in some memory, which I can then process as a ReadOnlySpan<byte>. To use it as a Utf8String, I then need to allocate and copy. So we're trading off usability for perf. I'm a bit surprised that's the right trade-off for the target audience, but the doc also doesn't specify who the target developers are, provide example scenarios for where/how this will be used, etc.

public ReadOnlySpan Bytes { get; }
public ReadOnlyMemory AsMemory();

Why is the to-memory conversion called AsMemory but the to-span conversion called Bytes?

public bool Contains(UnicodeScalar value);

I'm surprised not to see overloads of methods like Contains (IndexOf, EndsWith, etc.) that accept string or char. For char, even if you add an implicit cast from char to UnicodeScalar, we just had that discussion about not relying on implicit casts from a usability perspective in cases like this. And for string, with the currently defined methods someone would need to actually convert a string to a Utf8String, which is not cheap, in order to call these methods.

public int IndexOfAny(ReadOnlySpan value);
public int LastIndexOfAny(ReadOnlySpan value);

string.{Last}IndexOfAny calls this argument anyOf.

public Utf8String ToLowerInvariant();
public Utf8String ToUpperInvariant();

Presumably Utf8String will have culture support and will also have ToLower/Upper methods that are culture-sensitive?

public int IndexOf(UnicodeScalar value);
public int IndexOf(UnicodeScalar value, int startIndex);

What does the return value mean? Is that the number of the byte offset of the UnicodeScalar, or is it the number of the UnicodeScalar? Similarly, for startIndex. Assuming it's the number of UnicodeScalars, if I want to get Bytes and index into it starting at this UnicodeScalar, how do I convert that UnicodeScalar-offset to a byte offset?

Once culture support comes online, we should add CompareTo and related APIs.

From a design discussion perspective, I would think we'd want this outline to represent the ultimate shape we want, and the implementations can throw NotImplementedException until the functionality is available (before it ships).

public readonly struct UnicodeScalar

What's the plan for integration of this with the existing unicode support in .NET? For example, how do I get a System.Globalization.UnicodeCategory for one of these?

public readonly struct Utf8StringSegment

Similar questions related to the APIs on Utf8String.

And, presumably we wouldn't define any APIs (outside of Utf8String/Utf8StringSegment) that accept a Utf8String, instead accepting a Utf8StringSegment, since the former can cheaply convert to the latter but not vice versa?

For me, it also begs the question why do we need both? If we're going to have Utf8StringSegment, presumably that becomes the thing that most APIs would be written in terms of, because it can cheaply represent both the whole and slices. And once you have that, which effectively has the same surface area as Utf8String, why not just make it Utf8String, still as a struct, and get rid of the class-equivalent and duplication. It can then be constructed from a byte[] or a ReadOnlyMemory<byte> without any extra allocation or copying, can be cheaply sliced, etc. Utf8StringSegment (when named Utf8String) is then essentially as a nice wrapper / package for a lot of the functionality that exists in System.Memory as static methods.

n.b. This type is not pinnable because we cannot guarantee null termination.

I don't see why we'd place this restriction. Arrays don't guarantee null termination but are pinnable. Lots of types don't guarantee null termination but are pinnable.

// Pass a Utf8String instance across a p/invoke boundary

I would hope that before or as part of enabling this, we add support for Span<T> and ReadOnlySpan<T>. We still have debt to be paid down there and should address that before adding this as well.

Culture-aware processing code is currently implemented in terms of UTF-16 across all platforms. We don't expect this to change appreciably in the near future, which means that any operations which use culture data will almost certainly require two transcoding steps, making them expensive for UTF-8 data.

I didn't understand this part. Don't both Windows and ICU provide UTF8-based support in addition to the UTF16-based support that's currently being used?

Other stuff

Equivalents for String.Format?

@whoisj
Copy link
Contributor

whoisj commented Jun 6, 2018

Don't both Windows and ICU provide UTF8-based support in addition to the UTF16-based support that's currently being used?

Not that I know of. Windows is, with very good legacy reasons, very UTF-16/UCS-2 focused.

@whoisj
Copy link
Contributor

whoisj commented Jun 6, 2018

What about Equals(string other) or CompareTo(string other) ?

Seems like not implementing this would make it difficult for existing ecosystems to adopt this type.

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Jun 6, 2018

  1. The proposal lists servers and IoT as main scenarios. I think we need to add ML.NET. They explicitly requested UTF8 string support.
  2. The ML.NET team requires allocation free slicing. I am not sure if they need the slices to be heap-friendly or not. Something you should research.
  3. It would be good to drill into reasons for each of the pri 0 requirements. They all start with "must" and some are very limiting.
  4. I think the requirements should include slicing (even if we decide that slices are a different type and/or not heapable). Non-allocating slicing is a must have for high performance string manipulation.
  5. As a validation exercise, it would be good to rewrite ASP.NET platform server using this string (the code now uses custom AsciiString) and see if we can keep the same performance.
  6. EndsWith (and all similar operations) should have overloads that take ReadOnlySpan<some_type>, and C# should support conveniently creating literals of this span on the stack, e.g. (pseudocode): myString.EndsWith(stackalloc u8"World!"). Currently all the APIs that Utf8String (which allocates) and scalar (which is a single "char", i.e. not super useful).
  7. In the language support section you state that a literal assignment to Utf8String will result in conversion. Why? We should do target typing in the case you outline and avoid any conversions at runtime.
  8. Nit: I find the "u8" prefix super ugly.
  9. We use ReadOnlySpan<Char> as a representation of a slice of UTF16 string. You are proposing we use Utf8StringSegment. Is the discrepancy ok?
  10. Re Open Question Fixed a build system warning #1: I don't think doing LINQ over scalars is a good practice.

@nil4
Copy link
Contributor

nil4 commented Jun 6, 2018

The signature public Utf8String[] Split(Utf8String separator) implies a lot of allocations and memory copies.

First, an array must be allocated for the return value.

Then, each element in the array must be a copy of each match, into a newly-allocated buffer, as Utf8String mandates null-termination but the input will not have nulls after each separator.

If I understand this correctly, except for the trivial case when the separator is not present at all, this signature would basically require copying the whole input string.

Would it make sense to return a custom enumerator of Utf8StringSegment instead, similar to SplitByScalarEnumerator or SplitBySubstringEnumerator ?

@svick
Copy link
Contributor

svick commented Jun 7, 2018

I think the biggest issue with the proposed API is confusion between UTF8 code units and Unicode scalar values, especially when it comes to lengths and indexes. Would it make sense to alleviate that confusion by more explicit names, like ByteLength instead of Length or startByteIndex instead of startIndex?


[EditorBrowsable(EditorBrowsableState.Never)]
public static Utf8String DangerousCreateWithoutValidation(ReadOnlySpan<byte> value);

Is EditorBrowsableState.Never the right way to hide dangerous methods? I don't like it, because it means such methods are hard to use, when I think the actual goal is to limit their discoverability, not their usability. Wouldn't putting them into a separate type be a better solution, similar to how dangerous Span APIs were put into the MemoryMarshal type?


One potential workaround is to make the JIT recognize a ldstr opcode immediately followed by a newobj Utf8String(string) opcode. This pattern can be special-cased to behave similarly to the standalone ldstr today, where the address of the literal String (or Utf8String) object is known at JIT time and a single mov reg, imm instruction is generated.

Would this mean that if I write new Utf8String("foo"), which would produce the same sequence of opcodes, it might not actually create a new instance of Utf8String? I think that would be very confusing, since it's not how any other type behaves, not even string. It would also be a violation of the C# specification, which says that for a class, new has to allocate a new instance:

The run-time processing of an object_creation_expression of the form new T(A), […] consists of the following steps:

  • If T is a class_type:
    • A new instance of class T is allocated. […]

What is the relationship between UnicodeScalar and Rune (https://github.com/dotnet/corefx/issues/24093)?


We can also consider introducing a type StringSegment which is the String-backed analog of this type.

There was an issue about creating StringSegment in corefx, which was closed a month ago, with the justification that ReadOnlyMemory<char> and ReadOnlySpan<char> are good enough: https://github.com/dotnet/corefx/issues/20378. Does that mean it's now on the table again?


The code comments on the StringSegment type go into much more detail on the benefits of this type when compared to ReadOnlyMemory<T> / ReadOnlySpan<T>.

Where can I find those comments? I didn't find the StringSegment type in any dotnet repo.


More generally, with this proposal we will have: string, char[], Span<char>, ReadOnlySpan<char>, Memory<char>, ReadOnlyMemory<char>, Utf8String, byte[], Span<byte>, ReadOnlySpan<byte>, Memory<byte> and ReadOnlyMemory<byte>. Do we really need Utf8StringSegment as yet another string-like type?

@GrabYourPitchforks
Copy link
Member Author

I don't understand how this is possible. With Utf8String as a reference type, getting the data into it will necessitate a memcpy at a minimum, which is not O(1).

Yes, this is a typo.

I would expect a requirement would also be being able to query the total length in bytes in O(1) (which is also possible with string).

This is possible via Utf8String.Length or Utf8String.Bytes.Length, both of which return the byte count.

I'm surprised not to see overloads of methods like Contains (IndexOf, EndsWith, etc.) that accept string or char.

I struggled with this, and the reason I ultimately decided not to include it is because I think the majority of calls to these methods involve searching for literal substrings, and I'd rather rely on a one-time compiler conversion of the search target from UTF-16 to UTF-8 than a constantly-reoccurring runtime conversion from UTF-16 to UTF-8. I'm concerned that the presence of these overloads would encourage callers to inadvertently use a slow path that requires transcoding. We can go over this in Friday's discussion.

What's the plan for integration of [UnicodeScalar] with the existing unicode support in .NET?

I had planned APIs like UnicodeScalar.GetUnicodeCategory() in a future release, but we can go over them in Friday's meeting.

We use ReadOnlySpan as a representation of a slice of UTF16 string. You are proposing we use Utf8StringSegment. Is the discrepancy ok?

Check the comment at the top of https://github.com/dotnet/corefxlab/blob/utf8string/src/System.Text.Utf8/System/Text/StringSegment.cs. It explains in detail why I think this type provides significant benefits that we can't get simply from using ReadOnlySpan<char>.

It would also be a violation of the C# specification, which says that for a class, new has to allocate a new instance.

We do violate the specification in a few cases. For instance, new String(new char[0]) returns String.Empty. Not a new string that happens to be equivalent to String.Empty - the actual String.Empty instance itself. Similarly, the Roslyn compiler can sometimes optimize new statements away. See for example dotnet/roslyn@13adbac.

What is the relationship between UnicodeScalar and Rune (dotnet/corefx#24093)?

UnicodeScalar is validated: it is contractually guaranteed to represent a value in the range U+0000..U+D7FF or U+E000..U+10FFFF. Scalars have unique transcodings to UTF-8 and UTF-16 code unit sequences. Such transcoding operations are guaranteed always to succeed. Rune (which is not in this proposal) wraps a 32-bit integer which is ostensibly a Unicode code point value but which is not required to be valid. This means that developers consuming invalid Rune instances must be prepared for some operations on those instances to fail.

@svick
Copy link
Contributor

svick commented Jun 8, 2018

@GrabYourPitchforks

For instance, new String(new char[0]) returns String.Empty. Not a new string that happens to be equivalent to String.Empty - the actual String.Empty instance itself.

I didn't know that, interesting.

Similarly, the Roslyn compiler can sometimes optimize new statements away. See for example dotnet/roslyn@13adbac.

As far as I can tell, that commit is about Span<T>, which is a struct, so it doesn't violate the C# specification.

UnicodeScalar is validated: it is contractually guaranteed to represent a value in the range U+0000..U+D7FF or U+E000..U+10FFFF. […] Rune (which is not in this proposal) wraps a 32-bit integer which is ostensibly a Unicode code point value but which is not required to be valid.

That doesn't sound like a good enough reason to have two different types to me, especially since you can create an invalid UnicodeScalar. Maybe the two groups could work together to create a single type for representing Unicode scalar values?

@GrabYourPitchforks
Copy link
Member Author

As far as I can tell, that commit is about Span, which is a struct, so it doesn't violate the C# specification.

new byte[] { ... } isn't a struct type. :)

That doesn't sound like a good enough reason to have two different types to me

This proposal assumes that Rune is never committed. So there's only one type in the end.

@whoisj
Copy link
Contributor

whoisj commented Jun 8, 2018

I see that it's already committed, but can I just go on record as saying that UnicodeScalar is just a plain terrible name? It really is. It's long, it's generic enough to mean nearly nothing, and it is not even a term the Unicode group uses. I had the same complaints about Rune (with the exception that Rune is at least short`).

This type really ought to be named Character or CodePoint.

I'm mostly OK with the rest of it, though it would be nice if .Split didn't have to allocate quite as much. The underlying data is already read-only - can't Span<T> be used here or something?

@svick
Copy link
Contributor

svick commented Jun 8, 2018

@whoisj

I see that it's already committed, but can I just go on record as saying that UnicodeScalar is just a plain terrible name? It really is. It's long, it's generic enough to mean nearly nothing, and it is not even a term the Unicode group uses.

"Unicode Scalar Value" is the term Unicode uses for this.

This type really ought to be named Character or CodePoint.

"Character" doesn't really mean anything (Unicode lists 4 different meanings) and would be easily confused with System.Char/char.

"Code Point" is closer, but that term includes invalid Unicode Scalar Values (the range from U+D800 to U+DFFF).

@benaadams
Copy link
Member

benaadams commented Jun 8, 2018

It's long, ...

The question is, what would the C# keyword be? (Int32 vs int); something like uchar is short 😉 or nchar to match databases

@whoisj
Copy link
Contributor

whoisj commented Jun 9, 2018

The question is, what would the C# keyword be? (Int32 vs int); something like uchar is short 😉 or nchar to match databases

This.

Will there be a language word for the type? If there is, you can call the type ThatUnicodeValueWhichNobodyCouldAgreeOnAGoodNameForSoThisIsIt for all I care. I vote for c8 but I also like Rust. Keeping C# in mind, uchar seems the like to no-brainer to me.

@svick yeah, I know that "chartacter" is nearly meaningless hence my suggesting it. I prefer "code point" because how on Earth are you going to prevent me from writing invalid values to a UnicodeScalar's memory? Preventing unsafe is a recipe for a performance disaster; and making unsafe (the real meaning of the word) assumptions about what values a block of memory can contain will lead to fragile and exploitable software design.

@GrabYourPitchforks
Copy link
Member Author

how on Earth are you going to prevent me from writing invalid values to a UnicodeScalar's memory?

Nobody's stopping you. In fact, there's a public static factory that skips validation and allows you to create such an invalid value. But if you do this you're now violating the contractual guarantees offered by the type, I'd recommend not doing this. :)

To be clear, creating an invalid UnicodeScalar won't AV the process or anything quite so dire. But it could make the APIs behave in very strange and unexpected manners, leading to errors on the consumption side. For example, UnicodeScalar.Utf8SequenceLength could return -17 if constructed from invalid input. Such are the consequences of violating invariants.

Unlike the UnicodeScalar type, the Utf8String type specifically does not offer a contractual guarantee that instances of the type contain only well-formed UTF-8 sequences.

@whoisj
Copy link
Contributor

whoisj commented Jun 11, 2018

In fact, there's a public static factory that skips validation and allows you to create such an invalid value.

Sure, great, but a lot of the data being read into these structures will be coming from external sources. Very happy to hear that there's no validation steps being taking as the data is read in (because it would be horribly expensive), but still very concerned about:

But if you do this you're now violating the contractual guarantees offered by the type

BUT there is no guarantee - you've said so in your previous statement. There's an assumption, but no guarantee; so let's be careful how we describe this.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Jun 11, 2018

The Utf8String and UnicodeScalar types make different contractual guarantees. I'll try to clarify them.

The Utf8String type encourages but does not require the caller to provide it a string consisting of only valid UTF-8 sequences. All APIs hanging off it have well-defined behaviors even in the face of invalid input. For example, enumerating scalars over an ill-formed Utf8String instance will return U+FFFD when an invalid subsequence is encountered. (Not just that, but the number of bytes we skip in the face of an invalid subsequence is also well-defined and predictable.) This extends to ToUpperInvariant() / ToLowerInvariant() and other manipulation APIs. Their behavior is well-defined even in the face of invalid input.

Exception: If you construct a Utf8String instance and use unsafe code or private reflection to manipulate its data after it has been constructed, the APIs have undefined behavior.

The UnicodeScalar type requires construction from a Unicode scalar value. The API behavior is only well-defined when the instance itself is well-formed. If the caller knows ahead of time that the value it's providing is valid, it can call the "skip validation" factory method. If the instance members off a UnicodeScalar instance misbehave, it means that the caller who originally constructed it violated an invariant at construction time.

The reason for the difference is that it's going to be common to construct a Utf8String instance from some unknown data coming in over i/o. It's not common to construct a UnicodeScalar instance from arbitrary data. Instances of this type are generally constructed from enumerating over UTF-8 / UTF-16 data, and significant bit twiddling needs to happen during enumeration anyway in order to transcode the original data stream into a proper scalar value. Detection of invalid subsequences would necessarily need to occur during enumeration, which means the caller already has the responsibility of fixing up invalid values. The "skip validation" factory is simply a convenience for callers who have already performed this fixup step to avoid the additional validation logic in hot code paths.

So when I use the term "contractual guarantee", it's really shorthand for "This API behaves as expected as long as the caller didn't do anything untoward while constructing the instance. If the API misbehaves, take it up with whoever constructed this instance, as they expressly ignored the overloads that tried to save them from themselves and went straight for the 'I know what I'm doing' APIs."

@GrabYourPitchforks
Copy link
Member Author

FWIW, the reason for this design is that it means that consumers of these types don't have to worry about any of this. Just call the APIs like normal and trust that they'll give you sane values. If you take UnicodeScalar as a parameter to your method, you don't need to perform an IsValid check on it before you use it. Rely on the type system's enforcement to have prohibited the caller from even constructing a bad instance in the first place. (Modulo the caller doing something explicitly marked as dangerous, of course.)

This philosophy is different from the Rune proposal, where if you take a Rune as a parameter into your method you need to perform an IsValid check as part of your normal parameter validation logic since there's otherwise no guarantee that the type was constructed correctly.

@whoisj
Copy link
Contributor

whoisj commented Jun 12, 2018

I suppose those are safe-enough trade-offs. Still, too bad the name has to be so unwieldy. 🤷‍♂️

@GrabYourPitchforks
Copy link
Member Author

The name doesn't have to be unwieldy. If there's consensus that it should be named Rune or similar, I'll relent on the naming. :)

@KrzysztofCwalina
Copy link
Member

We should not call it a "Rune" if it's not a representation for the Unicode Code Point, i.e. let's not hijack a good unambiguous term and use it for something else.

@Tornhoof
Copy link
Contributor

Tornhoof commented Jun 13, 2018

ᚺᛖᛚᛚᛟ᛫ᚹᛟᚱᛚᛞ
Are you sure about Rune? It's a Unicode Block after all.
Maybe rather grapheme?

@KrzysztofCwalina
Copy link
Member

KrzysztofCwalina commented Jun 13, 2018

I think in graphemics (branch of science studying writing) rune is indeed a grapheme. I think in software engineering, rune is a code point. But possibly it's not such a clear cut as I think. The point I was trying to make is using "rune" to mean Unicode Scalar would be at least yet another overload of the word "rune".

@jeremyVignelles
Copy link

Thanks for your reply, I really appreciate the open discussion here 🙂

If you say that a "char-size agnostic" option could be implemented, that would be an awesome option as it could be progressive through the ecosystem.
My fear was that it would change the size of some structs (that contain a char, for example), thus potentially breaking the ABI (if that term means anything in the C# world).
It would also change the range of values of the char type, but I don't know if that's a big deal, I wouldn't represent integer values in a char in C# (while I might be tempted to do so in C).

Thanks for the discussion, i'll keep following the thread here, but apart from a blocker, I'd go for the bold idea of letting thing break and get updated incrementally until we get a better .net.

My estimate is that 99% of the time, upgrading a library or an application to the new version will be as easy as enabling the flag (as opposed to the nullable story where you had to break many things). I'd say that most .net devs don't care about how a string is represented in memory, the same way they don't need to worry about the size of an int and other internal implementations, until they reach the limit or work with low-level code, as you probably do on a daily basis.

@tannergooding
Copy link
Member

agnostic assemblies that do dynamic checks for sizeof(char) as necessary.

How would this handle the user string table (that is C# string literals)? Today, it is UTF16 encoded. I would have assumed that if we switched to UTF8, then it would become UTF8 encoded rather than carrying both or having a conversion cost on startup by the runtime.

@jkotas
Copy link
Member

jkotas commented Jan 17, 2021

How would this handle the user string table (that is C# string literals)? Today, it is UTF16 encoded.

I think it would stay UTF16 encoded. Utf8 optimized runtime mode would pay for the conversion, but that should not be a problem. For JITed cases, the conversion cost is miniscule compared to cost of JITing and type loading. For AOT cases, we can store the converted string in the AOT binary if it makes a difference. Also, keeping the user string blob UTF16 encoded allows the agnostic binaries to work with existing tools and on older runtime in the "I know what I doing mode.".

There are number of encoding inefficiencies in the IL format. If we believe that it is important to do something about them and rev the format, it should be a separate exercise.

@jaredpar
Copy link
Member

Bit of a long thread so is there a quick summary of why we feel like we need char to change to be one byte in length vs. the idea of exposing byte based operations on string?

My mental model of this approach, based on discussions we had back in the Midori days, is that the path forward here would be to change string to define its operations in terms of byte not char. Then we effectively deprecate char at that point and all the methods on string that expose char. Essentially let it become an artifact of history and move on with byte as the new standard.

Is the issue that we feel like we'd need to update to much code that is already written in terms of char today?

Let's leave issues of how C# uses string in the new world aside. If we make the jump at the runtime level to have utf8 be an opt-in story then we can likely do the same for the compiler. Imagine at a high level a /utf8string option where we effectively ban / deprecate all the old members on string and move to the new utf8 ones. Essentially foreach suddenly becomes byte based, not char based.

@ceztko
Copy link

ceztko commented Jan 19, 2021

Bit of a long thread so is there a quick summary of why we feel like we need char to change to be one byte in length vs. the idea of exposing byte based operations on string?

I think the whole discussion, especially since @jkotas intervened, moved to discuss a lower level approach that is not aiming to solve the problem through an additive API, but through changing the lower lever storage for the regular string type. This has the advantage of not jeopardizing the .NET API surface, which is remaining the same, covers interop scenarios and also should possibly introduce performance improvements in common workloads. The cost for this solution is handling ABI incompatibilities, and some situations where size/indexing of the storage array matters. I don't know if MS runtime team is going to have a resolution of this soon: they may stand by and wait for some experimenting on the lower level approach before taking a decision in which approach to follow. If MS is offering a remote short term contract to work on this I would love to apply :)

@jkotas
Copy link
Member

jkotas commented Jan 19, 2021

Is the issue that we feel like we'd need to update to much code that is already written in terms of char today?

Yes, it is the crux of the problem. We have a lot of APIs and code written in term of string and char today. We are trying to find the best way to move the code to Utf8, while maximizing the performance benefits of Utf8 and minimizing additional cognitive load of the platform.

the path forward here would be to change string to define its operations in terms of byte not char.

I agree that this design would be an option. I think the downsite of this approch is that we would need to add byte clones of (at least some) APIs that take char, ReadOnlySpan<char> or Span<char>; and in turn change all code using these APIs to use the byte version instead when running in the Utf8 optimized runtime mode.

Then we effectively deprecate char at that point

I do not think that we would be ever able to deprecate char. There will always be a lot of components that just do not care about the performance benefit of Utf8, no matter how we expose it, and we need to make sure that they will continue to work. Any solution in this space needs to be designed as opt-in. The key question for me is whether it is better to opt-in via adding UTF8 clones of many APIs; or whether it is better to opt-in via compiler/runtime mode. I am leaning towards the latter as you can tell.

@mconnew
Copy link
Member

mconnew commented Jan 19, 2021

Could this be solved with some JIT level marshalling? Whenever we make a native call which involves a string, we have to marshal the string with a full copy anyway. Could the JIT be used so that it ensures that a string passed to a method in a "legacy" assembly is converted to a 16-bit per character string and any callbacks to a newer assembly is converted to utf8. A mechanism could be added to treat an assembly as utf8 safe in the cases where an unmaintained assembly only uses string in a way which is safe (e.g. passes it through to a .NET assembly such as calling a File api with the string). This would ensure everything is safe, although with a potential performance cost. It would be up to the application developer to decide that their own code is safe and that the performance hit is worth it. Some helper classes could help alleviate a lot of the performance issues in targeted places, e.g. a class for newer apps to use that represents a cached UTF16 converted string which marshalling code recognizes and can use for the cached utf16 representation for hot code paths with a lot of reuse.

@jkotas
Copy link
Member

jkotas commented Jan 19, 2021

Could this be solved with some JIT level marshalling?

I have commented on it in #2350 (comment) . I do not think it is feasible to solve it via JIT level marshaling. I believe the marshaling would be too expensive and it would have to be done for too many situations.

@Serentty
Copy link

What if from the runtime's perspective, the UTF-8 string type were a completely separate type, and the choice between encodings would lie not with the CLR, but with the compiler? For example, the string keyword in C# would instead refer to Utf8String, string literals would default to UTF-8, and so on. Assemblies using either encoding could still communicate, as it would just be a matter of defaults. I know this goes against Roslyn's policy of not having “dialects” of C#, but it seems simpler than a change like this, which would probably create a rift that would last for years, if it ever closes at all.

@jeremyVignelles
Copy link

@Serentty I don't know how it works internally, but I'm under the impression that the runtime is a very small part to change compared to the BCL, which would need to be recompiled and re-checked, because there are assumptions that char represents an UTF-16 code point. (16bytes)

As jaredpar objected, that would create a branch in the ecosystem, and you would need to choose one or the other of the versions of .net. If you need an old library that won't get compiled for the new .net, but have other "new libraries" that use the new string representation, you'll get stuck.

jkotas' proposal seems like a more reasonable approach, where the BCL would become "string representation agnostic" over time, and encourage other libraries to do the same. you would only be able to "upgrade" if all your libs are agnostic, but with a less performant but still working fallback.

@whoisj
Copy link
Contributor

whoisj commented Feb 1, 2021

@mconnew what do you think about a sizeof(char) == 4 setup?

Asking this specifically because only a 4-byte WORD has sufficient space to fit any Unicode value. Of course, we'd have to give up on constant time access to string[int] { get -> char } indexers, unless they started returning byte or sbyte values; in which case, we're back to the same problem of "I expect string indexer to return a 'char' which is short for 'character'". Honestly, there's no real winning in this situation.

@jeremyVignelles
Copy link

@whoisj That would waste a lot of memory for most of the characters, and would require a conversion from UTF8 and UTF16 from everywhere outside the .net world.

What you're suggesting is basically an UTF32 encoding, but utf8 is the most compatible with other pieces of software in my opinion

@mconnew
Copy link
Member

mconnew commented Feb 2, 2021

@whoisj, in addition to what Jeremy said, it still wouldn't be sufficient. Unicode can use composed characters. For example é can be represented by the two character pair of the letter e (U+0065) and the acute accent (U+0301). If using the composed (as opposed to the single (U+00E9) form) version of the character, you still need to use two char's to represent this. This is a simple version, but things like emoji's use composition. For example, all the people and hand emoji's where you can choose the skin tone are a composed character with a skin tone specifying character followed by the image representing character. There is no single char representation.
The char data type represents a UTF code point. The Unicode table has 1,114,112 entries which can take up to 4 bytes to represent. So UTF-8 and UTF-16 have encoding mechanisms to specify there's a following value to complete the Unicode table entry (and UTF-32 has some leading zero's as 4 bytes is bigger than needed). So you have 3 different entities here. A UTF value which could represent a code point, or be part of a multi-value code point. A code point, which is made up of 1 or more UTF values. And a Rune, which represents a single thing you would see on your screen.
The problem trying to be solved here is that most of the internet uses UTF-8, but .NET uses UTF-16. This means when reading from the internet, everything needs to be converted from UTF-8 to UTF-16. This isn't just a matter of copying the values to every over byte. A two value UTF-8 code point will need to be converted to it's UTF-16 representation, which will include some bit shifting. A 3 value UTF-8 code point will need to be converted to 2 UTF-16 values. And then when displaying on the screen, you might have to combine multiple UTF code points into a single on screen character (or Rune).
My preference would be to make Rune[] or Span be the primitive any new api's use. When reading or writing to a stream, you want to do it one Rune at a time. With ASCII text, that would be one byte, but with other characters that could be 8 or more bytes. I don't really care how a character is represented under the hood. There's only a few things I care about.

  1. Easy consumption and writing of a series of Rune's from a byte stream, which is basically a string.
  2. Easy buffer math, i.e. do I have enough space left in my buffer to write the next Rune.
  3. Easy manipulation of a series of Rune's, i.e. Find/Replace, ToUpper/ToLower, concatenation, substring etc

I wonder how much mileage you could get out of implicit conversion operators to convert ReadOnlySpan<Rune> or Span<Rune> to byte[], ReadOnlySpan<byte> and old fashioned strings. Add in some useful methods which mirror the String methods such as Replace etc. As long as you can get to/from a byte[] or Span<byte>, you have your IO covered. Any existing api's in libraries which need a string would be converted, but I don't think that should cost much more than if you just went to string to begin with as all you have done is defer the conversion to UTF-16. As long as you cached the conversion and had some container to hold the Span<Rune> with it's cache (or used a weak reference table) I think you could minimize the cost.

Char is the least useful representation of Unicode. It's neither the raw representation of what gets sent over a stream (file or network), nor a representation of single entities that you see on the screen. It's a halfway representation which has no useful purpose in isolation, unlike byte[] or Rune[].

@jeremyVignelles
Copy link

Iterating over bytes is a O(1) operation, iterating over UTF-8 values is doable in a reasonable time, but I don't know what it takes to iterate over runes? How do one know a composition is possible? does it need to use a lookup table?
If so, that wouldn't seem reasonable to iterate rune by rune to write in a stream performance-wise. For such cases, as byte[] copy should be preferred.

However, i'm OK for the inclusion of a .AsRuneSpan() or being able to cast to an explicit IEnumerable<Rune> implementation

@LokiMidgard
Copy link

I think its a property of the code point. Similar to upper and lower case.

If so, that wouldn't seem reasonable to iterate rune by rune to write in a stream performance-wise. For such cases, as byte[] copy should be preferred.

I'm not sure someone would iterate over a UTF-8 string to write every rune to a stream. I think I'd never seen someone iterate over a string and write single char's to a stream. After all a stream does not take Runes as input but bytes.

@GrabYourPitchforks
Copy link
Member Author

I think I'd never seen someone iterate over a string and write single char's to a stream.

This would be lossy. As @mconnew mentioned earlier, sometimes you need to look at a pair of chars in a sequence in order to generate the correct UTF-8 output. If you're operating on chars in isolation rather than keeping some kind of state, you could lose data. The System.Text.Encoding.GetEncoder API is a stateful conversion class meant to help with this scenario. (If you're working on the Rune level, you don't need to worry about lossy conversions, as each Rune is absolutely guaranteed to be a standalone well-formed Unicode scalar value.)

As Jan said, we're looking into whether it would make sense to back string with UTF-8 instead of UTF-16 as is done today. This would have some interesting behavioral and runtime consequences, and it'd result in a trade-off where memory utilization might go down, but CPU utilization might go up. In other words, common operations like string.IndexOf, string.Equals, and string.ToUpper might become slower, but each string instance would be smaller. Still need to weigh things here.

@mconnew
Copy link
Member

mconnew commented Feb 3, 2021

I think I'd never seen someone iterate over a string and write single char's to a stream. After all a stream does not take Runes as input but bytes.

Have a look at how many classes use one of the overloads of Encoding.GetCharCount or Encoding.GetByteCount to see how many places care about how many bytes a char[] needs to write to a buffer, or how big a char[] is needed to store the sequence of bytes. Here's some code which writes a char[] to a buffer. If iterates through the char[] looking at each char in turn to see if it can be encoded in a single byte or if it needs multiple bytes. It then copies the value one at a time in that loop. If it finds a multiple byte char, it then searches for multiple in a row and then handles all of the consecutive ones using Encoding.GetBytes().
You just haven't needed to write any code which does the actual conversion between UTF-16 and UTF-8. Anyone writing a library which needs to be able to write UTF-8 from a char[] has had to worry about this if they are performance sensitive. If you write apps which are used by people using languages which use multi-byte characters such as Japanese, there's a good chance you've had to fix bugs in code where single character bytes were incorrectly presumed. English isn't the only language in existence. There's a reasonable argument to be made that UTF-16 should remain because switching to UTF-8 is optimizing for English and other languages frequently need 2 bytes so are easier to work with in memory that UTF-8. For example, switching string to use UTF-8 will cause many to have to switch from string.IndexOf(char value) to string.IndexOf(string value) because what they are searching for no longer fits in a single char.

@whoisj
Copy link
Contributor

whoisj commented Feb 14, 2021

@mconnew , I think you're missing the point.

There's no reason a string type needs to be a fancy wrapper around char[]. Additionally, the char type is misleading because most of the world has agreed on Unicode and Unicode defines characters using 20 bits of information.

My suggestion was sizeof(char) == 4 (effectively UTF-32), but I was advocating for a string type backed by UTF-8. A UTF-8 backed string type wastes less memory than the current UTF-16 (UCS-2 on Windows) variant does. I was also suggesting that the indexer should return byte to enable quick indexing/parsing of values, but it should also provide a method (.CharAt(index)) which returns a 32-bit char value type.

Seems like I might also need to remind some of you that there are platforms besides Windows, and those platforms forcing all string data into UTF-16 requires a re-encoding of that data and an increase in its memory foot print.

@ceztko
Copy link

ceztko commented Feb 14, 2021

NOTE: Sorry for a wrong comment with no content from me, if you received it.

@whoisj

My suggestion was sizeof(char) == 4 (effectively UTF-32), but I was advocating for a string type backed by UTF-8. [...] but it should also provide a method (.CharAt(index)) which returns a 32-bit char value type.

This proposal has for sure some drawbacks, for example all char[] arrays used so far suddenly doubling the storage space. Also the string class already has a char indexer[1], which would become O(n) complexity suddenly from O(1). You'd really want to introduce a brand new type to define an Unicode code point, IMO.

[1] https://docs.microsoft.com/en-us/dotnet/api/system.string.chars?view=net-5.0#System_String_Chars_System_Int32_

@whoisj
Copy link
Contributor

whoisj commented Feb 15, 2021

@ceztko

the string class already has a char indexer[1], which would become O(n) complexity suddenly from O(1). You'd really want to introduce a brand new type to define an Unicode code point, IMO.

I'm implemented utf-8 based string types in C++ and C# many times. In every case the optimal path is to keep the indexer returning the internal type value.

example (in the case where the underlying data type is byte[] for string):

public byte this[int index] { get; }

Example usage would be seeking the next new line character, which absolutely doesn't require a character-by-character search, but merely a byte-by-byte seek.

int index = myString.IndexOf('\n');

Even when looking for seeking for a character like '美', one can compose the utf-32 value into a utf-8 encoded value using the same 32-bit value space, and seek for the first series of bytes that match. Via unsafe, this can be amazingly quick.

In 99% of string parsing cases (the most likely reason code is using the indexer), reading a byte from the string is more than sufficient and there's generally no reason to read the entire Unicode value.

When code needs to pull each Unicode character out of a string, then this can be expensive and should be done via a utility like an enumerator. In which case, the enumerator handles the character composition for the caller.

@danmoseley
Copy link
Member

@jkotas shall we transfer this to runtime or runtimelab as we are archiving this repo?

@jkotas
Copy link
Member

jkotas commented Feb 15, 2021

The discussion in this issue is too long and github has troubles rendering it.

I think we should close this issue and start a new one in dotnet/runtime.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.Utf8String Design Review OpenBeforeArchiving These issues were open before the repo was archived. For re-open them, file them in the new repo
Projects
None yet
Development

No branches or pull requests