-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce a Utf8String type #933
Comments
Notes for our initial review are here. |
The indexer of public ref readonly Utf8Char this[int index] => throw null; Please add the following member to solve this problem: [System.Runtime.CompilerServices.SpecialName]
public Utf8Char get_Chars(int index) => throw null; This is what it looks like in VB: Public ReadOnly Property Chars(index As Integer) As Utf8Char |
Has there been any update on this in general and what other considerations/design changes have happened since the initial review? The current implementation in the NuGet package is vastly different from the proposed API. |
The NuGet package generally follows the proposal in dotnet/corefxlab#2350, which is where most of the discussion has taken place. It's a bit aggravating that the discussion is split across so many different forums, I know. :( |
The next steps on this are to:
|
We may also want to evaluate alternative that does not introduce the Utf8String type at all. We had a good discussion about it in dotnet/corefxlab#2350 recently. |
I noticed dotnet/corefxlab#2350 just got closed. Did the discussion moved somewhere else about more UTF8 first citizen support efforts? |
I'm currently (pre-)processing multi-TB data sets in C#. I have to match and join millions of strings, which are taking up A LOT of memory (100+ GB). Because my machine only has 64GB of memory, I had to switch to a more efficient string representation:
It would've saved me days of work if UTF8 string representation was a runtime configuration switch.
Main goal: Only use 50% of the memory with equal performance in managed code. I don't care about marshalling performance. Having to change the type in all existing code would be doable, but not ideal.
Personally, I'd strongly prefer this approach. |
Maybe we need to step up, if we've been talking about an api for a couple of years, it's going to be as slow and inefficient as the C-plusplus standard. Make UTF8String as an option for developers. |
@sgf we currently are discussing options here. We need to be really careful what we do with UTF-8 String because we do not want to duplicate all String APIs with Utf8String overloads but at the same time we do want UTF-8 Strings. Just swapping internals of string will break lots of apps at the moment as there is plenty of them relying on things like: |
What are the chances of getting UTF8 strings into the 7.0.0 milestone? |
sorry for the low effort comment, I have not read the whole thread but my opinion is that an external Utf8string is not convenient to use regarding discoverability and readability (explicitly expose an implementation detail while the developper intent is just to mean a String, which increase cognitive overhead) and for those reasons, usage will be niche/seldom used. |
Does the internal representation of String remain UTF-16 or Latin-1 (Compact String) in Java 18? Only default encoding might be changed to UTF-8. |
Yes UTF-8 is not the default internally as of now although they use latin-1 for characters that fits in the ASCII table |
If roles becomes a thing, would it make sense to have |
The primary problem is that all of today's useful API is defined on However, one could augment all the existing types and methods with |
I understand from dotnet/corefxlab#2350 that the |
This comment was marked as off-topic.
This comment was marked as off-topic.
@timcassell Exactly. Instead of introducing a Utf8String class, I think we should utilize the |
Utf8String are required as identifiable on-heap type, and it can't substituted by byte[] by normal means. Byte[] is what already can be used and is brutally not compatible with diagnostics/profiling. This why VST needed. |
If we introduce a role type from runtime, then the debugger/profiler can recognize the role type to provide diagnostics functionality. |
@hez2010 i'm not against roles. I'm just point what named heap store (e.g. Utf8String) is useful for memory diagnostics. Is not semi/exclusive things. |
With static interfaces wouldn't it be possible to Introduce an |
Another option instead of a runtime switch is to add new |
I find it hard to imagine a runtime switch to change the representation of |
@Neme12 No it isn't. A UTF-8 string would not index into the code points, but into the raw bytes. It also would not change For cases of normalization, parsing, runes and such, it's always a cat and mouse game, with new "runes" added all the time, which is why I say it's not suited for the BCL itself. |
Indexing a standard UTF-16 string doesn't give you code points, it gives you just 16-bit code units. A lot of Unicode characters (code points) take two 16-bit code units. UTF-16 is not different from UTF-8 on this matter. Only UTF-32 allows to index code points directly. |
Which part were you responding to? |
@vrubleg I never said the built in string gives you code points? It just adds to my reasoning that a BCL UTF-8 string should not try to enumerate code points, giving only O(1) access to the raw bytes |
It gives O(1) access to code units that is consistent with standard UTF-16 strings. Yes, in case of UTF-8 code units are equal to raw bytes, and in case of UTF-16 code units are equal to raw words. So what? |
Probably, @Neme12 meant that it would be O(n) if just internal representation of |
@vrubleg Oooh my bad, I misinterpreted what they were saying. I apologize, and am in agreement then 🙈 |
@vrubleg If viable this seems a great solution. It's not clear if this is currently the official Microsoft stance, though: are there more references on this? |
A Utf8String type would be a good additional. ""u8 to a ROS isn't great an having a great option to basically pass the ROS of that type to be written to a stream where very often the encoding is utf8 this would be a huge gain. |
There exists https://github.com/U8String/U8String but likely .net internals could also benefit from a native utf8 string type. |
@ramonsmits Interesting, though if a standard Utf8String type is added, I don't think any type of parsing should be added like that library does People using UTF8 strings are in 99% of cases just moving the string around, or doing simple ASCII parsing. Very rarely are people actually rendering it themselves. I think the type .NET gets (if it does get one) should have constant time access to each Edit: Actually no, I misunderstood, a rune is a Unicode codepoint, in which case that library is perfect I think, not over-extending into areas it shouldn't, nice! |
AB#1117209
This is the API proposal for
Utf8String
, an immutable, heap-allocated representation of UTF-8 string data. See dotnet/corefxlab#2368 for the scenarios and design philosophy behind this proposal.Included in this are also APIs to improve text processing across the framework as a whole, including changes to
existing types like
String
andCultureInfo
.Edits:
Nov. 8 - 9, 2018 - Updated API proposals in preparation for upcoming review.
The text was updated successfully, but these errors were encountered: