-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String as more of a byte vector type? #22616
Comments
Solution i. is #22572. OTOH I doubt solution ii. would be really practical: if you index a perfectly valid UTF-8 string at an incorrect index, you would get an invalid char silently, rather than an error, which would be quite confusing. Though I guess it could be allowed with |
The idea here is the following:
|
What do you mean exactly by "a byte Char value"? Wikipedia presents a few common strategies, for example UTF-8B, which Python 3.1 uses. EDIT: Woops, UTF-8b does not apply here since we're talking about storing bytes inside a
Is there a reliable an efficient way of distinguishing the last two cases? If the index points to a byte which isn't a valid first byte of a codepoint, you can't know whether it's the middle of a valid codepoint, or just a byte outside of a codepoint without going back to the previous bytes (up to 3 bytes). Is this what you envisage in case the index does not point to a valid first byte of a codepoint? |
A value of
That's part of this that is somewhat exploratory, but if you look at an 8-byte word (7 bytes, really) starting at |
Done by #24999. |
This is somewhat speculative, but a possible part of #16107 remaining for 1.0. In short, the concept is the following:
Some aspects of this:
String
object.String
-->Char...
-->String
round-trip for any data including invalid UTF-8.Item 2 entails some way of encoding invalid UTF-8 data as characters, which could be compatible with UTF-32 – use Int32 code units where negative values represent invalid data? – or incompatible with it – e.g. "exploded" UTF-8 – UTF-8 bytes padded with zeros, stored in UInt32 code units. The former maintains format compatibility for valid Unicode data, which is nice, but the latter might have significant performance benefits.
Part of #16107.
The text was updated successfully, but these errors were encountered: