Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String as more of a byte vector type? #22616

Closed
StefanKarpinski opened this issue Jun 29, 2017 · 5 comments
Closed

String as more of a byte vector type? #22616

StefanKarpinski opened this issue Jun 29, 2017 · 5 comments
Assignees
Labels
strings "Strings!"
Milestone

Comments

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 29, 2017

This is somewhat speculative, but a possible part of #16107 remaining for 1.0. In short, the concept is the following:

Treat String the way (sane) file systems and most UNIX utilities do – as bags of bytes which are interpreted as UTF-8 to the extent possible.

Some aspects of this:

  1. Allow any data in a String object.
  2. Make String --> Char... --> String round-trip for any data including invalid UTF-8.
  3. Choose an indexing policy/behavior:
    1. Throw an error for indexing at a point that iteration would never index into
    2. Never throw an error, just decode as much as iterating from that point would produce

Item 2 entails some way of encoding invalid UTF-8 data as characters, which could be compatible with UTF-32 – use Int32 code units where negative values represent invalid data? – or incompatible with it – e.g. "exploded" UTF-8 – UTF-8 bytes padded with zeros, stored in UInt32 code units. The former maintains format compatibility for valid Unicode data, which is nice, but the latter might have significant performance benefits.

Part of #16107.

@StefanKarpinski StefanKarpinski added the strings "Strings!" label Jun 29, 2017
@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Jun 29, 2017
@StefanKarpinski StefanKarpinski self-assigned this Jun 29, 2017
@nalimilan
Copy link
Member

Choose an indexing policy/behavior:

i. Throw an error for indexing at a point that iteration would never index into
ii. Never throw an error, just decode as much as iterating from that point would produce

Solution i. is #22572. OTOH I doubt solution ii. would be really practical: if you index a perfectly valid UTF-8 string at an incorrect index, you would get an invalid char silently, rather than an error, which would be quite confusing. Though I guess it could be allowed with @inbounds.

@JeffBezanson JeffBezanson added triage This should be discussed on a triage call and removed triage This should be discussed on a triage call labels Sep 20, 2017
@StefanKarpinski
Copy link
Member Author

The idea here is the following:

  • String values can hold arbitrary binary data.
  • Iterating a String yields a series of Char values – it will never throw an error.
  • Char can represent valid code points or as well as individual bytes.
  • When decoding a String:
    • a sub-sequence of bytes which is valid UTF-8 yield a code point Char value;
    • a byte which is not part of a valid sub-sequence of bytes yields a byte Char value.
  • Indexing – if the byte at the index is:
    • the first byte of a valid UTF-8 code point encoding, that character is returned as a Char;
    • any other byte of a valid UTF-8 code point encoding, a UnicodeError is thrown;
    • not part of a valid UTF-8 code point encoding, that byte is returned as a Char.

@nalimilan
Copy link
Member

nalimilan commented Sep 21, 2017

When decoding a String:
a sub-sequence of bytes which is valid UTF-8 yield a code point Char value;
a byte which is not part of a valid sub-sequence of bytes yields a byte Char value.

What do you mean exactly by "a byte Char value"? Wikipedia presents a few common strategies, for example UTF-8B, which Python 3.1 uses. EDIT: Woops, UTF-8b does not apply here since we're talking about storing bytes inside a UInt32 codepoint. There doesn't seem to be any "standard" to do this according to Wikipedia.

Indexing – if the byte at the index is:
the first byte of a valid UTF-8 code point encoding, that character is returned as a Char;
any other byte of a valid UTF-8 code point encoding, a UnicodeError is thrown;
not part of a valid UTF-8 code point encoding, that byte is returned as a Char.

Is there a reliable an efficient way of distinguishing the last two cases? If the index points to a byte which isn't a valid first byte of a codepoint, you can't know whether it's the middle of a valid codepoint, or just a byte outside of a codepoint without going back to the previous bytes (up to 3 bytes). Is this what you envisage in case the index does not point to a valid first byte of a codepoint?

@StefanKarpinski
Copy link
Member Author

What do you mean exactly by "a byte Char value"?

A value of Char which represents a raw byte instead of a code point. E.g. use values that are negative when reinterpreted as Int32 for this. The representation doesn't matter and certainly doesn't need to be standard since it's purely internal – using negative Int32 bit patterns would be compatible with the current Char representation since the representation of any valid string data doesn't change. When you write these back out to a UTF-8 encoded stream, you just print the byte. If someone tries to transcode it to any other encoding, then that's an error since it's not a valid code point. But being able to hold (and work with) arbitrary data in String values is useful and convenient, and so is being able to round-trip arbitrary data.

Is there a reliable an efficient way of distinguishing the last two cases?

That's part of this that is somewhat exploratory, but if you look at an 8-byte word (7 bytes, really) starting at i-3 then you can definitely tell if the index is part of a valid UTF-8 code point encoding since that cannot be more than four bytes. In @inbounds mode, we would ignore the middle case and just handle the first and third options, which would likely be considerably less work.

@StefanKarpinski
Copy link
Member Author

Done by #24999.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

3 participants