String as more of a byte vector type? #22616

StefanKarpinski · 2017-06-29T18:08:01Z

This is somewhat speculative, but a possible part of #16107 remaining for 1.0. In short, the concept is the following:

Treat String the way (sane) file systems and most UNIX utilities do – as bags of bytes which are interpreted as UTF-8 to the extent possible.

Some aspects of this:

Allow any data in a String object.
Make String --> Char... --> String round-trip for any data including invalid UTF-8.
Choose an indexing policy/behavior:
1. Throw an error for indexing at a point that iteration would never index into
2. Never throw an error, just decode as much as iterating from that point would produce

Item 2 entails some way of encoding invalid UTF-8 data as characters, which could be compatible with UTF-32 – use Int32 code units where negative values represent invalid data? – or incompatible with it – e.g. "exploded" UTF-8 – UTF-8 bytes padded with zeros, stored in UInt32 code units. The former maintains format compatibility for valid Unicode data, which is nice, but the latter might have significant performance benefits.

Part of #16107.

The text was updated successfully, but these errors were encountered:

nalimilan · 2017-06-29T22:15:11Z

Choose an indexing policy/behavior:

i. Throw an error for indexing at a point that iteration would never index into
ii. Never throw an error, just decode as much as iterating from that point would produce

Solution i. is #22572. OTOH I doubt solution ii. would be really practical: if you index a perfectly valid UTF-8 string at an incorrect index, you would get an invalid char silently, rather than an error, which would be quite confusing. Though I guess it could be allowed with @inbounds.

StefanKarpinski · 2017-09-21T18:44:34Z

The idea here is the following:

String values can hold arbitrary binary data.
Iterating a String yields a series of Char values – it will never throw an error.
Char can represent valid code points or as well as individual bytes.
When decoding a String:
- a sub-sequence of bytes which is valid UTF-8 yield a code point Char value;
- a byte which is not part of a valid sub-sequence of bytes yields a byte Char value.
Indexing – if the byte at the index is:
- the first byte of a valid UTF-8 code point encoding, that character is returned as a Char;
- any other byte of a valid UTF-8 code point encoding, a UnicodeError is thrown;
- not part of a valid UTF-8 code point encoding, that byte is returned as a Char.

nalimilan · 2017-09-21T20:50:10Z

When decoding a String:
a sub-sequence of bytes which is valid UTF-8 yield a code point Char value;
a byte which is not part of a valid sub-sequence of bytes yields a byte Char value.

What do you mean exactly by "a byte Char value"? Wikipedia presents a few common strategies, for example UTF-8B, which Python 3.1 uses. EDIT: Woops, UTF-8b does not apply here since we're talking about storing bytes inside a UInt32 codepoint. There doesn't seem to be any "standard" to do this according to Wikipedia.

Indexing – if the byte at the index is:
the first byte of a valid UTF-8 code point encoding, that character is returned as a Char;
any other byte of a valid UTF-8 code point encoding, a UnicodeError is thrown;
not part of a valid UTF-8 code point encoding, that byte is returned as a Char.

Is there a reliable an efficient way of distinguishing the last two cases? If the index points to a byte which isn't a valid first byte of a codepoint, you can't know whether it's the middle of a valid codepoint, or just a byte outside of a codepoint without going back to the previous bytes (up to 3 bytes). Is this what you envisage in case the index does not point to a valid first byte of a codepoint?

StefanKarpinski · 2017-09-21T21:42:53Z

What do you mean exactly by "a byte Char value"?

A value of Char which represents a raw byte instead of a code point. E.g. use values that are negative when reinterpreted as Int32 for this. The representation doesn't matter and certainly doesn't need to be standard since it's purely internal – using negative Int32 bit patterns would be compatible with the current Char representation since the representation of any valid string data doesn't change. When you write these back out to a UTF-8 encoded stream, you just print the byte. If someone tries to transcode it to any other encoding, then that's an error since it's not a valid code point. But being able to hold (and work with) arbitrary data in String values is useful and convenient, and so is being able to round-trip arbitrary data.

Is there a reliable an efficient way of distinguishing the last two cases?

That's part of this that is somewhat exploratory, but if you look at an 8-byte word (7 bytes, really) starting at i-3 then you can definitely tell if the index is part of a valid UTF-8 code point encoding since that cannot be more than four bytes. In @inbounds mode, we would ignore the middle case and just handle the first and third options, which would likely be considerably less work.

StefanKarpinski · 2017-12-15T03:34:36Z

Done by #24999.

StefanKarpinski added the strings "Strings!" label Jun 29, 2017

StefanKarpinski added this to the 1.0 milestone Jun 29, 2017

StefanKarpinski self-assigned this Jun 29, 2017

JeffBezanson added triage This should be discussed on a triage call and removed triage This should be discussed on a triage call labels Sep 20, 2017

nalimilan mentioned this issue Oct 25, 2017

Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

Closed

StefanKarpinski closed this as completed Dec 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String as more of a byte vector type? #22616

String as more of a byte vector type? #22616

StefanKarpinski commented Jun 29, 2017 •

edited

Loading

nalimilan commented Jun 29, 2017

StefanKarpinski commented Sep 21, 2017

nalimilan commented Sep 21, 2017 •

edited

Loading

StefanKarpinski commented Sep 21, 2017

StefanKarpinski commented Dec 15, 2017

String as more of a byte vector type? #22616

String as more of a byte vector type? #22616

Comments

StefanKarpinski commented Jun 29, 2017 • edited Loading

nalimilan commented Jun 29, 2017

StefanKarpinski commented Sep 21, 2017

nalimilan commented Sep 21, 2017 • edited Loading

StefanKarpinski commented Sep 21, 2017

StefanKarpinski commented Dec 15, 2017

StefanKarpinski commented Jun 29, 2017 •

edited

Loading

nalimilan commented Sep 21, 2017 •

edited

Loading