Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type #9297

nalimilan · 2014-12-10T08:47:35Z

This is a proposal that's been done in Rust and apparently is still under discussion (see rust-lang/rust#10044 (comment)), but I thought it could be interesting for Julia.

The idea is that since indexing strings with a number like s[3] only makes sense when 3 actually corresponds to the boundary of a unicode code point, it represents a trap for developers who only test it on ASCII, making bugs appear only in production when used with non-ASCII text. Typical cases are the naive:

julia> s = "noël";
julia> s[4] # Thinking that this provides the fourth "character"
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

the tempting:

julia> s[end - 1] # Thinking you skip the last character, works well... until it breaks
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

(Julia equivalent of http://www.reddit.com/r/rust/comments/1zlq21/should_rust_be_more_careful_with_unicode/cfush88)

or the slightly more involved:

julia> s[match(r"l", s).offset - 1]
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

Instead of letting people do incorrect-but-easy things like this, it could be useful to restrict string indexing to a special type, say ByteIndex, instead of a plain integer. match would provide offset as that type too. It would prevent both naive indexing using integers as well as doing incorrect arithmetic on indexes you get from functions, encouraging people to always use dedicated functions.

It might also make sense to allow arithmetic operations on this type, so that idx + 1 means "the code point after the one at position idx", which would be O(n) but starting from the index -- and usually you don't take very large offsets. I'm not saying this is necessarily a good idea, though, because ByteIndex implies a reasoning in bytes, and then arithmetic operations would switch to a reasoning in code points. It could be named StringIndex instead, and made opaque so that people never see the integer index which is in bytes.

Finally, it might be possible to perform some optimizations by removing checks that the index corresponds to the start of a code point, if the index held a reference to the string it was build from, so that it can be checked that it matches the indexed string. Not sure it would be significant, though.

The text was updated successfully, but these errors were encountered:

elextr · 2014-12-10T09:25:11Z

Remove indexing totally, only allow accessing strings via iterators? Access is O(1), and it knows the string it applies to.

nalimilan · 2014-12-10T10:14:04Z

@elextr Not remove: as the title says, restrict it to ByteIndex/StringIndex.

elextr · 2014-12-10T10:44:07Z

@nalimilan I was continuing on from your last comment about having a reference to the string in the index. That makes it possible to use it as an iterator, so there is no chance of applying it to the wrong string, and it can be passed to functions as a single parameter. And moving away from the indexing syntax removes the implication that it is guaranteed to be O(1).

nalimilan · 2014-12-10T11:19:52Z

Ah, OK. I'm not sure this is very practical since indexing is quite a common pattern, and anyway you will need a function to extract a character or a substring at a given position. Though that would allow for more explicit names like slice_bytes or slice_graphemes, as was proposed in the Rust issue.

elextr · 2014-12-10T11:43:37Z

Yes, byte(ByteIterator), char(CodepointIterator) and grapheme(GraphemeIterator) would definitely make it clear that different things are being found, especially as the last may be more than one codepoint which may be more than one byte whereas string[xxxIndex] doesn't really make it clear it returns an xxx. Looks more interesting the more I think about it, but as you say it may be just too radical for a language steeped in the [] index paradigm.

StefanKarpinski · 2014-12-10T15:07:00Z

I've considered that kind of StringIndex approach but felt that having the index reference the string object might cause performance issues and that it seemed like a kind of clunky, annoying API. If you want to play around with it and see if you can mock something up that has good performance and isn't annoying to use, that would be good and we could consider using it.

nalimilan · 2014-12-10T15:30:54Z

@StefanKarpinski In my mind the reference to the string object was only meant as an optimization, so if it hurts it could be avoided. The minimal version of StringIndex is just a wrapper around an integer; other points are just suggestions. What parts of my proposal do you consider worth investigating?

johnmyleswhite · 2014-12-10T15:43:37Z

I've argued before that indexing should be removed completely in the future, so I'll argue that again.

(1) We know there's no clearly right definition of indexing. This proposal is meant to force people to remove ambiguity about what they hope indexing will mean. So it's a huge step forward. But there's still point (2).

(2) I believe a lot of the people who will find indexing into strings intuitive will be distraught to learn that you can't mutate strings via setindex!. This is why I think indexing is really an operation best left for the mutable bytes level that shadows strings.

nalimilan · 2014-12-10T16:10:12Z

Actually I'd be rather in favor of removing string indexing. Making strings look like array sounds too confusing to me, because of the immutability, but also because of the absence of equivalence between bytes and codepoints.

But we would need to identify common use cases and how they would work with a new API. For example, the incorrect s[match(r"l", s).offset - 1] could become grapheme(s, match(r"l", s).offset - 1) or codepoint(s, match(r"l", s).offset - 1), depending on the application. This could work if match().offset returned a ByteIndex/StringIndex. Would an efficient indexing method based on bytes still be needed? If so, in which cases? Could it be completely avoided, with enough functions returning a ByteIndex/StringIndex for more safety?

nalimilan · 2014-12-10T16:26:20Z

FWIW, I've just bumped into the Rust issue where string indexing was removed: rust-lang/rust#12710 See also http://doc.rust-lang.org/guide-strings.html But they still allow taking byte slices via an explicit call, raising a run-time error on things like "aé".slice(1, 2). Issue rust-lang/rust#10044 was precisely about renaming slice and friends to something more explicit, which sounds like a very sane approach.

Rust also provides iterators for bytes, codepoints and graphemes, which must be chosen explicitly. This also sounds like a good idea to me, removing all issues from naive uses of Unicode strings. And it's possible thanks to @stevengj's PR #9261.

Go doesn't sound like a good model to follow IMHO, not least because they allow arbitrary bytes (in my words, "random garbage") in their str type, and therefore also allow indexing in the middle of codepoints... Though they still special-case UTF-8 in iteration. :-/ http://blog.golang.org/strings

StefanKarpinski · 2014-12-10T17:18:27Z

I don't think disallowing indexing into strings is helpful or sensible. You need a way of talking about positions in strings. Otherwise how do you talk about things like "give me the next character after this position in this string" or "give me the substring between this position and this position"? You can't use characters to talk about positions because characters can occur multiple times and even if they didn't this would imply doing an O(n) search to find a position, which is clearly a non-starter.

What might be sensible is disallowing arithmetic with indices into strings. Whenever someone writes str[i+1] they are writing Unicode-unsafe code and the Unicode-correct version should be str[nextind(str,i)]. The idea behind keeping a reference to the string as part of the StringIndex type is to do something like this:

immutable StringIndex{S<:String}
    string::S
    index::Int
end

Base.getindex(s::String, si::StringIndex) =
    s === si.string ? s[si.index] : s[chr2ind(s, ind2chr(si.string, si.index))]

function +(si::StringIndex, j::Integer)
    j < 0 && return si - (-j)
    i = si.index
    while j > 0
        i = nextind(si.string, i)
        j -= 1
    end
    return StringIndex(si.string, i)
end

function -(si::StringIndex, j::Integer)
    j < 0 && return si + (-j)
    i = si.index
    while j > 0
        i = prevind(si.string, i)
        j -= 1
    end
    return StringIndex(si.string, i)
end

That's pretty complicated code given that by far the most common use case is str[i + 1] which you want to just boil down to str[nextind(str,i)], which basically necessitates very aggressive inlining – more aggressive that we're currently able to do with the above code.

StefanKarpinski · 2014-12-10T17:47:22Z

I've tried a bunch of tricks to try to make this StringIndex approach generate efficient code and I can't manage it. It may be possible, but I'm wary of an approach that makes this so difficult to optimize. Still, it's plausible to have a StringIndex type that is a wrapper around Int disallowing index arithmetic.

nalimilan · 2014-12-10T18:30:41Z

I think there are two issues:

Whether indexing using [] should be allowed: this is a question of syntax/API. Anyway there needs to be a way to extract a codepoint, grapheme or substring.
What kind of indexing should be supported, either by getindex or dedicated functions. And whether arithmetic should be supported on that index.

Regarding 1), I don't have a strong opinion though I'm not a fan of indexing strings. (Actually I'm more concerned about iteration having different legitimate meanings.)

Regarding 2), I agree with @StefanKarpinski that inefficient arithmetic would be a problem, as it would let people use str[i + 1] when the most efficient solution is str[nextind(str,i)] (this could be fixed if/when function specialization depending on argument values was possible). But StringIndex could start as a wrapper to an int, without arithmetic, and it would already be a protection against incorrect uses. Then improvements could always come later. Though as I noted at the end of the description of the issue, storing a reference to the string could allow for some optimizations, even without allowing arithmetic, since it would save some checks in getindex about whether the index is in the middle of a code point or not.

johnmyleswhite · 2014-12-10T19:03:25Z

@StefanKarpinski: I don't believe the concept of "position" has any coherent meaning for strings. After all, our current indexing rules refer to indexing into bytes, whereas Python 3's indexing rules refer to indexing into codepoints. In light of multiple ways to break strings apart into pieces, [] is a fundamentally ambiguous operation -- and I personally think our choice of defaulting to byte indices doesn't offer a good performance/safety tradeoff. By the time you're indexing into bytes for performance, you should be talking about a mutable array of bytes that derives from a specific string encoding that you are fully aware of. For safety, though, you want to talk about codepoints or graphemes without reference to bytes or even encodings.

nalimilan · 2014-12-10T19:14:10Z

@johnmyleswhite No, the byte offset is a good index of a position within a string: it's like a pointer to a memory area. But it's an implementation detail which should never leak to the user. You should be able to ask for the grapheme at position X (in bytes), given that this position is obtained as an opaque StringIndex from another function like match.

StefanKarpinski · 2014-12-10T19:19:20Z

The consequences of not having any notion of position for strings (even an opaque one that just happens to correspond to byte offsets) would be massive string API bloat and complication. Consider the very simple question: given some string, does "foo" occur earlier than "bar" as a substring, assuming both occur? With a notion of position that is at least ordered, we can answer this question very simply:

search(s, "foo")[1] < search(s, "bar")[1]

Without an ordered notion of position, this becomes much harder. What does search(s, "foo") even return? Does it partition s into three substrings: before the match, the match itself, and after the match? That's a super weird, awkward API when all you wanted to know was where something occurs. Let's assume that's how it works (I can't think of anything else if we banish the concept of string positions). Then you could do this:

contains(search(s, "foo").before, "bar")

Of course, that only works for "foo" and "bar" because they share no letters and cannot overlap. What if the strings are "abba" and "baab" instead? I'm not even going to try to express the logic for figuring out which one occurs first in a string without a notion of position.

StefanKarpinski · 2014-12-10T19:30:14Z

I guess one way of doing this would be to compare the lengths of the before parts:

length(search(s, "foo").before) < length(search(s, "bar").before)

Of course, that adds an extra pair of O(n) operations to the solution, which is lousy.

kmsquire · 2014-12-10T20:37:44Z

One idea I've toyed with in the past is maintaining a list of positions
within a string containing the positions and length of characters (code
points or graphemes) represented by more than one byte (or more than the
default if most characters are multibyte).

Indexing into code points or graphemes then becomes well defined. For
ASCII, we get byte indexing, which is fast. For text with few multibyte
code points/graphemes, indexing requires a binary search of a small array.
For Asian character sets, I believe most code points are two bytes, so the
same applies.

The downside, of course, is the overhead of an extra array, which would be
significant when working with large numbers of short strings.

On Wednesday, December 10, 2014, Stefan Karpinski notifications@github.com
wrote:

I guess one way of doing this would be to compare the lengths of the
before parts:

length(search(s, "foo").before) < length(search(s, "bar").before)

Of course, that adds an extra pair of O(n) operations to the solution,
which is lousy.

—
Reply to this email directly or view it on GitHub
#9297 (comment).

stevengj · 2014-12-10T20:44:18Z

@kmsquire, I think the usual feeling (when discussions of variable-length encodings come up) is that schemes like yours are unnecessary, because randomly accessing the n-th codepoint or grapheme in a string is actually quite rare. Nearly all string processing is either sequential or sequential search followed by random access at the indices returned by search (or sequential accesses starting at these indices). That is, you only ever need "random" access at indices that are valid by construction (because they are merely cached).

@StefanKarpinski, I like the idea of an AbstractStringIndex type that is simply a wrapper around Int and supports isless and nextind(s,i) but not +. Seems like it would cut down on a lot of bugs, would have no performance overhead (because it would inline to an integer), and hopefully wouldn't be too annoying to use. You'd have concrete subtypes UTF8StringIndex etcetera, which would prevent you from accidentally using e.g. a UTF-8 string index with a UTF-16 copy of the same string. Of course, it wouldn't prevent you from using a UTF-8 string index with a completely different UTF-8 string, but this kind of mistake (if it is a mistake...it could actually be done correctly if you are careful to do it only if the second string is a substring of the first) seems less likely and would still throw an exception for an invalid index.

StefanKarpinski · 2014-12-10T21:00:29Z

It feels weirdly half-baked to prevent some kinds of cross-indexing errors but not others. In the above StringIndex example code I wrote, you'll note that I do the right thing and translate through code points if you use an index from one string into another. The thing I like about that is that it is pretty annoying to write that code and it's nice that it can just do it for you. Of course, the issue is that I couldn't coax our compiler stack into eliminating this check – it doesn't seem to know that an object is always === to itself. @JeffBezanson, any way we could teach our system that fact?

stevengj · 2014-12-10T21:02:33Z

It doesn't seem weird to me. The main point of a StringIndex type is to detect the common s[i+1] errors, not to prevent BoundsError() in general. The fact that it also detects certain kinds of cross-indexing errors is merely a bonus.

elextr · 2014-12-10T21:03:41Z

@StefanKarpinski an iterator is a position in a string, and, with the usual implementation of pointer to string and offset, it can be ordered simply by overloading <to consider only the offset. Arithmetic will be possible with overloading of + and - and will then DTRT with respect to incrementing by byte, codepoint or graphene and the extraction functions byte(), codepoint() and graphene() can also take an offset argument and again DTRT.

JeffBezanson · 2014-12-10T21:46:44Z

@StefanKarpinski yes we can add codegen cases for ===. We could check if the two arguments are the same symbol, or the same LLVM Value* (plus some constraints probably). Would that address this case?

StefanKarpinski · 2014-12-10T21:57:41Z

Unfortunately, it seems like neither of those things suffice. The necessary steps seem to be inlining of the getindex call and elimination of the construction of the StringIndex object altogether, turning si.string directly into just s, at which point LLVM might be able to tell that they're actually the same object. All that isn't happening in LLVM right now. Seems moderately unreasonable to expect it to.

stevengj · 2014-12-11T00:35:51Z

@elextr, technically an iterable type in Julia is not a position in the string. The iterable combined with the iteration state (which is created by start) is what defines the position. In the case of a string, the iteration state is currently just the index. Unless you are defining "iterator" as the (iterable, state) pair? I suppose that would be the analogue of iterator objects in Python.

elextr · 2014-12-11T01:08:34Z

@JeffBezanson yes, I was thinking like Python/C++ iterators where the state is part of the iterator object.

(my excuse is that searching the docs for "iterator" does not take you to the section http://docs.julialang.org/en/latest/stdlib/base/#iteration :)

nalimilan · 2014-12-13T12:26:45Z

@StefanKarpinski What's the use case for indexing a string with an index obtained on another one? Even if done correctly (by counting code points or graphemes), I guess it wouldn't make much sense unless the two strings start with exactly the same content. And if that's the case, people would better directly index at a given byte offset instead of using the code you suggested, which is is O(n) for different strings.

In short, I think it would make for a more reasonable design if cross-indexing of strings was not supported by getindex. Instead, functions like slice_bytes, slice_codepoint, slice_grapheme should be provided, and used depending on whether you know the strings start with the same content (use the first) or not (use one of the two others -- but this case isn't very frequent AFAICT).

StefanKarpinski · 2014-12-14T19:39:16Z

Slightly contrived example: you have a data parsing framework that lets you specify fields using something like this fmt = "[[[[ ]]].[[ ]]]]]". You parse the format string and get character offsets and then use those to parse out parts of other strings that contain actual data.

PallHaraldsson · 2015-07-23T13:34:02Z

@nalimilan: "No, the byte offset is a good index of a position within a string" - Not even this is safe - [in my idea of a string implementation.. That is, if position means everything after that byte follows in the string.]

I support this issue. Would like to see indexing gone in 0.4 (or deprecated).

malmaud · 2015-07-23T13:44:28Z

This might be a good candidate for .5, where lots of changes to indexing semantics is planned.

PallHaraldsson · 2015-07-23T14:01:15Z

Dropping in 0.5 might be good enough, but do we then need to deprecated in 0.4? Not sure how it is done, just documenting it or needs there be a warning? Anyway I think that is a minor change to code to emit warnings and will not break any code.. [Dropping indexing-functionality is also simple, no, except breaks code.. Not sure if much code breaks however..]

malmaud · 2015-07-23T14:07:46Z

Just to be clear, I'm not necessarily advocating for the removal of string indexing - just pointing out that changes to string indexing behavior might as well happen at the same time as changes to array indexing behavior.

andreasnoack · 2017-10-25T12:48:44Z

Two comments:

Ellide immutable allocation in some simple cases #15259 should be relevant to this discussion since the discussion of overhead from keeping a reference to a String predates the PR. With enough inlining, it might be possible to avoid the overhead.
I'm working on a project where we have to iterate large text files and quite a few of them have invalid UTF8 byte sequences. Throwing when a sequence is invalid makes it inconvenient to handle some of these files. Hence locally, I've decided to change next to return '�' in those cases. This is going a bit in the opposite direction of Make getindex for String check if indices are valid #22572 which throws even more but as mentioned in Improvement of the functions for handling string indexing #23765 this might actually be a problem with the getindex API for strings. Hence, I'd be in favor of a string API that tried to hide the buffer and interpreted a string a sequence of characters (which is like the current string iterator) instead of a sequence of bytes. That would probably mean to stop having the getindex API since it would be too inefficient. Improvement of the functions for handling string indexing #23765 seems like a good direction.

nalimilan · 2017-10-25T13:42:58Z

It would indeed be interesting to try again with #15259 and see whether performance has improved.

Regarding your second point, it seems that the issue of invalid sequences is mostly orthogonal to StringIndex. The same discussions can be developed only with the current API. The revelant issue for this is #22616. next would return Chars with invalid Unicode code points when the data is corrupt instead of throwing an error. getindex would do the same when pointed to a byte which is not part of a valid character. But it would still throw an error when pointing to the middle of a valid char, so it's not opposed to #22572 AFAICT.

The area where StringIndex could help is that we could rely on the assumption that an index does not point to the middle of a valid character, so we could avoid some checks. But these would only be needed for invalid sequences (i.e. continuation bytes not part of a character), since indices to valid characters will never be continuation bytes, so the overhead may not be a serious issue in practice.

nalimilan added the domain:unicode Related to unicode characters and encodings label Dec 10, 2014

stevengj added the needs decision A decision on this change is needed label Dec 10, 2014

quinnj mentioned this issue Jun 16, 2015

Add check_string function that is more generic, thanks to Encodings quinnj/Strings.jl#2

Open

PallHaraldsson mentioned this issue Jul 23, 2015

UTF-8 combining characters and normalization in reverse() #6165

Closed

stevengj mentioned this issue Nov 26, 2015

Invalid second UTF8String indices don't cause errors #14158

Closed

JobJob mentioned this issue Nov 27, 2015

Fixes #14158 #14160

Closed

nalimilan mentioned this issue Jan 18, 2016

RFC, WIP: highlander #14383

Closed

nalimilan mentioned this issue Mar 21, 2016

ReshapedArrays, take2 #15449

Merged

quinnj mentioned this issue Apr 28, 2016

Stringapalooza #16107

Closed

32 tasks

This was referenced Jul 16, 2016

updated to utf8proc v2.0.1, gives support for Unicode 9 #17402

Merged

Chop optional number of characters #17457

Closed

nalimilan mentioned this issue Aug 29, 2016

Improving performance of done(s::AbstractString, i) #18081

Closed

StefanKarpinski added design Design of APIs or of the language itself and removed needs decision A decision on this change is needed labels Sep 13, 2016

StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016

StefanKarpinski self-assigned this Sep 13, 2016

JeffBezanson modified the milestones: 1.0, 0.6.0 Dec 29, 2016

JeffBezanson modified the milestones: 2.0+, 1.0 May 2, 2017

nalimilan mentioned this issue Jun 28, 2017

Make getindex for String check if indices are valid #22572

Merged

nalimilan mentioned this issue Sep 19, 2017

Improvement of the functions for handling string indexing #23765

Closed

stevengj mentioned this issue Nov 16, 2017

make reverseind generic #24613

Closed

StefanKarpinski closed this as completed Dec 15, 2017

nalimilan mentioned this issue Feb 23, 2018

Issue with splice! on array of strings #25561

Open

stevengj mentioned this issue Feb 28, 2018

keys, eachindex and linearindices consistency #25901

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type #9297

Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type #9297

nalimilan commented Dec 10, 2014

elextr commented Dec 10, 2014

nalimilan commented Dec 10, 2014

elextr commented Dec 10, 2014

nalimilan commented Dec 10, 2014

elextr commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

nalimilan commented Dec 10, 2014

johnmyleswhite commented Dec 10, 2014

nalimilan commented Dec 10, 2014

nalimilan commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

nalimilan commented Dec 10, 2014

johnmyleswhite commented Dec 10, 2014

nalimilan commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

kmsquire commented Dec 10, 2014

stevengj commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

stevengj commented Dec 10, 2014

elextr commented Dec 10, 2014

JeffBezanson commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

stevengj commented Dec 11, 2014

elextr commented Dec 11, 2014

nalimilan commented Dec 13, 2014

StefanKarpinski commented Dec 14, 2014

PallHaraldsson commented Jul 23, 2015

malmaud commented Jul 23, 2015

PallHaraldsson commented Jul 23, 2015

malmaud commented Jul 23, 2015

andreasnoack commented Oct 25, 2017

nalimilan commented Oct 25, 2017

Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

Comments

nalimilan commented Dec 10, 2014

elextr commented Dec 10, 2014

nalimilan commented Dec 10, 2014

elextr commented Dec 10, 2014

nalimilan commented Dec 10, 2014

elextr commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

nalimilan commented Dec 10, 2014

johnmyleswhite commented Dec 10, 2014

nalimilan commented Dec 10, 2014

nalimilan commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

nalimilan commented Dec 10, 2014

johnmyleswhite commented Dec 10, 2014

nalimilan commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

kmsquire commented Dec 10, 2014

stevengj commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

stevengj commented Dec 10, 2014

elextr commented Dec 10, 2014

JeffBezanson commented Dec 10, 2014

StefanKarpinski commented Dec 10, 2014

stevengj commented Dec 11, 2014

elextr commented Dec 11, 2014

nalimilan commented Dec 13, 2014

StefanKarpinski commented Dec 14, 2014

PallHaraldsson commented Jul 23, 2015

malmaud commented Jul 23, 2015

PallHaraldsson commented Jul 23, 2015

malmaud commented Jul 23, 2015

andreasnoack commented Oct 25, 2017

nalimilan commented Oct 25, 2017

Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type #9297

Restrict indexing into strings to a special `ByteIndex` or `StringIndex` type #9297