Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict indexing into strings to a special ByteIndex or StringIndex type #9297

Closed
nalimilan opened this issue Dec 10, 2014 · 92 comments
Closed
Assignees
Labels
breaking This change will break code design Design of APIs or of the language itself unicode Related to unicode characters and encodings
Milestone

Comments

@nalimilan
Copy link
Member

This is a proposal that's been done in Rust and apparently is still under discussion (see rust-lang/rust#10044 (comment)), but I thought it could be interesting for Julia.

The idea is that since indexing strings with a number like s[3] only makes sense when 3 actually corresponds to the boundary of a unicode code point, it represents a trap for developers who only test it on ASCII, making bugs appear only in production when used with non-ASCII text. Typical cases are the naive:

julia> s = "noël";
julia> s[4] # Thinking that this provides the fourth "character"
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

the tempting:

julia> s[end - 1] # Thinking you skip the last character, works well... until it breaks
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

(Julia equivalent of http://www.reddit.com/r/rust/comments/1zlq21/should_rust_be_more_careful_with_unicode/cfush88)

or the slightly more involved:

julia> s[match(r"l", s).offset - 1]
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:68
 in getindex at string.jl:57

Instead of letting people do incorrect-but-easy things like this, it could be useful to restrict string indexing to a special type, say ByteIndex, instead of a plain integer. match would provide offset as that type too. It would prevent both naive indexing using integers as well as doing incorrect arithmetic on indexes you get from functions, encouraging people to always use dedicated functions.

It might also make sense to allow arithmetic operations on this type, so that idx + 1 means "the code point after the one at position idx", which would be O(n) but starting from the index -- and usually you don't take very large offsets. I'm not saying this is necessarily a good idea, though, because ByteIndex implies a reasoning in bytes, and then arithmetic operations would switch to a reasoning in code points. It could be named StringIndex instead, and made opaque so that people never see the integer index which is in bytes.

Finally, it might be possible to perform some optimizations by removing checks that the index corresponds to the start of a code point, if the index held a reference to the string it was build from, so that it can be checked that it matches the indexed string. Not sure it would be significant, though.

@nalimilan nalimilan added the unicode Related to unicode characters and encodings label Dec 10, 2014
@elextr
Copy link

elextr commented Dec 10, 2014

Remove indexing totally, only allow accessing strings via iterators? Access is O(1), and it knows the string it applies to.

@nalimilan
Copy link
Member Author

@elextr Not remove: as the title says, restrict it to ByteIndex/StringIndex.

@elextr
Copy link

elextr commented Dec 10, 2014

@nalimilan I was continuing on from your last comment about having a reference to the string in the index. That makes it possible to use it as an iterator, so there is no chance of applying it to the wrong string, and it can be passed to functions as a single parameter. And moving away from the indexing syntax removes the implication that it is guaranteed to be O(1).

@nalimilan
Copy link
Member Author

Ah, OK. I'm not sure this is very practical since indexing is quite a common pattern, and anyway you will need a function to extract a character or a substring at a given position. Though that would allow for more explicit names like slice_bytes or slice_graphemes, as was proposed in the Rust issue.

@elextr
Copy link

elextr commented Dec 10, 2014

Yes, byte(ByteIterator), char(CodepointIterator) and grapheme(GraphemeIterator) would definitely make it clear that different things are being found, especially as the last may be more than one codepoint which may be more than one byte whereas string[xxxIndex] doesn't really make it clear it returns an xxx. Looks more interesting the more I think about it, but as you say it may be just too radical for a language steeped in the [] index paradigm.

@StefanKarpinski
Copy link
Member

I've considered that kind of StringIndex approach but felt that having the index reference the string object might cause performance issues and that it seemed like a kind of clunky, annoying API. If you want to play around with it and see if you can mock something up that has good performance and isn't annoying to use, that would be good and we could consider using it.

@nalimilan
Copy link
Member Author

@StefanKarpinski In my mind the reference to the string object was only meant as an optimization, so if it hurts it could be avoided. The minimal version of StringIndex is just a wrapper around an integer; other points are just suggestions. What parts of my proposal do you consider worth investigating?

@johnmyleswhite
Copy link
Member

I've argued before that indexing should be removed completely in the future, so I'll argue that again.

(1) We know there's no clearly right definition of indexing. This proposal is meant to force people to remove ambiguity about what they hope indexing will mean. So it's a huge step forward. But there's still point (2).

(2) I believe a lot of the people who will find indexing into strings intuitive will be distraught to learn that you can't mutate strings via setindex!. This is why I think indexing is really an operation best left for the mutable bytes level that shadows strings.

@nalimilan
Copy link
Member Author

Actually I'd be rather in favor of removing string indexing. Making strings look like array sounds too confusing to me, because of the immutability, but also because of the absence of equivalence between bytes and codepoints.

But we would need to identify common use cases and how they would work with a new API. For example, the incorrect s[match(r"l", s).offset - 1] could become grapheme(s, match(r"l", s).offset - 1) or codepoint(s, match(r"l", s).offset - 1), depending on the application. This could work if match().offset returned a ByteIndex/StringIndex. Would an efficient indexing method based on bytes still be needed? If so, in which cases? Could it be completely avoided, with enough functions returning a ByteIndex/StringIndex for more safety?

@nalimilan
Copy link
Member Author

FWIW, I've just bumped into the Rust issue where string indexing was removed: rust-lang/rust#12710 See also http://doc.rust-lang.org/guide-strings.html But they still allow taking byte slices via an explicit call, raising a run-time error on things like "aé".slice(1, 2). Issue rust-lang/rust#10044 was precisely about renaming slice and friends to something more explicit, which sounds like a very sane approach.

Rust also provides iterators for bytes, codepoints and graphemes, which must be chosen explicitly. This also sounds like a good idea to me, removing all issues from naive uses of Unicode strings. And it's possible thanks to @stevengj's PR #9261.

Go doesn't sound like a good model to follow IMHO, not least because they allow arbitrary bytes (in my words, "random garbage") in their str type, and therefore also allow indexing in the middle of codepoints... Though they still special-case UTF-8 in iteration. :-/ http://blog.golang.org/strings

@StefanKarpinski
Copy link
Member

I don't think disallowing indexing into strings is helpful or sensible. You need a way of talking about positions in strings. Otherwise how do you talk about things like "give me the next character after this position in this string" or "give me the substring between this position and this position"? You can't use characters to talk about positions because characters can occur multiple times and even if they didn't this would imply doing an O(n) search to find a position, which is clearly a non-starter.

What might be sensible is disallowing arithmetic with indices into strings. Whenever someone writes str[i+1] they are writing Unicode-unsafe code and the Unicode-correct version should be str[nextind(str,i)]. The idea behind keeping a reference to the string as part of the StringIndex type is to do something like this:

immutable StringIndex{S<:String}
    string::S
    index::Int
end

Base.getindex(s::String, si::StringIndex) =
    s === si.string ? s[si.index] : s[chr2ind(s, ind2chr(si.string, si.index))]

function +(si::StringIndex, j::Integer)
    j < 0 && return si - (-j)
    i = si.index
    while j > 0
        i = nextind(si.string, i)
        j -= 1
    end
    return StringIndex(si.string, i)
end

function -(si::StringIndex, j::Integer)
    j < 0 && return si + (-j)
    i = si.index
    while j > 0
        i = prevind(si.string, i)
        j -= 1
    end
    return StringIndex(si.string, i)
end

That's pretty complicated code given that by far the most common use case is str[i + 1] which you want to just boil down to str[nextind(str,i)], which basically necessitates very aggressive inlining – more aggressive that we're currently able to do with the above code.

@StefanKarpinski
Copy link
Member

I've tried a bunch of tricks to try to make this StringIndex approach generate efficient code and I can't manage it. It may be possible, but I'm wary of an approach that makes this so difficult to optimize. Still, it's plausible to have a StringIndex type that is a wrapper around Int disallowing index arithmetic.

@nalimilan
Copy link
Member Author

I think there are two issues:

  1. Whether indexing using [] should be allowed: this is a question of syntax/API. Anyway there needs to be a way to extract a codepoint, grapheme or substring.
  2. What kind of indexing should be supported, either by getindex or dedicated functions. And whether arithmetic should be supported on that index.

Regarding 1), I don't have a strong opinion though I'm not a fan of indexing strings. (Actually I'm more concerned about iteration having different legitimate meanings.)

Regarding 2), I agree with @StefanKarpinski that inefficient arithmetic would be a problem, as it would let people use str[i + 1] when the most efficient solution is str[nextind(str,i)] (this could be fixed if/when function specialization depending on argument values was possible). But StringIndex could start as a wrapper to an int, without arithmetic, and it would already be a protection against incorrect uses. Then improvements could always come later. Though as I noted at the end of the description of the issue, storing a reference to the string could allow for some optimizations, even without allowing arithmetic, since it would save some checks in getindex about whether the index is in the middle of a code point or not.

@johnmyleswhite
Copy link
Member

@StefanKarpinski: I don't believe the concept of "position" has any coherent meaning for strings. After all, our current indexing rules refer to indexing into bytes, whereas Python 3's indexing rules refer to indexing into codepoints. In light of multiple ways to break strings apart into pieces, [] is a fundamentally ambiguous operation -- and I personally think our choice of defaulting to byte indices doesn't offer a good performance/safety tradeoff. By the time you're indexing into bytes for performance, you should be talking about a mutable array of bytes that derives from a specific string encoding that you are fully aware of. For safety, though, you want to talk about codepoints or graphemes without reference to bytes or even encodings.

@nalimilan
Copy link
Member Author

@johnmyleswhite No, the byte offset is a good index of a position within a string: it's like a pointer to a memory area. But it's an implementation detail which should never leak to the user. You should be able to ask for the grapheme at position X (in bytes), given that this position is obtained as an opaque StringIndex from another function like match.

@StefanKarpinski
Copy link
Member

The consequences of not having any notion of position for strings (even an opaque one that just happens to correspond to byte offsets) would be massive string API bloat and complication. Consider the very simple question: given some string, does "foo" occur earlier than "bar" as a substring, assuming both occur? With a notion of position that is at least ordered, we can answer this question very simply:

search(s, "foo")[1] < search(s, "bar")[1]

Without an ordered notion of position, this becomes much harder. What does search(s, "foo") even return? Does it partition s into three substrings: before the match, the match itself, and after the match? That's a super weird, awkward API when all you wanted to know was where something occurs. Let's assume that's how it works (I can't think of anything else if we banish the concept of string positions). Then you could do this:

contains(search(s, "foo").before, "bar")

Of course, that only works for "foo" and "bar" because they share no letters and cannot overlap. What if the strings are "abba" and "baab" instead? I'm not even going to try to express the logic for figuring out which one occurs first in a string without a notion of position.

@StefanKarpinski
Copy link
Member

I guess one way of doing this would be to compare the lengths of the before parts:

length(search(s, "foo").before) < length(search(s, "bar").before)

Of course, that adds an extra pair of O(n) operations to the solution, which is lousy.

@kmsquire
Copy link
Member

One idea I've toyed with in the past is maintaining a list of positions
within a string containing the positions and length of characters (code
points or graphemes) represented by more than one byte (or more than the
default if most characters are multibyte).

Indexing into code points or graphemes then becomes well defined. For
ASCII, we get byte indexing, which is fast. For text with few multibyte
code points/graphemes, indexing requires a binary search of a small array.
For Asian character sets, I believe most code points are two bytes, so the
same applies.

The downside, of course, is the overhead of an extra array, which would be
significant when working with large numbers of short strings.

On Wednesday, December 10, 2014, Stefan Karpinski notifications@github.com
wrote:

I guess one way of doing this would be to compare the lengths of the
before parts:

length(search(s, "foo").before) < length(search(s, "bar").before)

Of course, that adds an extra pair of O(n) operations to the solution,
which is lousy.


Reply to this email directly or view it on GitHub
#9297 (comment).

@stevengj
Copy link
Member

@kmsquire, I think the usual feeling (when discussions of variable-length encodings come up) is that schemes like yours are unnecessary, because randomly accessing the n-th codepoint or grapheme in a string is actually quite rare. Nearly all string processing is either sequential or sequential search followed by random access at the indices returned by search (or sequential accesses starting at these indices). That is, you only ever need "random" access at indices that are valid by construction (because they are merely cached).

@StefanKarpinski, I like the idea of an AbstractStringIndex type that is simply a wrapper around Int and supports isless and nextind(s,i) but not +. Seems like it would cut down on a lot of bugs, would have no performance overhead (because it would inline to an integer), and hopefully wouldn't be too annoying to use. You'd have concrete subtypes UTF8StringIndex etcetera, which would prevent you from accidentally using e.g. a UTF-8 string index with a UTF-16 copy of the same string. Of course, it wouldn't prevent you from using a UTF-8 string index with a completely different UTF-8 string, but this kind of mistake (if it is a mistake...it could actually be done correctly if you are careful to do it only if the second string is a substring of the first) seems less likely and would still throw an exception for an invalid index.

@stevengj stevengj added the needs decision A decision on this change is needed label Dec 10, 2014
@StefanKarpinski
Copy link
Member

It feels weirdly half-baked to prevent some kinds of cross-indexing errors but not others. In the above StringIndex example code I wrote, you'll note that I do the right thing and translate through code points if you use an index from one string into another. The thing I like about that is that it is pretty annoying to write that code and it's nice that it can just do it for you. Of course, the issue is that I couldn't coax our compiler stack into eliminating this check – it doesn't seem to know that an object is always === to itself. @JeffBezanson, any way we could teach our system that fact?

@stevengj
Copy link
Member

It doesn't seem weird to me. The main point of a StringIndex type is to detect the common s[i+1] errors, not to prevent BoundsError() in general. The fact that it also detects certain kinds of cross-indexing errors is merely a bonus.

@elextr
Copy link

elextr commented Dec 10, 2014

@StefanKarpinski an iterator is a position in a string, and, with the usual implementation of pointer to string and offset, it can be ordered simply by overloading <to consider only the offset. Arithmetic will be possible with overloading of + and - and will then DTRT with respect to incrementing by byte, codepoint or graphene and the extraction functions byte(), codepoint() and graphene() can also take an offset argument and again DTRT.

@JeffBezanson
Copy link
Member

@StefanKarpinski yes we can add codegen cases for ===. We could check if the two arguments are the same symbol, or the same LLVM Value* (plus some constraints probably). Would that address this case?

@StefanKarpinski
Copy link
Member

Unfortunately, it seems like neither of those things suffice. The necessary steps seem to be inlining of the getindex call and elimination of the construction of the StringIndex object altogether, turning si.string directly into just s, at which point LLVM might be able to tell that they're actually the same object. All that isn't happening in LLVM right now. Seems moderately unreasonable to expect it to.

@stevengj
Copy link
Member

@elextr, technically an iterable type in Julia is not a position in the string. The iterable combined with the iteration state (which is created by start) is what defines the position. In the case of a string, the iteration state is currently just the index. Unless you are defining "iterator" as the (iterable, state) pair? I suppose that would be the analogue of iterator objects in Python.

@elextr
Copy link

elextr commented Dec 11, 2014

@JeffBezanson yes, I was thinking like Python/C++ iterators where the state is part of the iterator object.

(my excuse is that searching the docs for "iterator" does not take you to the section http://docs.julialang.org/en/latest/stdlib/base/#iteration :)

@nalimilan
Copy link
Member Author

@StefanKarpinski What's the use case for indexing a string with an index obtained on another one? Even if done correctly (by counting code points or graphemes), I guess it wouldn't make much sense unless the two strings start with exactly the same content. And if that's the case, people would better directly index at a given byte offset instead of using the code you suggested, which is is O(n) for different strings.

In short, I think it would make for a more reasonable design if cross-indexing of strings was not supported by getindex. Instead, functions like slice_bytes, slice_codepoint, slice_grapheme should be provided, and used depending on whether you know the strings start with the same content (use the first) or not (use one of the two others -- but this case isn't very frequent AFAICT).

@StefanKarpinski
Copy link
Member

Slightly contrived example: you have a data parsing framework that lets you specify fields using something like this fmt = "[[[[ ]]].[[ ]]]]]". You parse the format string and get character offsets and then use those to parse out parts of other strings that contain actual data.

@PallHaraldsson
Copy link
Contributor

@nalimilan: "No, the byte offset is a good index of a position within a string" - Not even this is safe - [in my idea of a string implementation.. That is, if position means everything after that byte follows in the string.]

I support this issue. Would like to see indexing gone in 0.4 (or deprecated).

@malmaud
Copy link
Contributor

malmaud commented Jul 23, 2015

This might be a good candidate for .5, where lots of changes to indexing semantics is planned.

@PallHaraldsson
Copy link
Contributor

Dropping in 0.5 might be good enough, but do we then need to deprecated in 0.4? Not sure how it is done, just documenting it or needs there be a warning? Anyway I think that is a minor change to code to emit warnings and will not break any code.. [Dropping indexing-functionality is also simple, no, except breaks code.. Not sure if much code breaks however..]

@malmaud
Copy link
Contributor

malmaud commented Jul 23, 2015

Just to be clear, I'm not necessarily advocating for the removal of string indexing - just pointing out that changes to string indexing behavior might as well happen at the same time as changes to array indexing behavior.

@andreasnoack
Copy link
Member

Two comments:

  1. Ellide immutable allocation in some simple cases #15259 should be relevant to this discussion since the discussion of overhead from keeping a reference to a String predates the PR. With enough inlining, it might be possible to avoid the overhead.

  2. I'm working on a project where we have to iterate large text files and quite a few of them have invalid UTF8 byte sequences. Throwing when a sequence is invalid makes it inconvenient to handle some of these files. Hence locally, I've decided to change next to return '�' in those cases. This is going a bit in the opposite direction of Make getindex for String check if indices are valid #22572 which throws even more but as mentioned in Improvement of the functions for handling string indexing #23765 this might actually be a problem with the getindex API for strings. Hence, I'd be in favor of a string API that tried to hide the buffer and interpreted a string a sequence of characters (which is like the current string iterator) instead of a sequence of bytes. That would probably mean to stop having the getindex API since it would be too inefficient. Improvement of the functions for handling string indexing #23765 seems like a good direction.

@nalimilan
Copy link
Member Author

It would indeed be interesting to try again with #15259 and see whether performance has improved.

Regarding your second point, it seems that the issue of invalid sequences is mostly orthogonal to StringIndex. The same discussions can be developed only with the current API. The revelant issue for this is #22616. next would return Chars with invalid Unicode code points when the data is corrupt instead of throwing an error. getindex would do the same when pointed to a byte which is not part of a valid character. But it would still throw an error when pointing to the middle of a valid char, so it's not opposed to #22572 AFAICT.

The area where StringIndex could help is that we could rely on the assumption that an index does not point to the middle of a valid character, so we could avoid some checks. But these would only be needed for invalid sequences (i.e. continuation bytes not part of a character), since indices to valid characters will never be continuation bytes, so the overhead may not be a serious issue in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code design Design of APIs or of the language itself unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests