-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
string overhaul #24999
string overhaul #24999
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to review so many deep changes, but here are the few comments I can make.
function peekchar(s::IOStream) | ||
if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{Char}), s, _chtmp) < 0 | ||
chref = Ref{UInt32}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this hurt performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not according to @Keno who suggested that I move the externally allocated box inside IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
welcome to the brave new world of julia 0.7
base/strings/basic.jl
Outdated
Predicate indicating whether the given index is the start of the encoding of | ||
a character in `s` or not. If `isvalid(s, i)` is true then `s[i]` will return | ||
the character whose encoding starts at that index, if it's false, then `s[i]` | ||
will raise an invalid index error. Behavior of `next(s, i)` is similar except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to mention what exception type is thrown exactly.
base/strings/basic.jl
Outdated
a character in `s` or not. If `isvalid(s, i)` is true then `s[i]` will return | ||
the character whose encoding starts at that index, if it's false, then `s[i]` | ||
will raise an invalid index error. Behavior of `next(s, i)` is similar except | ||
that the character is returned along with the index of the following character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really guarantee that the iterator state is "the index of the following character"? We don't for arrays for example, and some custom string types could do something different. Same remark for the next
docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a fair amount of generic string code in Base (and presumably elsewhere) that assumes that the string iteration state is the same as the string index. If it's not safe to assume that then we need to make sure that people don't use the iterator state in s[i]
indexing expressions and that they only use it to pass to done
and next
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would very much like the ability to override string iteration for MyStringType (which I believe is currently possible?), and have the docs reflect the default behaviour of next
if you don't also override start
and done
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamed2, I'm not quite sure what you're asking for – do you want to allow the iteration state not to be the string index? If so, I'm totally fine with that and it's mostly a documentation issue; for the built-in string types they happen to coincide. Might be worth changing the iteration state of SubString
to match its parent's iteration state just to shake out the places where we currently rely on these being the same though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This'll need to be updated in that case to provide whatever is most efficient:
Lines 649 to 650 in 224f1e9
# requires that indices for target are small ordered integers bounded by start and endof | |
function readuntil_indexable(io::IO, target#=::Indexable{T}=#, out) |
Pretty easy just to add the offset as needed, if that's what we want. At the time, I had seen that in a number of other places we define the bounds of the strings indices to be
1:n
, so just went with it:Lines 24 to 25 in 224f1e9
start(s::AbstractString) = 1 | |
done(s::AbstractString,i) = (i > endof(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are sane fallback definitions. Most strings work that way. The question is just that if someone overrides those, do we support it working. Note that this is not actually pressing for 1.0 since if we don't allow it now, we can always allow it in the future. However, changing the iteration state of substrings might break people's code, so we would want to do that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamed2, I'm not quite sure what you're asking for – do you want to allow the iteration state not to be the string index? If so, I'm totally fine with that and it's mostly a documentation issue; for the built-in string types they happen to coincide. Might be worth changing the iteration state of SubString to match its parent's iteration state just to shake out the places where we currently rely on these being the same though.
Yes, exactly :)
base/strings/basic.jl
Outdated
will raise an invalid index error. Behavior of `next(s, i)` is similar except | ||
that the character is returned along with the index of the following character. | ||
In order for `isvalid(s, i)` to be an O(1) function, the encoding of `s` must | ||
be [self-synchronizing](https://en.wikipedia.org/wiki/Self-synchronizing_code); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also use this link for the mention of "self-synchronizing" above.
base/strings/basic.jl
Outdated
sizeof(s::AbstractString) = error("type $(typeof(s)) has no canonical binary representation") | ||
isvalid(s::AbstractString, i::Integer) = typeof(i) === Int ? | ||
throw(MethodError(isvalid, Tuple{typeof(s),Int})) : | ||
isvalid(s, Int(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird indentation. Also why not define two methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are two methods then any string type that defines isvalid(s::StringType, i::Integer)
creates an ambiguity. We could not define any isvalid(s::AbstractString, i::Integer)
method that converts to Int
but that's annoying because then every type that implements the string interface has to implement the same convert to Int
logic. This allows subtypes to just implement the Int
method in the common case or handle Integer
and either one is fine. In the former case this method is called and the error branch is dead code; in the latter case this method is never called at all.
@inbounds b = codeunit(s, i) | ||
b & 0xc0 == 0x80 || @goto ret | ||
u |= UInt32(b); i += 1 | ||
@label ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird to have no indentation for a label, given that it's only valid inside the parent function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty common style in C/C++ for goto labels, e.g. see http://en.cppreference.com/w/cpp/language/goto. The logic in my mind is that they are part of the control flow construct so having them outdented from the code seems fitting.
base/strings/string.jl
Outdated
while true | ||
(i += 1) ≤ n || break | ||
@inbounds b = codeunit(s, i) # lead byte | ||
@label L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark as above.
test/strings/basic.jl
Outdated
@test_throws BoundsError chr2ind("hello", 10) | ||
@test_throws BoundsError ind2chr("hellø", 10) | ||
@test_throws BoundsError chr2ind("hellø", 10) | ||
@test ind2chr("hello", -1) == -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really support these? Why would -1 codeunit be equal to -1 character? That doesn't make a lot of sense.
test/strings/basic.jl
Outdated
@test nextind(strs[i], -1, 1) == 1 | ||
@test prevind(strs[i], 20) == 19 | ||
@test prevind(strs[i], 20, 1) == 19 | ||
@test prevind(strs[i], 20, 10) == 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of behavior sounds weird to me. So we assume that all imaginary characters take one code unit, and once we reach actual characters we use them? I can't imagine a situation where it would be useful, and it could hide bugs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wrote the assumption that they all take one code unit, but realized that for substrings it's very convenient to just pass-through to the backing string which means that the "imaginary" characters are whatever size the actual characters in the backing string are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically that means that the behavior of prevind
/nextind
is undefined when an out of bounds index is passed, which isn't very Julian IMHO. It would be more standard to throw an error when an out of bounds index is passed, and enable the undefined behavior via @inbounds
, but as you said it's convenient in some cases, so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not undefined – it satisfies lots of nice, predictable properties:
prevind(s, i) < i < nextind(s, i)
prevind(s, i, n) ≤ i ≤ nextind(s, i, n)
for alln ≥ 0
prevind(s, nextind(s, i, n), n) == thisind(s, i)
for alln > 0
prevind(s, nextind(s, i, n), m) == nextind(s, i, n-m)
whenevern > m
nextind(s, prevind(s, i, n), m) == prevind(s, i, n-m)
whenevern > m
There is a very consistent philosophy to all of these changes: only throw an error when you have to. For character iteration, the error only occurs when you try to ask about the code point of a malformed character – handling a character or a string that has malformed data is fine. For indexing, computing with out-of-bounds indices is fine, the only thing that's a problem is asking for a character or a string that is out of bounds. As I said before, we do this all the time with array indices and it's not a problem this is just another form index arithmetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I said it's undefined because the number of code units which are added depend on the parent string (if any), so for example it will be different for a plain String
and for a SubString
. So in practice if you pass an out-of-bounds index, call nextind
/prevind
a given number of times, and get an in-bounds index, it's totally unpredictable what character you get.
Array indices are quite different, since in that case the step is always 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this out-of-bounds array index come from? If you start in bounds for a given string and then end back in bounds you always get the same answer. Yes, the out of bounds values in the middle may be different but why does that matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, normally you won't pass an out of bounds index. But if that happens because of a bug it would have been nice to be able to catch that by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making out of bound code units always size one won’t help with that. We already discussed the bit about why allowing computing out of bounds values is useful. I guess what hasn’t been shown is that these functions need to accept out of bounds indices as well as produce them.
@@ -46,10 +55,12 @@ for idx in [0, 1, 4] | |||
@test SubString("∀∀", 4, idx) == "∀∀"[4:idx] | |||
end | |||
|
|||
# second index beyond endof("∀∀") | |||
for idx in 5:8 | |||
# index beyond endof("∀∀") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"second" was accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't get what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It refers to the second index (third argument) passed to SubString
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that was completely unclear to me since the comment doesn't mention SubString
at all. I think I'd like to deprecate the three-arg constructor for SubString
anyway and always use SubString(s, i:j)
instead.
Regarding indexing past the ends of strings. Here's my reasoning for it. At the very least we need to support indices of My general philosophy here (which I need to write down somewhere in the string docs) is that doing computations with indices outside of the bounds of strings is totally fine – it's only accessing a character past one of the ends of a string that's a problem. By analogy to indexing into an array, it's perfectly normal to compute intermediate values when doing index arithmetic that are outside of the range of the array. The problem is only if at the end you index into the array out of bounds. I ended up not needing to use indices outside of the Here's a motivating example. Suppose you're writing code to find a particular word in some text with What about having |
OK, I see the advantages. But I still still find it disturbing when |
Can you think of an efficient way to handle the word-search-with-context example any other way? |
I was tempted to just make |
5b29efa
to
463ea70
Compare
Well, it would work if
Given how special string indices are, it makes sense to have a dedicated exception for them. |
I suspect I can come up with a slightly more complex use case where coming back in-bounds is required for the above word-with-context example. But there's a reason we don't do saturating integer arithmetic and this is just another form of integer arithmetic. |
I'm getting !! No doc found for reference '[`ncodeunits`](@ref)'. Anyone know why that would be happening? There is a docstring for |
I think you need to add the function to the stdlib part of the manual for Documenter to find it. |
@nalimilan, I've thought about this a bit and while there are examples (like the word-context one) that show that it's useful to produce out-of-bounds indices from string index arithmetic functions, I can't think of a concrete use case for accepting them – except for the only-just-out-of-bounds ones, i.e.
So why not replace the RHS with the LHS which does less work and may avoid going out of bounds where the LHS went out of bounds temporarily? So I guess the conservative thing to do where would be to make these functions raise errors when passed starting indices outside of the range |
Related issue: |
I've pushed a new commit deleting the |
It sounds reasonable to start with the most restrictive behavior (throwing an error), since if it turns out to be problematic in some cases we will be able to change it, which wouldn't be possible in the other direction. |
Yeah, I suppose so, but it just feels a bit weird and undisciplined to allow some out-of-bounds indices – |
I have now added the deprecation of |
@@ -565,14 +565,15 @@ Some other useful functions include: | |||
|
|||
* [`endof(str)`](@ref) gives the maximal (byte) index that can be used to index into `str`. | |||
* [`length(str)`](@ref) the number of characters in `str`. | |||
* [`length(str, i, j)`](@ref) the number of valid character indices in `str` from `i` to `j`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I wonder whether we should add this method, as it looks a bit weird. Would it be enough to recommend length(SubString(str, i:j))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It likely can be free if it gets inlined. But seem to recall the SubString constructor does quite a bit of work to munge the indices into something valid? Anyways, doesn't seem like this API decision should be too much influenced by temporary performance concerns (unless it's only for internal usage, but them maybe call it Base.substr_length
).
Not sure if its related, but I had a similar problem: #23703 (comment) Removing the backticks fixed the issue, but this seemed less than ideal because I thought "auto-referencing" should work with or without backticks. |
6739272
to
5708e3e
Compare
@nalimilan: is it ok with you if I merge my PR as is to avoid it getting lots of conflicts and then make the changes to make indexing more conservative in a separate PR? My general plan is to try this and see how it goes:
|
c7d5ed1
to
46bcd75
Compare
base/strings/basic.jl
Outdated
ncodeunits(s::AbstractString) | ||
|
||
""" | ||
codeunit(s::AbstractString) -> Type{<:Union{UInt8, UInt16, UInt32}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(from a discussion on Discourse)
I think that the function that retrieves the type should perhaps have a different name than the function that returns a codeunit. eg codeunit_type
. Methods of the same function usually share behavior on some abstract level, this is stretching that. I realize that the name is long, but it is only used in a few places (mostly to define the method for various types). I expect that in code outside Base
it will be only used to help type inference so that this will not be cumbersome.
Sorry, I hadn't seen your question. Yes, that's fine with me. (Actually, what worries me more are the conflicts with the search/find PRs, but such is life...) |
- {next,getindex}_continued don't need to re-check bounds - short-circuiting is slightly faster in length
also: avoid second bounds check in `get` with default
also: `next` can assume that incoming indices are valid
94e99d3
to
8de25f5
Compare
Is the memory error on FreeBSD expected these days? cc @iblis17 |
Ok, I've already tested the heck out of this change, the BSD failure seems to be an unrelated memory problem, it passes on Windows and Linux and I can't reproduce the macOS spawn failure locally, so I'm assuming that's unrelated too. |
I did not mean to squash that, but I think it's fine. The intermediate commits are not all that useful. |
@StefanKarpinski about the ENOMEM: #23273 (comment) |
Actually this wasn't squashed, GitHub just changed how the display merge commits. |
Replacement PR for #24439. More description to come...
Edit see this discourse post for a description: https://discourse.julialang.org/t/changes-to-the-representation-of-char/7291/5