-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate AbstractString interface from iteration protocol #26133
Conversation
One more thing to note is that is may actually make sense to not deprecate the use of
that would generically work (and give the |
Bikeshedding the name
|
I thought strings had to implement |
No, we have generic versions of those based on Line 462 in 2e1fb5e
|
I kinda like |
676e0e7
to
941c1c5
Compare
I have a There should be a test case that ensures that it is efficient to do comparison of a dynamically un-escaped string respecting the
I'd like to use something like this for splice-editing JSON text I now see that this used to be in base but was removed: #12330 IOString Another variation on This is a string that grows in size as data is received from an |
941c1c5
to
74961ab
Compare
@nanosoldier |
:( cc @ararslan |
That should be fixed by #26186. |
74961ab
to
6d9fb91
Compare
@nanosoldier |
@ararslan kick nanosoldier for me? |
Related: #26202 |
I haven't verified, but from your description, I think this should fix #26202, because the abstract |
Maybe today?: |
6d9fb91
to
03448ed
Compare
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Ok, that's decently encouraging. The only string-performance-sensitive regression I see is join (we have a bunch particularly in dates and the micro benchmarks). join is a notoriously unreliable benchmark (it benchmarks a bit of an unrealistic case, with a very large separator), I believe due to memory allocation patterns, but I'll benchmark locally. My plan is to rename this to |
Up until now, the basic interface new AbstractStrings had to implement was: ``` struct MyString; ...; end next(::MyString, i::Int64)::Tuple{Char, Int64} isvalid(::MyString, i::Int64)::Bool ncodeunits(::MyString)::Int64 ``` In this interface, the iteration state (i.e. the second tuple element returned from `next`) always had to be the next valid string index. This is inconvenient for several reasons: 1. The iteration protocol will change, breaking every use of this API 2. Some strings may want iteration states other than linear indicies for efficiency reasons (e.g. RopeStrings) 3. Strings implementors can no longer assume that the second argument they receive was necessarily produced by them, so may need to do various validation of the iteration sate on every iteration. This PR attempts to remidy this, by introducing a new generic `Next` iterator. The iterator is defined to iterate (values, next index) tuple, which is the return value the `next` method on strings at the moment and thus allows for a natural transition from the older API. Thus, this PR changes, the requisite method to implement from: ``` next(::MyString, i::Int)::Tuple{Char, Int} ``` to ``` next(::StringNext{MyString}, state::Any)::Tuple{Tuple{Char, Int}, Any} ``` where `StringNext{T} = Next{T, EachIndexString{T}}` Efficient implementations of iteration over strings, the indicies as well as `Pairs` can be derived from this iterator. The reason this iterator is useful is perhaps best understood by considering strings to be variable-length encodings of character arrays. In a variable-length encoding, one generally decodes the value and the length (i.e. the index of the next element) at the same time, so it makes sense to base the API on the implementation of an iterator with these semantics. To demonstrate the use and test the new abstract implementations based on this iterator, there are three string types in the test suite: - CharString, as before, which simply wraps an array of `Chars` with direct indexing. The only change to this iterator is to change the signature of the `next` method. - RopeString, which strings together several Strings, and more importantly does not have efficient linear iteration state. - DecodeString, which decodes escape sequences on the fly as part of iteration. This string type demonstrates one string type wrapping another string type to test the interface from both sides
03448ed
to
c2658e1
Compare
Renamed |
Local benchmark shows no difference in |
StringNext(x::T, idx) where {T<:AbstractString} = Next(x, idx) | ||
StringNext(x::T, idx, itr) where {T<:AbstractString} = Next(x, idx, itr) | ||
|
||
start(sp::StringNext) = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be start(sp::StringNext) = firstindex(sp.data)
?
The doc says that string indexing starts at 1
, but there are other places that use firstindex
(same for first(::EachStringIndex) = 1
above).
The more places the 1
is not hard coded, the easier it'll be to eventually support AbstractString
types with non standard indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the goal of this change is to separate the assumption currently baked into the string code that indices and itartion state of strings are the same thing. We want indices to always be 1
through ncodeunits(s)
but allow iteration state to be anything at all (even non-integers). So firstindex(s)
and start(s)
should be totally unrelated: the former is always 1
but the latter can be anything at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start(s)
here is just a default (custom string types can override it). It is also going away with the iteration protocol change. I do want to eventually define non-standard indexing for strings, but that's a natural extension of this and string should always support linear indicies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want indices to always be 1 through ncodeunits(s)
I've been trying out various string implementations as a way to understand what the abstract interface contract is. If firstindex(s) == 1
is part of the protocol, so be it.
Is it also to be assumed that isvalid(s, firstindex(s)) == true
? If this wasn't required that would allow "padding" of the codeunit space in cases where that is more efficient.
I have a SplicedString
implementation where the indexes are in the range 1:ncodeunits(s)
(and isvalid(s, 1) == true
), but where the index space is only sparsely populated.
It all works nicely for constant time indexing and iteration, but anyone who writes a i = i+1
loop is going to find it take a long time to run. Is this too heretical?
julia> s = LazyJSON.SplicedString("Foo", " ", "Bar")
"Foo Bar"
julia> [keys(s)...]
7-element Array{Int64,1}:
1
2
3
1099511627777
2199023255553
2199023255554
2199023255555
julia> [(i >> 40, i & (2^40-1)) for i in keys(s)]
7-element Array{Tuple{Int64,Int64},1}:
(0, 1)
(0, 2)
(0, 3)
(1, 1)
(2, 1)
(2, 2)
(2, 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
firstindex(s) == 1
I'd be fine relaxing this constraint, but I don't care too much.
isvalid(s, firstindex(s)) == true
This one I do care about.
Is this too heretical?
I mean, you can do various kinds of encoding here, this one included, but I'd rather go the direction where strings can have non-standard index kinds, but all still need to support linear indexing no matter how slow it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so keen on this, I'm afraid. Overriding next(StringNext{MyString}, state)
seems very convoluted as a way to tell people to define iteration for their custom string types. The fact that the definition seems to typically involved returning the same index value twice seems problematic as well both stylistically, and potentially for performance, although I suppose in any case that's not too slow anyway that duplication will be inlined and eliminated anyway.
The first value returned by the `Next` iterator should correspond to the element at `idx`. | ||
Please note that if you override iteration for `Next{A}` and your iteration state is not | ||
the next index, you will have to additionally overload `Next(data::A, idx, itr::I)` for | ||
four `A`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "for four A
" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
('a', 2) | ||
|
||
julia> first(Next(['a','b','c'], 3)) | ||
('c', 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a somewhat confusing example. Would a more typical usage example be possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what confuses you about it? The purpose of the second example was to show that the last tuple will have an out-of-bounds index. Would adding first(Next(['a','b','c'], 2))
before this one be better?
StringNext(x::T, idx) where {T<:AbstractString} = Next(x, idx) | ||
StringNext(x::T, idx, itr) where {T<:AbstractString} = Next(x, idx, itr) | ||
|
||
start(sp::StringNext) = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the goal of this change is to separate the assumption currently baked into the string code that indices and itartion state of strings are the same thing. We want indices to always be 1
through ncodeunits(s)
but allow iteration state to be anything at all (even non-integers). So firstindex(s)
and start(s)
should be totally unrelated: the former is always 1
but the latter can be anything at all.
I mean that's hardly surprising, because that was the only option at the moment. Also, looking back at this, I realized SubString should use the parent's index state rather than duplicating the index - will fix. Some of this will change with the new iteration protocol anyway. Separating the indexing and iteration state is a new feature. The
Yes, where this is the case, the compiler will eliminate it. Registers are also fairly cheap during return if it doesn't. |
I'm still confused by what problem this PR is solving. Shouldn't we just say that the iteration state i.e. string implementations should define (at least):
(Analogous to how array implementations define indexing, and get iteration for free.) |
Yes, that is one possible solution, but it's unfortunately 2x slower for cases where you need both (because getting the next index generally involves the same work as decoding the character). That's a decently common operation because things like search want to return indices. This solves that problem by making getting both the basic operation from which the others are derived. |
If you want to optimize that, couldn't you implement a That is, the most basic string types implement just indexing. Optimized string types also implement a custom Also, this way people wanting both the indices and the values use the standard |
That's what I thought, but |
@StefanKarpinski Would you be happier if we fell back to assuming the iteration state was the same as the string state:
We'd have to be a bit careful about circular fallbacks, but that would allow defining string iterate on string and having this work automatically. |
👍 |
Clarify that `firstindex(str)` should always be `1` for any `AbstractString`, as mentioned by @StefanKarpinski [here](#26133 (comment)). Also reference `prevind` and `eachindex`. Also introduce the "code unit" terminology and mention the `codeunit` functions.
Sounds good. The iteration state might be a tuple of tree node and local index. Or the iteration state might be an opaque handle from an external system.
This sounds good too, but doesn't it assume that iteration state and index are the same? Maybe the API should have something like this: index(s, iteration_state)::Integer
start(s, i::Integer=firstindex(s)) -> iteration_state A parser that is blindly iterating along and finds an interesting delimiter can call By default: index(::AbstractString, state::Integer) = state But specialiseable like: function index(t::TreeString, state::Tuple{TreeStringNode, Integer})
n, i = state
return length(n.left) + index(t, (n.parent,1)) + i
end |
I think I like |
|
@@ -1070,6 +1124,7 @@ end | |||
function fixpoint_iter_type(itrT::Type, valT::Type, stateT::Type) | |||
nextvalstate = Base._return_type(next, Tuple{itrT, stateT}) | |||
nextvalstate <: Tuple{Any, Any} || return Any | |||
nextvalstate === Union{} && return Union{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly? Prevents things from erroring in the wrong place if you mess up how to do iteration.
@@ -0,0 +1,120 @@ | |||
# A specialized iterator for EachIndex of strings | |||
struct EachStringIndex{T<:AbstractString} | |||
s::T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat tangential, but while we're moving this type, let's spell the field out as .string
instead of .s
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have nextindex
spelled out instead of nextind
too? 🚲🏚
end | ||
|
||
# Derive iteration over pairs from `StringNext` | ||
const StringPairs{T<:AbstractString} = Iterators.Pairs{Int, Char, EachStringIndex{T}, T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to @Keno this is:
the kind of Pairs iterator you get when you call
pairs
on a string of typeT
There should probably be a comment to that effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be written something like const StringPairs{T<:AbstractString} = typeof(pairs(T())
(I had this problem creating a const type alias for view(Vector{UInt8}), the type is really convoluted, so I ended up with const nobytes = UInt8[]; const ByteView = typeof(view(nobytes))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think that will work since there is no actual T
at the time of definition. The RHS needs to use parametric type syntax to work – despite the appearance, this is not a normal assignment, it's what used to be written as typealias
.
Talking with @JeffBezanson and @StefanKarpinski we will punt on generalizing string iteration state for 1.0. In particular, we will:
I will do 1/2 in the iteration PR and once that's in open separate PRs for 3/4. We also discussed whether we would need to do something about |
Let's keep the branch around for a bit, it's useful as a reference. |
Clarify that `firstindex(str)` should always be `1` for any `AbstractString`, as mentioned by @StefanKarpinski [here](#26133 (comment)). Also reference `prevind` and `eachindex`. Also introduce the "code unit" terminology and mention the `codeunit` functions.
Clarify that `firstindex(str)` should always be `1` for any `AbstractString`, as mentioned by @StefanKarpinski [here](#26133 (comment)). Also reference `prevind` and `eachindex`. Also introduce the "code unit" terminology and mention the `codeunit` functions. (cherry picked from commit 3b6773d)
Up until now, the basic interface new AbstractStrings had to implement was:
In this interface, the iteration state (i.e. the second tuple element returned
from
next
) always had to be the next valid string index. This is inconvenientfor several reasons:
reasons (e.g. RopeStrings)
was necessarily produced by them, so may need to do various validation of the
iteration sate on every iteration.
This PR attempts to remedy this, by introducing a new generic
LeadIndPairs
iterator.This iterator behaves similarly to
Pairs
, except that instead of the index of anelement, it gives the index of the next element (in other words, the indicies lead
the values by one element). The astute reader will note that the elements of this
iterator are precisely the elements of the tuple currently returned by next.
Thus, this PR changes, the requisite method to implement from:
to
where
StringLIPairs{T} = LeadIndPairs{Char, Int, EachIndexString{T}, T}
Efficient implementations of iteration over strings, the indicies as well as
Pairs
can be derived from this iterator. The reason this iterator is useful is perhaps best
understood by considering strings to be variable-length encodings of character arrays.
In a variable-length encoding, one generally decodes the value and the length (i.e. the
index of the next element) at the same time, so it makes sense to base the API
on the implementation of an iterator with these semantics.
To demonstrate the use and test the new abstract implementations based on this iterator,
there are three string types in the test suite:
Chars
with direct indexing. Theonly change to this iterator is to change the signature of the
next
method.efficient linear iteration state.
type demonstrates one string type wrapping another string type to test the interface from
both sides
This bootstraps and many of the abstract string operations work on the three test string types I mentioned. Some more cleanup and testing is required, but this seemed like the right time to stop for feedback.