-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
WIP/RFC: Separate AbstractString interface from iteration protocol
Up until now, the basic interface new AbstractStrings had to implement was: ``` struct MyString; ...; end next(::MyString, i::Int64)::Tuple{Char, Int64} isvalid(::MyString, i::Int64)::Bool ncodeunits(::MyString)::Int64 ``` In this interface, the iteration state (i.e. the second tuple element returned from `next`) always had to be the next valid string index. This is inconvenient for several reasons: 1. The iteration protocol will change, breaking every use of this API 2. Some strings may want iteration states other than linear indicies for efficiency reasons (e.g. RopeStrings) 3. Strings implementors can no longer assume that the second argument they receive was necessarily produced by them, so may need to do various validation of the iteration sate on every iteration. This PR attempts to remidy this, by introducing a new generic `LeadIndPairs` iterator. This iterator behaves similarly to `Pairs`, except that instead of the index of an element, it gives the index of the next element (in other words, the indicies lead the values by one element). The astute reader will note that the elements of this iterator are precisely the elements of the tuple currently returned by next. Thus, this PR changes, the requisite method to implement from: ``` next(::MyString, i::Int64)::Tuple{Char, Int64} ``` to ``` next(::StringLIPairs{MyString}, state::Any)::Tuple{Pair{Int, Char}, Int64} ``` where `StringLIPairs{T} = LeadIndPairs{Char, Int, EachIndexString{T}, T}` Efficient implementations of iteration over strings, the indicies as well as `Pairs` can be derived from this iterator. The reason this iterator is useful is perhaps best understood by considering strings to be variable-length encodings of character arrays. In a variable-length encoding, one generally decodes the value and the length (i.e. the index of the next element) at the same time, so it makes sense to base the API on the implementation of an iterator with these semantics. To demonstrate the use and test the new abstract implementations based on this iterator, there are three string types in the test suite: - CharString, as before, which simply wraps an array of `Chars` with direct indexing. The only change to this iterator is to change the signature of the `next` method. - RopeString, which strings together several Strings, and more importantly does not have efficient linear iteration state. - DecodeString, which decodes escape sequences on the fly as part of iteration. This string type demonstrates one string type wrapping another string type to test the interface from both sides
- Loading branch information
Showing
11 changed files
with
345 additions
and
68 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
struct EachStringIndex{T<:AbstractString} | ||
s::T | ||
end | ||
keys(s::AbstractString) = EachStringIndex(s) | ||
|
||
length(e::EachStringIndex) = length(e.s) | ||
first(::EachStringIndex) = 1 | ||
last(e::EachStringIndex) = lastindex(e.s) | ||
eltype(::Type{<:EachStringIndex}) = Int | ||
|
||
const StringLIPairs{T<:AbstractString} = Iterators.LeadIndPairs{Int, Char, EachStringIndex{T}, T} | ||
const StringPairs{T<:AbstractString} = Iterators.Pairs{Int, Char, EachStringIndex{T}, T} | ||
StringLIPairs{T}(x::T) where {T<:AbstractString} = Iterators.LeadIndPairs(x, eachindex(x)) | ||
StringLIPairs(x::T) where {T<:AbstractString} = StringLIPairs{T}(x) | ||
StringPairs{T}(x::T) where {T<:AbstractString} = Iterators.Pairs(x, eachindex(x)) | ||
StringPairs(x::T) where {T<:AbstractString} = StringPairs{T}(x) | ||
|
||
Iterators.pairs(s::AbstractString) = StringPairs(s) | ||
Iterators.reverse(s::StringPairs) = Iterators.Reverse(s) | ||
|
||
start(sp::StringLIPairs) = 1 | ||
function done(s::StringLIPairs, i) | ||
if isa(i, Integer) | ||
return i > ncodeunits(s.data) | ||
else | ||
throw(MethodError(done, (s, i))) | ||
end | ||
end | ||
function next(s::StringLIPairs, i) | ||
if isa(i, Integer) && !isa(i, Int64) | ||
return next(s, Int64(i)) | ||
else | ||
throw(MethodError(next, (s, i))) | ||
end | ||
end | ||
|
||
# Reverse pair iteration | ||
start(e::Iterators.Reverse{<:StringPairs}) = ncodeunits(e.itr.data)+1 | ||
done(e::Iterators.Reverse{<:StringPairs}, idx) = idx == firstindex(e.itr.data) | ||
function next(s::Iterators.Reverse{<:StringPairs}, idx) | ||
tidx = thisind(s.itr.data, idx-1) | ||
(nidx, c) = first(leadindpairs(s.itr.data, tidx)) | ||
Pair(tidx, c), tidx | ||
end | ||
|
||
function prev(s::AbstractString, idx) | ||
(i, c), _ = next(Iterators.Reverse(StringPairs(s)), idx) | ||
(c, i) | ||
end | ||
|
||
start(e::StringPairs) = (firstindex(e.data), start(StringLIPairs(e.data))) | ||
done(e::StringPairs, (idx, state)) = done(StringLIPairs(e.data), state) | ||
function next(s::StringPairs, (idx, state)) | ||
((nidx, c), state) = next(StringLIPairs(s.data), state) | ||
Pair(idx, c), (nidx, state) | ||
end | ||
|
||
start(s::AbstractString) = start(StringLIPairs(s)) | ||
done(s::AbstractString, state) = done(StringLIPairs(s), state) | ||
function next(s::AbstractString, state) | ||
((idx, c), state) = next(StringLIPairs(s), state) | ||
(c, state) | ||
end | ||
|
||
start(e::EachStringIndex) = start(StringPairs(e.s)) | ||
done(e::EachStringIndex, state) = done(StringPairs(e.s), state) | ||
function next(e::EachStringIndex, state) | ||
((idx, c), state) = next(StringPairs(e.s), state) | ||
(idx, state) | ||
end | ||
|
||
eltype(::Type{<:AbstractString}) = Char | ||
sizeof(s::AbstractString) = ncodeunits(s) * sizeof(codeunit(s)) | ||
firstindex(s::AbstractString) = 1 | ||
lastindex(s::AbstractString) = thisind(s, ncodeunits(s)) | ||
|
||
function getindex(s::AbstractString, i::Integer) | ||
@boundscheck checkbounds(s, i) | ||
@inbounds return isvalid(s, i) ? first(leadindpairs(s, i)).second : string_index_err(s, i) | ||
end | ||
|
||
getindex(s::AbstractString, i::Colon) = s | ||
# TODO: handle other ranges with stride ±1 specially? | ||
# TODO: add more @propagate_inbounds annotations? | ||
getindex(s::AbstractString, v::AbstractVector{<:Integer}) = | ||
sprint(io->(for i in v; write(io, s[i]) end), sizehint=length(v)) | ||
getindex(s::AbstractString, v::AbstractVector{Bool}) = | ||
throw(ArgumentError("logical indexing not supported for strings")) | ||
|
||
function get(s::AbstractString, i::Integer, default) | ||
# TODO: use ternary once @inbounds is expression-like | ||
if checkbounds(Bool, s, i) | ||
@inbounds return s[i] | ||
else | ||
return default | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.