Skip to content

Commit

Permalink
WIP/RFC: Separate AbstractString interface from iteration protocol
Browse files Browse the repository at this point in the history
Up until now, the basic interface new AbstractStrings had to implement was:

```
struct MyString; ...; end
next(::MyString, i::Int64)::Tuple{Char, Int64}
isvalid(::MyString, i::Int64)::Bool
ncodeunits(::MyString)::Int64
```

In this interface, the iteration state (i.e. the second tuple element returned
from `next`) always had to be the next valid string index. This is inconvenient
for several reasons:
1. The iteration protocol will change, breaking every use of this API
2. Some strings may want iteration states other than linear indicies for efficiency
   reasons (e.g. RopeStrings)
3. Strings implementors can no longer assume that the second argument they receive
   was necessarily produced by them, so may need to do various validation of the
   iteration sate on every iteration.

This PR attempts to remidy this, by introducing a new generic `LeadIndPairs` iterator.
This iterator behaves similarly to `Pairs`, except that instead of the index of an
element, it gives the index of the next element (in other words, the indicies lead
the values by one element). The astute reader will note that the elements of this
iterator are precisely the elements of the tuple currently returned by next.
Thus, this PR changes, the requisite method to implement from:

```
next(::MyString, i::Int64)::Tuple{Char, Int64}
```

to

```
next(::StringLIPairs{MyString}, state::Any)::Tuple{Pair{Int, Char}, Int64}
```

where `StringLIPairs{T} = LeadIndPairs{Char, Int, EachIndexString{T}, T}`

Efficient implementations of iteration over strings, the indicies as well as `Pairs`
can be derived from this iterator. The reason this iterator is useful is perhaps best
understood by considering strings to be variable-length encodings of character arrays.
In a variable-length encoding, one generally decodes the value and the length (i.e. the
index of the next element) at the same time, so it makes sense to base the API
on the implementation of an iterator with these semantics.

To demonstrate the use and test the new abstract implementations based on this iterator,
there are three string types in the test suite:
- CharString, as before, which simply wraps an array of `Chars` with direct indexing. The
  only change to this iterator is to change the signature of the `next` method.
- RopeString, which strings together several Strings, and more importantly does not have
  efficient linear iteration state.
- DecodeString, which decodes escape sequences on the fly as part of iteration. This string
  type demonstrates one string type wrapping another string type to test the interface from
  both sides
  • Loading branch information
Keno committed Feb 23, 2018
1 parent f82e845 commit 74961ab
Show file tree
Hide file tree
Showing 11 changed files with 345 additions and 68 deletions.
34 changes: 34 additions & 0 deletions base/iterators.jl
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,39 @@ setindex!(v::Pairs, value, key) = (v.data[key] = value; v)
get(v::Pairs, key, default) = get(v.data, key, default)
get(f::Base.Callable, collection::Pairs, key) = get(f, v.data, key)

"""
Iterators.LeadIndPairs(values, keys=eachindex(values))
Like `Pairs`, but rather than returning the index corresponding to the current element,
return the index corresponding to the next value. In other words, the index leads the
value by one element. The index corresponding to the last element will be `lastindex(keys)+1`.
"""
struct LeadIndPairs{K, V, I, A}
data::A
itr::I
LeadIndPairs(data::A, itr::I) where {A, I} = new{eltype(I), eltype(A), I, A}(data, itr)
end
LeadIndPairs(data) = LeadIndPairs(data, eachindex(data))

start(lip::LeadIndPairs) = start(lip.itr)
done(lip::LeadIndPairs, state) = done(lip.itr, state)
function next(lip::LeadIndPairs, state)
nidx = ns = next(lip.itr, state)
# A bit awkward now, done for consistency with the new iteration protocol
done(lip.itr, ns) && (nidx = lastindex(lip.itr)+1)
Pair(nidx, lip.data[ns]), ns
end

leadindpairs(data) = LeadIndPairs(data)
leadindpairs(data, idx) = Rest(LeadIndPairs(data), idx)

length(lip::LeadIndPairs) = length(lip.itr)
eltype(::Type{LeadIndPairs{K, V}}) where {K, V} = Pair{K, V}

IteratorSize(::Type{LeadIndPairs{<:Any, <:Any, I}}) where {I} = IteratorSize(I)
IteratorEltype(::Type{LeadIndPairs{<:Any, <:Any, I}}) where {I} = IteratorEltype(I)

# zip

abstract type AbstractZipIterator end
Expand Down Expand Up @@ -1070,6 +1103,7 @@ end
function fixpoint_iter_type(itrT::Type, valT::Type, stateT::Type)
nextvalstate = Base._return_type(next, Tuple{itrT, stateT})
nextvalstate <: Tuple{Any, Any} || return Any
nextvalstate === Union{} && return Union{}
nextvalstate = Tuple{
typejoin(valT, fieldtype(nextvalstate, 1)),
typejoin(stateT, fieldtype(nextvalstate, 2))}
Expand Down
4 changes: 3 additions & 1 deletion base/regex.jl
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,9 @@ ncodeunits(s::SubstitutionString) = ncodeunits(s.string)
codeunit(s::SubstitutionString) = codeunit(s.string)
codeunit(s::SubstitutionString, i::Integer) = codeunit(s.string, i)
isvalid(s::SubstitutionString, i::Integer) = isvalid(s.string, i)
next(s::SubstitutionString, i::Integer) = next(s.string, i)
start(s::StringLIPairs{<:SubstitutionString}) = start(StringLIPairs(s.data.string))
next(s::StringLIPairs{<:SubstitutionString}, state) = next(StringLIPairs(s.data.string), state)
done(s::StringLIPairs{<:SubstitutionString}, state) = done(StringLIPairs(s.data.string), state)

function show(io::IO, s::SubstitutionString)
print(io, "s")
Expand Down
51 changes: 5 additions & 46 deletions base/strings/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ See also: [`codeunit`](@ref), [`ncodeunits`](@ref), [`thisind`](@ref),
"""
AbstractString

include("strings/iteration.jl")

## required string functions ##

"""
Expand Down Expand Up @@ -121,6 +123,7 @@ Stacktrace:
@propagate_inbounds isvalid(s::AbstractString, i::Integer) = typeof(i) === Int ?
throw(MethodError(isvalid, (s, i))) : isvalid(s, Int(i))

#=
"""
next(s::AbstractString, i::Integer) -> Tuple{Char, Int}
Expand All @@ -135,37 +138,7 @@ See also: [`getindex`](@ref), [`start`](@ref), [`done`](@ref),
"""
@propagate_inbounds next(s::AbstractString, i::Integer) = typeof(i) === Int ?
throw(MethodError(next, (s, i))) : next(s, Int(i))

## basic generic definitions ##

start(s::AbstractString) = 1
done(s::AbstractString, i::Integer) = i > ncodeunits(s)
eltype(::Type{<:AbstractString}) = Char
sizeof(s::AbstractString) = ncodeunits(s) * sizeof(codeunit(s))
firstindex(s::AbstractString) = 1
lastindex(s::AbstractString) = thisind(s, ncodeunits(s))

function getindex(s::AbstractString, i::Integer)
@boundscheck checkbounds(s, i)
@inbounds return isvalid(s, i) ? next(s, i)[1] : string_index_err(s, i)
end

getindex(s::AbstractString, i::Colon) = s
# TODO: handle other ranges with stride ±1 specially?
# TODO: add more @propagate_inbounds annotations?
getindex(s::AbstractString, v::AbstractVector{<:Integer}) =
sprint(io->(for i in v; write(io, s[i]) end), sizehint=length(v))
getindex(s::AbstractString, v::AbstractVector{Bool}) =
throw(ArgumentError("logical indexing not supported for strings"))

function get(s::AbstractString, i::Integer, default)
# TODO: use ternary once @inbounds is expression-like
if checkbounds(Bool, s, i)
@inbounds return s[i]
else
return default
end
end
=#

## bounds checking ##

Expand Down Expand Up @@ -379,6 +352,7 @@ julia> thisind("αβγdef", 10)
julia> thisind("αβγdef", 20)
20
```
"""
thisind(s::AbstractString, i::Integer) = thisind(s, Int(i))

Expand Down Expand Up @@ -470,21 +444,6 @@ function nextind(s::AbstractString, i::Int, n::Int)
return i + n
end

## string index iteration type ##

struct EachStringIndex{T<:AbstractString}
s::T
end
keys(s::AbstractString) = EachStringIndex(s)

length(e::EachStringIndex) = length(e.s)
first(::EachStringIndex) = 1
last(e::EachStringIndex) = lastindex(e.s)
start(e::EachStringIndex) = start(e.s)
next(e::EachStringIndex, state) = (state, nextind(e.s, state))
done(e::EachStringIndex, state) = done(e.s, state)
eltype(::Type{<:EachStringIndex}) = Int

"""
isascii(c::Union{Char,AbstractString}) -> Bool
Expand Down
97 changes: 97 additions & 0 deletions base/strings/iteration.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
struct EachStringIndex{T<:AbstractString}
s::T
end
keys(s::AbstractString) = EachStringIndex(s)

length(e::EachStringIndex) = length(e.s)
first(::EachStringIndex) = 1
last(e::EachStringIndex) = lastindex(e.s)
eltype(::Type{<:EachStringIndex}) = Int

const StringLIPairs{T<:AbstractString} = Iterators.LeadIndPairs{Int, Char, EachStringIndex{T}, T}
const StringPairs{T<:AbstractString} = Iterators.Pairs{Int, Char, EachStringIndex{T}, T}
StringLIPairs{T}(x::T) where {T<:AbstractString} = Iterators.LeadIndPairs(x, eachindex(x))
StringLIPairs(x::T) where {T<:AbstractString} = StringLIPairs{T}(x)
StringPairs{T}(x::T) where {T<:AbstractString} = Iterators.Pairs(x, eachindex(x))
StringPairs(x::T) where {T<:AbstractString} = StringPairs{T}(x)

Iterators.pairs(s::AbstractString) = StringPairs(s)
Iterators.reverse(s::StringPairs) = Iterators.Reverse(s)

start(sp::StringLIPairs) = 1
function done(s::StringLIPairs, i)
if isa(i, Integer)
return i > ncodeunits(s.data)
else
throw(MethodError(done, (s, i)))
end
end
function next(s::StringLIPairs, i)
if isa(i, Integer) && !isa(i, Int64)
return next(s, Int64(i))
else
throw(MethodError(next, (s, i)))
end
end

# Reverse pair iteration
start(e::Iterators.Reverse{<:StringPairs}) = ncodeunits(e.itr.data)+1
done(e::Iterators.Reverse{<:StringPairs}, idx) = idx == firstindex(e.itr.data)
function next(s::Iterators.Reverse{<:StringPairs}, idx)
tidx = thisind(s.itr.data, idx-1)
(nidx, c) = first(leadindpairs(s.itr.data, tidx))
Pair(tidx, c), tidx
end

function prev(s::AbstractString, idx)
(i, c), _ = next(Iterators.Reverse(StringPairs(s)), idx)
(c, i)
end

start(e::StringPairs) = (firstindex(e.data), start(StringLIPairs(e.data)))
done(e::StringPairs, (idx, state)) = done(StringLIPairs(e.data), state)
function next(s::StringPairs, (idx, state))
((nidx, c), state) = next(StringLIPairs(s.data), state)
Pair(idx, c), (nidx, state)
end

start(s::AbstractString) = start(StringLIPairs(s))
done(s::AbstractString, state) = done(StringLIPairs(s), state)
function next(s::AbstractString, state)
((idx, c), state) = next(StringLIPairs(s), state)
(c, state)
end

start(e::EachStringIndex) = start(StringPairs(e.s))
done(e::EachStringIndex, state) = done(StringPairs(e.s), state)
function next(e::EachStringIndex, state)
((idx, c), state) = next(StringPairs(e.s), state)
(idx, state)
end

eltype(::Type{<:AbstractString}) = Char
sizeof(s::AbstractString) = ncodeunits(s) * sizeof(codeunit(s))
firstindex(s::AbstractString) = 1
lastindex(s::AbstractString) = thisind(s, ncodeunits(s))

function getindex(s::AbstractString, i::Integer)
@boundscheck checkbounds(s, i)
@inbounds return isvalid(s, i) ? first(leadindpairs(s, i)).second : string_index_err(s, i)
end

getindex(s::AbstractString, i::Colon) = s
# TODO: handle other ranges with stride ±1 specially?
# TODO: add more @propagate_inbounds annotations?
getindex(s::AbstractString, v::AbstractVector{<:Integer}) =
sprint(io->(for i in v; write(io, s[i]) end), sizehint=length(v))
getindex(s::AbstractString, v::AbstractVector{Bool}) =
throw(ArgumentError("logical indexing not supported for strings"))

function get(s::AbstractString, i::Integer, default)
# TODO: use ternary once @inbounds is expression-like
if checkbounds(Bool, s, i)
@inbounds return s[i]
else
return default
end
end
7 changes: 4 additions & 3 deletions base/strings/string.jl
Original file line number Diff line number Diff line change
Expand Up @@ -167,10 +167,11 @@ is_valid_continuation(c) = c & 0xc0 == 0x80

## required core functionality ##

@propagate_inbounds function next(s::String, i::Int)
@propagate_inbounds function next(sp::StringLIPairs{String}, i::Int)
s = sp.data
b = codeunit(s, i)
u = UInt32(b) << 24
between(b, 0x80, 0xf7) || return reinterpret(Char, u), i+1
between(b, 0x80, 0xf7) || return (Pair(i + 1, reinterpret(Char, u)), i+1)
return next_continued(s, i, u)
end

Expand All @@ -193,7 +194,7 @@ function next_continued(s::String, i::Int, u::UInt32)
b & 0xc0 == 0x80 || @goto ret
u |= UInt32(b); i += 1
@label ret
return reinterpret(Char, u), i
return Pair(i, reinterpret(Char, u)), i
end

@propagate_inbounds function getindex(s::String, i::Int)
Expand Down
7 changes: 4 additions & 3 deletions base/strings/substring.jl
Original file line number Diff line number Diff line change
Expand Up @@ -62,10 +62,11 @@ function codeunit(s::SubString, i::Integer)
@inbounds return codeunit(s.string, s.offset + i)
end

function next(s::SubString, i::Integer)
function next(sp::StringLIPairs{<:SubString}, i::Int64)
s = sp.data
@boundscheck checkbounds(s, i)
@inbounds c, i = next(s.string, s.offset + i)
return c, i - s.offset
@inbounds (idx, c), i = next(StringLIPairs(s.string), s.offset + i)
return Pair(idx - s.offset, c), i - s.offset
end

function getindex(s::SubString, i::Integer)
Expand Down
8 changes: 4 additions & 4 deletions base/strings/util.jl
Original file line number Diff line number Diff line change
Expand Up @@ -136,9 +136,9 @@ julia> lstrip(a)
function lstrip(s::AbstractString, chars::Chars=_default_delims)
e = lastindex(s)
for (i, c) in pairs(s)
!(c in chars) && return SubString(s, i, e)
!(c in chars) && return @inbounds SubString(s, i, e)
end
SubString(s, e+1, e)
@inbounds SubString(s, e+1, e)
end

"""
Expand All @@ -161,9 +161,9 @@ julia> rstrip(a)
"""
function rstrip(s::AbstractString, chars::Chars=_default_delims)
for (i, c) in Iterators.reverse(pairs(s))
c in chars || return SubString(s, 1, i)
c in chars || return @inbounds SubString(s, 1, i)
end
SubString(s, 1, 0)
@inbounds SubString(s, 1, 0)
end

"""
Expand Down
2 changes: 1 addition & 1 deletion base/sysimg.jl
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ Array{T}(::Missing, d...) where {T} = fill!(Array{T}(uninitialized, d...), missi
include("abstractdict.jl")

include("iterators.jl")
using .Iterators: zip, enumerate
using .Iterators: zip, enumerate, leadindpairs
using .Iterators: Flatten, product # for generators

include("namedtuple.jl")
Expand Down
4 changes: 3 additions & 1 deletion stdlib/Test/src/Test.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1429,7 +1429,9 @@ Base.ncodeunits(s::GenericString) = ncodeunits(s.string)
Base.codeunit(s::GenericString) = codeunit(s.string)
Base.codeunit(s::GenericString, i::Integer) = codeunit(s.string, i)
Base.isvalid(s::GenericString, i::Integer) = isvalid(s.string, i)
Base.next(s::GenericString, i::Integer) = next(s.string, i)
Base.start(s::Base.StringLIPairs{GenericString}) = start(Base.StringLIPairs(s.data.string))
Base.done(s::Base.StringLIPairs{GenericString}, state) = done(Base.StringLIPairs(s.data.string), state)
Base.next(s::Base.StringLIPairs{GenericString}, state) = next(Base.StringLIPairs(s.data.string), state)
Base.reverse(s::GenericString) = GenericString(reverse(s.string))
Base.reverse(s::SubString{GenericString}) =
GenericString(typeof(s.string)(reverse(String(s))))
Expand Down
30 changes: 21 additions & 9 deletions test/strings/basic.jl
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,6 @@ end
@test first(eachindex("foobar")) === 1
@test first(eachindex("")) === 1
@test last(eachindex("foobar")) === lastindex("foobar")
@test done(eachindex("foobar"),7)
@test Int == eltype(Base.EachStringIndex) ==
eltype(Base.EachStringIndex{String}) ==
eltype(Base.EachStringIndex{GenericString}) ==
Expand Down Expand Up @@ -480,17 +479,11 @@ end
@test nextind(s, lastindex(s)) > sizeof(s)
end
end

# Test cmp with AbstractStrings that don't index the same as UTF-8, which would include
# (LegacyString.)UTF16String and (LegacyString.)UTF32String, among others.
include("teststringtypes.jl")

mutable struct CharStr <: AbstractString
chars::Vector{Char}
CharStr(x) = new(collect(x))
end
Base.start(x::CharStr) = start(x.chars)
Base.next(x::CharStr, i::Int) = next(x.chars, i)
Base.done(x::CharStr, i::Int) = done(x.chars, i)
Base.lastindex(x::CharStr) = lastindex(x.chars)
@testset "cmp without UTF-8 indexing" begin
# Simple case, with just ANSI Latin 1 characters
@test "áB" != CharStr("áá") # returns false with bug
Expand Down Expand Up @@ -866,3 +859,22 @@ let x = SubString("ab", 1, 1)
@test y === x
chop("ab") === chop.(["ab"])[1]
end

@testset "Generic String APIs" begin
cs = CharString([' ', 'a', 'b', 'c', ' '])
@test cs == " abc "
@test lstrip(cs) == SubString(cs, 2) == "abc "
@test rstrip(cs) == " abc"
rs = RopeString([" ", "abc", " "])
@test rs == cs
@test lstrip(rs) == lstrip(cs) == "abc "
@test rstrip(rs) == rstrip(cs) == " abc"
ds = DecodeString(raw" \x61b\u0063 ")
@test ds == cs
@test lstrip(ds) == lstrip(cs) == "abc "
@test rstrip(ds) == rstrip(cs) == " abc"
ds2 = DecodeString(RopeString([" ", raw"\x61b\u00", "63 "]))
@test ds2 == cs
@test lstrip(ds2) == lstrip(cs) == "abc "
@test rstrip(ds2) == rstrip(cs) == " abc"
end
Loading

0 comments on commit 74961ab

Please sign in to comment.