Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support indexing with generators #37648

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Support indexing with generators #37648

wants to merge 3 commits into from

Conversation

omus
Copy link
Member

@omus omus commented Sep 18, 2020

Indexing into generators is something I found myself wanting as part of JuliaTime/TimeZones.jl#291. Basically I had function which contained code similar to:

x = [expensive_call(i) for i in range]

In some cases I would need access to all of the elements but in the most performance critical case I only needed access to a single element for which the index was already known. Using an index-able generator was a good solution for this.

I realize that not all generators will be index-able but this allows a generator to use indexing if the underlying iterator used by the generator also supports indexing.

@omus omus added feature Indicates new feature / enhancement requests collections Data structures holding multiple items, e.g. sets labels Sep 18, 2020
@mbauman
Copy link
Member

mbauman commented Sep 18, 2020

I have a branch that begins work on adding indexing to the ProductIterator — I've been out of the loop for a while now but if that's still not supported I think it should be before doing this.

@omus
Copy link
Member Author

omus commented Sep 18, 2020

have a branch that begins work on adding indexing to the ProductIterator — I've been out of the loop for a while now but if that's still not supported I think it should be before doing this.

Adding indexing to ProductIterator would be great! Although having indexing with ProductIterator would definitely be useful here I don't think this PR requires that functionality as we can always add it later.

@rfourquet
Copy link
Member

And for reference, I also have an old PR (#22489) which adds indexing/setindexing to a bunch of iterators from Iterators (but not finished for products ;-) ). but which failed to get traction. Maybe one reason was that it was trying to add also traits to classify indexing capabilities (I could easily remove the traits aspect and keep only indexing methods). Mentioning thishere so that it gets also considered when making a decision for this PR...

@N5N3
Copy link
Member

N5N3 commented Sep 21, 2020

Should these be included in LazyArray.jl as new lazy types?

base/generator.jl Outdated Show resolved Hide resolved
test/functional.jl Show resolved Hide resolved
base/generator.jl Show resolved Hide resolved
@omus
Copy link
Member Author

omus commented Sep 28, 2020

Some benchmarks with comparing indexing into generators versus having to collect the generator to index into them:

julia> g = (string(x) for x in [1 3 5; 2 4 6])
Base.Generator{Matrix{Int64}, typeof(string)}(string, [1 3 5; 2 4 6])

julia> @btime g[end]
  67.197 ns (2 allocations: 96 bytes)
"6"

julia> @btime collect(g)[end]
  286.318 ns (13 allocations: 704 bytes)
"6"

julia> @btime g[1:6]
  292.162 ns (14 allocations: 832 bytes)
6-element Vector{String}:
 "1"
 "2"
 "3"
 "4"
 "5"
 "6"

julia> @btime collect(g)[1:6]
  311.695 ns (14 allocations: 832 bytes)
6-element Vector{String}:
 "1"
 "2"
 "3"
 "4"
 "5"
 "6"

However, I did notice worse performance when handling colon:

julia> @btime g[:]
  471.658 ns (16 allocations: 864 bytes)
6-element Vector{String}:
 "1"
 "2"
 "3"
 "4"
 "5"
 "6"

julia> @btime collect(g)[:]
  307.751 ns (14 allocations: 832 bytes)
6-element Vector{String}:
 "1"
 "2"
 "3"
 "4"
 "5"
 "6"

@nalimilan nalimilan requested a review from mbauman September 28, 2020 16:38
@@ -52,7 +52,16 @@ size(g::Generator) = size(g.iter)
axes(g::Generator) = axes(g.iter)
ndims(g::Generator) = ndims(g.iter)

getindex(g::Generator, I...) = map(g.f, g.iter[I...])
function getindex(g::Generator, I...)
I′ = to_indices(g.iter, I)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_indices doesn't work on non-array indexables like NamedTuple and Dict. As I commented earlier, I don't think we should support vectorized indexing.

You can use LazyArrays.jl or MappedArrays.jl to get a full array API. If we were to add vectorized indexing for lazy mapping in Base, Broadcasted is probably a better interface to do it since it's already very array-like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I commented earlier, I don't think we should support vectorized indexing.

The tricky part of your statement is how would we limit getindex to just use scalar indexing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just define getindex(::Generator, ::Integer...) and nothing else. I also think this is the right thing to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we limit getindex to just use scalar indexing?

Ah, good point. How about introducing

function to_scalar_indices(A, I)
    J = to_indices(A, I)
    J isa Tuple{Vararg{Integer}} || error("expected scalar indices. got: ", I)
    return J
end

scalar_getindex(A::AbstractArray, I...) = getindex(A, to_scalar_indices(A, I)...)
scalar_getindex(A, I...) = getindex(A, I...)

and use it for Generator?

Just define getindex(::Generator, ::Integer...)

Isn't it too restrictive and a half-way solution if it were to restrict the API? For example, it'd mean x[1] for x = Iterators.map(string, (a=1, b=2)) works but x[:a] doesn't.

Also, I guess we need a different implementation for AbstractDict or error out? Applying f after getindex is not correct for AbstractDict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just define getindex(::Generator, ::Integer...)

Mostly that works but we get into some strange corner cases with AbstractDict:

julia> function Base.getindex(g::Base.Generator, I::Integer...)
           g.f(g.iter[I...])
       end

julia> g = (x * y for (x, y) in Dict(3 => 4))
Base.Generator{Dict{Int64,Int64},var"#1#2"}(var"#1#2"(),
Dict(3 => 4))

julia> collect(g)[1]
12

julia> g[1]
ERROR: KeyError: key 1 not found
Stacktrace:
 [1] getindex at ./dict.jl:467 [inlined]
 [2] getindex(::Base.Generator{Dict{Int64,Int64},var"#1#2"}, ::Int64) at ./REPL[1]:2
 [3] top-level scope at REPL[3]:1

This was the main reason I ended up restricting the logic to AbstractArray

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need a different implementation for AbstractDict or error out

That was also my conclusion. A first pass at the AbstractDict version looks like:

function Base.getindex(g::Base.Generator{<:AbstractDict}, I...)
    subset = collect(pairs(g.iter))[I...]
    if subset isa Pair
        g.f(subset)
    else
        map(g.f, subset)
    end
end

Copy link
Contributor

@rapus95 rapus95 Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in my #37648 (comment) to this whole PR, why do you want to support AbstractDict in general?
IMO Generator should behave like a lazy map. And for map we have:

julia> map(x->println(x), Dict(:a=>2, :c=>"h"))
ERROR: map is not defined on dictionaries
Stacktrace:
 [1] error(::String) at .\error.jl:33
 [2] map(::Function, ::Dict{Symbol,Any}) at .\abstractarray.jl:2190
 [3] top-level scope at REPL[10]:1

So I'd definitively error out, optimally with the same error.

@omus
Copy link
Member Author

omus commented Sep 28, 2020

As pointed out supporting getindex when the iterator is not array-like gets difficult. We could define a fallback of getindex(g::Generator, I...) = collect(g)[I...] which is functional but not very performant. Instead I've limited this PR to only support getindex(::Generator{<:AbstractArray}, I...) which should provide the functionality most people care about while ensuring that users don't use getindex on generators where they should be doing collect(g)[I...] themselves.

@mcabbott
Copy link
Contributor

Should this also define keys? Today I wanted something like this to work:

 findall(c->c[1]>0, eachcol(randn(2,20)))

Adding Base.keys(g::Base.Generator) = keys(g.iter) would make it so, and might fit well with being able to index.

@omus
Copy link
Member Author

omus commented Oct 9, 2020

Should this also define keys?

See: #37648 (comment)

@omus omus force-pushed the cv/generator-indexing branch from 91f4da0 to bad925e Compare October 9, 2020 20:37
@omus
Copy link
Member Author

omus commented Oct 9, 2020

I found a reasonable fallback for "handling" index with iterators. As getindex and iteration don't have to return the same values I need index on the iteration results. There's still a good performance here for cases where f is an expensive call:

julia> g = (string(p) for p in Dict(Pair.('a':'z', 1:26)));

julia> @btime g[2:5]
  1.481 μs (64 allocations: 3.03 KiB)
4-element Vector{String}:
 "'f' => 6"
 "'w' => 23"
 "'d' => 4"
 "'e' => 5"

julia> @btime collect(g)[2:5]
  8.895 μs (409 allocations: 15.81 KiB)
4-element Vector{String}:
 "'f' => 6"
 "'w' => 23"
 "'d' => 4"
 "'e' => 5"

julia> g = (p for p in Dict(Pair.('a':'z', 1:26)));

julia> @btime g[2:5]
  187.209 ns (3 allocations: 784 bytes)
4-element Vector{Pair{Char, Int64}}:
 'f' => 6
 'w' => 23
 'd' => 4
 'e' => 5

julia> @btime collect(g)[2:5]
  172.204 ns (2 allocations: 640 bytes)
4-element Vector{Pair{Char, Int64}}:
 'f' => 6
 'w' => 23
 'd' => 4
 'e' => 5

@rapus95
Copy link
Contributor

rapus95 commented Dec 17, 2020

Suggestion: Change lines 61 and 71 to their lazy equivalents, i.e. Generator(g.f, subset) instead of map(g.f, subset)

Reasoning 1

We shouldn't mix lazy and eager evaluation if the user already decided on lazy evaluation by supplying a generator in the first place.

Reasoning 2

There are two things that Generator(identity, iter)[:] could mean:

a) collect(iter)
b) iter

To me b) seems to be favourable. In that case we must not call map(g.f, iter).

Example

(1:8)[2:3] == 2:3 (not ==[2,3])
If we collect in those cases we would mix up lazy and eager evaluation and probably introduce some unexpected allocations.

Reasoning 3

Given that Generator is the lazy equivalent of map, I would expect to stay in the lazy operation mode as long as possible (i.e. without collecting).
And as such the result of indexing into a generator should also be the lazy, non-collected, equivalent of indexing into an allocated structure.
Ultimately the decision about collection should be made by indexing into the iterator (for which the Generator is just a lazy relay) and thus shouldn't be made by the relay type by explicitly collecting the iterator.

This wouldn't cost us any possibilities because anybody who wants could still apply a manual explicit collect.

Overall question

I know, I'm reasking what already has been asked, but I'm not convinced by the arguments.
Why do we want the lazy map variant to be more capable than the eager variant? I would consider that to be unexpected. Especially I wouldn't want the lazy variant to start collecting just to be more capable than the eager variant (which it wouldn't be anyway since it is itself eager then).

Why not just stick to
getindex(g::Generator, I...) = Generator(g.f, g.iter[I...])
to make it behave exactly like the eager variant and make it an effective pass-through call that respects indexing behaviour of the iterator that is mapped lazily. (i.e. Generator(f, iter)[I...] should be equal to Generator(f, iter[I...]), instead of map(f, iter[I...]))
It'd also align perfectly with #34678
That way repeated sub-indexing into a generator won't necessarily allocate at all and definitively won't apply f to unnecessary elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets feature Indicates new feature / enhancement requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants