Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index GroupedDataFrame with array of keys #2046

Merged
merged 13 commits into from
Feb 1, 2020

Conversation

jlumpe
Copy link
Contributor

@jlumpe jlumpe commented Dec 8, 2019

Use array of keys (GroupKey/Tuple/NamedTuple) in getindex(::GroupedDataFrame, ...).

I also fixed a couple of issues remaining in my last PR: a test for IndexStyle(::GroupKeys) and added some missing bindings to the documentation.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
test/grouping.jl Outdated Show resolved Hide resolved
@jlumpe jlumpe force-pushed the gdf-arrayindexing branch from 138b64d to 84dc1a7 Compare December 9, 2019 02:08
@jlumpe
Copy link
Contributor Author

jlumpe commented Dec 9, 2019

Resolved issues and added support for InvertedIndex.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Dec 9, 2019

Looks good apart from the comment I left (and maybe @nalimilan will want to add something 😄)

@bkamins
Copy link
Member

bkamins commented Dec 9, 2019

Also please make sure you test:

  • Not(:) (which I think will not work correctly now)
  • end and Not(end) (they should work I think, but have not tested it)
  • : (it is tested as far as I recall but can you please check it)

Thank you.

@bkamins
Copy link
Member

bkamins commented Dec 9, 2019

I have thought over your proposal and I think we need to be more strict in a way we introduce allowed indexing. I will leave the comments in relevant methods.

@jlumpe
Copy link
Contributor Author

jlumpe commented Dec 9, 2019

Before we go any further with this - I went a little further than your request to group all the indexing code together and just split grouping.jl into multiple files which I think are much more manageable. If I did it in this PR it would be hard to tell what else I changed so I opened a separate one (#2050). If you want to merge that I can rebase this PR on top of it.

@bkamins
Copy link
Member

bkamins commented Dec 10, 2019

Let us wait for @nalimilan to comment on #2050 then.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved
src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved
src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved
src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved
src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved
@jlumpe
Copy link
Contributor Author

jlumpe commented Dec 26, 2019

Updates:

  • Rebased onto master with new organization of src/groupeddataframe/ files.
  • getindex(gd, a::AbstractArray) narrowed to getindex(gd, ::AbstractVector)
  • Removed unnecessary where statements
  • Significantly cleaned up code
  • Implemented gd[Not(:)] and test
  • Tests for GroupedDataFrame first(), last(), lastindex(), gd[end]
  • Correctly infer element type of arrays with eltype() == Any or anything else nonspecific, throw an exception for non-homogeneous arrays (even when all elements are individually valid scalar indices)
    • Applies to Not(::AbstractArray) as well

Issues not addressed:

  • Not(end) isn't syntactically valid, x[end] is syntactic sugar for x[lastindex(x)] but this isn't available outside of a :ref expression.
  • AbstractVector{Any} full of Bools does not work as an index. I left it out because AbstractArray doesn't check for it either.

@bkamins
Copy link
Member

bkamins commented Dec 26, 2019

Thank you for the update.

Not(end) isn't syntactically valid

It is valid in DataFrames.jl:

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1        │ x2       │ x3       │ x4        │
│     │ Float64   │ Float64  │ Float64  │ Float64   │
├─────┼───────────┼──────────┼──────────┼───────────┤
│ 1   │ 0.0193821 │ 0.489792 │ 0.649841 │ 0.887786  │
│ 2   │ 0.35907   │ 0.727754 │ 0.392126 │ 0.0994079 │
│ 3   │ 0.315079  │ 0.969064 │ 0.956188 │ 0.9801    │

julia> df[:, Not(end)]
3×3 DataFrame
│ Row │ x1        │ x2       │ x3       │
│     │ Float64   │ Float64  │ Float64  │
├─────┼───────────┼──────────┼──────────┤
│ 1   │ 0.0193821 │ 0.489792 │ 0.649841 │
│ 2   │ 0.35907   │ 0.727754 │ 0.392126 │
│ 3   │ 0.315079  │ 0.969064 │ 0.956188 │

julia> df[Not(end), :]
2×4 DataFrame
│ Row │ x1        │ x2       │ x3       │ x4        │
│     │ Float64   │ Float64  │ Float64  │ Float64   │
├─────┼───────────┼──────────┼──────────┼───────────┤
│ 1   │ 0.0193821 │ 0.489792 │ 0.649841 │ 0.887786  │
│ 2   │ 0.35907   │ 0.727754 │ 0.392126 │ 0.0994079 │

and also for normal vectors:

julia> x = [1,2,3,4]
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> x[Not(end)]
3-element Array{Int64,1}:
 1
 2
 3

I will have a look at your code to check.

AbstractVector{Any} full of Bools does not work as an index.

Yes - this is how it should work

@bkamins
Copy link
Member

bkamins commented Dec 26, 2019

It also seems from your implementation that you do not support gd[Not(1)] nor things like gd[Not(Not(1))]. Please correct me if I am wrong.

I would rather delegate Not (except for non-standard indexing and optimizations you have) to be handled by the mechanics provided by InvertedIndices.jl and then all should work as expected.

Also - as I have commended earlier - I would recommend you to use Not not InvertedIndex in the code, as this is the style we use everywhere else in the code base (interestingly I have checked that InvertedIndices.jl mixes both of them in different methods, but I think it is better to stick to one form everywhere).

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/grouping.jl Outdated Show resolved Hide resolved
@jlumpe
Copy link
Contributor Author

jlumpe commented Dec 26, 2019

You're right about end, I'm not sure why I thought you couldn't use it in non-trivial expressions but it looks like you can do pretty much whatever you want with it in between the brackets. I have added this test.

@jlumpe
Copy link
Contributor Author

jlumpe commented Dec 26, 2019

For everything involving InvertedIndex.jl's built-in behavior and Base.to_index and related, it all seems to only really work on AbstractArrays. Is there a reason GroupedDataFrame doesn't subclass it (I think this discussion already exists but I can't find it)? If it did we'd just need definitions for getindex with integers and arrays of integers and then Base.to_index for scalar keys or arrays of them, and we'd get everything else.

@bkamins
Copy link
Member

bkamins commented Dec 26, 2019

I think this discussion already exists but I can't find it

I think the main reason is that this decision is very committing so as long as it can be postponed we try not to do it. The major issue is that Julia does not allow multiple inheritance currently.

@jlumpe jlumpe force-pushed the gdf-arrayindexing branch from 64dccb3 to 279b1c1 Compare January 2, 2020 17:18
@jlumpe
Copy link
Contributor Author

jlumpe commented Jan 2, 2020

  • Handle arrays of non-concrete type, check for homogeneity
  • Indexing with Not now uses the InvertedIndices.jl machinery. Implements to_indices(gd, (idx,)::Tuple{Not}) instead of getindex(gd, ::Not). Recurses on ints = Base.to_indices(gd, (idx.skip,)) and then gets inverted integer indices with to_indices(Base.OneTo(length(gd)), ints). This handles all cases, but one extra definition is needed to avoid a method ambiguity.
  • Tests for Not{Not}
  • Test for mixed non-Bool Integer arrays

@bkamins
Copy link
Member

bkamins commented Jan 2, 2020

Looks good. Thank you. I just left one comment (with a request for an additional test).

@jlumpe
Copy link
Contributor Author

jlumpe commented Jan 12, 2020

Remaining issues should be resolved.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Hopefully we have caught all corner cases 😄.

@nalimilan - can you please merge it when you are fine with this PR. Thank you!

@bkamins
Copy link
Member

bkamins commented Jan 18, 2020

@nalimilan - please have a look at it when you have time so that we can merge it if it is OK.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I hadn't noticed this was ready to review. Looks mostly good!

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
@jlumpe
Copy link
Contributor Author

jlumpe commented Jan 30, 2020

Resolved rest of @nalimilan's suggestions.

@bkamins
Copy link
Member

bkamins commented Jan 31, 2020

Looks good. I just left one minimal comment (we should either narrow the condition there or special case when T is Bool).

@nalimilan - apart from this I would merge this. Thank you!

@nalimilan nalimilan merged commit 46a4ca8 into JuliaData:master Feb 1, 2020
@nalimilan
Copy link
Member

Thanks @jlumpe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants