-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add subset #2496
Add subset #2496
Conversation
Let us confirm if the functionality is as we like it before I write the tests. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this!
I don't think you've gotten all the missing
logic worked out yet.
julia> df = DataFrame(a = [1, 1, 1, 2, 2, 2, 2], b = [1, 1, 2, 101, 101, 102, missing]);
julia> where(df, :b => (t -> t .> mean(t)))
ERROR: MethodError: no method matching &(::Missing)
I think this only shows up when we return exactly one column that contains missing values. Proabably a missing method for (&)(missing)
This looks great! I just have two comments about missing values. I'm not a fan of introducing a new kwarg Also, I support dropping rows with missing values because that is why what R/sql/stata do. But @nalimilan was concerned about it, so maybe it would be worth having a vote somewhere to make sure others agree. |
Thank you both for looking into it. Indeed there are some rough edges - thank you for catching them (such things happen if tests are not written yet)
This is exactly the reason. I will fix this.
I think we should drop them (but clearly indicate that we behave here differently than expected).
I would assume that it would be hardly used, so it is not that problematic hopefully. On the other hand having it makes, in particular, sure that users notice more easily that we have a non-standard behavior here. In general - thank you for the feedback. I will fix the mistake, and let us wait for @nalimilan to comment when he has time (I would not put a vote on it - I think us four have enough understanding of the ecosystem to decide). |
w.r.t |
This is less efficient as it will not be able to do broadcasting fusion (you will have allocations grow linearly with the number of conditions). |
I have thought of pros and cons of dropping row vs erroring on The question is - if we error on missing how do we cleanly handle something like:
assuming both You would have to write something like:
which seems verbose (but indeed it is not the end of the world and at least it is explicit). By the way. Can you please comment what would be the recommended style of writing simple filters:
or
? The benefit of:
is that it is compiled once, while both:
would be compiled anew on every call. |
@pdeffebach + @matthieugomez : if you could spare some time it would be great if you could comment with examples similar to my above that you find typical use-cases, so that we have more hands-on material to make an informed decision what to do. Thank you! |
@nalimilan - do you have an opinion what we should do with handling of |
Sorry I've given the priority to breaking issues. I think this one could wait for post-0.22 so that we have time to find the best design. I think having a Regarding the name of that argument, |
src/abstractdataframe/where.jl
Outdated
function where(df::AbstractDataFrame, @nospecialize(args...); | ||
dropmissing::Bool=true, view::Bool=false) | ||
row_selector = _get_where_conditions(df, args, dropmissing) | ||
return view ? Base.view(df, row_selector, :) : df[row_selector, :] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth using @inline
to avoid the type instability, with @noinline
in _get_where_conditions
(with inference tests to prevent regressions). Same for the GroupedDataFrame
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I have added @inline
. I keep this comment open to remember testing against it when we finalize the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is type unstable even with @inline
and @noinline
in _get_where_conditions
. Do you think it can be resolved somehow? (and do you think it is so problematic given DataFrame
is type unstable anyway?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird. We use this kind of pattern elsewhere and it works, right? Is row_selector
correctly inferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - given we add ungroup
kwarg I am not sure we can get type stability here, do we? (as now we have yet another option of the return value related to the second kwarg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ungroup
shouldn't really make things worse as long as the function is inlined and the value of arguments know at compile time. But if that doesn't work in the first place...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I have now:
julia> @code_warntype subset(DataFrame(a=1:10), :a => ByRow(<(4)), view=false, skipmissing=false)
Variables
#unused#::Core.Compiler.Const(DataFrames.var"#subset##kw"(), false)
@_2::NamedTuple{(:view, :skipmissing),Tuple{Bool,Bool}}
@_3::Core.Compiler.Const(DataFrames.subset, false)
df::DataFrame
args::Tuple{Pair{Symbol,ByRow{Base.Fix2{typeof(<),Int64}}}}
skipmissing::Bool
view::Bool
@_8::Bool
@_9::Bool
Body::Any
1 ── %1 = Base.haskey(@_2, :skipmissing)::Core.Compiler.Const(true, false)
│ %1
│ %3 = Base.getindex(@_2, :skipmissing)::Bool
│ %4 = (%3 isa DataFrames.Bool)::Core.Compiler.Const(true, false)
│ %4
└─── goto #3
2 ── Core.Compiler.Const(:(%new(Core.TypeError, Symbol("keyword argument"), :skipmissing, DataFrames.Bool, %3)), false)
└─── Core.Compiler.Const(:(Core.throw(%7)), false)
3 ┄─ (@_8 = %3)
└─── goto #5
4 ── Core.Compiler.Const(:(@_8 = false), false)
5 ┄─ (skipmissing = @_8)
│ %13 = Base.haskey(@_2, :view)::Core.Compiler.Const(true, false)
│ %13
│ %15 = Base.getindex(@_2, :view)::Bool
│ %16 = (%15 isa DataFrames.Bool)::Core.Compiler.Const(true, false)
│ %16
└─── goto #7
6 ── Core.Compiler.Const(:(%new(Core.TypeError, Symbol("keyword argument"), :view, DataFrames.Bool, %15)), false)
└─── Core.Compiler.Const(:(Core.throw(%19)), false)
7 ┄─ (@_9 = %15)
└─── goto #9
8 ── Core.Compiler.Const(:(@_9 = false), false)
9 ┄─ (view = @_9)
│ %25 = (:skipmissing, :view)::Core.Compiler.Const((:skipmissing, :view), false)
│ %26 = Core.apply_type(Core.NamedTuple, %25)::Core.Compiler.Const(NamedTuple{(:skipmissing, :view),T} where T<:Tuple, false)
│ %27 = Base.structdiff(@_2, %26)::Core.Compiler.Const(NamedTuple(), false)
│ %28 = Base.pairs(%27)::Core.Compiler.Const(Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}(), false)
│ %29 = Base.isempty(%28)::Core.Compiler.Const(true, false)
│ %29
└─── goto #11
10 ─ Core.Compiler.Const(:(Core.tuple(@_2, @_3, df)), false)
└─── Core.Compiler.Const(:(Core._apply_iterate(Base.iterate, Base.kwerr, %32, args)), false)
11 ┄ %34 = Core.tuple(skipmissing, view, @_3, df)::Core.Compiler.PartialStruct(Tuple{Bool,Bool,typeof(subset),DataFrame}, Any[Bool, Bool, Core.Compiler.Const(DataFrames.subset, false), DataFrame])
│ %35 = Core._apply_iterate(Base.iterate, DataFrames.:(var"#subset#406"), %34, args)::Any
└─── return %35
julia> f() = subset(DataFrame(a=1:10), :a => ByRow(<(4)), view=false, skipmissing=false)
f (generic function with 1 method)
julia> @code_warntype f()
Variables
#self#::Core.Compiler.Const(f, false)
Body::Any
1 ─ %1 = (1:10)::Core.Compiler.Const(1:10, false)
│ %2 = (:a,)::Core.Compiler.Const((:a,), false)
│ %3 = Core.apply_type(Core.NamedTuple, %2)::Core.Compiler.Const(NamedTuple{(:a,),T} where T<:Tuple, false)
│ %4 = Core.tuple(%1)::Core.Compiler.Const((1:10,), false)
│ %5 = (%3)(%4)::Core.Compiler.Const((a = 1:10,), false)
│ %6 = Core.kwfunc(Main.DataFrame)::Core.Compiler.Const(Core.var"#Type##kw"(), false)
│ %7 = (%6)(%5, Main.DataFrame)::DataFrame
│ %8 = (<)(4)::Core.Compiler.Const(Base.Fix2{typeof(<),Int64}(<, 4), false)
│ %9 = Main.ByRow(%8)::Core.Compiler.Const(ByRow{Base.Fix2{typeof(<),Int64}}(Base.Fix2{typeof(<),Int64}(<, 4)), false)
│ %10 = (:a => %9)::Core.Compiler.Const(:a => ByRow{Base.Fix2{typeof(<),Int64}}(Base.Fix2{typeof(<),Int64}(<, 4)), false)
│ %11 = (:view, :skipmissing)::Core.Compiler.Const((:view, :skipmissing), false)
│ %12 = Core.apply_type(Core.NamedTuple, %11)::Core.Compiler.Const(NamedTuple{(:view, :skipmissing),T} where T<:Tuple, false)
│ %13 = Core.tuple(false, false)::Core.Compiler.Const((false, false), false)
│ %14 = (%12)(%13)::Core.Compiler.Const((view = false, skipmissing = false), false)
│ %15 = Core.kwfunc(Main.subset)::Core.Compiler.Const(DataFrames.var"#subset##kw"(), false)
│ %16 = (%15)(%14, Main.subset, %7, %10)::Any
└── return %16
(not passing kwargs does not change this)
Maybe the reason is that we have a vararg function with @nospecialize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing @nospecialize
did not change anything.
Sure - we can skip it for now to gather more feedback what would be best (especially concrete examples so that we can judge between both designs). |
Given the discussion in #2417 I have changed my mind and think that However, what we should discuss I think is @pdeffebach thinks that DataFramesMeta.jl should use the |
I have to admit that #2417 is a bit of a niche and tangential issue to base this decision on. I am okay with a bifurcation in DataFramesMeta. I suppose I will probably start out with mimicking DataFrames and then change it if enough people complain? This really underscores how we need to
Then it matters less what the default behavior in |
I would like to revive this PR and make a decision:
|
I'm fine with a different name for I'm okay with |
The only thing I feel strongly about is that dropmissing should be called skipmissing if it defaults to false. |
As I said somewhere before I'm not a fan of the name Regarding missings, I think we should be consistent everywhere, and since we are strict in |
OK, so I vote for:
|
OK - I will add a comment in the code though. |
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/abstractdataframe/subset.jl
Outdated
@inline function subset(gdf::GroupedDataFrame, @nospecialize(args...); | ||
skipmissing::Bool=false, view::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't have to be implemented right now, but I guess at some point it will make sense to support ungroup=false
to get a GroupedDataFrame
like with select
/transform
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I had in an initial implementation. Then I thought that the problem is that ungroup=false
does not give us any benefit (as in select
/transform
as we have to run groupby
anyway) and complicates the API. But given you raised it I think it is OK to have it for consistency. I will add it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could apply the grouping code directly to group indices, using the same path as refarray
/refpool
. It could often be faster than using the original columns. Maybe leave a TODO?
src/abstractdataframe/where.jl
Outdated
function where(df::AbstractDataFrame, @nospecialize(args...); | ||
dropmissing::Bool=true, view::Bool=false) | ||
row_selector = _get_where_conditions(df, args, dropmissing) | ||
return view ? Base.view(df, row_selector, :) : df[row_selector, :] | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird. We use this kind of pattern elsewhere and it works, right? Is row_selector
correctly inferred?
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
src/abstractdataframe/subset.jl
Outdated
@inline function subset(gdf::GroupedDataFrame, @nospecialize(args...); | ||
skipmissing::Bool=false, view::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could apply the grouping code directly to group indices, using the same path as refarray
/refpool
. It could often be faster than using the original columns. Maybe leave a TODO?
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
I have applied the same fix in the other docstring. |
You mean to use
Do you think it would be worth it (I expect this code-path will not be used much). |
Yes, something like this. That's not super important, but it doesn't hurt (at least as a TODO). |
I will add it as TODO, as I do not want to overcomplicate the code now. |
I left a TODO, as:
In order to do a fully fast implementation we should treat |
OK - so regarding this PR is there anything we should discuss or it is good to be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to fix the return type stability issue at some point, but this doesn't have to block the PR.
@pdeffebach, @matthieugomez - I will merge this PR later this week. Please comment if you want to add something to the discussion before merging. |
This PR looks good to me. I have a few suggestions.
|
These are good points. Thank you for raising them.
I thought @matthieugomez wanted
I will add
This is strange. I would think that it should be the same as the most compilation work should be with
Probably one would have to do it manually (which means it would be not very convenient). My thinking was that "OR" is needed much less frequently than "AND" (and if you need "OR" you probably want it on the same column, which can be expressed using an anonymous function). |
I started updating the manual, but given the discussion we had today on Slack I gave up - it was too ad hoc. The manual needs to be updated. I have opened a separate PR for this: #2595. |
IMO using |
I have investigated the performance related to compilation and I could not improve things here (the compilation time is mostly
I will wait for a feedback and if there are no more comments I am going to merge this PR soon as is now. |
Thank you! |
This follows #2314 discussion.
CC @pdeffebach @matthieugomez