Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deprecate map on GroupedDataFrame #2662

Merged
merged 7 commits into from
Mar 25, 2021
Merged

deprecate map on GroupedDataFrame #2662

merged 7 commits into from
Mar 25, 2021

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Mar 14, 2021

Fixes #2254

I will comment on the design decisions in the code.

test/data.jl Outdated
[:x1] => x -> x > 0, ["x1"] => x -> x > 0,
r"1" => x -> x > 0, AsTable(:) => x -> x.x1 > 0)
@test filter(fun, df) isa DataFrame
@inferred filter(fun, df)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strangely removing methods for filter but not touching this specific method made @inferred fail. CC @timholy as maybe you can explain this?

gdfv = groupby(dfv, :a)

for x in (df, dfv)
@test collect(x) == map(identity, x) == [v for v in x] == [x[i, :] for i in 1:3]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I did not implement custom map and collect and thus currently the following identity holds. Do we really think it is a problem?

note that e.g. map can take several iterables like map(fun, df1, df2) and the question is if we would want to intercept this also?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd take a defensive approach and make everything for which we don't have a clear use case an error for now. People may expect collect to materialize a data frame in some way, like in Query.jl or dplyr, and be confused if they get a weird vector of rows. Likewise, we could imagine doing something useful in the future with a multi-argument map.

Regarding map(identity, df), as I said at #2254, I think it should return a DataFrame, just like all cases where the function returns a row-like object.

end

for x in (gdf, gdfv)
@test collect(x) == map(identity, x) == [v for v in x] == [x[i] for i in 1:3]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how things worked (and still work) for GroupedDataFrame.


# Iteration protocol

function Base.iterate(df::AbstractDataFrame, i=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function Base.iterate(df::AbstractDataFrame, i=1)
function Base.iterate(df::AbstractDataFrame, i::Int=1)

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/iteration.jl Show resolved Hide resolved
gdfv = groupby(dfv, :a)

for x in (df, dfv)
@test collect(x) == map(identity, x) == [v for v in x] == [x[i, :] for i in 1:3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd take a defensive approach and make everything for which we don't have a clear use case an error for now. People may expect collect to materialize a data frame in some way, like in Query.jl or dplyr, and be confused if they get a weird vector of rows. Likewise, we could imagine doing something useful in the future with a multi-argument map.

Regarding map(identity, df), as I said at #2254, I think it should return a DataFrame, just like all cases where the function returns a row-like object.

@bkamins
Copy link
Member Author

bkamins commented Mar 17, 2021

Actually given the comment by @pdeffebach in #2654 (comment) (which I read as "the comment of experienced user, but not hard-core data scientist"), I am hesitant to implement this PR at all.
Maybe it is better to leave filter(:x => predicate, df) even if it is not consistent, but it will be easier to understand by non-experts.

Also not having iterate or map or collect is no problem at all for normal users.

The only thing that for sure needs fixing is eltype of DataFrameRows. Here is an explicit example of the bug we have now:

julia> using DataFrames

julia> df = view(DataFrame(a=1), :, :)
1×1 SubDataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> dfr = eachrow(df)
1×1 DataFrameRows
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> eltype(dfr)
DataFrameRow{SubDataFrame{DataFrame, DataFrames.Index, Base.OneTo{Int64}}, DataFrames.Index}

julia> typeof(dfr[1])
DataFrameRow{DataFrame, DataFrames.Index}

@bkamins
Copy link
Member Author

bkamins commented Mar 17, 2021

I also prefer what @pdeffebach suggested as we start to loose the grip on "non-breaking" promise (I know we do not want to break anything in this PR but it is almost breaking for a very common use case 😄).

@nalimilan
Copy link
Member

I think it's OK if we don't implement iterate, map and collect right now. What matters is that we reserve this in case we want to define them in 1.x without breaking anything.

Regarding filter, it's true that deprecating it will hit many users as it's currently the recommended solution. But deprecations are disabled by default so it doesn't have a large impact. I guess it only matters to deprecate it before 1.0 if we consider that it would allow us to remove it before 2.0 if we wanted to. Otherwise, we could deprecate it later if we eventually decide that it's inconsistent with the rest of the API. I don't have a strong preference.

@bkamins
Copy link
Member Author

bkamins commented Mar 18, 2021

@pdeffebach - can you please add a "voice of the user" and "downstream package maintainer" regarding filter? Thank you!

Given the discussion I would not implement iterate et al. for AbstractDataFrame right now. Still there is a question then if we want to add custom behavior for map and collect for GroupedDataFrame (as they currently use the generic methods)

@pdeffebach
Copy link
Contributor

I am fine to deprecate filter.

I prefer consistency with transform and select, and thus use subset rather than filter. If we don't implement iteration, then I think filter makes less sense. But by that logic we would also have to deprecate first etc. I don't think it's worth it right now to expunge all "DataFrame as collection of rows" functions.

It's worth noting that unique now also operates on columns. So filter is even more the odd one out after #2652 .

I don't use it in DataFramesMeta, either, and don't have plans for a @filter macro.

Regarding map. Maybe an initial first step is to allow map(source => fun, eachrow(df)). We could make this very performant, I think, by doing something similar to @eachrow in DataFramesMeta.

I'm not sure about iterate, but we could make it easier to expose Tables.namedtupleiterator.

To reply form another thread here

Maybe in the not-too-distant future we will be able to fix this using a trick similar to what @andyferris mentioned recently (I lost the link): if we can inspect the function's code to detect which variables are used, we can pass it a small NamedTuple instead of a DataFrameRow and benefit from type stability.

I don't think this is a good idea. Since I don't think we have seen any examples yet in the wild of someone doing this kind of inspection and having things be flexible and not have too many edge cases.

@bkamins
Copy link
Member Author

bkamins commented Mar 18, 2021

So - the decision would be to deprecate filter fully for AbstractDataFrame to have a flexibility post-1.0 release to decide what to do. Are we in agreement here?

Then again we go back to the decision what to do with filter for GroupedDataFrame.

@nalimilan
Copy link
Member

unique doesn't operate on columns: it returns unique rows. It simply allows considering only some columns for the comparison. We have lots of row-oriented functions like that, so I don't think filter is really exceptional. But as I said I'm not opposed to deprecating it.

GroupedDataFrames are the most difficult part. For filter I think it's easy, as a GroupedDataFrame should always be returned. For map, what's tricky is that different return types are possible. When a table object is returned, it makes sense to return a GroupedDataFrame. When a row object is returned, it's less obvious as we could return either a GroupedDataFrame (but that doesn't make a lot of sense since there's only one row per group) or a DataFrame (but then the behavior of map would be complex). When another object is returned, we could return a simple Vector or a Dict. Maybe better throw an error for now?

@bkamins
Copy link
Member Author

bkamins commented Mar 18, 2021

I was most concerned about filter(cols => predicate, ::GroupedDataFrame) but it seems OK (it works on whole columns).

So the conclusion is:

  • deprecate filter for AbstactDataFrame fully
  • do not add iteration protocol for AbstactDataFrame for now
  • keep filter for GroupedDataFrame
  • deprecate map for grouped data frame (to avoid being breaking)

then the only question is what we do with collect on GroupedDataFrame. We can:

  1. leave it as is (producing a vector)
  2. leave it as is with a deprecation
  3. make collect produce a DataFrame collecting the GroupedDataFrame

@pdeffebach
Copy link
Contributor

I think this is fine.

I wouldn't mind bringing back filter eventually, but given unique it would probably be best to have the function act on columns and have it really just be subset.

w.r.t collect on a GroupedDataFrame. I would deprecate it, but provide DataFrame(gd::GroupedDataFrame) for a convenient way to ungroup. (or combine).

@bkamins
Copy link
Member Author

bkamins commented Mar 19, 2021

but provide DataFrame(gd::GroupedDataFrame) for a convenient way to ungroup.

DataFrame(gd::GroupedDataFrame) already does that.

The issue is that combine collect is a Julia Base function and what it does now is just a default behavior so we would have to add a custom implementation (and I am OK to add now just a deprecation falling back to a default implementation).

@nalimilan
Copy link
Member

collect is documented to return an array, so maybe it's OK to keep the default behavior. It's not super useful but a Vector{<:SubDataFrame} makes more sense than a Vector{<:DataFrameRow}. But deprecating it would also be fine if we want to be safe. At least I don't think collect(::GroupedDataFrame) should return a DataFrame, as in the input elements are groups, but in the output elements would be rows.

@bkamins
Copy link
Member Author

bkamins commented Mar 20, 2021

Ah - you are right it is documented to produce an Array, so what we have now is OK.

@bkamins
Copy link
Member Author

bkamins commented Mar 21, 2021

The next difficulty is that we agreed that subset should not allow passing a bare function argument (as it seemed not very useful). However, filter(fun, df) is this kind of syntax. Therefore we have two general options:

  • reconsider leaving filter(fun, df) as it is currently defined (maybe it is not that super bad to keep it without deprecation?)
  • add support to subset(df, predicate) (which makes little practical sense, as then predicate will get a whole df as its input, but it would be needed to maintain consistency)

What do you think?

@pdeffebach
Copy link
Contributor

I don't think subset(df, fun) makes much, sense its not applying to each element of the iterator (a row). But the benefit of not calling it filter is because we don't have to maintain all these consistencies with Base. So I'm fine to drop.

@bkamins
Copy link
Member Author

bkamins commented Mar 21, 2021

I agree this is a hard decision. Let me summarize the filter(fun, df) case and why I am raising it back again considering to keep it.

  1. deprecation: filter(fun, df) would get deprecated to subset(df, x -> fun.(eachrow(x)) which seems quite verbose (and it would require extending the subset syntax as discussed above). As an additional a shorter deprecation would be df[fun.(eachrow(x), :] so maybe this is a way to go (with a shortcoming that filter would get deprecated into two different functions depending on the call signature).
  2. do we think that filter(fun, df) would confuse anyone what it does? I think not - if you asked a random person to guess what it does most likely filtering by-row rows would be the answer
  3. filter(fun, df) was present in DataFrames.jl for a very long time so it is quite breaking to remove it - even if we plan to do it in a long term. And what we discuss here is breaking although we promised not to do breaking things.

@pdeffebach
Copy link
Contributor

To add on to this, we don't have broadcasted ||. So having a by-row version is important. (Though the user can always wrap in ByRow.

@bkamins
Copy link
Member Author

bkamins commented Mar 21, 2021

So - in conclusion, @nalimilan + @pdeffebach - are we OK to keep filter(fun, ::AbstractDataFrame) without deprecation in 1.0 release? (we can always deprecate it later).

@pdeffebach
Copy link
Contributor

If we don't deprecate filter(fun, df) then I think we should also keep filter(source => fun, df). So in connclusion... I guess we should change nothing.

@bkamins
Copy link
Member Author

bkamins commented Mar 21, 2021

So in connclusion... I guess we should change nothing.

I would be fine with this also. As commented above after thinking about this I feel that it is not that hard to learn that filter is special while it is very easy to use this way.

We just had a similar discussion in #2665 that df[row, cols] is special and I also was in favor of not changing things.

@nalimilan
Copy link
Member

OK, let's keep filter as it is. In the worst case we'll deprecate these two methods, but I agree they are not likely to confuse people in practice.

@bkamins bkamins force-pushed the bk/itration_protocol branch from c95d6d5 to e1e8e9e Compare March 21, 2021 22:12
@bkamins
Copy link
Member Author

bkamins commented Mar 21, 2021

OK - so in conclusion we only deprecate map for GroupedDataFrame and fix eltype of eachrow

@bkamins bkamins changed the title add iteration protocol to AbstractDataFrame deprecate map on GroupedDataFrame Mar 21, 2021
@bkamins
Copy link
Member Author

bkamins commented Mar 22, 2021

only coverage fails here

NEWS.md Outdated
Comment on lines 37 to 39
* applying `map` to `GroupedDataFrame` is now deprecated. The return value
of this function might change in 2.0 release
([#2662](https://github.com/JuliaData/DataFrames.jl/pull/2662))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again, it would be annoying to have to wait for 2.0 if we want to change this. How about backporting it to 0.22?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will change this PR in a way to allow it to be included in 0.22.6 release.

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits March 25, 2021 00:06
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 777afbb into main Mar 25, 2021
@bkamins bkamins deleted the bk/itration_protocol branch March 25, 2021 10:26
@bkamins
Copy link
Member Author

bkamins commented Mar 25, 2021

Thank you!

bkamins added a commit that referenced this pull request Mar 25, 2021
@tbenst
Copy link

tbenst commented Sep 4, 2021

Sorry to bump an old thread, but I have a bunch of code from March that is now broken because map no longer works with grouped dataframes. For example,

n_times = map(x->length(unique(x.time)), groupby(df,:experiment))
# seems like this is equivalent?
n_times = combine(x->length(unique(x.time)), groupby(df,:experiment)).x1

The new version is quite arcane (combine? x1?) compared to vanilla map. Perhaps consider bringing it back? Many thanks for their package!

@nalimilan
Copy link
Member

Actually the replacement (as suggested by the deprecation warning) is much simpler: [length(unique(x.time)) for x in groupby(df, :experiment))]. This is why we deprecated it, as it seems an array comprehension isn't longer to type and is more explicit. map could potentially be used for more complex use cases, e.g. it could return a GroupedDataFrame (for most collections, map returns an object of the same type as their input).

@bkamins
Copy link
Member Author

bkamins commented Sep 4, 2021

Agreed. We removed old map to allow for making a decision how map should work in the future without being breaking. Regarding your combine expression you can also write:

combine(groupby(df, :experiment), :time => length∘unique => :len).len```

which will be more efficient than both, your combine, old map or comprehension:

julia> using DataFrames

julia> df = DataFrame(x = rand(1:10^6, 10^7));

julia> gdf = groupby(df, :x);

julia> @time [length(unique(x.x)) for x in gdf];
  2.964561 seconds (21.05 M allocations: 1.221 GiB, 3.71% gc time, 1.20% compilation time)

julia> @time combine(x->length(unique(x.x)), gdf).x1;
  2.856571 seconds (13.17 M allocations: 1016.776 MiB, 3.78% gc time, 4.04% compilation time)

julia> @time combine(gdf, :x => length∘unique => :len).len;
  1.371527 seconds (6.00 M allocations: 602.711 MiB, 4.28% gc time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deprecate filter(:x => f, df) Clarify position on iteration API add map for AbstractDataFrame
4 participants