deprecate map on GroupedDataFrame #2662

bkamins · 2021-03-14T20:38:15Z

I will comment on the design decisions in the code.

bkamins · 2021-03-14T20:42:07Z

test/data.jl

-                [:x1] => x -> x > 0, ["x1"] => x -> x > 0,
-                r"1" => x -> x > 0, AsTable(:) => x -> x.x1 > 0)
-        @test filter(fun, df) isa DataFrame
-        @inferred filter(fun, df)


Strangely removing methods for filter but not touching this specific method made @inferred fail. CC @timholy as maybe you can explain this?

src/abstractdataframe/iteration.jl

bkamins · 2021-03-14T20:44:35Z

test/iteration.jl

+    gdfv = groupby(dfv, :a)
+
+    for x in (df, dfv)
+        @test collect(x) == map(identity, x) == [v for v in x] == [x[i, :] for i in 1:3]


For now I did not implement custom map and collect and thus currently the following identity holds. Do we really think it is a problem?

note that e.g. map can take several iterables like map(fun, df1, df2) and the question is if we would want to intercept this also?

I'd take a defensive approach and make everything for which we don't have a clear use case an error for now. People may expect collect to materialize a data frame in some way, like in Query.jl or dplyr, and be confused if they get a weird vector of rows. Likewise, we could imagine doing something useful in the future with a multi-argument map.

Regarding map(identity, df), as I said at #2254, I think it should return a DataFrame, just like all cases where the function returns a row-like object.

bkamins · 2021-03-14T20:45:11Z

test/iteration.jl

+    end
+
+    for x in (gdf, gdfv)
+        @test collect(x) == map(identity, x) == [v for v in x] == [x[i] for i in 1:3]


This is how things worked (and still work) for GroupedDataFrame.

nalimilan · 2021-03-16T15:27:28Z

src/abstractdataframe/abstractdataframe.jl

+
+# Iteration protocol
+
+function Base.iterate(df::AbstractDataFrame, i=1)


Suggested change

function Base.iterate(df::AbstractDataFrame, i=1)

function Base.iterate(df::AbstractDataFrame, i::Int=1)

src/abstractdataframe/abstractdataframe.jl

src/abstractdataframe/iteration.jl

nalimilan · 2021-03-16T17:51:19Z

test/iteration.jl

+    gdfv = groupby(dfv, :a)
+
+    for x in (df, dfv)
+        @test collect(x) == map(identity, x) == [v for v in x] == [x[i, :] for i in 1:3]


I'd take a defensive approach and make everything for which we don't have a clear use case an error for now. People may expect collect to materialize a data frame in some way, like in Query.jl or dplyr, and be confused if they get a weird vector of rows. Likewise, we could imagine doing something useful in the future with a multi-argument map.

Regarding map(identity, df), as I said at #2254, I think it should return a DataFrame, just like all cases where the function returns a row-like object.

bkamins · 2021-03-17T19:33:57Z

Actually given the comment by @pdeffebach in #2654 (comment) (which I read as "the comment of experienced user, but not hard-core data scientist"), I am hesitant to implement this PR at all.
Maybe it is better to leave filter(:x => predicate, df) even if it is not consistent, but it will be easier to understand by non-experts.

Also not having iterate or map or collect is no problem at all for normal users.

The only thing that for sure needs fixing is eltype of DataFrameRows. Here is an explicit example of the bug we have now:

julia> using DataFrames

julia> df = view(DataFrame(a=1), :, :)
1×1 SubDataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> dfr = eachrow(df)
1×1 DataFrameRows
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> eltype(dfr)
DataFrameRow{SubDataFrame{DataFrame, DataFrames.Index, Base.OneTo{Int64}}, DataFrames.Index}

julia> typeof(dfr[1])
DataFrameRow{DataFrame, DataFrames.Index}

bkamins · 2021-03-17T19:40:33Z

I also prefer what @pdeffebach suggested as we start to loose the grip on "non-breaking" promise (I know we do not want to break anything in this PR but it is almost breaking for a very common use case 😄).

nalimilan · 2021-03-18T17:41:56Z

I think it's OK if we don't implement iterate, map and collect right now. What matters is that we reserve this in case we want to define them in 1.x without breaking anything.

Regarding filter, it's true that deprecating it will hit many users as it's currently the recommended solution. But deprecations are disabled by default so it doesn't have a large impact. I guess it only matters to deprecate it before 1.0 if we consider that it would allow us to remove it before 2.0 if we wanted to. Otherwise, we could deprecate it later if we eventually decide that it's inconsistent with the rest of the API. I don't have a strong preference.

bkamins · 2021-03-18T17:47:41Z

@pdeffebach - can you please add a "voice of the user" and "downstream package maintainer" regarding filter? Thank you!

Given the discussion I would not implement iterate et al. for AbstractDataFrame right now. Still there is a question then if we want to add custom behavior for map and collect for GroupedDataFrame (as they currently use the generic methods)

pdeffebach · 2021-03-18T18:20:04Z

I am fine to deprecate filter.

I prefer consistency with transform and select, and thus use subset rather than filter. If we don't implement iteration, then I think filter makes less sense. But by that logic we would also have to deprecate first etc. I don't think it's worth it right now to expunge all "DataFrame as collection of rows" functions.

It's worth noting that unique now also operates on columns. So filter is even more the odd one out after #2652 .

I don't use it in DataFramesMeta, either, and don't have plans for a @filter macro.

Regarding map. Maybe an initial first step is to allow map(source => fun, eachrow(df)). We could make this very performant, I think, by doing something similar to @eachrow in DataFramesMeta.

I'm not sure about iterate, but we could make it easier to expose Tables.namedtupleiterator.

To reply form another thread here

Maybe in the not-too-distant future we will be able to fix this using a trick similar to what @andyferris mentioned recently (I lost the link): if we can inspect the function's code to detect which variables are used, we can pass it a small NamedTuple instead of a DataFrameRow and benefit from type stability.

I don't think this is a good idea. Since I don't think we have seen any examples yet in the wild of someone doing this kind of inspection and having things be flexible and not have too many edge cases.

bkamins · 2021-03-18T18:26:04Z

So - the decision would be to deprecate filter fully for AbstractDataFrame to have a flexibility post-1.0 release to decide what to do. Are we in agreement here?

Then again we go back to the decision what to do with filter for GroupedDataFrame.

nalimilan · 2021-03-18T18:39:28Z

unique doesn't operate on columns: it returns unique rows. It simply allows considering only some columns for the comparison. We have lots of row-oriented functions like that, so I don't think filter is really exceptional. But as I said I'm not opposed to deprecating it.

GroupedDataFrames are the most difficult part. For filter I think it's easy, as a GroupedDataFrame should always be returned. For map, what's tricky is that different return types are possible. When a table object is returned, it makes sense to return a GroupedDataFrame. When a row object is returned, it's less obvious as we could return either a GroupedDataFrame (but that doesn't make a lot of sense since there's only one row per group) or a DataFrame (but then the behavior of map would be complex). When another object is returned, we could return a simple Vector or a Dict. Maybe better throw an error for now?

bkamins · 2021-03-18T19:13:23Z

I was most concerned about filter(cols => predicate, ::GroupedDataFrame) but it seems OK (it works on whole columns).

So the conclusion is:

deprecate filter for AbstactDataFrame fully
do not add iteration protocol for AbstactDataFrame for now
keep filter for GroupedDataFrame
deprecate map for grouped data frame (to avoid being breaking)

then the only question is what we do with collect on GroupedDataFrame. We can:

leave it as is (producing a vector)
leave it as is with a deprecation
make collect produce a DataFrame collecting the GroupedDataFrame

pdeffebach · 2021-03-18T22:43:27Z

I think this is fine.

I wouldn't mind bringing back filter eventually, but given unique it would probably be best to have the function act on columns and have it really just be subset.

w.r.t collect on a GroupedDataFrame. I would deprecate it, but provide DataFrame(gd::GroupedDataFrame) for a convenient way to ungroup. (or combine).

bkamins · 2021-03-19T06:27:08Z

but provide DataFrame(gd::GroupedDataFrame) for a convenient way to ungroup.

DataFrame(gd::GroupedDataFrame) already does that.

The issue is that ~~combine~~ collect is a Julia Base function and what it does now is just a default behavior so we would have to add a custom implementation (and I am OK to add now just a deprecation falling back to a default implementation).

nalimilan · 2021-03-20T14:10:40Z

collect is documented to return an array, so maybe it's OK to keep the default behavior. It's not super useful but a Vector{<:SubDataFrame} makes more sense than a Vector{<:DataFrameRow}. But deprecating it would also be fine if we want to be safe. At least I don't think collect(::GroupedDataFrame) should return a DataFrame, as in the input elements are groups, but in the output elements would be rows.

bkamins · 2021-03-20T14:46:37Z

Ah - you are right it is documented to produce an Array, so what we have now is OK.

bkamins · 2021-03-21T00:04:50Z

The next difficulty is that we agreed that subset should not allow passing a bare function argument (as it seemed not very useful). However, filter(fun, df) is this kind of syntax. Therefore we have two general options:

reconsider leaving filter(fun, df) as it is currently defined (maybe it is not that super bad to keep it without deprecation?)
add support to subset(df, predicate) (which makes little practical sense, as then predicate will get a whole df as its input, but it would be needed to maintain consistency)

What do you think?

pdeffebach · 2021-03-21T02:01:22Z

I don't think subset(df, fun) makes much, sense its not applying to each element of the iterator (a row). But the benefit of not calling it filter is because we don't have to maintain all these consistencies with Base. So I'm fine to drop.

bkamins · 2021-03-21T08:02:09Z

I agree this is a hard decision. Let me summarize the filter(fun, df) case and why I am raising it back again considering to keep it.

deprecation: filter(fun, df) would get deprecated to subset(df, x -> fun.(eachrow(x)) which seems quite verbose (and it would require extending the subset syntax as discussed above). As an additional a shorter deprecation would be df[fun.(eachrow(x), :] so maybe this is a way to go (with a shortcoming that filter would get deprecated into two different functions depending on the call signature).
do we think that filter(fun, df) would confuse anyone what it does? I think not - if you asked a random person to guess what it does most likely filtering by-row rows would be the answer
filter(fun, df) was present in DataFrames.jl for a very long time so it is quite breaking to remove it - even if we plan to do it in a long term. And what we discuss here is breaking although we promised not to do breaking things.

pdeffebach · 2021-03-21T16:58:10Z

To add on to this, we don't have broadcasted ||. So having a by-row version is important. (Though the user can always wrap in ByRow.

bkamins · 2021-03-21T17:21:19Z

So - in conclusion, @nalimilan + @pdeffebach - are we OK to keep filter(fun, ::AbstractDataFrame) without deprecation in 1.0 release? (we can always deprecate it later).

pdeffebach · 2021-03-21T17:25:16Z

If we don't deprecate filter(fun, df) then I think we should also keep filter(source => fun, df). So in connclusion... I guess we should change nothing.

bkamins · 2021-03-21T17:31:03Z

So in connclusion... I guess we should change nothing.

I would be fine with this also. As commented above after thinking about this I feel that it is not that hard to learn that filter is special while it is very easy to use this way.

We just had a similar discussion in #2665 that df[row, cols] is special and I also was in favor of not changing things.

nalimilan · 2021-03-21T21:04:58Z

OK, let's keep filter as it is. In the worst case we'll deprecate these two methods, but I agree they are not likely to confuse people in practice.

bkamins · 2021-03-21T22:13:34Z

OK - so in conclusion we only deprecate map for GroupedDataFrame and fix eltype of eachrow

bkamins · 2021-03-22T17:53:13Z

only coverage fails here

nalimilan · 2021-03-24T20:17:25Z

NEWS.md

+* applying `map` to `GroupedDataFrame` is now deprecated. The return value
+  of this function might change in 2.0 release
+  ([#2662](https://github.com/JuliaData/DataFrames.jl/pull/2662))


Here again, it would be annoying to have to wait for 2.0 if we want to change this. How about backporting it to 0.22?

OK. I will change this PR in a way to allow it to be included in 0.22.6 release.

src/groupeddataframe/groupeddataframe.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2021-03-25T10:26:53Z

Thank you!

tbenst · 2021-09-04T09:18:02Z

Sorry to bump an old thread, but I have a bunch of code from March that is now broken because map no longer works with grouped dataframes. For example,

n_times = map(x->length(unique(x.time)), groupby(df,:experiment))
# seems like this is equivalent?
n_times = combine(x->length(unique(x.time)), groupby(df,:experiment)).x1

The new version is quite arcane (combine? x1?) compared to vanilla map. Perhaps consider bringing it back? Many thanks for their package!

nalimilan · 2021-09-04T10:55:52Z

Actually the replacement (as suggested by the deprecation warning) is much simpler: [length(unique(x.time)) for x in groupby(df, :experiment))]. This is why we deprecated it, as it seems an array comprehension isn't longer to type and is more explicit. map could potentially be used for more complex use cases, e.g. it could return a GroupedDataFrame (for most collections, map returns an object of the same type as their input).

bkamins · 2021-09-04T11:54:30Z

Agreed. We removed old map to allow for making a decision how map should work in the future without being breaking. Regarding your combine expression you can also write:

combine(groupby(df, :experiment), :time => length∘unique => :len).len```

which will be more efficient than both, your combine, old map or comprehension:

julia> using DataFrames

julia> df = DataFrame(x = rand(1:10^6, 10^7));

julia> gdf = groupby(df, :x);

julia> @time [length(unique(x.x)) for x in gdf];
  2.964561 seconds (21.05 M allocations: 1.221 GiB, 3.71% gc time, 1.20% compilation time)

julia> @time combine(x->length(unique(x.x)), gdf).x1;
  2.856571 seconds (13.17 M allocations: 1016.776 MiB, 3.78% gc time, 4.04% compilation time)

julia> @time combine(gdf, :x => length∘unique => :len).len;
  1.371527 seconds (6.00 M allocations: 602.711 MiB, 4.28% gc time)

This was linked to issues Mar 14, 2021

deprecate filter(:x => f, df) #2654

Closed

add map for AbstractDataFrame #2053

Closed

bkamins mentioned this pull request Mar 14, 2021

Clarify position on iteration API #2254

Closed

bkamins commented Mar 14, 2021

View reviewed changes

src/abstractdataframe/iteration.jl Show resolved Hide resolved

bkamins commented Mar 14, 2021

View reviewed changes

bkamins mentioned this pull request Mar 14, 2021

deprecate filter(:x => f, df) #2654

Closed

nalimilan reviewed Mar 16, 2021

View reviewed changes

deprecate map on GroupedDataFrame and fix eachrow eltype

e1e8e9e

bkamins force-pushed the bk/itration_protocol branch from c95d6d5 to e1e8e9e Compare March 21, 2021 22:12

bkamins changed the title ~~add iteration protocol to AbstractDataFrame~~ deprecate map on GroupedDataFrame Mar 21, 2021

bkamins added 3 commits March 22, 2021 00:40

fix precompile

fc87de2

fix precompilation

047fce3

Merge branch 'main' into bk/itration_protocol

dc322c3

nalimilan reviewed Mar 24, 2021

View reviewed changes

bkamins and others added 2 commits March 25, 2021 00:06

Update src/groupeddataframe/groupeddataframe.jl

02ebd08

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

change PR for 0.22.6 release

dcd8432

nalimilan approved these changes Mar 25, 2021

View reviewed changes

Merge branch 'main' into bk/itration_protocol

3a9086a

bkamins merged commit 777afbb into main Mar 25, 2021

bkamins deleted the bk/itration_protocol branch March 25, 2021 10:26

bkamins added a commit that referenced this pull request Mar 25, 2021

deprecate map on GroupedDataFrame (#2662)

059248f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deprecate map on GroupedDataFrame #2662

deprecate map on GroupedDataFrame #2662

bkamins commented Mar 14, 2021

bkamins Mar 14, 2021

bkamins Mar 14, 2021

nalimilan Mar 16, 2021

bkamins Mar 14, 2021

nalimilan Mar 16, 2021

nalimilan Mar 16, 2021

bkamins commented Mar 17, 2021

bkamins commented Mar 17, 2021

nalimilan commented Mar 18, 2021

bkamins commented Mar 18, 2021

pdeffebach commented Mar 18, 2021

bkamins commented Mar 18, 2021

nalimilan commented Mar 18, 2021

bkamins commented Mar 18, 2021

pdeffebach commented Mar 18, 2021

bkamins commented Mar 19, 2021 •

edited

Loading

nalimilan commented Mar 20, 2021

bkamins commented Mar 20, 2021

bkamins commented Mar 21, 2021

pdeffebach commented Mar 21, 2021

bkamins commented Mar 21, 2021

pdeffebach commented Mar 21, 2021

bkamins commented Mar 21, 2021

pdeffebach commented Mar 21, 2021

bkamins commented Mar 21, 2021

nalimilan commented Mar 21, 2021

bkamins commented Mar 21, 2021

bkamins commented Mar 22, 2021

nalimilan Mar 24, 2021

bkamins Mar 24, 2021

bkamins commented Mar 25, 2021

tbenst commented Sep 4, 2021 •

edited

Loading

nalimilan commented Sep 4, 2021

bkamins commented Sep 4, 2021


		# Iteration protocol

		function Base.iterate(df::AbstractDataFrame, i=1)

deprecate map on GroupedDataFrame #2662

deprecate map on GroupedDataFrame #2662

Conversation

bkamins commented Mar 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Mar 17, 2021

bkamins commented Mar 17, 2021

nalimilan commented Mar 18, 2021

bkamins commented Mar 18, 2021

pdeffebach commented Mar 18, 2021

bkamins commented Mar 18, 2021

nalimilan commented Mar 18, 2021

bkamins commented Mar 18, 2021

pdeffebach commented Mar 18, 2021

bkamins commented Mar 19, 2021 • edited Loading

nalimilan commented Mar 20, 2021

bkamins commented Mar 20, 2021

bkamins commented Mar 21, 2021

pdeffebach commented Mar 21, 2021

bkamins commented Mar 21, 2021

pdeffebach commented Mar 21, 2021

bkamins commented Mar 21, 2021

pdeffebach commented Mar 21, 2021

bkamins commented Mar 21, 2021

nalimilan commented Mar 21, 2021

bkamins commented Mar 21, 2021

bkamins commented Mar 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Mar 25, 2021

tbenst commented Sep 4, 2021 • edited Loading

nalimilan commented Sep 4, 2021

bkamins commented Sep 4, 2021

bkamins commented Mar 19, 2021 •

edited

Loading

tbenst commented Sep 4, 2021 •

edited

Loading