Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaner syntax #2206

Closed
matthieugomez opened this issue Apr 23, 2020 · 48 comments · Fixed by #2214
Closed

Cleaner syntax #2206

matthieugomez opened this issue Apr 23, 2020 · 48 comments · Fixed by #2214
Labels
breaking The proposed change is breaking. decision
Milestone

Comments

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 23, 2020

I've followed a bit the recent updates of this package. This is very impressive — thanks @bkamins
and @nalimilan for all your work.

I have one comment about the syntax. An operation such as computing the mean of a variable in a dataframe can be classified along two dimensions (i) whether the new dataframe as the same size as the original dataframe (ii) whether it is a by operation or not.

dplyr and stata make it very easy to alternate across these two dimensions.

in dplyr

mutate(df, mean(x)) mutate(group_by(df, id), mean(x))
summarize(d, mean(x)) summarize(group_by(df, id), mean(x))

In stata:

egen x = mean(x) egen x = mean(x), by(id)
collapse (mean) x collapse (mean) x, by(id)

This is very neat. There is a symmetry between top vs bottom, and left vs right. People can understand what these commands do just by reading them

The current syntax of Dataframes.jl is not as neat IMO. On the current master, we have:

select(df, :, :x => mean) by(df, :id, :, :x => mean)
select(df, :x => mean) by(df, :id, :x => mean)

I wish DataFrames.jl would follow the example of dplyr and stata here. For instance, a nice syntax could be the following:

transform(df, :x => mean) transform(df, :x => mean, by = :id)
aggregate(df, :x => mean) aggregate(df, :x => mean, by = :id)

In short, my suggestion would be to (i) remove the function by (ii) allow by kwarg in transform (ii) define a new function that would be the Julia equivalent of summarize (dplyr)/ collapse (stata).

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

This is essentially #2172, which will be added later (as it is non-breaking). We leave by mainly for backward compatibility reasons.

As for mutate vs summarize - why do you see that this distinction is needed? If I understand the functionality there correctly we already provide it with a single function (currently select/transform and combine/by).

EDIT

Just to expand transform(groupby(df, :col1), :col2 => mean) will keep the result a GroupedDataFrame (this is also what dplyr does), while by will transform it to DataFrame as it does now. So the API will be minimally different.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

So - to summarize - unless you have some comment why we should distinguish "single row" vs "all rows" functionality (we currently do not and dplyr does, but I do not see a huge benefit of this distinction) then this issue can be closed and please comment in #2172 on the functionality select/transform should have for GroupedDataFrame (they key discussion is row order of the output as you will note there).

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 24, 2020

Yes, I think different methods should do single rows vs all rows. The current syntax for the top right quadrant vs bottom right quadrant is not good. It’s very important to separate between the two when doing data wangling.

@matthieugomez
Copy link
Contributor Author

Also, even looking only at the top quadrant. by vs transform is not satisfying. I would rather have:
transform(df::DataFrame, :x => mean) -> Dataframe
transform(df::DataFrame, :x => mean, by = :id) -> Dataframe
transform(df::GroupedDataFrame, :x => mean) -> GroupedDataframe

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

It’s very important to separate between the two when doing data wangling.

Can you please comment on this more? Why it is so important? My understanding is that it is what the function you apply returns determines the shape of the output.

EDIT in particular this function can neither return 1 row nor that many rows as were in the original group but something completely different.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

I would rather have:

This is doable and essentially depends on whether we want to allow by argument in transform and can be added in the future.

@matthieugomez
Copy link
Contributor Author

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

So essentially you would want to extend keepkeys kwarg functionality in the following way:

  • true: keep groping columns in the original shape (this is what we do not have now and can be easily added)
  • false: drop gruping columns
  • nothing: current behavior and the default (and currently this happens for keepkeys=true but we could change it)

Do I understand your request correctly?

@matthieugomez
Copy link
Contributor Author

I would like by to be deprecated and split into two functions aggregate(..., by) vs tranform(..., by).. I dont think I have anything to say wrt keepkeys.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

but by(...) would not be the same as transform(..., by) but rather the same as select(..., by).

Also currently we cannot use aggregate, because we have to go through a deprecation period.

Finally then aggregate should be two functions - one pairing select and another pairing transform. What names would you use for both functionalities?

So that is why I am asking if it is OK with you if we achieve the functionality you describe using kwargs only that would be added to select and transform:

  • modifying how keepkeys works (this should be done before 0.21 release)
  • adding by keyword argument to them and support for passing GroupedDataFrame (this is for the future), also then probably combine kwarg should be added to them (with different default depending on if AbstractDataFrame or GroupedDataFrame is passed)

If we went this way then by and also combine should get deprecated. But adding by kwarg and the consequences probably will not happen in 0.21 release.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

just to add - what you write you want is currently available by writing (disregarding some corner cases for a while not to complicate the discussion):

by(df, :key, :key, :some_col => mean)

simply we would make by(df, :key, :some_col => mean, keepkeys=true) to expand to that form.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 24, 2020

I tried to make my point without talking about whether or not you want to keep all other columns. With this added complication, the full table in dplyr is:

transmute(df, mean(x)) transmute(group_by(df, id), mean(x))
mutate(df, mean(x)) mutate(group_by(df, id), mean(x))
summarize(d, mean(x)) summarize(group_by(df, id), mean(x))

Note that there is no need for summarize that keeps all existing columns (not sure what you mean by "aggregate should be two functions...").

I am suggesting:

select(df, :x => mean) select(df, :x => mean, by:id)
transform(df, :x => mean) transform(df, :x => mean, by = :id)
aggregate(df, :x => mean) aggregate(df, :x => mean, by = :id)

So in short, I would like by to deprecated, and split between select(..., by) and aggregate(.., by) (the name aggregate is just an example, it can be any other name.). As you say, transform can get a by argument later.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

I have to stop now, I will comment on it later, but the crucial part is select(df, :x => mean) which currently in DataFrames.jl will be the same as aggregate(df, :x => mean) not transmute(df, mean(x)).

The key thing is that we accept any length of the output (not only 1 row or nrow(df) rows). I will have to think about it and comment.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 24, 2020

Ok. Yes, sorry, I do not completely understand the current syntax in master. Hopefully, though, I have communicated the kind of syntax I am hoping for.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

Yes - I think it is clear. Let me summarize how I now think we can handle all the cases (also from the #2172 discussion). I describe select but transform is analogous:

  • deprecate by and combine (as they will not be needed)
  • select(df::AbstractDataFrame, ...; keeprows=true) => make sure that the returned data frame has number of rows exactly equal to nrow(df)
  • select(df::AbstractDataFrame, ...; keeprows=false) => returned data frame can have any number of rows - determined by the return values ; this is the default
  • select(gdf::GroupedDataFrame, ...; combine=true, keeprows=true) => return a data frame and make sure that each group has exactly the same number of rows and in the same order as in parent(gdf); if keeprows=true then sort argument in groupby is ignored
  • select(gdf::GroupedDataFrame, ...; combine=false, keeprows=true) => return a grouped data frame and make sure that each group has exactly the same number of rows and in the same order as in parent(gdf); if keeprows=true then sort argument in groupby is ignored
  • select(gdf::GroupedDataFrame, ...; combine=true, keeprows=false) => return a data frame with row order determined by the return values of ... arguments and then sort argument in groupby is respected
  • select(gdf::GroupedDataFrame, ...; combine=false, keeprows=false) => return a grouped data frame with row order determined by the return values of ... arguments and then sort argument in groupby is respected; this is the default for GroupedDataFrame option
  • for DataFrame(::GroupedDataFramel; keeprows) also add keeprows keyword argument that will decide what the order of groups should be in the result

@nalimilan - I can quickly implement it if we go for this. I guess all needs of @matthieugomez are satisfied with this design and also the requirement of tracking of row order after producing a data frame. For 0.21 it would be a simple implementation (not the fastest possible), but we can improve performance after 0.21 release.

@pdeffebach
Copy link
Contributor

pdeffebach commented Apr 24, 2020

Thank you for your input on this. I think there are definitely changes to make here.

  • I use bysort(id) egen x = mean(x) in Stata instead of a keyword argument. As a result, I find by very readable and intuitive. I think it reads well, "by this variable, do these operations". I would be sad to see it deprecated.
  • As Bogumil noted, select will automatically collapse if it can. Think of select as having force pushing down on the data frame and it only meets resistance if it sees a column with the full length. As a result, it does what you describe aggregate to be. Consequently, the current implementation of your table looks like this
not available not available
select(df, :, :x => mean) by(df, :id, :, :x => mean)
select(df, :x => mean) by(df, :id, :x => mean)

We do not have an operation to "spread" the mean of a variable across the full data frame, but drop other columns. This is a good lapse and I thank you for pointing it out.

I like by. I think it is readable. Here is my attempt at a totally new syntax that I think would avoid all confusion. Here select does not collapse by default. Rather it "spreads" results to match the input data frame length.

select(df, :x => mean) selectby(df, :id, :x => mean)
transform(df, :x => mean) transformby(df, :id, :, :x => mean)
collapse(df, :x => mean) collapseby(df, :id, :x => mean)
  • As you mentioned, there is a whole other dimension to this when we think about putting in grouped data frames. Below is my attempt at a full list.

  • data frame, keep cols, non-grouped operations, return dataframe with same size:
    *`transform(df, :x => mean)

    • Alternatively select(df, :, :x => mean)
  • data frame, drop cols, non-grouped operations, return dataframe with same size:

    • empty
    • select(df, :x => (t -> fill(mean(t), length(t)))), which is a mouthful.
  • data frame, keep cols, grouped operations, return dataframe with same size:

    • by(df, :, :id, :x => mean)
    • Potentially transform(df, :x => mean; by = :id) (I don't like this because you have to read a lot)
    • Potentially transformby(df, :id, :x => mean) (Would work if we renamed transform to gen)
  • data frame, drop cols, grouped operations, return dataframe with collapsed size:

    • by(df, :id, :x => mean)
    • Potentially aggregate(df, :id, :x => mean)
  • data frame, drop cols, grouped operations, return dataframe with same size:

    • empty (see above)
    • Potentially select(df, :x => mean, by = :id)
    • Potentially selectby(df, :id, :x => mean)
  • grouped data frame, keep cols, grouped operations, return grouped dataframe with same size:

    • transform(gd, :x => mean)
    • Alternatively select(gd, :, :x => mean)
  • grouped data frame, drop cols, grouped operations, return grouped data frame with collapsed size:

    • select(gd, :x => mean)
  • grouped data frame, keep cols, grouped operations, return dataframe with same size:

    • combine(gd, :, :x => mean)
  • grouped data frame, drop cols, grouped operations, return dataframe with collapsed size:

    • combine(gd, :, :x => mean)

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 24, 2020

I don't think I entirely agree with @bkamins's synthesis of my proposal. My original point is that it is important to have two different methods to keep rows or not (instead of different options keeprows = true or false). This makes explicit an important contract with the user: there is no loss of observations. Moreover dataframes remain aligned: the row number 10 in the previous dataframe refers to the same observation as the row number 10 in the next dataframe.

To be honest, I'm not sure what the main difference between transform and select is in your proposal?

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 24, 2020

@pdeffebach I'm agnostic on whether we should have a transmute actually. It can simply be obtained by using transform (i.e. a function that keeps all rows and colums) and then select so it's fine to not have it.

For transformby, I think it's a bit inelegant to change the method name depending on whether the operation should be done by group or not. I don't get why a by kwarg does not work for you. But if people do hate this, I'm happy with just transform(x::GroupedDataFrame, ....), that returns a groupeddataframe, which can be combine'd afterwards, as long as row order remains the same.

@pdeffebach
Copy link
Contributor

My original point is that it is important to have two different methods to keep rows or not (instead of different options keeprows = true or false). This makes explicit an important contract with the user: there is no loss of observations.

I fully agree with this. I think that select shouldn't collapse by default and we should have a separate function name for collapseing. Too many keyword arguments will get complicated. I propose collapse if we don't get sued by Stata for using it.

I think it's a bit inelegant to change the method name depending on whether the operation should be done by group or not. I don't get why a by kwarg does not work for you.

My opposition to keyword arguments is

  1. by is easy to read. transformby(df, :id, ...) reads like a sentence. "by the column :id: transform the data frame...".
  2. These function calls might get very long. I don't want to read
transform(df, 
    :income => mean, 
    :income => std, 
    :educ => first, 
    :startage => first, by = :id)

only to find out the operations are performed by group at the end.

I agree that transformby is hard to type. I wouldn't mind gen being used instead of transform, i.e. gen and genby. But I don't know how much support I would get for my idea.

I'm happy with just transform(x::GroupedDataFrame, ....), that returns a groupeddataframe, and can be combine'd afterwards

I'm happy with this, too. But you have convinced me of the need for collapse and select to be separate, at the very least.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

I had a long discussion with @nalimilan about it but also having read your comments I have updated my thinking a bit and have the following conclusion.

We should have the following functions

  • select for data frames => like current select but ensuring that number of rows in the output is nrow of input
  • transform for data frames => like current transform but ensuring that number of rows in the output is nrow of input (now it is not guaranteed)
  • collapse for data frames (I like the name as it is explicit) => this is exactly current select (i.e. it takes everything as is and collapses as much as possible, but in particular it can also produce nrow of input rows if transformations requested produce such a number of rows)
  • select and transform should not work on GrupedDataFrame - the reason is that in Julia GroupedDataFrame is not just a data frame with information about grouping columns, such an object can have dropped groups or reordered them, so the basic contract of select and transform the way you propose them cannot be kept
  • but select and transform can get a by keyword argument (or be named selectby or transformby - this can be discussed which approach is better); in this case no groups are dropped (so we do not provide sort nor skipmissing etc. functionalities) as we want to guarantee that the result still has nrow of input rows; in this case we guarantee that the output is a DataFrame that has the same order and number of rows as the input
    • map, combine and by would work as they are working now (they are versions of collapse but for GroupedDataFrame); why they are needed? - because GroupedDataFrame can be subsetted, reordered etc. - if someone does this then probably one wants to preserve the consequences of these operations in the result; we can discuss the names and in particular if by is really needed (it is just a convenience for groupby+combine combo)
    • DataFrame for grouped data frames should get a keepkeys keyword argument for consistency with combine

So in summary - the major difference from dplyr is that in DataFrames.jl groupby has a slightly different functionality than there and the consequence is that the table you have proposed would look like (assuming by kwarg just to keep one convention):

select(df, :x => mean) select(df, :x => mean, by=:id)
transform(df, :x => mean) transform(df, :x => mean, by=:id)
collapse(df, :x => mean) by(df, :id, :x => mean)

With the only caveat that collapse could produce more than one row depending on what the transformations request. Similarly by.

Then on top of this table we have combine and map on GroupedDataFrame that are an independent functionality serving a different purpose as GroupedDataFrame itself can be transformed as I noted.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 24, 2020

Well, that would be awesome...

I guess my only question is: will it complicate the package too much? I realize that you and @nalimilan are basically the only maintainers these days, so I get that we want to avoid a multiplication of methods.

On a related note, I guess I don't really understand why GroupedDataFrame is needed then, if select, transform, collapse get a by argument. Would not it dramatically simplify things to remove it or put it in a separate package?

@bkamins
Copy link
Member

bkamins commented Apr 25, 2020

GroupedDataFrame is needed because it also:

  • can be operated on (and we know people often do this, e.g. create a GroupDataFrame, subset it, work on groups using e.g. loop; like a recent discussion here: filter and broadcasting for GroupedDataFrame #2194 which was about the need for such a functionality)
  • in particular it provides a key to a data frame (so that you can quickly look-up groups by key and get a SubDataFrame matching this key) - this is something that is needed very often, e.g. JuliaDB.jl by design allows to set a key but DataFrame does not

The basic building blocks of what is proposed above is something we have now:

  1. current combine (which would stay unaltered)
  2. current select (which would be renamed to collapse)

Now (technical implementation migtht differ a bit):

  1. future select and transform without by (so not grouping) would be simply ensure that all output has nrow(df) rows which is very simple (you just need to check if the first column you produce has this number of rows, and the rest is the same as we already make this check anyway but just allowed any number of rows)
  2. future select and transform with by (so grouping) would just use by function but adding one virtual column axes(df, 1), call it :__dummy__ to a data frame that would contain axes(df, 1) values, then we would add just one operation :__dummy__ (i.e. retain __dummy__) column to by processing. In this way we ensure: a) each group may not change its length, b) after by finishes we use idx field of GrupedDataFrame to recover the original order of rows in linear time

So in summary - the change is relatively small. Fortunately none of this functionality exists in 0.20 so we can do whatever we want and the current design for 0.21 is flexible enough to cover all that you request here relatively simply.

@bkamins
Copy link
Member

bkamins commented Apr 25, 2020

An additional comment based on the discussion with @nalimilan on Slack (this is my understanding of things based on the input from @matthieugomez, @pdeffebach and @nalimilan + my own opinions):

Q1: Why we do do not allow passing GroupedDataFrame to select and transform?

A1: Because select and transform have an invariant that they always return nrow(df) rows and they are in the order of rows in df. The problem is that GroupedDataFrame has sort and skipmissing kwargs that respectively reorder and remove rows from df. Aslo GroupedDataFrame can be subsetted/reordered. Theoretically we could define what consequences reordering and subsetting should have on the result of select taking such a GroupedDataFrame but such rules would be complex, and for sure not easily grasped by the regular user of DataFrames.jl. Also - if we really find it useful in the future adding support for GroupedDataFrame passed to select can be always added as it will be non-breaking.

Q2: Why do we need collapse as a separate function.

A2: since select and transform have a very clear contract "always return nrow(df) rows and they are in the order of rows in df" adding a kwarg to them that would allow to choose if we want all rows or any number of rows would be a mental overload. It is much better to have a separate function that clearly signals that it can change number of rows. Now it is enough to have combine that is similar to select as for transform it does not really make sense to have a "collapsing" behaviour (as we want to keep existing columns which have nrow(df) rows anyway)

Q3: Why do I prefer select and tansform to have by keyword argument rather than define selectby and transformby functions.

A3. This is a tough decision. I was thinking about it and I think that it is better not to pollute the namespace with too many functions. by will be a kwarg, so it should be clear enough what the behaviour is (this is like with select in SQL, and actually better than syntax in dplyr where you have to know what was the type of the input to know how the function will work. Here we will have a clear visual signal of by keyword that we are working in groups).

Q4. Why do we keep combine and map for GroupedDataFrame.

A4. First of all - they are useful. You firs "tweak" your GroupedDataFrame (like subsetting, reordering etc.) and then call them to get a desired result. Also it is better to be backward compatible than be breaking if there is no clear benefit of being breaking. Also map and combine have a bit different API that in particular allows for do-syntax, which again is useful if we are in "collapse" mode.

Q5. Should we keep by or deprecate it and instead define collapse with by keyword argument

A5. After thinking my preference is to deprecate by. The reason is to reduce the name-space pollution, especially by such a short function name. The deprecation would go:

  • call by(arg, df, key, kwargs...) is deprecated to be combine(arg, groupby(df, key, kwargs...), kwargs...)
  • call by(df, key, arg, kwargs...) is deprecated to be combine(groupby(df, key, kwargs...), arg, kwargs...)
  • call by(df, key, args..., kwargs...) is deprecated to be collapse(df, args..., by=key, kwargs...)

In particular this will resolve the problem I have always had that by(df, :x1, :x1, :x2) is a bit hard to parse - you have to remember that the second positional argument is key, but in the context ``by(fun, df, :x1)` actually third positional argument is key.

In summary the table we discuss would be

expected result ungrouped operation grouped operation
retain number and order of rows, drop old columns select(df, :x => mean) select(df, :x => mean, by=:id)
retain number and order of rows, keep old columns transform(df, :x => mean) transform(df, :x => mean, by=:id)
any number of rows, order of rows determined by the way you perform grouping (sort and skipmissing kwargs in particular), drop old columns collapse(df, :x => mean) collapse(df, :x => mean, by=:id) + old combine and map applied to GroupedDataFrame + deprecate by

additionally we support keepkeys kwarg for all functions (determining if grouping columns should be retained or not). This kwarg should be also added to DataFrame working on GroupedDataFrame.

Now the important thing is that these definitions have a very nice feature that left column (ungrouped operation) is exactly right column (grouped operation) when by=[] (which will be the default) as by=[] essentially creates one group containing a whole data frame. I think it is a nice symmetry showing that the design is consistent.

Please up- or down- vote this proposal. I am now personally convinced to go this way so if no opposing voices will be made I will make a PR implementing this (fortunately it is relatively easy) when #2199 is merged so that we can include it in 0.21 release.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 25, 2020

Well I think it's terrific. Thanks a lot @bkamins.
(also @jmboehm may be interested).

@jmboehm
Copy link

jmboehm commented Apr 25, 2020

I have limited experience with the current interface, but for what it's worth, I think both the request and the proposal make a lot of sense.

  • I would also be in favour of a depreciation of the by function, mainly because it's not clear what it does by itself. All other functions that come to mind here are verbs. Having by as kwarg reads more naturally to me.
  • I would suggest that keepkeys be true by default, at least for those where the contract is that columns are being preserved (but my preference would be to keep them everywhere by default, it's more prudent).
  • It's a more minor thing, but something that I always found relatively unintuitive is that select does more than selecting columns. When I looked at the package for the first time, I was very confused what select(df, :a => :c) means. In my view, the most obvious interpretation when you have only high school math background is that I'm selecting something based on the logical condition "a implies b". It's a more minor thing for me because I can just use select for the things I think it should do, but my concerns would be that newcomers find the syntax hard to read. (Or perhaps I'm just generally not very fond of "=>". What does that mean? "implies", "is mapped to", or an assignment operation?)

I'm not entirely sure how we are meant to perform operations on subsets of the rows, without deleting the other rows. This is one of the issues I've been facing when implementing a Stata-like interface. But perhaps that's an issue for another day.

Thanks a lot to everyone involved!

@pdeffebach
Copy link
Contributor

@bkamins I think this is a great idea. I still maintain a dislike for keyword arguments.

I have implemented, in the crudest way possible, a non-keyword argument version of all this in a PR here. I use Stata-esque names to avoid name conflicts with existing functions. I think everyone in this thread will be able to understand what each function does.

I think that we should deprecate by with the Pairs argument. However I think

by(df, :a) do sdf
   f(sdf)

is still a very powerful and convenient syntax.

For reference, here is my list of new functions implemented in #2210

  • gen: makes new column, keeps old columns, preserves nrow
  • genby: makes new column by group, keeps old columns, preserves nrow
  • keep: makes new column, deletes old columns, preserves nrow
  • keepby: makes new column by group, deletes old columns, preserves nrow
  • collapse: makes new column, deletes old columns, nrow is not preserved (returns 1 row data frame)
  • collapseby: makes new column by group, deletes old columns, nrow is not preserved
  • agggen: takes in grouped data frame, makes new columns, keeps old columns, returns data frame with same nrow
  • aggkeep: same as agggen but deletes old collumns
  • aggcollapse: takes in grouped dataframe, collapses.

@jmboehm Can you file an issue about replicating stata's if syntax? I have some ideas, but we should keep discussion focused on this subject.

@nalimilan
Copy link
Member

Just a few comments regarding @bkamins' detailed proposal. Overall I like it, I just have a few reservations:

  • I don't like the idea of having both combine and collapse, with quite similar behaviors. Keeping the API simple and without ambiguous names is essential. I suggest we use a single function, which could be combine or use another name. The name "collapse" implies you retain a single summary statistic (like in Stata), so I don't really like it for a function which allows returning any number of rows. dplyr's summarize is better I think (you can "summarize" a group of rows into a different group of rows), though probably not ideal (but note that dplyr now allows returning multiple rows). combine has the advantage of not breaking existing code, even if it sounds a bit weird for a single DataFrame. Anyway the name can be discussed after choosing the design.
  • I think select and transform should work on GroupedDataFrame, so that you can write operations in steps (like in dplyr). It's better in general to provide composable objects and functions. We can restrict them to the case where no groups have been skipped and they are not sorted (at least as a first step).
  • While adding by keyword arguments to select and transform sounds convenient, if we allow these operations on GroupedDataFrame they are just syntactic sugar to avoid calling groupby, so I'd be inclined to leave these out for now until we have settled the rest of the API. They can be added at any time, contrary to breaking changes.

@bkamins
Copy link
Member

bkamins commented Apr 25, 2020

@jmboehm thank you for your thoughts. Here are my comments to the questions you have raised.

Or perhaps I'm just generally not very fond of "=>". What does that mean? "implies", "is mapped to", or an assignment operation?

So :a => :b is a shorthand for :a => identity => :b.
And :a => fun => :b means pass column :a to fun and store the result in :b.

how we are meant to perform operations on subsets of the rows, without deleting the other rows

use a view

EDIT - oh - I understand you want to do some operation conditionally on some other column?

something like [:a, :b] => (a, b) -> ifelse.(a .< 0, b, 0)?

@pdeffebach
Copy link
Contributor

@nalimilan thank you for this feedback. I think there is a consensus around this kind of proposal, unless there are truly insurmountable edge cases.

Do you imagine transform(gd::GroupedDataFrame) to return a grouped dataframe. Perhaps a keyword argument to all 3 functions (keep, gen, and collapse for now), which preserves grouping and does not return a data frame.

@pdeffebach
Copy link
Contributor

I have created #2211 for discussion about if syntax and subsetting.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 25, 2020

I'll just give my two cents about kwargs vs type, but, in any case, I'm happy with either proposal.

On the one hand, I like @nalimilan's proposal because it keeps everything super simple.
On the other hand, I'm worried that GroupedDataFrames is so different from DataFrame than it makes too complicated. If you end up, in a future version of DataFrames, allowing a by kwarg in transform, then the whole syntax becomes more complicated in the end.

Is it worth thinking about defining a GroupedDataFrame type that is much more similar to DataFrame, as in dplyr? The current GroupedDataFrame type could then be renamed to GroupedDataFrameIterator, with a function eachgroup(df::GroupedDataFrame) -> GroupedDataFrameIterator

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 25, 2020

I guess something I don't understand in @nalimilan's proposal is whether the output of transform/select of a GroupedDataFrame is a GroupedDataFrame or a DataFrame.

@bkamins
Copy link
Member

bkamins commented Apr 25, 2020

@nalimilan - thank you for the comment. I will summarize the possible design under these circumstances in the post that follows (including @pdeffebach's comment about returning a DataFrame or a GroupedDataFrame).

Here let me add some more general comments.

@matthieugomez - I think @nalimilan wants to give an option to choose what should be the output. i also comment on this below.

Just as a note to @pdeffebach (rephrased what I commented in #2210). In DataFrames.jl we want to provide a minimal set of functions that provide the required functionality. Therefore if we end up with the design in which:

by(df, :a) do sdf
   f(sdf)
end

will be possible to achieve in a different and easy enough way we will probably remove it. I know that in the past DataFrames.jl did not always follow this rule strictly and everywhere but as we go for 1.0 this is needed:

  • we have to have possibly minimal (but complete) functionality in the package
  • users should be encouraged to add companion packages with these extra functions that make it more convenient to work with the core package

For example in #2211 I assume we will clarify what functionality is needed. If is is easily achievable when composing current functionality then probably we will not add it, but if we decide that it is very hard to achieve it without adding something to the "core" of the package then we will add it.

@bkamins bkamins added breaking The proposed change is breaking. decision labels Apr 25, 2020
@bkamins bkamins added this to the 1.0 milestone Apr 25, 2020
@bkamins
Copy link
Member

bkamins commented Apr 25, 2020

The issue about what GroupedDataFrame is a focal point here.
It is also related to #2106 as this issue should be resolved also (e.g. in theory someone might want to expand a 0-row group into something).

Given the amount of decisions that are to be made I start getting a feeling that we will not be able to resolve all issues in a way in 0.21 release, so that we ill not be breaking later. So maybe we will have to decide on allow for breaking changes between 0.21 release and 1.0 release.

What I think the consensus is is what we need for operating on a DataFrame. And this for sure can "go into" 0.21 release. If we decided to stay with this we would for the time being leave by and combine in 0.21 release as legacy and announce that after 0.21 release a redesign of split-apply-combine infrastructure will be done. But maybe (and hopefully) we will quickly settle on the functionality for split-apply-combine part. Then it can also go into 0.21 release.


So the functionality for data frame seems to require three functions, all of them take a a data frame and return a DataFrame:

  • one retains number and order of rows, drops old columns, a tentative name for it is select (and I assume there is a consensus on this)
  • one retains number and order of rows, keeps old columns, a tentative name for it is transform (and I assume there is a consensus on this)
  • any number of rows, here there is a problem how to call this function collapse, combine, summarize, apply

In this thread of the discussion I think we can concentrate on finding an appropriate name for the third function.


The functionality of the grouped data frame is more complex. The starting point is what @matthieugomez said about adding another type (a la dplyr). The good thing is that currently we do not allow to change GroupedDataFrame in-place (but only to generate a new GroupedDataFrame based on it). If we kept this approach then we could add a second parameter to a GroupedDataFrame that would signal if it in cannonical form (not reordered nor subsetted) or modified in some way (this should be enough for our purposes I think - but maybe I do not see something so please comment). The benefit of such an approach is that we would not redesign everything from scratch.

Below I assume that we will settle on combine name in the "data frame passed" case.

Then following the proposal of @nalimilan:

  • we will have three functions: select, transform and combine with functionalities as defined above (actually combine then can be left as it is implemented now).
  • select, transform (at least for now) will accept only GroupedDataFrame in cannonical form (i.e. not subsetted nor reordered), and combine will accept any GroupedDataFrame.
  • there will be no by keyword argument
  • by function will be also removed
  • all the three functions will have keepkeys keyword argument with the current meaning
  • all the three functions will have keepgrouped keyword argument (name is tentative) which if true means that they return a cannonical GroupedDataFrame backed by a freshly allocated DataFrame and otherwise they return a DataFrame
  • DataFrame will get a keepkeys keyword argument

So essentially the difference in comparison to the earlier proposal is dropping by keyword argument and instead dispatching on type and adding one extra keyword argument that will govern the type of the return value.

@pdeffebach
Copy link
Contributor

I have updated #2210 with the proposal described by Bogumil above, still with the names gen, keep, and collapse. I think it's a very good proposal! I would be very happy to work with it and would trust myself to promote it to new users!

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 25, 2020

  • I don't want to complicate the issue, but at some point, I guess, there will be a ! versions of select, transform, collapse (after 1.0 ;)). Could you expand on how this would work? For instance, could we make sure that the proposal would, in the end, allow to create a new column with the mean of :x within groups defined by :id without having to duplicate the original DataFrame?

  • Tbh, I don't really see an issue with using transform/select on a non-canonical grouped DataFrame. The thing that is important, I think, is that transform/select do not sort or remove rows between the argument and the output.

  • What I was suggesting is to create a type GroupedDataFrame that inherits from AbstractDataFrame. This would retain the composability of @nalimilan's proposal without having to transform DataFrames into things that are not DataFrames when computing means by group. The current behavior of GroupedDataFrame could still be obtained by eachgroup(df::GroupedDataFrame), the same way graphemes(x::String) allows for a different kind of iteration on Strings in Base.

@bkamins
Copy link
Member

bkamins commented Apr 25, 2020

I guess, there will be a ! versions of select, transform, collapse

for a DataFrame argument yes - and this is already implemented.

For GroupeDataFrame this is problematic. Let me give a quick example of an R session:

> df <- data.frame(g=c(1,1,1,2,2),x=1:5)
> gdf <- group_by(df, g)
> summarize(gdf, mean(g), mean(x))
# A tibble: 2 x 3
      g `mean(g)` `mean(x)`
  <dbl>     <dbl>     <dbl>
1     1         1       2  
2     2         2       4.5
> gdf$g[1] <- 3
> summarize(gdf, mean(g), mean(x))
# A tibble: 2 x 3
      g `mean(g)` `mean(x)`
  <dbl>     <dbl>     <dbl>
1     1      1.67       2  
2     2      2          4.5

I do not like this design.

In DataFrames.jl we are clear that GroupedDataFrame is a view and views are by definition assuming that their parent is not mutated. Also note that we have SubDataFrames in DataFrames.jl and GroupedDataFrame could be defined on such object.

So - in short:

  • in the proposal with by argument these ! functions can be provided as we do not expose GroupedDataFrame to the end user.
  • in the proposal by @nalimilan with GroupedDataFrame passed around these functions are problematic to provide. What you would probably do is generate a 1-column data frame and in the second step add this column to the original data frame

I don't really see an issue with using transform/select on a non-canonical grouped DataFrame.

Well take this GroupedDataFrame:

julia> df = DataFrame(g=[3,1,2,3,1,2], x=1:6)
6×2 DataFrame
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 3     │ 1     │
│ 2   │ 1     │ 2     │
│ 3   │ 2     │ 3     │
│ 4   │ 3     │ 4     │
│ 5   │ 1     │ 5     │
│ 6   │ 2     │ 6     │

julia> gdf = groupby(df, :g)
GroupedDataFrame with 3 groups based on key: g
First Group (2 rows): g = 3
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 3     │ 1     │
│ 2   │ 3     │ 4     │
⋮
Last Group (2 rows): g = 2
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 3     │
│ 2   │ 2     │ 6     │

julia> gdf2 = gdf[[3,2]]
GroupedDataFrame with 2 groups based on key: g
First Group (2 rows): g = 2
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 2     │ 3     │
│ 2   │ 2     │ 6     │
⋮
Last Group (2 rows): g = 1
│ Row │ g     │ x     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 1     │ 5     │

what should be the result of select(gdf, :x => identity => :x) and select(gdf2, :x => identity => :x)?

What I was suggesting is to create a type GroupedDataFrame that inherits from AbstractDataFrame.

I understand this, but could you please specify exactly how this type should behave exactly (should it be a view or a copy, should it precompute groups on creation and store them or only carry over information about grouping columns but do not store them, should it have a special column type for grouping columns that make them read only, should you be allowed to add rows to it, should you be allowed to add columns to it, should you be allowed to remove columns from it, etc. - probably the list here is longer, but these are probably major questions). Maybe you have answers for these questions and then we can quickly move forward. If not, and if you find the current design (trying to slightly modify current GroupedDataFrame only without designing a new type) insufficient, then what you propose should be discussed but it is probably several months of heavy design work (which is OK, but it means that we go for the option: release 0.21 without significantly modifying split-apply-combine ecosystem, and work on the breaking changes after that release).

@matthieugomez
Copy link
Contributor Author

in the proposal by @nalimilan with GroupedDataFrame passed around these functions are problematic to provide. What you would probably do is generate a 1-column data frame and in the second step add this column to the original data frame

This does not sound good. I think it's a good reason to prefer kwargs.

@bkamins
Copy link
Member

bkamins commented Apr 26, 2020

Well technically such select! could work on a cannonical GroupedDataFrame{DataFrame} but would have to be disallowed for GroupedDataFrame{SubDataFrame}. However, I found it a bit confusing (but now that I think of it maybe this could be allowed with this distinction).

Let me summarize the pros and cons of both proposals:

Using kwarg by:

  • PRO: you always pass around only DataFrame which you are normally sure is a type that is stand-alone and "owns" its columns (you could start with SubDataFrame but this is not a problem I think as simply you will be notified that ! methods do not work for it and after the first step of transformations you have a DataFrame anyway); in particular ! methods can be provided here without a problem
  • PRO: we do not have to change the design, GroupedDataFrame can be left "as is", there is no need to add a third special type
  • PRO: there is not much to be discussed here (it is easy to implement, not much design decisions need to be made - essentially apart from function names all is clear how things should work)
  • CON: you have to remember grouping columns yourself (in the future we will probably allow it to be stored in DataFrame metadata, but this is a separate issue and it will be only a convenience you have to manually decide to use)
  • CON: you have to perform grouping each time you want to perform an operation (with GroupedDataFrame you can perform grouping once, and then do 100 combines without having to pay for grouping); of course this only applies to functions with by kwarg; you will still be able to do this with combine (and just remember that the other functions essentially will be wrappers for combine with some extra constraints)
  • CON: the workflow is less clear (as @nalimilan commented - a single function doing a single operation seems a cleaner design)
  • CON: we need one more function defined; i.e. combine works on GroupedDataFrame as it currently does, but we need some other function name that takes a data frame and by argument (currently this is our by but it should be replaced by a name that matches the same operation without by)
  • CON: relatedly - we will have two "flows" allowed, one "big flow" for DataFrames that is fully featured and one "small" flow for GroupedDataFrames
  • CON: we could add an option of processing cannonical GroupedDataFrame by non mutating functions; so we would have both select(::DataFrame, ..., by=...) and select(::GroupedDataFrame, ...) and select!(::DataFrame, ..., by=...) but no select! for GroupedDataFrame (or only for GroupedDataFrame{DataFrame} as commented above); but it would seem to be a duplication of functionality.

Not using by kwarg, but allowing passing a grouped object to functions:

  • PRO: the flow of working with this design is cleaner
  • PRO: it will be more efficient if you do grouping once and want to perform many separate transformations
  • PRO: less functions need to be defined (essentially three functions)
  • CON: it is natural to request then to have a "grouped data frame" object that is an AbstractDataFrame; I will not repeat the arguments from above - maybe using a GroupedDataFrame is enough if we introduce the distinction between "cannonical" and "non-cannonical" and differentiate the behavior between GroupedDataFrame{DataFrame} and GroupedDataFrame{SubDataFrame}. But the major CON here is that there is a lot of decisions to be made if we go this path

As I have written it down I feel I would prefer option 2 in the long term (as proposed by @nalimilan), but for it to work we need to work out the "grouped data frame" case.

You have not commented on the problems I see with select on "non-cannonical" data frames and the problems I see with the current design of the result of group_by in dplyr. It would be valuable if you gave your opinion for it. But assuming you agree with me and we stay with the current GroupedDataFrame (which is easiest implementation-wise) then we could have:

  • select and transform accept any "cannonical" GroupedDataFrame;
  • select! and transform! accept any "cannonical" GroupedDataFrame{DataFrame};
  • combine (or other name if we decide to change it) accept any GroupedDataFrame;
  • combine! (a companion method to combine mutating the parent) - would not be defined as combine changes number and order of rows so it does not make sense to mutate the parent with the output;

then select!, transform! methods would mutate the parent of the GroupedDataFrame (which would automatically be reflected in GroupedDataFrame as it always takes all its columns).

@bkamins
Copy link
Member

bkamins commented Apr 26, 2020

I was thinking about it today some more and my conclusion is the following:

  1. we should split "core" of DataFrames.jl (low level API) into a separate package (as commented in move DataFrame type definition to lightweight package #1764). The reason is that things we discuss here are convenience functionalities and there are many possible competing (and compelling) frameworks for it. In particular we reserve names like select or transform which should be free to use by extension packages.
  2. Still in DataFrames.jl we should provide a reference implementation of high-level API.

Here I write what conclusion I have for this high level API currently (but bear in mind that given what I have just written actually I think it is encouraged to design other high-level APIs that could be built on top of the "core" low level API):

We should provide 5 functions (in the scope of this issue):

  • select, select!, transform, transform! and combine.
  • all of them can be passed an AbstractDataFrame or a GroupedDataFrame
  • GroupedDataFrame can be in "cannonical" state and "non-cannonical" state; for now only combine would allow processing GroupedDataFrame in non-cannonical state (maybe in the future we can allow other functions to allow non-cannonical GroupedDataFrame but this can be decided later - even if we do I do not think it is crucially useful)

The signatures would be:

  • select(::AbstractDataFrame, args...; copycols): ensure we keep all rows, drop old columns; create a new DataFrame
  • select!(::AbstractDataFrame, args...): the same as above but modify source data frame
  • transform(::AbstractDataFrame, args...; copycols): ensure we keep all rows, keep old columns; create a new DataFrame
  • transform!(::AbstractDataFrame, args...): the same as above but modify source data frame
  • combine(::AbstractDataFrame, args...): any transformations, always copy cols, drop old columns; create a new DataFrame
  • combine(arg, ::AbstractDataFrame): the same but allow arg to be a function also (for consistency with GroupedDataFrame version)
  • select(::GroupedDataFrame, args...; copycols, keepkeys, regroup): ensure we keep all rows in order, drop old columns; accept only cannonical GroupedDataFrame
  • select!(::GroupedDataFrame{DataFrame}, args..., keepkeys, regroup): the same as above but modify parent DataFrame
  • transform(::GroupedDataFrame, args...; copycols, keepkeys, regroup): ensure we keep all rows in order, keep old columns; accept only cannonical GroupedDataFrame
  • transform!(::GroupedDataFrame{DataFrame}, args..., keepkeys, regroup): the same as above but modify parent DataFrame
  • combine(::GroupedDataFrame, args..., keepkeys, regroup): current combine, except that if one args is passed we are not flexible (we thought it is OK to be flexible and allow the same as in combine(arg, ::GroupedDataFrame, keepkeys, regroup) but it would be insonsistent with select and transform so we disallow it - still the functionality is available with arg in first position)
  • combine(arg, ::GroupedDataFrame, keepkeys, regroup): current combine

regroup can be true in which case a cannonical GroupedDataFrame is returned, if false (the default) a DataFrame is returned. We allow for this kwarg because it is more efficient to regroupimmediately (we know how to group without having to compute grouping again). If regroup=true we throw an error if keepkeys=false (it does not make much sense otherwise)

Other kwargs have current meaning.

Note that for select! and transform! for GroupedDataFrame it does not matter much what we set for regroup as both parent DataFrame and GroupedDataFrame passed will be modified.

Now why I opt for combine name for the last operation. First - it is non breaking. Second, we will read it as "combine rows", which makes some sense (none of the functions we considered were ideal).

Given this design both by and map for GroupedDataFrame would be deprecated (by duplicates functionality that is easy to get otherwise, map will be just combine with regroup=true). In this way we are free to decide what would be the use of map.

Finally DataFrame on GroupedDataFrame would get keepkeys kwarg.

Sorry that the posts are lengthly, but we are desinging a complex ecosystem and details matter; I hope I have not mixed up something in the descriptions. I hope this design is something you find acceptable and useful. If yes. After #2199 I would go forward to implement it.

Also as I have noted above after 0.21 I would go forward to decouple all this from the low-level API (that would be moved to DataFramesBase.jl) to allow other high-level APIs to be implemented (this one would be just a reference implementation - still with the aim to be useful).

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 26, 2020

Honestly I don’t know enough about how people use groupeddataframes to bring substantive points to this discussion (kwarg vs gdf). That being said, the final proposal looks good to me. Thanks a lot — especially for considering making transform! work with grouped data frames.

@pdeffebach
Copy link
Contributor

+1 to all of this. I can't comment on the contract about mutating the parent of a grouped data frame, but if that's surmountable it would be great. I went over your R example and also find the behavior unintuitive. Perhaps we can disallow modifying a key column.

Thank you everyone for their detailed proposals and thoughtful comments.

@bkamins
Copy link
Member

bkamins commented Apr 27, 2020

the contract about mutating the parent of a grouped data frame

It is "surmonutable" as you say for "cannonical" GroupedDataFrame as it is 1 to 1 mapping to the parent data frame (no rows are removed/reordered) and fortunately GroupedDataFrame does not subset columns. Finally select! and transform! guarantee not to reorder/subset rows. This means that if we modify the parent the GroupedDataFrame will still be valid (this prompts me that probably keepkeys should be disallowed in select! and transform! and it should be always true (in other words - if you want to do a ! operation on GroupedDataFrame you are not allowed to remove grouping columns from the parent), as otherwise derived GroupedDataFrame would be invalidated - I guess it is not a problem and is an intuitive case).

Taken all this into consideration the only risk is that some other GroupedDataFrames would be backed by the same parent but with different grouping columns - and that other GroupedDataFrame would get invalidated, but I think we can be explicit enough in the documentation to warn users about this case.

Note that this is a different situation than the one we discuss in #2211. The problem is that if you add a column to a SubDataFrame with some name, this column potentially can exist in the parent data frame so there would be a conflict (still this also is fixable but first I would like to understand in #2211 if this is really needed to be added).

The general thinking is that view should not modify parent (if possible) as there are potentially other views based on the same parent that might get invalidated and it is easy to forget about it (but as I have said - for GroupedDataFrame I think we can allow this as this is a very specific use case - as opposed to setindex! in #2211 which is a very fundamental operation).

Perhaps we can disallow modifying a key column.

In the proposed design the problem that dplyr has does not exist (we explicitly check if grouping column has remained unchanged in combine which will be still a workhorse of the whole solution). We would have this problem if we created a new type DataFrameWithGroups (call it tentatively) that would be an AbstractDataFrame but with information about grouping columns. If we went this way (we currently do not, but some "extra" package is free to define it) then grouping columns should probably be stored as https://github.com/bkamins/ReadOnlyArrays.jl. But this design has some problems as when we make an array read only we lose its type information and Julia does not allow multiple inheritance currently and most of array types are not trait based, chiefly CategoricalArray and PooledArray would be a problem and they are important for performance reasons. Again - these cases could be worked around, but I did not want to overly complicate things on top of the current design.

Thank you all for discussing this.

@nalimilan
Copy link
Member

Sounds like a good plan! I think discussions about special cases can be handled later, like what to do with "non-canonical" GroupedDataFrame (in which groups have been dropped or reordered).

Regarding the idea of having GroupedDataFrame <: AbstractDataFrame, that was my goal originall. But I asked Hadley Wickham and he said he would have liked to change this in dplyr so that group_by returns an object which is not a data.frame. Then I realized that we don't really need GroupedDataFrame to behave like a DataFrame for most basic operations: for example, it's not very useful to have gdf.col return its parent's column since that doesn't help you to perform by-group operations. OTOH what is definitely useful is to have select and transform work like for DataFrame, but operating by groups (what we are discussing here).

@matthieugomez Do you see something in particular that having GroupedDataFrame <: AbstractDataFrame would allow that wouldn't be possible otherwise?

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 27, 2020

@nalimilan I'm not sure. My initial reaction is that I find it confusing to do gdf=groupby(df), see something in the REPL that looks completely scrambled, but then have tranform!(gdf) returns the original df with an added column. But maybe I'm wrong and it is not that confusing.

@bkamins
Copy link
Member

bkamins commented Apr 27, 2020

As a side note groupby(df) is not allowed.

Now - the issue you report is printing related. The original thinking (not mine - it was implemented long before I started working on DataFrames.jl) was that it is more useful to show groups in GroupedDataFrame rather than one table just with e.g. information about grouping columns and number of groups. But we can change it. If you have some better proposal for show please open a separate issue as it is only output related.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Apr 27, 2020

@bkamins My point about printing was really not intended to be a proposal, just an answer to @nalimilan's question. I am really not knowledgeable enough to have a well formed proposal about GroupedDataFrames.

That being said, if we are all still hesitant about how to work with GroupedDataFrames, and you want DataFrames to be 1.0 soon, maybe you could consider separating the type GroupedDataFrame in a different package (together with its methods for select/transform/combine).

@bkamins
Copy link
Member

bkamins commented Apr 27, 2020

maybe you could consider separating the type GroupedDataFrame in a different package

This is exactly the plan, these functions would not go to DataFramesBase.jl.

if we are all still hesitant

Well - I am not hesitant, though I understand that different users might find different things useful. Note however, that there is little value added of making GroupedDataFrame a subtype of AbstractDataFrame (you can always use its parent to do whatever you like). While the key benefit of GroupedDataFrame is a fast lookup of groups with a convenient interface for it:

julia> df = DataFrame(a=repeat(1:3, 4), b=repeat(1:2, 6), c = 1:12)
12×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │
│ 3   │ 3     │ 1     │ 3     │
│ 4   │ 1     │ 2     │ 4     │
│ 5   │ 2     │ 1     │ 5     │
│ 6   │ 3     │ 2     │ 6     │
│ 7   │ 1     │ 1     │ 7     │
│ 8   │ 2     │ 2     │ 8     │
│ 9   │ 3     │ 1     │ 9     │
│ 10  │ 1     │ 2     │ 10    │
│ 11  │ 2     │ 1     │ 11    │
│ 12  │ 3     │ 2     │ 12    │

julia> gdf = groupby(df, [:a, :b])
GroupedDataFrame with 6 groups based on keys: a, b
First Group (2 rows): a = 1, b = 1
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 1     │ 1     │ 7     │
⋮
Last Group (2 rows): a = 3, b = 2
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 3     │ 2     │ 6     │
│ 2   │ 3     │ 2     │ 12    │

julia> gdf[(a=2, b=1)]
2×3 SubDataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 2     │ 1     │ 5     │
│ 2   │ 2     │ 1     │ 11    │

If we wanted GroupedDataFrame a subtype of AbstractDataFrame this interface would be problematic and a fast lookup by key is a functionality that is very useful in many contexts so it should be easily accessible.

As a note gdf[(a=2, b=1)] is not only convenient (as this is subjective) but it is also very fast (as fast as Dict lookup in Base). To be honest - I am not really clear how you can achieve it quickly (i.e. in a way that is fast) in dplyr - if you know and could comment on it it would be an interesting comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants