Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping API consistency and improvements #1256

Closed
nalimilan opened this issue Oct 13, 2017 · 36 comments
Closed

Grouping API consistency and improvements #1256

nalimilan opened this issue Oct 13, 2017 · 36 comments
Labels
Milestone

Comments

@nalimilan
Copy link
Member

Our grouping API is currently lacking, as it does not allow performing most common operations in the most efficient way (see this discussion and this one), and because it is not very user-friendly for all cases.

  • groupby is the essential building block which returns a GroupedDataFrame object, i.e. a per-group view on the original DataFrame. Calling map on the resulting object gives a GroupApplied object, which can then be turned into a standard DataFrame using combine.
  • by is a convenience wrapper around these operations. It supports any function taking a DataFrame and returning either a single value, a vector or a DataFrame, and combines the result into a DataFrame, with as many rows per group as the function returned. It is very flexible, but also cumbersome (since you need to write df[:col] inside the function). It is also quite slow since we can make almost no assumptions about the kind of data which is going to be returned, and inference cannot really help since the input type is always DataFrame and does not include any information on the column types.
  • aggregate is a more specific function which applies an operations independently on each column. It is also quite flexible since the function can return either a single value or a vector (in which case one row is created for each entry). This makes it slower than it could be since knowing that the function returns a single value would allow for optimizations (allocating in advance and looping efficiently over each column). However, here inference could be used since the function is applied separately to each function. See Modify aggregate for efficiency and decide its future #1246.

Comparing with other software, our API is quite similar to Pandas, which provides three different functions (see also this summary):

  • groupby is similar to our function. However, the result allows accessing its "columns" via indexing, and calling a reduction function like mean on them computes it for each group.
  • apply for general transformations. This is similar to our by.
  • transform returns a data frame of the same size as the original. The passed function is applied independently to each column, and must return either a vector of the same length as the original for each column, or a single value which is recycled/broadcasted. This is the equivalent of our aggregate except that the result is guaranteed to have as many rows as the input (while in ours the number of rows is arbitrary). In terms of performance, this restriction allows allocating the final data frame in advance since the number of rows is known, avoiding the need to store copies for all groups before concatenating them.
  • aggregate returns a data frame with one row per group. It's a more restrictive variant of transform that requires the passed function to return a single value. Our aggregate is a bit more general since it also accepts vector returns, which can come with a performance impact unless we can find a way to use inference to distinguish the two cases (see above).

dplyr takes a quite different approach:

  • group_by is similar to our groupby. However, the grouped data frame it returns behaves mostly like a standard data frame (contrary to our GroupedDataFrame type) except for aggregation operations, so it does not always need to be combined back to a data frame immediately.
  • mutate applies to each group a function which can access any columns of a data frame, and returns a column with as many rows as the original (several such operations can be carried out at the same time by passing multiple arguments). It could be seen as a version of by with a restricted output type: since the number of rows must be equal to the original, operations simply add new columns to (a copy of) the original data frame. An essential difference is that a convenience syntax is provided so that one does not actually write a function accessing the data frame, but expressions like x = mean(col). In terms of performance, this means the list of columns that each operation needs is known, which allows compiling specialized code (similar to what DataFramesMeta does with @transform). Finally, note that mutate returns a grouped data frame, but since its behavior is close to that of plain data frames, this has only limited consequences for usability (comparing to what would happen with our current implementation).
  • summarise is similar to mutate, but for functions returning a single value. It can benefit from the same performance advantages as mutate, but can be even more efficient since results can be saved directly into the final DataFrame without intermediate allocations.
  • mutate_all and summarise_all are variants of the two previous functions which take a function and apply it to each column (similar to our aggregate). For the first one, the function must return either a single value, or a vector of the same length as the input. For the second one, the function must return a single value. The result is respectively a grouped data frame or a data frame.

Overall, our API is quite similar to Pandas, though we do not provide an equivalent of their aggregate for the simplest case where a function always returns a scalar. It is not clear whether we need it for performance, or whether we could use inference in our aggregate to automatically use a fast path when possible (see above).

On the contrary, our API is quite different from dplyr, which provides convenient functions to create new columns based on per-group operations (aggregation or arbitrary transformations), using a syntax similar to what Query or DataFramesMeta provide. Of course our by allows this kind of operation, but it makes it hard or impossible to do this efficiently since inference and specialization are problematic (see above). This kind of problem can only be solved using macros in order to have access the the names of columns which need to be used (and specialized on). So it is probably better left to Query and DataFramesMeta. So I'd say we should concentrate on ensuring our limited Pandas-like API is consistent and efficient, and point to other packages in our documentation for use cases which cannot be efficiently handled through it.

One big limitation we currently have compared with both Pandas and dplyr is that a simple case like computing a grouped mean for a single column is inconvenient and slow. Pandas allows accessing the pseudo-column of a grouped data frame and calling mean() on it. dplyr allows a similar operation by calling mutate(grouped_df, m = mean(col)). On the contrary, we require either using by, which is slow and not so convenient, or using aggregate, which applies the operation to all columns (and fails when some are not numeric, which should be fixed).

Ideas for possible improvements:

  • Make GroupedDataFrame behave more like a plain DataFrame when possible. In particular, using a more compact printing would make it more useful on its own (showing grouping columns like normal ones, maybe highlighted).
  • Allow computing reductions to a specific column from a GroupedDataFrame object, for example by adding a getindex method to select specific columns before calling colwise. Allowing to call mean (or any other reduction) on the resulting object could be convenient, but that would only work for the special case when there is a single column, while colwise is more general. We probably don't want to allow calling mean on a DataFrame to apply it column-wise.
  • As a complement or as an alternative, allow specifying a subset of columns when calling aggregate.
  • Make aggregate silently skip (i.e. return as-is) columns for which the passed function does not apply (according to its signature). Pandas does this by default, but since this is potentially confusing, it could be enabled via an argument.
  • Make colwise return a DataFrame so that when called on a GroupedDataFrame it would preserve grouping columns (currently colwise does not even report the names of columns to which numbers correspond).
  • Try to make aggregate more efficient by using inference (WIP: Modify aggregate for efficiency DataTables.jl#65). Unfortunately, this sounds more difficult for by, even though there may be room for improvement.
  • Improve the documentation to list the best way of performing common operations, and point to other packages where appropriate.
@xiaodaigh
Copy link
Contributor

I did some research and managed to beat data.table in some groupby cases.

Key areas to work on are string hashing and better parallelization. There may be a way to merge the work in FastGroupBy.jl into the Dataframes by rewriting the grouping backend to Radixsort based for bit types.

See post
https://www.codementor.io/zhuojiadai/an-empirical-study-of-group-by-strategies-in-julia-dagnosell

See code here
https://github.com/xiaodaigh/FastGroupBy.jl

@nalimilan
Copy link
Member Author

Interesting. Let's discuss this elsewhere as this issue was supposed to deal about APIs, not implementations.

@nalimilan
Copy link
Member Author

A good solution to make by faster would be to add an AbstractDataFrame type which would encode column type information, and which would be used to pass sub-data frames to the user-provided function. That would allow specializing the function on the types of the inputs. I have done some experimentations by wrapping a NamedTuple.

@pdeffebach
Copy link
Contributor

Bumping this. I have been playing around with aggregate recently and found something that made my computer stall

df = DataFrame(rand(50_000, 6000));
df[:id] = rand(1:25_000, 50_000); # 2 observations per group, on average
aggregate(df, :id, mean) # takes a very long time

I am not sure how contrived an example this is, but it's one we are more or less running into with our work in Stata, and Stata isn't handling it that well either. I think the error is that the GroupedDataFrame operation isn't using all the type information it could be using.

function _aggregate(d::AbstractDataFrame, fs::Vector{T}, headers::Vector{Symbol}, sort::Bool=false) where T<:Function

You can see an Any[] there.

@bkamins
Copy link
Member

bkamins commented Jan 15, 2019

@nalimilan what is left from this issue to be done?

@nalimilan
Copy link
Member Author

I think several points still need to be addressed, e.g. regarding aggregate. I'd also like to check that the API is OK before closing this.

@bkamins
Copy link
Member

bkamins commented Jan 17, 2019

👍 Thank you!

@nalimilan
Copy link
Member Author

Just leaving a note that it could make sense to replace aggregate(gd, f) with something like combine(gd, All() => f) or combine(gd, : => f), with All or : a special selection object similar to JuliaDB selectors. That would make the vocabulary much easier to grasp than using arbitrary terms like aggregate (combine and aggregate mean the same thing). dplyr uses summarize_all for this. This could be extended using other kinds of selectors like in JuliaDB.

The only question is what we also need a replacement for aggregate(df, f). I guess we could support combine(df, ...) for all variants of combine, making it work as if df was the only group in a GroupedDataFrame (and without grouping columns).

@bkamins
Copy link
Member

bkamins commented Apr 14, 2019

  1. I am in favor of removing aggregate for df. We have mapcols which is cleaner in this respect (does not cover all functionality, but for me everything above what mapcols does is too much magic).

With combine I have a mental problem that => syntax is reserved for applying f to what is on LHS.

So maybe:

  • add something like combine(gd, All(collection) => f), where collection is a list of sets of columns to apply f over to, and in particular it can be All() which means all columns
  • the only question is column naming (as now aggregate and combine use different methods for this)

@nalimilan
Copy link
Member Author

I am in favor of removing aggregate for df. We have mapcols which is cleaner in this respect (does not cover all functionality, but for me everything above what mapcols does is too much magic).

That's tempting, but both dplyr's summarize and Pandas' aggregate support data frames and grouped data frames, so I think we also need a function working on both.

With combine I have a mental problem that => syntax is reserved for applying f to what is on LHS.

Good point, so indeed : => f is not really correct. Conceptually that would be : .=> f (along the lines of #1620), which of course cannot work. combine(gd, All(collection) => f) would be OK. Though a problem is that All(collection) => f could also mean we want to pass all arguments in collection to f. That's the same thing as collection => f for All, but we could support other operations like Between and Not as in JuliaDB, in which case the meaning is less obvious. Of course it's not very likely that you would like f to take a list of variables that you cannot list one by one, but would it be OK to not even support that use case?

@bkamins
Copy link
Member

bkamins commented Apr 14, 2019

so I think we also need a function working on both.

I am OK to do the way you want to go (I just wanted to hint that mapcols does very similar thing)

By All(collection) I meant something as All([:a, :b]) or All([[:a,:b], [:c,:d]]) (to pass two columns two times).

Now I actually see that it can be even simpler like All(:a, :b) or All([:a, :b], [:c, :d]) and even (if we want to be extra fancy) All(targetcolname1=[:a, :b], targetcolname2=[:c, :d]). And then still All() could be a shorthand for all columns.

Edit by All(:a, :b) I mean apply f to column :a and to column :b producing two results.

@nalimilan
Copy link
Member Author

The last step to complete to close this issue is to deprecate aggregate. For GroupedDataFrame, it can be replaced with combine(gdf, All() .=> f) (thanks to JuliaData/DataAPI.jl#10). But there's also a DataFrame method, for which combine isn't the best term. A more natural deprecation for it would be select(gdf, All() .=> f). We could imagine supporting combine on DataFrame too, treating them as a single group, but that can be added later if it turns out it's a useful generalization.

@bkamins
Copy link
Member

bkamins commented Sep 8, 2019

Two comments:

  1. All() in combine should "drop" grouping columns, but if All is given some arguments it should work normally (this means that All() for DataFrame and GroupedDataFrame work consistently if we use the definition that All() captures all non-grouping columns, as in DataFrame there are no grouping columns)
  2. With select we should coordinate with JuliaDB.jl (as it works elementwise there AFAICT).

@piever - regarding the second point. We need two kinds of transformation operations:

  1. working elementwise (rowwise)
  2. working on whole columns

How do you handle this distinction in JuliaDB.jl (it seems that select works rowwise, then the question is how do you handle transformations that take a whole vector and transform it?)

@bkamins
Copy link
Member

bkamins commented Sep 8, 2019

@piever This is my current conclusion from he discussion with @nalimilan what would be best for DataFrames.jl. Please comment what you think of it (point 1 is probably the most relevant to JuliaDB.jl):

  1. select would work on columns of a DataFrame, and deprecating aggregate in this case
  2. I would add map for a DataFrame would work rowwise (also note that if a function works on vectors you can always work on rows with it using broadcasting)
  3. I would make combine deprecate aggregate for GroupedDataFrame

In this way map and select perform two different functions (and are naturally complementary). If they both work rowwise then they have an overlap of the functionality.

I think it would be great to settle this before DataFrames.jl 1.0 release (so hopefully soon).

@piever
Copy link

piever commented Sep 8, 2019

select would work on columns of a DataFrame, and deprecating aggregate in this case

select in IndexedTables is a very complicated beast, in that select(t, v::AbstractVector) returns v, select(t, i::Union{Int, Symbol, SpecialSelector}) selects columns, passing a column name and function maps the function row-wise and select(t, tup::Tuple) returns a table with the various selected vectors for each element in the tuple (see here).

select is often used in combination with other parts of the API, for example, if I do groupby(f, t, by, select = :a => f) it first computes select(t, :a => f) and then uses that as data for groupby. Because of this, changing select would be very breaking (a lot of API functions have a select keyword that always calls the select machinery).

Among others, transform inherits the same API: one can do transform(t, :a => v) (replace a column by a vector) or transform(t, :a => :b => f) ( replace a column by a function applied element-wise). When adding transform to DataFrames, do you plan on having it work column-wise like like select or row-wise? In my mind, it's important that the two functions are consistent.

In JuliaDB working on the columns directly is IMO not as common as in DataFrames as it doesn't work well with the distributed case. The main function to do this are groupby and summarize (which is similar to DataFrames.aggregate and is just syntactic sugar on groupby). So actually, the usecase you have in mind for colwise select is simply groupby with by = () in JuliaDB.

I suspect it'd be nicer, just like there is reduce and groupreduce to have apply and groupapply (where apply is DataFrames' colwise select) or, even better, reduce(f, group(df, key)), apply(f, group(df, key)).

Would apply, groupapply (or apply(f, group(df))) (for some name of apply, maybe this is not the best one) be a sensible choice instead of the proposed select and by for DataFrames or is it too late in the game for this kind of changes? It'd be great if we could also fix the inconsistencies with the grouping terminology between DataFrames and JuliaDB.

I would add map for a DataFrame would work rowwise (also note that if a function works on vectors you can always work on rows with it using broadcasting)

I'm not sure I'm particularly qualified to comment on this one. On the one hand, if DataFrames don't iterate rows (you need eachrow) and their size is 2D, then I'm not sure why map would work on rows, for consistency one should do map(f, eachrow(df)) I guess, esp. since otherwise map would change the shape of the container. On the other hand, this feels useful in practice and it's odd to have filter but not map.

I would make combine deprecate aggregate for GroupedDataFrame

Even though JuliaDB has the syntactic sugar summarize, I think it's not a matter of having a new function but an appropriate "special selector". The All() .=> f idea is brilliant, esp. since the one can have Number .=> mean and things like that (to only choose numeric columns). As discussed above though, I'd rather call this apply/groupapply.

@bkamins
Copy link
Member

bkamins commented Sep 8, 2019

Thank you for the extensive and insightful comment. I will add two short notes (probably @nalimilan will have more to say on this):

  • the idea of apply name is excellent (I hope @nalimilan agrees 😄); we could - for the time being leave select and map untouched (and leave the decisions what to do with them post 1.0 release) and now rename combine to apply and remove aggregate (we have not considered reduce as I think it is mostly relevant in distributed context so there is no high pressure to add it now to DataFrames.jl)
  • then apply would work on a GroupedDataFrame or a DataFrame where in the latter case we would "virtually" treat it as a GroupedDataFrame with one group and no grouping columns
  • we use the idea of applying Type to select groups in categorical and I agree that it can be naturally used also here (I have not thought about it before)

@piever
Copy link

piever commented Sep 8, 2019

(we have not considered reduce as I think it is mostly relevant in distributed context so there is no high pressure to add it now to DataFrames.jl)

One more tiny comment: I think that DataFrames actually uses something like groupreduce to optimize groupby for a set of commonly used operations, see for example here. Instead, IndexedTables accepts that groupby(:a => sum, t, :b) is slow and recommends instead groupreduce(:a => +, t, :b) (using OnlineStats for things that are a bit more complicated, like mean or higher moments). Still, I agree that this can be a post 1.0 thing.

@bkamins bkamins added this to the 1.0 milestone Sep 9, 2019
@nalimilan
Copy link
Member Author

Thanks. That's so hard! I think there are two different but related issues:

  • whether transformation functions like select or a potential mutate/transform should operate row-wise or on whole vectors
  • how we can replace aggregate

Actually the former issue has much deeper implications than the latter, so I've filed a separate issue to continue the discussion: #1952.

The discussion about aggregate is much less problematic: at worst, we can keep aggregate, it doesn't harm much even though it is largely redundant with the more general API we want to introduce.

I must say I'm not really convinced by the proposal to use apply instead of combine: I find the name too general, that doesn't even imply that grouping is dropped; and it's the name of an operation in Pandas that is equivalent to our combine(f, df) (see OP), which increases confusion. summarize is a good term in most cases, but since we also allow returning multiple rows it's not perfect either. combine is actually quite accurate, the only problem is that it doesn't make a lot of sense when there's a single DataFrame: it's only OK if you think of it as a special case of a single group. So basically I'd say it's OK if we don't deprecate aggregate. I don't have a great solution right now... :-)

Regarding the standardization of the API between DataFrames and JuliaDB, there are several issues:

  • groupby in DataFrames has no equivalent in JuliaDB
  • by in DataFrames is groupby in JuliaDB
  • then there's the potential issue with select and transform, but currently we're consistent since DtaaFrames doesn't accept functions in select, and transform doesn't even exist yet

I think it would be fine to add reduce over GroupedDataFrame in the future, and maybe even groupreduce as a shorthand. But in exchange for this and for not defining select and transform with conflicting meanings, I would kindly ask JuliaDB to rename groupby to something else (why not by?), as this is an obvious clash in meanings (and our name is consistent with precedent in dplyr).

@bkamins
Copy link
Member

bkamins commented Sep 9, 2019

Sticking to DataFrames.jl issues 😄, then I guess we stay with aggregate for now.
The discussion in #1952 can probably be concluded post 1.0.

So can we close this issue for now or there are some outstanding things to be discussed that are not covered in #1952?

@nalimilan
Copy link
Member Author

I think it would be wise to check that we have a rough idea of where we want to go before 1.0, just in case it requires deprecating something (apart from aggregate).

@bkamins
Copy link
Member

bkamins commented Sep 9, 2019

I agree, but what is the list of things to decide apart from #1952 and the newly opened #1953 (the latter does not have to go in by 1.0).

What I mean is that I would prefer to keep a list of atomic things to do for 1.0 with a milestone set for them, as then it is easier to track what is left to do.

@piever
Copy link

piever commented Sep 10, 2019

I think it would be fine to add reduce over GroupedDataFrame in the future, and maybe even groupreduce as a shorthand. But in exchange for this and for not defining select and transform with conflicting meanings, I would kindly ask JuliaDB to rename groupby to something else (why not by?), as this is an obvious clash in meanings (and our name is consistent with precedent in dplyr).

Of course there needs to be consensus with the other JuliaDB maintainers, but, as far as I'm concerned, I agree that the name groupby is unfortunate: it is inconsitent with DataFrames and with JuliaDB itself (groupreduce is reduce applied group-wise, groupby is not by applied group-wise).

To me, the only big thing that need deciding soonish to settle this is whether there is a good name for the equivalent of reduce for a function that requires the whole vector and may give a vector or scalar ("colwise select"). Basically, the function that's used internally so that :a => mean applied to a DataFrame gives the mean of column :a. The confusion seems to be that DataFrames has a concept of grouped data (which JuliaDB does not) and this function should merge groups.

The name combine makes sense here, but less so for JuliaDB (in some sense grouping there is given by primary keys, so no need to explicitly combine, the grouping will just be kept in the result: maybe this is more similar to dplyr). It also reads a bit odd in the ungrouped case.

I had proposed apply but it seems inadequate in the grouped case. Maybe apply(f, df; combine = true) (not sure what is the best default for the keyword) could replace combine and also be used as a "colwise operation dropping columns" (as select is not ideal due to name clash with JuliaDB)? Or even apply(f, df; combine = true, keepcols = false) to encompass combine, colwise select and colwise transform? This would solve the colwise issue in #1952 and make combine no longer necessary as far as I understand.

Personally, I like that apply(f, groupby(df, :a), combine = true) is clearly the group-apply-combine strategy. On top of that, apply(f, groupby(df, :a), combine = true, keepcols = true) would be a nice way to add columns computed as groupwise window functions to a DataFrame (at the moment it's not super-easy as far as I understand).

by(df, cols, args...; kwargs...) would then be exactly apply(df, groupby(df, cols), args...; kwargs...). In JuliaDB land it would make more sense to call this operation groupapply (in analogy with groupreduce), but I guess by is also a fine name.

@nalimilan
Copy link
Member Author

I'm afraid this is a very JuliaDB-esque proposal. :-)

DataFrame objects don't store any grouping information, and GroupedDataFrame objects are completely different from DataFrame. So having a combine keyword argument would be a bit weird IMHO (besides being type-unstable). Also, GroupedDataFrame is a collection of SubDataFrames representing groups, so we already have map to apply a function to each SubDataFrame. Actually, the canonical split-apply-combine in DataFrames (for which by is a shorthand) is groupby-map-combine; we support combine with functions/pairs because it's more efficient to avoid having map create an intermediate GroupedDataFrame before combining.

If we added a combine keyword argument, that would only be for interoperability with JuliaDB, but not as an idiomatic DataFrames API AFAICT. But I guess that conversely, JuliaDB could add combine for the same reason.

Also as I said I don't really like apply as it doesn't indicate what it does: how does it differ from map, which sounds like a synonym? We could as well extend mapcols for this, at least it would be explicit that it operates over columns.

On top of that, apply(f, groupby(df, :a), combine = true, keepcols = true) would be a nice way to add columns computed as groupwise window functions to a DataFrame (at the moment it's not super-easy as far as I understand).

What do you mean by "groupwise window functions"?

@piever
Copy link

piever commented Sep 10, 2019

Also as I said I don't really like apply as it doesn't indicate what it does: how does it differ from map, which sounds like a synonym? We could as well extend mapcols for this, at least it would be explicit that it operates over columns.

I'm not completely sold on mapcols as it is different in the case with just one function, but I see that maybe we just need a good concept of grouped table in JuliaDB and then one would call map on that, so by is actually more of a groupmap (at least in JuliaDB, as vectors are not flattened together by default). In which case I guess it would be fine to repurpose groubpby in JuliaDB to do the grouping and then just call map on it and have by(f, df, cols) be map(f, groupby(df, cols)).

What do you mean by "groupwise window functions"?

I simply meant things like by(iris, :Species, :SepalLengthRank => :SepalLength => ordinalrank), i.e. adding a new column of the same length of the old one in a by while keeping other columns. But maybe my point does not hold, in that the keepcols = true might as well be added to by.

@nalimilan
Copy link
Member Author

I'm not really sure mapcols is the right operation for this. I just find apply not explicit enough. Need to think about other possibilities.

I simply meant things like by(iris, :Species, :SepalLengthRank => :SepalLength => ordinalrank), i.e. adding a new column of the same length of the old one in a by while keeping other columns. But maybe my point does not hold, in that the keepcols = true might as well be added to by.

by and combine keep grouping columns by default. Actually there's no way to avoid them, but there's a PR to add an argument: JuliaLang/julia#30901

@tkf
Copy link
Contributor

tkf commented Dec 1, 2019

Comparing with other software, our API is quite similar to Pandas, which provides three different functions (see also this summary):

  • groupby is similar to our function. However, the result allows accessing its "columns" via indexing, and calling a reduction function like mean on them computes it for each group.

Let me add that pandas' groupby is slightly different from DataFrames.jl as it returns an iterable of key-group pairs rather than iterable of groups, as in DataFrames.groupby.

In [1]: import pandas
   ...:
   ...: df = pandas.DataFrame(
   ...:     {
   ...:         "a": [1, 1, 2, 2, 3, 3],
   ...:         "b": [1, 2, 3, 4, 5, 6],
   ...:     }
   ...: )
   ...:
   ...: for key, group in df.groupby("a"):
   ...:     print("key =", key)
   ...:     print("group =")
   ...:     print(group)
   ...:     print()
key = 1
group =
   a  b
0  1  1
1  1  2

key = 2
group =
   a  b
2  2  3
3  2  4

key = 3
group =
   a  b
4  3  5
5  3  6

I find this useful sometimes. Of course, with DataFrames.jl, I can pull out the key from SubDataFrame very easily if I know cols passed to groupby. But I feel it's worth considering to have some API equivalent to the iterable returned by pandas.DataFrame.groupby, if there is nothing equivalent already.

(I don't know DataFrame.jl enough to say for sure there is no function that does this out-of-the-box. I hope my comment is not adding unnecessary noise.)

@davidanthoff
Copy link
Contributor

The Query.jl way of doing this is to have a key function that returns the key of a given group. I assume that API would also work here, if key was defined for SubDataFrame?

@tkf
Copy link
Contributor

tkf commented Dec 1, 2019

if key was defined for SubDataFrame

Yes, that would be equivalent and nice to have.

@bkamins
Copy link
Member

bkamins commented Dec 1, 2019

@tkf do you mean functionality implemented in #1908 (it should be merged in a few days and included in release v0.20)

CC @jlumpe

@tkf
Copy link
Contributor

tkf commented Dec 1, 2019

Yes, exactly that. Thanks!

@xiaodaigh
Copy link
Contributor

The Query.jl way of doing this is to have a key function that returns the key of a given group. I assume that API would also work here, if key was defined for SubDataFrame?

I found Query.jl's performance on large group by problems to be slower. In fact, I would say much slower. As long as can keep the performance at a good level

@tkf
Copy link
Contributor

tkf commented Dec 1, 2019

@xiaodaigh Your work on extensive benchmarking of group-by is valuable to many people. But I believe this thread is about API, not implementation or performance. If you have a strong concern about Query.jl's key accessor function-based approach (which I don't get why), maybe #1908 is the right place to discuss?

@xiaodaigh
Copy link
Contributor

about API, not implementation or performance

I am just not sure if a certain API will lead to slower implementation e.g. Maybe an iteration based API will be slower?

@davidanthoff
Copy link
Contributor

Having a key function (or something like it) is not tied to an iterator based implementation at all.

@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

Additional to-do / to-decide before 1.0 release:

  • sync by and combine syntax with coming select syntax
  • consider what should be the order of groups in `GroupedDataFrame (see API for groupwise column transformation #1727)
  • decide if select and transform should work on grouped data frame

@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

I am closing this as I do not see anything not covered elsewhere (or not fitting the current design). If you feel something is missing then please open a new issue (possibly giving a link to this one for past reference).

@bkamins bkamins closed this as completed Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants