-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleaner syntax #2206
Comments
This is essentially #2172, which will be added later (as it is non-breaking). We leave As for EDIT Just to expand |
So - to summarize - unless you have some comment why we should distinguish "single row" vs "all rows" functionality (we currently do not and dplyr does, but I do not see a huge benefit of this distinction) then this issue can be closed and please comment in #2172 on the functionality |
Yes, I think different methods should do single rows vs all rows. The current syntax for the top right quadrant vs bottom right quadrant is not good. It’s very important to separate between the two when doing data wangling. |
Also, even looking only at the top quadrant. |
Can you please comment on this more? Why it is so important? My understanding is that it is what the function you apply returns determines the shape of the output. EDIT in particular this function can neither return 1 row nor that many rows as were in the original group but something completely different. |
This is doable and essentially depends on whether we want to allow |
See this discourse for instance: |
So essentially you would want to extend
Do I understand your request correctly? |
I would like |
but Also currently we cannot use Finally then So that is why I am asking if it is OK with you if we achieve the functionality you describe using kwargs only that would be added to
If we went this way then |
just to add - what you write you want is currently available by writing (disregarding some corner cases for a while not to complicate the discussion):
simply we would make |
I tried to make my point without talking about whether or not you want to keep all other columns. With this added complication, the full table in dplyr is:
Note that there is no need for summarize that keeps all existing columns (not sure what you mean by "aggregate should be two functions..."). I am suggesting:
So in short, I would like |
I have to stop now, I will comment on it later, but the crucial part is The key thing is that we accept any length of the output (not only 1 row or |
Ok. Yes, sorry, I do not completely understand the current syntax in master. Hopefully, though, I have communicated the kind of syntax I am hoping for. |
Yes - I think it is clear. Let me summarize how I now think we can handle all the cases (also from the #2172 discussion). I describe
@nalimilan - I can quickly implement it if we go for this. I guess all needs of @matthieugomez are satisfied with this design and also the requirement of tracking of row order after producing a data frame. For 0.21 it would be a simple implementation (not the fastest possible), but we can improve performance after 0.21 release. |
Thank you for your input on this. I think there are definitely changes to make here.
We do not have an operation to "spread" the mean of a variable across the full data frame, but drop other columns. This is a good lapse and I thank you for pointing it out. I like
|
I don't think I entirely agree with @bkamins's synthesis of my proposal. My original point is that it is important to have two different methods to keep rows or not (instead of different options keeprows = true or false). This makes explicit an important contract with the user: there is no loss of observations. Moreover dataframes remain aligned: the row number 10 in the previous dataframe refers to the same observation as the row number 10 in the next dataframe. To be honest, I'm not sure what the main difference between transform and select is in your proposal? |
@pdeffebach I'm agnostic on whether we should have a For |
I fully agree with this. I think that
My opposition to keyword arguments is
only to find out the operations are performed by group at the end. I agree that
I'm happy with this, too. But you have convinced me of the need for |
I had a long discussion with @nalimilan about it but also having read your comments I have updated my thinking a bit and have the following conclusion. We should have the following functions
So in summary - the major difference from
With the only caveat that Then on top of this table we have |
Well, that would be awesome... I guess my only question is: will it complicate the package too much? I realize that you and @nalimilan are basically the only maintainers these days, so I get that we want to avoid a multiplication of methods. On a related note, I guess I don't really understand why |
The basic building blocks of what is proposed above is something we have now:
Now (technical implementation migtht differ a bit):
So in summary - the change is relatively small. Fortunately none of this functionality exists in 0.20 so we can do whatever we want and the current design for 0.21 is flexible enough to cover all that you request here relatively simply. |
An additional comment based on the discussion with @nalimilan on Slack (this is my understanding of things based on the input from @matthieugomez, @pdeffebach and @nalimilan + my own opinions): Q1: Why we do do not allow passing A1: Because Q2: Why do we need A2: since Q3: Why do I prefer A3. This is a tough decision. I was thinking about it and I think that it is better not to pollute the namespace with too many functions. Q4. Why do we keep A4. First of all - they are useful. You firs "tweak" your Q5. Should we keep A5. After thinking my preference is to deprecate
In particular this will resolve the problem I have always had that In summary the table we discuss would be
additionally we support Now the important thing is that these definitions have a very nice feature that left column (ungrouped operation) is exactly right column (grouped operation) when Please up- or down- vote this proposal. I am now personally convinced to go this way so if no opposing voices will be made I will make a PR implementing this (fortunately it is relatively easy) when #2199 is merged so that we can include it in 0.21 release. |
I have limited experience with the current interface, but for what it's worth, I think both the request and the proposal make a lot of sense.
I'm not entirely sure how we are meant to perform operations on subsets of the rows, without deleting the other rows. This is one of the issues I've been facing when implementing a Stata-like interface. But perhaps that's an issue for another day. Thanks a lot to everyone involved! |
@bkamins I think this is a great idea. I still maintain a dislike for keyword arguments. I have implemented, in the crudest way possible, a non-keyword argument version of all this in a PR here. I use Stata-esque names to avoid name conflicts with existing functions. I think everyone in this thread will be able to understand what each function does. I think that we should deprecate
is still a very powerful and convenient syntax. For reference, here is my list of new functions implemented in #2210
@jmboehm Can you file an issue about replicating stata's |
Just a few comments regarding @bkamins' detailed proposal. Overall I like it, I just have a few reservations:
|
@jmboehm thank you for your thoughts. Here are my comments to the questions you have raised.
So
use a EDIT - oh - I understand you want to do some operation conditionally on some other column? something like |
@nalimilan thank you for this feedback. I think there is a consensus around this kind of proposal, unless there are truly insurmountable edge cases. Do you imagine |
I have created #2211 for discussion about |
I'll just give my two cents about kwargs vs type, but, in any case, I'm happy with either proposal. On the one hand, I like @nalimilan's proposal because it keeps everything super simple. Is it worth thinking about defining a |
I guess something I don't understand in @nalimilan's proposal is whether the output of |
@nalimilan - thank you for the comment. I will summarize the possible design under these circumstances in the post that follows (including @pdeffebach's comment about returning a Here let me add some more general comments. @matthieugomez - I think @nalimilan wants to give an option to choose what should be the output. i also comment on this below. Just as a note to @pdeffebach (rephrased what I commented in #2210). In DataFrames.jl we want to provide a minimal set of functions that provide the required functionality. Therefore if we end up with the design in which:
will be possible to achieve in a different and easy enough way we will probably remove it. I know that in the past DataFrames.jl did not always follow this rule strictly and everywhere but as we go for 1.0 this is needed:
For example in #2211 I assume we will clarify what functionality is needed. If is is easily achievable when composing current functionality then probably we will not add it, but if we decide that it is very hard to achieve it without adding something to the "core" of the package then we will add it. |
The issue about what Given the amount of decisions that are to be made I start getting a feeling that we will not be able to resolve all issues in a way in 0.21 release, so that we ill not be breaking later. So maybe we will have to decide on allow for breaking changes between 0.21 release and 1.0 release. What I think the consensus is is what we need for operating on a So the functionality for data frame seems to require three functions, all of them take a a data frame and return a
In this thread of the discussion I think we can concentrate on finding an appropriate name for the third function. The functionality of the grouped data frame is more complex. The starting point is what @matthieugomez said about adding another type (a la dplyr). The good thing is that currently we do not allow to change Below I assume that we will settle on Then following the proposal of @nalimilan:
So essentially the difference in comparison to the earlier proposal is dropping |
I have updated #2210 with the proposal described by Bogumil above, still with the names |
|
for a For
I do not like this design. In DataFrames.jl we are clear that So - in short:
Well take this
what should be the result of
I understand this, but could you please specify exactly how this type should behave exactly (should it be a view or a copy, should it precompute groups on creation and store them or only carry over information about grouping columns but do not store them, should it have a special column type for grouping columns that make them read only, should you be allowed to add rows to it, should you be allowed to add columns to it, should you be allowed to remove columns from it, etc. - probably the list here is longer, but these are probably major questions). Maybe you have answers for these questions and then we can quickly move forward. If not, and if you find the current design (trying to slightly modify current |
This does not sound good. I think it's a good reason to prefer |
Well technically such Let me summarize the pros and cons of both proposals: Using kwarg
Not using
As I have written it down I feel I would prefer option 2 in the long term (as proposed by @nalimilan), but for it to work we need to work out the "grouped data frame" case. You have not commented on the problems I see with
then |
I was thinking about it today some more and my conclusion is the following:
Here I write what conclusion I have for this high level API currently (but bear in mind that given what I have just written actually I think it is encouraged to design other high-level APIs that could be built on top of the "core" low level API): We should provide 5 functions (in the scope of this issue):
The signatures would be:
Other kwargs have current meaning. Note that for Now why I opt for Given this design both Finally Sorry that the posts are lengthly, but we are desinging a complex ecosystem and details matter; I hope I have not mixed up something in the descriptions. I hope this design is something you find acceptable and useful. If yes. After #2199 I would go forward to implement it. Also as I have noted above after 0.21 I would go forward to decouple all this from the low-level API (that would be moved to DataFramesBase.jl) to allow other high-level APIs to be implemented (this one would be just a reference implementation - still with the aim to be useful). |
Honestly I don’t know enough about how people use groupeddataframes to bring substantive points to this discussion (kwarg vs gdf). That being said, the final proposal looks good to me. Thanks a lot — especially for considering making transform! work with grouped data frames. |
+1 to all of this. I can't comment on the contract about mutating the parent of a grouped data frame, but if that's surmountable it would be great. I went over your R example and also find the behavior unintuitive. Perhaps we can disallow modifying a key column. Thank you everyone for their detailed proposals and thoughtful comments. |
It is "surmonutable" as you say for "cannonical" Taken all this into consideration the only risk is that some other Note that this is a different situation than the one we discuss in #2211. The problem is that if you add a column to a The general thinking is that
In the proposed design the problem that dplyr has does not exist (we explicitly check if grouping column has remained unchanged in Thank you all for discussing this. |
Sounds like a good plan! I think discussions about special cases can be handled later, like what to do with "non-canonical" Regarding the idea of having @matthieugomez Do you see something in particular that having |
@nalimilan I'm not sure. My initial reaction is that I find it confusing to do |
As a side note Now - the issue you report is printing related. The original thinking (not mine - it was implemented long before I started working on DataFrames.jl) was that it is more useful to show groups in |
@bkamins My point about printing was really not intended to be a proposal, just an answer to @nalimilan's question. I am really not knowledgeable enough to have a well formed proposal about GroupedDataFrames. That being said, if we are all still hesitant about how to work with |
This is exactly the plan, these functions would not go to DataFramesBase.jl.
Well - I am not hesitant, though I understand that different users might find different things useful. Note however, that there is little value added of making
If we wanted As a note |
I've followed a bit the recent updates of this package. This is very impressive — thanks @bkamins
and @nalimilan for all your work.
I have one comment about the syntax. An operation such as computing the mean of a variable in a dataframe can be classified along two dimensions (i) whether the new dataframe as the same size as the original dataframe (ii) whether it is a by operation or not.
dplyr and stata make it very easy to alternate across these two dimensions.
in dplyr
In stata:
This is very neat. There is a symmetry between top vs bottom, and left vs right. People can understand what these commands do just by reading them
The current syntax of Dataframes.jl is not as neat IMO. On the current master, we have:
I wish DataFrames.jl would follow the example of dplyr and stata here. For instance, a nice syntax could be the following:
In short, my suggestion would be to (i) remove the function
by
(ii) allowby
kwarg intransform
(ii) define a new function that would be the Julia equivalent of summarize (dplyr)/ collapse (stata).The text was updated successfully, but these errors were encountered: