-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
f[] under groupby #2470
Comments
I would remove groupby columns only from
a must, unless there will be |
Thanks for the feedback, Jan. The starting point seem to be uncontroversial: that I guess what I am trying to say is this: if you look at the original So, no matter what convention is adopted, there will be situations were people would get surprised by the unexpected behavior. Given this, we could resort to some other lines of reasoning in order to decide which convention is better:
In this case both approaches are "magical" when it comes to In terms of usefulneuss, I believe the proposed solution has an upper hand. After all, the reason why Lastly, the alternatives. In the proposed approach one can say
Currently we would consider neither |
If your I think the least magic way is to have extra symbol for that, so |
The "magic" starts at the point where we add grouping columns to the resulting frame automatically, and this happens without explicit request from a user. It also means that doing group by users don't have a full control over the content/column order of the result. |
@jangorecki I see what you mean regarding a datetime column converted to year+month representation: when you aggregate something to month level, the original exact date is no longer relevant. But perhaps sometimes it is: perhaps you want to split the data into months, but within each month see all observations, together with their dates; or maybe you need to apply some other date function, like day-of-week, or holiday indicator. I guess what I'm trying to say is that it's not so cut-and-dry, and we can't turn it into a reliable automatic rule for column exclusion. "In the face of ambiguity refuse the temptation to guess". So what this example tells us is that sometimes there will be situations where we would want to treat certain columns as excluded, and therefore we should have a syntax for that. We have the
I thought about this, and you're right: this would indeed be the least magical way. Yet I still don't feel very comfortable about it, from practical standpoint. First of all, it's unclear what the name of the new symbol will be. We can't use |
In #2460 we want to change the meaning of the
f[:]
symbol within the groupby so that it means "all columns excluding those used for grouping". There are several rationales for this change:DT[:, :, by()]
already have such meaning, so this change will make the API more self-consistent;f[:]
the key columns were appearing twice in the resulting frame, since the keys are automatically appended to the result;DT[:, mean(f[:]), by(...)]
(in R this would beDT[, lapply(.SD, mean), by=...]
).In practice, consider a dataset with columns
[A, B, C, D]
. Then grouping by 'C' should result inHowever,
:
is just one case of a slice: a trivial slice. We should expect consistency between this trivial slice and other slices:Thus,
DT[:, 2]
would select columnC
, whereas in the presence of a groupby (by column "C") it would already refer to columnD
. This could potentially be a source of confusion, especially if the user writesDT[:, f[2], by(f[2])]
andf[2]
ends up meaning different things in these two places. But I guess that's the price you pay for referring to columns by their numbers instead of names...Speaking of names, it is absolutely necessary to be able to refer to groupby columns by their names within
j
, so that they could be used in expressions. For example:I'm not sure how to reconcile these divergent goals: consistency and usability...
One way would be to say that even though there is no column "C" in
f[:]
, the lookupf["C"]
should still be able to resolve that column "magically".Another way would be to say that
f.C
is no longer valid, and you should useby.C
instead:This has the advantage that it allows us to also implement queries such as
by[:]
(all groupby columns),by[:2]
(first 2 groupby columns),by[-1]
(the last groupby column), etc. The disadvantage is that the user would have to learn one more syntax in addition tof
and remember to use it. Though I guess it would also better reinforce the notion thatf[:]
is kinda different under a groupby, and the user should better be aware of that difference...This is an open-ended discussion. Thoughts / suggestions / comments are welcome.
The text was updated successfully, but these errors were encountered: