Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

f[] under groupby #2470

Open
st-pasha opened this issue Jun 5, 2020 · 5 comments
Open

f[] under groupby #2470

st-pasha opened this issue Jun 5, 2020 · 5 comments
Labels
design-doc Generic discussion / roadmap how some major new functionality can be implemented

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Jun 5, 2020

In #2460 we want to change the meaning of the f[:] symbol within the groupby so that it means "all columns excluding those used for grouping". There are several rationales for this change:

  • DT[:, :, by()] already have such meaning, so this change will make the API more self-consistent;
  • during groupby the grouping columns have fundamentally different role, and therefore mixing them up with the "regular" columns is somewhat inappropriate;
  • with the old meaning of f[:] the key columns were appearing twice in the resulting frame, since the keys are automatically appended to the result;
  • queries where you need to aggregate all columns (except the group columns) are very common, and we need a convenient API for them. The obvious choice seems to be DT[:, mean(f[:]), by(...)] (in R this would be DT[, lapply(.SD, mean), by=...]).

In practice, consider a dataset with columns [A, B, C, D]. Then grouping by 'C' should result in

>>> DT[:, f[:], by(f.C)]
   |  C   A   B   D
-- + --  --  --  --

However, : is just one case of a slice: a trivial slice. We should expect consistency between this trivial slice and other slices:

>>> DT[:, f[::-1], by(f.C)]   # select all columns in reverse order
   |  C   D   B   A
-- + --  --  --  --

>>> DT[:, f[-2:], by(f.C)]    # select only the last 2 columns
   |  C   B   D
-- + --  --  --

>>> DT[:, f[2:3], by(f.C)]    # select the 3rd column (note that it's D not C)
   |  C   D
-- + --  --

>>> DT[:, f[2], by(f.C)]      # again select the 3rd column
   |  C   D
-- + --  --

>>> DT[:, 2, by(f.C)]         # also select the 3rd column
   |  C   D
-- + --  --

Thus, DT[:, 2] would select column C, whereas in the presence of a groupby (by column "C") it would already refer to column D. This could potentially be a source of confusion, especially if the user writes DT[:, f[2], by(f[2])] and f[2] ends up meaning different things in these two places. But I guess that's the price you pay for referring to columns by their numbers instead of names...

Speaking of names, it is absolutely necessary to be able to refer to groupby columns by their names within j, so that they could be used in expressions. For example:

>>> DT[:, f.A + f.C, by("C")]   # f.C is usable even though f[:] does not contain column "C"

I'm not sure how to reconcile these divergent goals: consistency and usability...

One way would be to say that even though there is no column "C" in f[:], the lookup f["C"] should still be able to resolve that column "magically".

Another way would be to say that f.C is no longer valid, and you should use by.C instead:

>>> DT[:, f.A + by.C, by("C")]

This has the advantage that it allows us to also implement queries such as by[:](all groupby columns), by[:2] (first 2 groupby columns), by[-1] (the last groupby column), etc. The disadvantage is that the user would have to learn one more syntax in addition to f and remember to use it. Though I guess it would also better reinforce the notion that f[:] is kinda different under a groupby, and the user should better be aware of that difference...


This is an open-ended discussion. Thoughts / suggestions / comments are welcome.

@jangorecki
Copy link
Contributor

jangorecki commented Jun 5, 2020

I would remove groupby columns only from f[:] and : usage. When providing columns by position, it is risky to shift that positions. I sometimes add an extra column to group by, then whole expression is going to be invalid then.
It is also tricky to well define what are groupby columns if you could group by on an expression, like DT[, .SD, by=.(grp=fun(col1, col2)]. There are cases where only col1 should be considered a grouping column. AFAIK we have an open issue about that in R dt.

One way would be to say that even though there is no column "C" in f[:], the lookup f["C"] should still be able to resolve that column "magically".

a must, unless there will be by["C"] for that, which is not bad idea, one more thing to learn but no surprised and magic.
Eventually, f could store all columns, by grouping columns, and new variable, lets say j, that stores f - by columns. Then even more to learn.

@st-pasha
Copy link
Contributor Author

st-pasha commented Jun 5, 2020

Thanks for the feedback, Jan.
I completely agree that shifting column positions seems like a very risky idea. I don't like this conclusion at all, yet this is where the logic of well-intended reasoning seems to be leading.

The starting point seem to be uncontroversial: that f[:] should refer to all columns except the groupby columns. Yet, in Python one would expect to be able to manipulate the slice and arrive at consistent results each time. For example, if I look at the sequence returned by :, then I expect that changing : into ::-1 would reverse that sequence. It's just a very common python idiom. Similarly, if I want to remove first few elements of the sequence, I'd want to write 2: for the slice expression.

I guess what I am trying to say is this: if you look at the original DT, and then write f[2:], you will get something you didn't expect (assuming the new behavior of f). HOWEVER, if you start with a grouped expression such as DT[:, smth(f[:]), by(...)] and then change f[:] into f[2:] there -- you'd also get something you didn't expect (assuming the old behavior of f).

So, no matter what convention is adopted, there will be situations were people would get surprised by the unexpected behavior.

Given this, we could resort to some other lines of reasoning in order to decide which convention is better:

  • "Muggle-ness": the solution which has less "magic", or special behaviors in them are generally preferrable;
  • Usefulness: which solution is more likely to be useful to a user;
  • Access to alternative: if one approach is adopted, is it possible to achieve the behavior of the alternative solution?

In this case both approaches are "magical" when it comes to f[:], because this expression removes the group columns auto-magically. The two approaches differ in how to interpret f[:-1], but it can be argued either way which one is more "magical" in this respect. Finally, the proposed solution would be more magical if it made f["C"] work (but not if we adopt by["C"] syntax).

In terms of usefulneuss, I believe the proposed solution has an upper hand. After all, the reason why f[:] removes the groupby columns is because they are annoying to deal with if they are mixed-in with the rest of the columns. But the same logic applies to expressions such as f[1:] or f[::-1] just as well. If these expressions were to exclude the group columns, it would be more useful for practical standpoint than if not.

Lastly, the alternatives. In the proposed approach one can say f[:].extend(by[:]), while in the alternative approach it is f[:].remove(by[:]). Though the former is slightly inferior because the order of columns is not preserved. One could also say f[DT.names].

It is also tricky to well define what are groupby columns if you could group by on an expression, like DT[, .SD, by=.(grp=fun(col1, col2)]. There are cases where only col1 should be considered a grouping column.

Currently we would consider neither col1 nor col2 to be a group column, only grp will be. Honestly, I do not see anything ambiguous here, so if you can find a link to the issue, that would be very helpful.

@jangorecki
Copy link
Contributor

If your col1 is a datetime, and you pass it to year_month function, then you generally don't want to have datetime still in your j. On the other hand there are issues like this: Rdatatable/data.table#1427

I think the least magic way is to have extra symbol for that, so j and by together will be f.

@st-pasha st-pasha added the design-doc Generic discussion / roadmap how some major new functionality can be implemented label Jun 6, 2020
@st-pasha st-pasha added this to the Release 0.11.0 milestone Jun 6, 2020
@oleksiyskononenko
Copy link
Contributor

oleksiyskononenko commented Jun 7, 2020

The "magic" starts at the point where we add grouping columns to the resulting frame automatically, and this happens without explicit request from a user. It also means that doing group by users don't have a full control over the content/column order of the result.

@st-pasha
Copy link
Contributor Author

st-pasha commented Jun 8, 2020

@jangorecki I see what you mean regarding a datetime column converted to year+month representation: when you aggregate something to month level, the original exact date is no longer relevant. But perhaps sometimes it is: perhaps you want to split the data into months, but within each month see all observations, together with their dates; or maybe you need to apply some other date function, like day-of-week, or holiday indicator. I guess what I'm trying to say is that it's not so cut-and-dry, and we can't turn it into a reliable automatic rule for column exclusion. "In the face of ambiguity refuse the temptation to guess".

So what this example tells us is that sometimes there will be situations where we would want to treat certain columns as excluded, and therefore we should have a syntax for that. We have the f[:].remove(f.annoying_column) syntax for that.

I think the least magic way is to have extra symbol for that, so j and by together will be f.

I thought about this, and you're right: this would indeed be the least magical way. Yet I still don't feel very comfortable about it, from practical standpoint. First of all, it's unclear what the name of the new symbol will be. We can't use i, or j, or s or sd -- all of these are likely to clash with user-defined variables. But suppose we came up with a good symbol, something like o. The f symbol would still continue to work, so we will have a situation where sometimes people write group-bys in term of fs, and sometimes in term of o. columns. This will be quite confusing. Plus, it's unclear what this symbol should represent when there is no groupby. Nor how o. behaves in the presence of a join: does it collect columns from all joined frames, or from the "main" frame only? Overall it seems like this approach is actually messier than the original.

@st-pasha st-pasha removed this from the Release 0.11.0 milestone Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-doc Generic discussion / roadmap how some major new functionality can be implemented
Projects
None yet
Development

No branches or pull requests

3 participants