Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local and global versions of .I, .N #1206

Open
MichaelChirico opened this issue Jul 3, 2015 · 14 comments
Open

local and global versions of .I, .N #1206

MichaelChirico opened this issue Jul 3, 2015 · 14 comments
Labels
breaking-change issues whose solution would require breaking existing behavior feature request
Milestone

Comments

@MichaelChirico
Copy link
Member

It's always been a bit confusing to me that .I is "global" in the sense that it doesn't change with by, while .N is "local" in the sense that it does.

I understand (some of) the advantages of this arrangement, but I think there are ample situations for using a local .I (see, e.g. 1, 2, 3, etc.) or a global .N (e.g., 1).

I'm not sure how easy this is to build into the source code, but having .i and .n be "local" while .I and .N are "global" seems like an intuitive alternative. On the other hand, it could be painful to switch the behavior of .N given that it's so ubiquitous in data.table code.


Throwing a hat in the ring for .SD and .sd as well, since I've been tempted a few times to try .SD with the intention of getting the full table within by, specifically here.

@franknarf1
Copy link
Contributor

I agree. It'd be quite a break from backward compatibility, but that notation would be useful and a lot more intuitive.

@jangorecki
Copy link
Member

Quite a big chance. Not sure if the performance gain are big enough to use .i instead of current 1:.N, anybody measured it? 2.0.0 release is going to have some breaking changes so this could be the place to release such change.

@arunsrinivasan
Copy link
Member

Like the idea very much, but not sure if it's possible at this stage, as it'd break a lot of code.. Where were you when this was implemented first :P?

Marked as FR for now.

@MichaelChirico
Copy link
Member Author

In my R swaddling blankets, I suppose ;)

I understand .N->.n is a big push, but I rarely need that.

.i, however, shouldn't break any code and I would use it all the time!

@arunsrinivasan
Copy link
Member

Right. But seq_len(.N) is .i what you're looking for.. Is that not okay? I ask because I find the intent quite clear and understandable. .i and .I can get quite confusing quickly. If it's really necessary, then maybe .seqN? Not sure. I'm always on alert when we've to add more symbols :-).

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Jul 7, 2015

It feels pretty natural to me, and writing, e.g. var[.i<5] is certainly more intuitive (and cleaner!) than var[seq_len(.N)<5] (or even var[.seqN<5]), but maybe that's just me--FWIW I have more of a math than a programming background, which may be why it's easy for me to compartmentalizing capital vs. lowercase symbols.

I understand (and appreciate!) the aversion to over-loading data.table with arcane symbols, but <opinion>I think that anyone that can handle .I and .N can conquer .i quickly, given the tight relationship to .I. <\opinion>.

Just one more parallel to draw--var[.N] is redundant with var[length(var)], but .len was eschewed for the (I think) clearly superior .N; .end also would have worked but would seem more obtuse in other contexts (e.g. var:=.end).

Food for thought! Thanks for the consideration.

@franknarf1
Copy link
Contributor

Or... .GRPI?

Personally, I think it would be easier for new users if the shortcuts were revamped to not only include this extra one but also to be consistent in some sense, like

  • capital/lowercase (which I also prefer, being a mathy type) or
  • .I & .N / .GRPI & .GRPN or
  • .I & .N / .GI & .GN (also making .G an alias or replacement for .GRP) or
  • .DTI & .DTN / .I & .N

Breaking compatibility like this isn't so great, and I'd settle for .i or .GRPI or .GI or even .seqN (though it strikes me as too R-ish) alone. I'd use that shortcut all the time.

@MichaelChirico
Copy link
Member Author

I've added an example to the main post of when my instinct was to use .n and .N but needed to use nrow(dt) instead

@franknarf1
Copy link
Contributor

franknarf1 commented Apr 25, 2017

Maybe related: it might be nice to have .NGRP for the number of groups. E.g., here it could use the condition if (.GRP != .NGRP) instead of if(.I[.N] < nrow(DT)). http://stackoverflow.com/a/43615843/

This would also be nice for easily tracking progress by throwing a print(.GRP/.NGRP) into j.

@st-pasha
Copy link
Contributor

How about the following scheme

Current New symbol Meaning
.I (if no groups) .I row number in the resulting data.table
.N (if no groups) .N number of rows in the resulting data.table (may not be always computable)
? x.I row number in DT
? x.N number of rows in DT
? i.I row number in i data.table when joining
? i.N number of rows in i data.table when joining
.SD .SD data.table with subset of data within the current group
.I .SD.I row number within the current group
.N .SD.N number of rows within the current group
.BY .BY data.table with all groupby keys, OR current key within the current group
.GRP .BY.I group counter
? .BY.N number of groups

@MichaelChirico
Copy link
Member Author

Not sure when symbol overload kicks in... certainly most seem intuitive (though I admit I don't immediately get .BY.I/.BY.N. Why not .GRP.I and .GRP.N?

And why wouldn't .N always be computable? Unless there's some plan for distributing data.table?

The primary concern remains the introduction of code-breaking behavior.

@st-pasha
Copy link
Contributor

The idea is that ?.I is always an index within some data.table, where ? explicitly states which one (and empty ? means the data.table which is being constructed, and hence having no name yet). Similarly, ?.N always denotes the number of rows in N.

The symbols .BY.I and .BY.N refer to data.table .BY, which is the currently existing symbol and it denotes the data.table of all unique group-by keys. On the contrary, .GRP currently means the "group counter", so .GRP.I/.GRP.N would require change in the meaning of .GRP.

I was trying to make a suggestion that is least breaking and most logically consistent. As it stands, it only changes the meaning of .I and .N, and only within the group-by context.

.N may not be computable if j expression returns a data.table with unpredictable number of rows. So if you have an expression like DT[, {if(.N>5) .SD else data.table()} ] then it is impossible to know how many rows there will be in the resulting data.table (which is the new meaning of .N) until you actually construct that data.table.

@MichaelChirico
Copy link
Member Author

Adding this potentially confusing syntax to this issue (not sure if worth fixing):

testDT = data.table(full1 = LETTERS, full2 = letters)
# .N = nrow(testDT)
testDT[seq(1, .N, by = 2L),
# .N = .5 * nrow(testDT)
        stagger1 := rnorm(.N)]

@jangorecki jangorecki changed the title Feature Request: local and global versions of .I, .N local and global versions of .I, .N Apr 6, 2020
@daynefiler
Copy link

Just wanted to ping this as a great feature request -- I would very regularly use a version of .I and second the proposal for .GRPI.

@MichaelChirico MichaelChirico added the breaking-change issues whose solution would require breaking existing behavior label Nov 10, 2023
@jangorecki jangorecki added this to the 2.0.0 milestone Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change issues whose solution would require breaking existing behavior feature request
Projects
None yet
Development

No branches or pull requests

6 participants