[R-Forge #2696] change data.table by-without-by syntax to require a "by" #371

arunsrinivasan · 2014-06-08T13:11:40Z

Submitted by: Eduard Antonyan; Assigned to: Nobody; R-Forge link

This request stems from the following SO thread:

Since this is something that old timers, and of course the author of the package are probably very used to, the following examples may not seem unusual to them, however I'll do my best to show you the progression of expected results for someone relatively new to the package (I've been using it for about a month now, and love it so far) and how the current syntax breaks expectations and forces to go through extensive investigation to figure out what's going on.

Let's take:

d = data.table(a = c(1,1,1,2,2,3,4), b = c(1,1,3,4,5,6,7), c = 1:7, key = "a")
t = data.table(a = c(1,2), key = "a")
z = data.table(a = 3, key = "a")

# first, the set up - getting to know data.table 
# i,j,by syntax and running a few commands
d[6]
#    a b c
#3: 3 6 6

d[6, a]
# [1] 3

d[6, b]
# [1] 6

d[a <= 2]
#    a b c
#1: 1 1 1
#1: 1 1 2
#1: 1 3 3
#2: 2 4 4
#2: 2 5 5

d[a <= 2, sum(c)]
# [1] 15

d[a <= 2, sum(c), by = a]
#    a V1
#1: 1  6
#2: 2  9

# ok, so with the above set-up, let's do some merges and see what the results are (together with what I contend the results *should* be with that syntax)

d[z]
#    a b c
#3: 3 6 6

d[z, a]
#    a a
#1: 3 3
# "should" be
# [1] 3
# to get the above result, one "should" type instead d[z, a, by = a]

d[z, b]
#    a b
#1: 3 6
# "should" be
# [1] 6

d[t]
# prints same output as d[a <= 2]

d[t, sum(c)]
# prints same output as d[a <= 2, sum(c), by = a]
# "should" print same output as d[a <= 2, sum(c)]

d[t, sum(c), by = a]
# complains and prints same output as above ("should" not complain, and should silently do the by-without-by, for speed reasons, internally)

d[t, sum(c), by = b]
# no complaints and does exactly what one would expect, i.e. same as d[a <= 2, sum(c), by = b]

I can see how this may not seem obviously off for someone who's been relying on current behavior for a while, but please believe me when I say this, for someone who's just getting to know the package current behavior makes no sense. Yes, it's documented in no less than 3 FAQ points (which seems to indicate that this syntax is a stumbling block not just for me), but that doesn't make it less unintuitive.

The above completely breaks the reading of d[i,j,by=b] from take d, apply i, then return j by b and instead converts it to take d, apply i, if no b, then return j by key, else if b and b == key, complain and return j by b, else return j by b. I hope you can see how the latter interpretation of the syntax is much more complicated and needlessly taxing the user.

Let me be very clear - I love data.table, and I love that it's trying to be fast when I merge and do a "by" by the key of the merge, but it really shouldn't be doing that "by" unless I ask for it specifically (and if I do, it should of course do the automagical merge and by at the same time).

The text was updated successfully, but these errors were encountered:

arunsrinivasan closed this as completed Jun 8, 2014

arunsrinivasan modified the milestone: v1.9.4 Jun 19, 2014

arunsrinivasan added the High label Jun 20, 2014

arunsrinivasan assigned mattdowle Jun 20, 2014

arunsrinivasan mentioned this issue Jun 23, 2014

[R-Forge #5297] Speed in rolling joins #538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #2696] change data.table by-without-by syntax to require a "by" #371

[R-Forge #2696] change data.table by-without-by syntax to require a "by" #371

arunsrinivasan commented Jun 8, 2014

[R-Forge #2696] change data.table by-without-by syntax to require a "by" #371

[R-Forge #2696] change data.table by-without-by syntax to require a "by" #371

Comments

arunsrinivasan commented Jun 8, 2014