You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since this is something that old timers, and of course the author of the package are probably very used to, the following examples may not seem unusual to them, however I'll do my best to show you the progression of expected results for someone relatively new to the package (I've been using it for about a month now, and love it so far) and how the current syntax breaks expectations and forces to go through extensive investigation to figure out what's going on.
Let's take:
d = data.table(a = c(1,1,1,2,2,3,4), b = c(1,1,3,4,5,6,7), c = 1:7, key = "a")
t = data.table(a = c(1,2), key = "a")
z = data.table(a = 3, key = "a")
# first, the set up - getting to know data.table
# i,j,by syntax and running a few commands
d[6]
# a b c
#3: 366d[6, a]
# [1] 3d[6, b]
# [1] 6d[a <= 2]
# a b c
#1: 111
#1: 112
#1: 133
#2: 244
#2: 255d[a <= 2, sum(c)]
# [1] 15d[a <= 2, sum(c), by = a]
# a V1
#1: 16
#2: 29
# ok, so with the above set-up, let's do some merges and see what the results are (together with what I contend the results *should* be with that syntax)d[z]
# a b c
#3: 366d[z, a]
# a a
#1: 33
# "should" be
# [1] 3
# to get the above result, one "should" type instead d[z, a, by = a]
d[z, b]
# a b
#1: 36
# "should" be
# [1] 6d[t]
# prints same output as d[a <= 2]
d[t, sum(c)]
# prints same output as d[a <= 2, sum(c), by = a]
# "should" print same output as d[a <= 2, sum(c)]
d[t, sum(c), by = a]
# complains and prints same output as above ("should"not complain, and should silently do the by-without-by, forspeed reasons, internally)
d[t, sum(c), by = b]
# no complaints and does exactly what one would expect, i.e. same as d[a <= 2, sum(c), by = b]
I can see how this may not seem obviously off for someone who's been relying on current behavior for a while, but please believe me when I say this, for someone who's just getting to know the package current behavior makes no sense. Yes, it's documented in no less than 3 FAQ points (which seems to indicate that this syntax is a stumbling block not just for me), but that doesn't make it less unintuitive.
The above completely breaks the reading of d[i,j,by=b] from take d, apply i, then return j by b and instead converts it to take d, apply i, if no b, then return j by key, else if b and b == key, complain and return j by b, else return j by b. I hope you can see how the latter interpretation of the syntax is much more complicated and needlessly taxing the user.
Let me be very clear - I love data.table, and I love that it's trying to be fast when I merge and do a "by" by the key of the merge, but it really shouldn't be doing that "by" unless I ask for it specifically (and if I do, it should of course do the automagical merge and by at the same time).
The text was updated successfully, but these errors were encountered:
Submitted by: Eduard Antonyan; Assigned to: Nobody; R-Forge link
This request stems from the following SO thread:
Since this is something that old timers, and of course the author of the package are probably very used to, the following examples may not seem unusual to them, however I'll do my best to show you the progression of expected results for someone relatively new to the package (I've been using it for about a month now, and love it so far) and how the current syntax breaks expectations and forces to go through extensive investigation to figure out what's going on.
Let's take:
I can see how this may not seem obviously off for someone who's been relying on current behavior for a while, but please believe me when I say this, for someone who's just getting to know the package current behavior makes no sense. Yes, it's documented in no less than 3 FAQ points (which seems to indicate that this syntax is a stumbling block not just for me), but that doesn't make it less unintuitive.
The above completely breaks the reading of
d[i,j,by=b]
from take d, apply i, then return j by b and instead converts it to take d, applyi
, if no b, then returnj
by key, else ifb
andb == key
, complain and returnj
byb
, else returnj
byb
. I hope you can see how the latter interpretation of the syntax is much more complicated and needlessly taxing the user.Let me be very clear - I love
data.table
, and I love that it's trying to be fast when I merge and do a "by" by the key of the merge, but it really shouldn't be doing that "by" unless I ask for it specifically (and if I do, it should of course do the automagical merge and by at the same time).The text was updated successfully, but these errors were encountered: