Add examples using .BY to documentation #1363

mattdowle · 2015-09-25T21:53:39Z

Two people expressed they'd like more examples.
http://stackoverflow.com/questions/22511301/how-to-benefit-from-by-in-data-table
http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/ (see comments)

MichaelChirico · 2015-09-29T01:51:22Z

This would be nice. I've never understood .BY and still don't quite understand it after reading the links here (fwiw i'm not on a computer with R so I couldn't run practice code)... still I feel it must be useful in my code somewhere.

#Edit:

Finally used this today! Exactly as suggested on Andrew Brooks' blog -- leave-one-out mean estimation to test sensitivity of a mean to certain observations.

mattdowle · 2015-09-30T22:27:12Z

Two examples:
http://stackoverflow.com/a/22694260/403310
http://stackoverflow.com/a/22512179/403310

MichaelChirico · 2015-12-02T03:18:51Z

Here seems like a pretty good example if I do say so myself ;-)

http://stackoverflow.com/questions/34033613/how-can-i-write-the-function-that-writes-multiple-excel-files-for-each-unique-id/

(I'm quite self-satisfied to have used .BY twice in two different ways in one line of useful code 😄)

MichaelChirico · 2016-08-04T15:30:10Z

Where do we want these to go?

Came up with an example I think shows a typical use case of .BY -- leave-one-out averaging:

set.seed(20160804)
students <- data.table(id = sample(10, 100, replace = TRUE), 
                       score = rnorm(100, mean = 75, sd = 10))

#average score in the class for students APART FROM oneself
students[ , students[id != .BY$id, mean(score)], by = id]

#more real-world version: many classes, many grades
NN <- 1e4
students <- data.table(grade = sample(12, NN, TRUE),
                       class = sample(8, NN, TRUE))
students[ , id := sample(.N/10, .N, TRUE), by = .(grade, class)]
students[ , score := rnorm(NN, mean = 75, sd = 10)]

students[ , students[grade == .BY$grade & 
                       class == .BY$class &
                       id != .BY$id,  mean(score)],
          by = .(grade, class, id)]

franknarf1 · 2016-08-04T15:53:21Z

@MichaelChirico Leave-one-out means will probably always be faster like

students[ , 
  (sum(score)-score)/(.N-1)
, by = .(grade, class)]

Maybe medians or something would make sense.

You seem to have repeating ids within the same class and grade. Not obvious what that means here, but the extension is...

res = students[, .(v = sum(score), n = .N), by=.(grade,class,id)][ , 
  .(id, V1 = (sum(v)-v)/(sum(n)-n))
, by = .(grade, class)]

mcres = students[ , students[grade == .BY$grade & 
                   class == .BY$class &
                   id != .BY$id,  mean(score)],
      by = .(grade, class, id)]

fsetequal(res, mcres) # TRUE

MichaelChirico · 2016-08-04T16:54:33Z

Just having multiple tests for each student (also the case in the first example).

Goal here is to illustrate a use for .BY, I think the code reads very cleanly.

franknarf1 · 2016-08-04T17:29:02Z

@MichaelChirico I think the goal should be not just to illustrate what .BY is, but also a good use-case. Scrolling up, I think your earlier example and the others are more convincing (adding .BY to file names or plot titles or merging on other tables).

The performance here is pretty abysmal (1 sec for mcres vs ~instant for res), which is why the shortcut approach is usually recommended for leave-one-out means. And, while the code is readable, it involves manually rewriting the var names several times. Regarding the latter problem, there's

res2 = students[ , students[.BY, on=setdiff(names(.BY), "id")][id != .BY$id,  mean(score)]
 , by = .(grade, class, id)]

fsetequal(res, res2) # TRUE

While this dodges the retype-every-varname-twice issue, it is even slower, as one might expect.

Anyway, we can agree to disagree.

Henrik-P · 2017-09-24T19:34:58Z

@mattdowle I see that you are pointing to How to benefit from .BY in data.table?, a nice answer which uses .BY together with paste. When plotting, it's quite common to order plots (or bars/boxes) using factor and specify levels in the desired order. Therefore, it may be worth pointing out that .BY based on a "factor by" needs to be unlisted (or am I wrong here?) when used with paste, to avoid coercion of the grouping variable to its integer representation (see the very last example below).

One thing that the help text states ".BY is a list", but it's not immidiately obvious that the "length 1 vector for each item in by" is also a list. This fact is not revealed in the seemingly simple example with paste and .BY in the link above. Perhaps the structure of individual items of .BY could be described slightly more explicit in the help text?

Some examples:

Non-factor grouping variable in by, just as in the linked example. No problem to paste .BY:

d <- data.table(grp = rep(c("a", "b"), each = 2))

d[ , .BY, by = grp]
#    grp BY
# 1:   a  a
# 2:   b  b

d[ , paste("Group: ", .BY), by = grp]
#    grp        V1
# 1:   a Group:  a
# 2:   b Group:  b

Factor grouping variable in by. .BY is coerced when using paste:

d[ , grp_fac := factor(grp, levels = c("b", "a"))]

d[ , .BY, by = grp_fac]
#    grp_fac BY
# 1:       a  a
# 2:       b  b

d[ , paste("Group: ", .BY), by = grp_fac]
#    grp_fac        V1
# 1:       a Group:  2
# 2:       b Group:  1

mattdowle added the documentation label Sep 25, 2015

MichaelChirico mentioned this issue Feb 16, 2020

add some more .BY use cases to the documentation #4245

Merged

mattdowle added this to the 1.12.9 milestone Feb 16, 2020

mattdowle closed this as completed in #4245 Feb 17, 2020

jangorecki modified the milestones: 1.12.11, 1.12.9 May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add examples using .BY to documentation #1363

Add examples using .BY to documentation #1363

mattdowle commented Sep 25, 2015

MichaelChirico commented Sep 29, 2015

mattdowle commented Sep 30, 2015

MichaelChirico commented Dec 2, 2015

MichaelChirico commented Aug 4, 2016

franknarf1 commented Aug 4, 2016

MichaelChirico commented Aug 4, 2016

franknarf1 commented Aug 4, 2016

Henrik-P commented Sep 24, 2017 •

edited

Loading

Add examples using .BY to documentation #1363

Add examples using .BY to documentation #1363

Comments

mattdowle commented Sep 25, 2015

MichaelChirico commented Sep 29, 2015

mattdowle commented Sep 30, 2015

MichaelChirico commented Dec 2, 2015

MichaelChirico commented Aug 4, 2016

franknarf1 commented Aug 4, 2016

MichaelChirico commented Aug 4, 2016

franknarf1 commented Aug 4, 2016

Henrik-P commented Sep 24, 2017 • edited Loading

Henrik-P commented Sep 24, 2017 •

edited

Loading