Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add examples using .BY to documentation #1363

Closed
mattdowle opened this issue Sep 25, 2015 · 8 comments · Fixed by #4245
Closed

Add examples using .BY to documentation #1363

mattdowle opened this issue Sep 25, 2015 · 8 comments · Fixed by #4245
Milestone

Comments

@mattdowle
Copy link
Member

Two people expressed they'd like more examples.
http://stackoverflow.com/questions/22511301/how-to-benefit-from-by-in-data-table
http://brooksandrew.github.io/simpleblog/articles/advanced-data-table/ (see comments)

@MichaelChirico
Copy link
Member

This would be nice. I've never understood .BY and still don't quite understand it after reading the links here (fwiw i'm not on a computer with R so I couldn't run practice code)... still I feel it must be useful in my code somewhere.

#Edit:

Finally used this today! Exactly as suggested on Andrew Brooks' blog -- leave-one-out mean estimation to test sensitivity of a mean to certain observations.

@mattdowle
Copy link
Member Author

@MichaelChirico
Copy link
Member

Here seems like a pretty good example if I do say so myself ;-)

http://stackoverflow.com/questions/34033613/how-can-i-write-the-function-that-writes-multiple-excel-files-for-each-unique-id/

(I'm quite self-satisfied to have used .BY twice in two different ways in one line of useful code 😄)

@MichaelChirico
Copy link
Member

Where do we want these to go?

Came up with an example I think shows a typical use case of .BY -- leave-one-out averaging:

set.seed(20160804)
students <- data.table(id = sample(10, 100, replace = TRUE), 
                       score = rnorm(100, mean = 75, sd = 10))

#average score in the class for students APART FROM oneself
students[ , students[id != .BY$id, mean(score)], by = id]

#more real-world version: many classes, many grades
NN <- 1e4
students <- data.table(grade = sample(12, NN, TRUE),
                       class = sample(8, NN, TRUE))
students[ , id := sample(.N/10, .N, TRUE), by = .(grade, class)]
students[ , score := rnorm(NN, mean = 75, sd = 10)]

students[ , students[grade == .BY$grade & 
                       class == .BY$class &
                       id != .BY$id,  mean(score)],
          by = .(grade, class, id)]

@franknarf1
Copy link
Contributor

@MichaelChirico Leave-one-out means will probably always be faster like

students[ , 
  (sum(score)-score)/(.N-1)
, by = .(grade, class)]

Maybe medians or something would make sense.

You seem to have repeating ids within the same class and grade. Not obvious what that means here, but the extension is...

res = students[, .(v = sum(score), n = .N), by=.(grade,class,id)][ , 
  .(id, V1 = (sum(v)-v)/(sum(n)-n))
, by = .(grade, class)]

mcres = students[ , students[grade == .BY$grade & 
                   class == .BY$class &
                   id != .BY$id,  mean(score)],
      by = .(grade, class, id)]

fsetequal(res, mcres) # TRUE

@MichaelChirico
Copy link
Member

Just having multiple tests for each student (also the case in the first example).

Goal here is to illustrate a use for .BY, I think the code reads very cleanly.

@franknarf1
Copy link
Contributor

@MichaelChirico I think the goal should be not just to illustrate what .BY is, but also a good use-case. Scrolling up, I think your earlier example and the others are more convincing (adding .BY to file names or plot titles or merging on other tables).

The performance here is pretty abysmal (1 sec for mcres vs ~instant for res), which is why the shortcut approach is usually recommended for leave-one-out means. And, while the code is readable, it involves manually rewriting the var names several times. Regarding the latter problem, there's

res2 = students[ , students[.BY, on=setdiff(names(.BY), "id")][id != .BY$id,  mean(score)]
 , by = .(grade, class, id)]

fsetequal(res, res2) # TRUE

While this dodges the retype-every-varname-twice issue, it is even slower, as one might expect.

Anyway, we can agree to disagree.

@Henrik-P
Copy link

Henrik-P commented Sep 24, 2017

@mattdowle I see that you are pointing to How to benefit from .BY in data.table?, a nice answer which uses .BY together with paste. When plotting, it's quite common to order plots (or bars/boxes) using factor and specify levels in the desired order. Therefore, it may be worth pointing out that .BY based on a "factor by" needs to be unlisted (or am I wrong here?) when used with paste, to avoid coercion of the grouping variable to its integer representation (see the very last example below).

One thing that the help text states ".BY is a list", but it's not immidiately obvious that the "length 1 vector for each item in by" is also a list. This fact is not revealed in the seemingly simple example with paste and .BY in the link above. Perhaps the structure of individual items of .BY could be described slightly more explicit in the help text?


Some examples:

Non-factor grouping variable in by, just as in the linked example. No problem to paste .BY:

d <- data.table(grp = rep(c("a", "b"), each = 2))

d[ , .BY, by = grp]
#    grp BY
# 1:   a  a
# 2:   b  b

d[ , paste("Group: ", .BY), by = grp]
#    grp        V1
# 1:   a Group:  a
# 2:   b Group:  b

Factor grouping variable in by. .BY is coerced when using paste:

d[ , grp_fac := factor(grp, levels = c("b", "a"))]

d[ , .BY, by = grp_fac]
#    grp_fac BY
# 1:       a  a
# 2:       b  b

d[ , paste("Group: ", .BY), by = grp_fac]
#    grp_fac        V1
# 1:       a Group:  2
# 2:       b Group:  1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants