Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proper nest/unnest functions #3672

Open
randomgambit opened this issue Jun 30, 2019 · 16 comments
Open

proper nest/unnest functions #3672

randomgambit opened this issue Jun 30, 2019 · 16 comments

Comments

@randomgambit
Copy link

Hello,

I am trying to get the equivalent of nest() / unnest() in data.table and I wonder if a new function would make more sense. Consider this little example


tibble(group = c(1,1,1,2,2,2),
      val = c('hello', 'world','hello', 'world','hello', 'world'),
      col = c(1,2,3,4,5,6)) %>% 
  group_by(group) %>% nest() %>% 
  mutate(newval = group + 1)

# A tibble: 2 x 3
  group data             newval
  <dbl> <list>            <dbl>
1     1 <tibble [3 x 2]>      2
2     2 <tibble [3 x 2]>      3

I am not sure the same can be done with data.table. Look at the mysterious output I get after this. Am I missing something?


mydf[, list(listcol=list(data.table(val, col))), by=group][
  , newval := group + 1]
   group    val newval
1:     1 <list>      3
2:     1 <list>      3
3:     1 <list>      3
4:     2 <list>      3
5:     2 <list>      3
6:     2 <list>      3
@franknarf1
Copy link
Contributor

franknarf1 commented Jun 30, 2019

Re the mysterious output, yeah that does look odd (having newval := group + 1 and yet that output, and also listcol not being a column name...). Could you provide a reproducible example, preferably not using packages that are not required? Some more guidance is here: https://github.com/Rdatatable/data.table/wiki/Support

As far as I know, nest and unnest are feasible in data.table already, though there is a proposal to add a fast unnest: #2146

(Wrapping in list as you have done is nesting. For unnest, there are several options depending on what specifically you want to do. Rep and unlist work for many cases: https://stackoverflow.com/a/38796822)

@randomgambit
Copy link
Author

sure you can simply use

mydf <- data.table(group = c(1,1,1,2,2,2),
           val = c('hello', 'world','hello', 'world','hello', 'world'),
           col = c(1,2,3,4,5,6)) 

and this seems to work


> mydf[, list(listcol=list(data.table(val, col))), by=group]
   group      listcol
1:     1 <data.table>
2:     2 <data.table>

but the rest seems to break down the list-column format. Ideally, I would like to use purrr::map() on the nest data.frame. Yes, using the best for both worlds :)

@randomgambit
Copy link
Author

wow but the nest() ticket has been open for more than 2 years!

@moodymudskipper
Copy link

moodymudskipper commented Jul 5, 2019

@randomgambit your problem is that you used assignment by reference on top of an aggregating call, it seems like data.table doesn't like it too much, this works :

library(data.table)
mydf <- data.table(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 
  
mydf <- mydf[, .(listcol=list(data.table(val, col))), by=group]
mydf [, newval := group + 1]
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

This also works, with the strange effect that mydf needs to called twice to be printed (bug ?) :

mydf <- data.table(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 

mydf <- mydf[, .(listcol=list(data.table(val, col))), by=group][, newval := group + 1]
mydf
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

We could create a function .nest as a shortcut :

.nest <- function(...) list(data.table(...))

mydf <- mydf[, .(listcol= .nest(val, col)), by=group]
mydf [, newval := group + 1]
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

As this function doesn't make sense outside of [.data.table it could be defined in its scope only.

Could you be more precise about what you would like to do with purrr::map but feel you can't ?


As for unnest it seems that we can get help from tidyr without any conversion to tibble. It might be possible to get the best of both worlds after all!

library(tidyr)
unnest(mydf)
#>    group newval   val col
#> 1:     1      2 hello   1
#> 2:     1      2 world   2
#> 3:     1      2 hello   3
#> 4:     2      3 world   4
#> 5:     2      3 hello   5
#> 6:     2      3 world   6

# or
mydf[,unnest(.SD)]
#>    group newval   val col
#> 1:     1      2 hello   1
#> 2:     1      2 world   2
#> 3:     1      2 hello   3
#> 4:     2      3 world   4
#> 5:     2      3 hello   5
#> 6:     2      3 world   6

@jaapwalhout
Copy link

jaapwalhout commented Jul 5, 2019

This also works, with the strange effect that mydf needs to called twice to be printed (bug ?)

This is expected behavior since at least version 1.9.6; see here for reference.

@randomgambit
Copy link
Author

randomgambit commented Jul 5, 2019

Ah that is pretty great. As for

Could you be more precise about what you would like to do with purrr::map but feel you can't ?

Here is in my opinion what makes tibble superior to DT (for now :)) for data analysis. Look at this wonderful output:

mydf <- tibble(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 

myreg <- function(df){
  lm(col ~ I(val), data = df)
}

mycoef <- function(obj){
  obj %>% broom::tidy()
}

mydf %>% 
  group_by(group) %>% 
  nest() %>% 
  mutate(myreg = map(data, ~myreg(.x)),
         myoutput = map(myreg, ~mycoef(.x)))

# A tibble: 2 x 4
  group data             myreg    myoutput        
  <dbl> <list>           <list>   <list>          
1     1 <tibble [3 x 2]> <S3: lm> <tibble [2 x 5]>
2     2 <tibble [3 x 2]> <S3: lm> <tibble [2 x 5]>

Here, by just looking at the R console I can see that myreg is a lm object, and that the regression by group went fine because myoutput contains non-empty tibbles.

I would love being able to encapsulate DTs or tibbles in a DT and get the same functionality. But this requires proper unnest()/nest() functions (and proper printing functions ... see my other github issue #3671).

Here, it is important to be able to unnest myoutput while preserving the other variables (and possibly the other list-column myreg). I am not sure this works right now in DT.

Thanks!

@danielarantes
Copy link

The only real difference I see in the data.table approach (below what I would do...) is the print doesn't show the size of the objects in the list column.
To unnest you can use tidyr as pointed out earlier.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, coalesce, first, last

mydf <- tibble(group = c(1,1,1,2,2,2),
               val = c('hello', 'world','hello', 'world','hello', 'world'),
               col = c(1,2,3,4,5,6)) 

myreg <- function(df){
  lm(col ~ I(val), data = df)
}

mycoef <- function(obj){
  obj %>% broom::tidy()
}

mydt <- setDT(mydf)

mydt[, .(data = list(.SD)), group]  %>% 
  .[, {
    myreg = lapply(data, FUN = function(x) myreg(x))
    myoutput = lapply(myreg, FUN = function(x) mycoef(x))
    list(data = data, myreg = myreg, myoutput = myoutput)
  }, group]
#>    group         data myreg myoutput
#> 1:     1 <data.table>  <lm> <tbl_df>
#> 2:     2 <data.table>  <lm> <tbl_df>
# OR this other way with more intermediary steps and unnesting at the end

mydt[, .(data = list(.SD)), group] %>% 
  .[, .(data, myreg = lapply(data, FUN = function(x) myreg(x))), group] %>% 
  .[, .(data, myreg, myoutput = lapply(myreg, FUN = function(x) mycoef(x))), group] %>% 
  unnest(myoutput)
#>    group        term     estimate std.error     statistic   p.value
#> 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672
#> 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000
#> 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797
#> 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000

Created on 2019-07-05 by the reprex package (v0.3.0)

@moodymudskipper
Copy link

This is what I thought, we're so used at considering data.table and tidyverse as competing paradigms that we don't see how much they can work hand in hand, and would probably even more if more effort was invested in that direction :

.nest <- function(...) list(data.table(...))

mydt <- as.data.table(mydf)

mydt <- mydt[, .(data = .nest(.SD)),  by = group]

# the following works with lapply but I wanted to show that map works just fine,
# including the formula notation and the potential quasiquotation stuff.

mydt[, myreg := map(data, myreg)][
  , myoutput := map(myreg, mycoef)]

mydt
# group         data myreg myoutput
# 1:     1 <data.table>  <lm> <tbl_df>
# 2:     2 <data.table>  <lm> <tbl_df>

I agree that it prints less nice. And I find the need to switch between assignment by copy and reference annoying, but to be fair it's quite nice and readable.

I tried to use tibble:::print.tbl() on mydt and it failed, but I think it's something the tidyverse team could sort out, it might even be a tidyverse bug in the sense that data.tables are data.frames and it's supposed to support data.frames.

Now to unnest a specific column :

# this removes other list columns but it does the same on your mydf tibble
mydt[,unnest(.SD, myoutput)]
# group        term     estimate std.error     statistic   p.value
# 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672
# 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000
# 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797
# 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000

# to keep all we can use a join
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]
#    group        term     estimate std.error     statistic   p.value         data myreg myoutput
# 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672 <data.table>  <lm> <tbl_df>
# 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000 <data.table>  <lm> <tbl_df>
# 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797 <data.table>  <lm> <tbl_df>
# 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000 <data.table>  <lm> <tbl_df>

@moodymudskipper
Copy link

ah Daniel I hadn't seen your answer, well that gives more options!

@randomgambit
Copy link
Author

Very interesting. I am all if for a better integration between DT and tidyverse.
@moodymudskipper when you say:

And I find the need to switch between assignment by copy and reference annoying

Is this related to #3682?

Also, perhaps the best solution is to make tibble:::print.tbl() accept DTs. The more I think about it, the more this looks like a genius option. I can file an issue on the tibble repo if you like.

@moodymudskipper
Copy link

moodymudskipper commented Jul 5, 2019

Actually I was wrong to describe it as a bug, it's an unexported method so it's reasonable that it works on tibbles only, and actually the following works so no need to file anything (except it could support data.table as a special class, as it doesn't now) :

as_tibble(mydt)
# A tibble: 2 x 4
  group data             myreg  myoutput        
  <dbl> <list>           <list> <list>          
1     1 <df[,2] [3 x 2]> <lm>   <tibble [2 x 5]>
2     2 <df[,2] [3 x 2]> <lm>   <tibble [2 x 5]>

@moodymudskipper
Copy link

The trick to use mydt[TRUE] to make sure we're copying is really nice, and for quick exploration mydt[T] is just 3 more keys.

@randomgambit
Copy link
Author

got it.

Also this syntax is quite interesting
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]

My understanding is that this only works because mydt is copied on the first place. So if I were to use := inside the first call then I should necessarily use the mydf[TRUE] insane hack :)

@jangorecki
Copy link
Member

Calling it hack is not appropriate, each data table query that doesn't update by reference will copy. The hack about TRUE is that it does shallow copy, not really relevant here. Making copies on [ is a regular behaviour.

@MichaelChirico
Copy link
Member

Thanks Jan, I was thinking 1:.N and NULL were the same behavior... so actually DT[TRUE] and DT[1:.N]/DT[NULL] have different effects (the latter being full copies)

@jangorecki
Copy link
Member

I think we can close this issue as duplicate of #2146. We don't need fast nest function because AFAIU it is a matter of wrapping in a list. Any comments?

@jangorecki jangorecki changed the title feature request: proper nest/unnest functions proper nest/unnest functions Jul 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants