Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lead/Lag behavior with nested column #3789

Closed
TylerGrantSmith opened this issue Aug 31, 2018 · 11 comments
Closed

Lead/Lag behavior with nested column #3789

TylerGrantSmith opened this issue Aug 31, 2018 · 11 comments
Labels
feature a feature request or enhancement funs 😆

Comments

@TylerGrantSmith
Copy link

TylerGrantSmith commented Aug 31, 2018

I'm not sure if this is an intentional design, but I believe that lead and lag do not behave appropriately with nested columns.

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.4.4
#> -- Attaching packages -------------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.0.0     v purrr   0.2.4
#> v tibble  1.4.2     v dplyr   0.7.6
#> v tidyr   0.8.1     v stringr 1.3.1
#> v readr   1.1.1     v forcats 0.2.0
#> Warning: package 'ggplot2' was built under R version 3.4.4
#> Warning: package 'tibble' was built under R version 3.4.4
#> Warning: package 'tidyr' was built under R version 3.4.4
#> Warning: package 'purrr' was built under R version 3.4.2
#> Warning: package 'dplyr' was built under R version 3.4.4
#> Warning: package 'stringr' was built under R version 3.4.4
#> -- Conflicts ----------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

dt <- 
  tibble::tribble(
  ~id, ~a, ~b,
   1L, 0L, 0L,
   2L, 0L, 1L,
   3L, 1L, 0L,
   4L, 1L, 1L
  )

# doesn't work
dt %>% 
  nest(-id) %>% 
  mutate(prior = lag(data, default = first(data)))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
#> Error in mutate_impl(.data, dots): Column `prior` must be length 4 (the number of rows) or one, not 5

# works
dt %>% 
  nest(-id) %>% 
  mutate(prior = lag(data, default = list(first(data))))
#> # A tibble: 4 x 3
#>      id data             prior           
#>   <int> <list>           <list>          
#> 1     1 <tibble [1 x 2]> <tibble [1 x 2]>
#> 2     2 <tibble [1 x 2]> <tibble [1 x 2]>
#> 3     3 <tibble [1 x 2]> <tibble [1 x 2]>
#> 4     4 <tibble [1 x 2]> <tibble [1 x 2]>

Why do I need to encapsulate with list?

Edit: On second thought, perhaps this has to do with how first works?

@cderv
Copy link
Contributor

cderv commented Aug 31, 2018

library(dplyr, warn.conflicts = FALSE)
dt <- 
  tibble::tribble(
    ~id, ~a, ~b,
    1L, 0L, 0L,
    2L, 0L, 1L,
    3L, 1L, 0L,
    4L, 1L, 1L
  )

# get the nested data
dt_nested <- dt %>% 
  tidyr::nest(-id) 

first is equivalent to [[1]] and you get what is inside of the first element of the list.
So a tibble with two colums, a list of length 2.

first(dt_nested$data) %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>  $ a: int 0
#>  $ b: int 0
first(dt_nested$data) %>% length()
#> [1] 2
identical(
  first(dt_nested$data), 
  dt_nested$data[[1]]
)
#> [1] TRUE

lag gets you a list of 4 elements and NA for the first row.

lag(dt_nested$data) %>% str()
#> List of 4
#>  $ : logi NA
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 1
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 1
#>   ..$ b: int 0

Using first, it is like putting as default a list of 2 elements at the first element.

lag(dt_nested$data, default = dt_nested$data[[1]]) %>% str()
#> List of 5
#>  $ : int 0
#>  $ : int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 1
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 1
#>   ..$ b: int 0

You have a list of length 5, more than the number of rows in your tibble. It is why mutate is throwing an error, because
you are trying to add a list column to long.
If you select the first element instead of what is inside, you’ll a get an element of length one to use as default.
putting first result inside a list works also.

lag(dt_nested$data, default = dt_nested$data[1]) %>% str()
#> List of 4
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 1
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 1
#>   ..$ b: int 0

You can use base R head also instead, as it select first part of any object

head(dt_nested$data, n = 1) %>% str()
#> List of 1
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
dt %>% 
  tidyr::nest(-id) %>% 
  mutate(prior = lag(data, default = head(data, 1)))
#> # A tibble: 4 x 3
#>      id data             prior           
#>   <int> <list>           <list>          
#> 1     1 <tibble [1 x 2]> <tibble [1 x 2]>
#> 2     2 <tibble [1 x 2]> <tibble [1 x 2]>
#> 3     3 <tibble [1 x 2]> <tibble [1 x 2]>
#> 4     4 <tibble [1 x 2]> <tibble [1 x 2]>

So I think you get the error because the result of first cannot be pass as default in lag here.
Created on 2018-08-31 by the reprex package (v0.2.0).

@hadley
Copy link
Member

hadley commented Aug 31, 2018

This does feel surprising to me. It's possible that I've forgotten some reason that this can't work, but I suspect we've forgotten to think fully about lists in either lag() or first()

@krlmlr
Copy link
Member

krlmlr commented Aug 31, 2018

Is this correct behavior?

dplyr::first(as.list(1:3))
#> [1] 1

Created on 2018-08-31 by the reprex package (v0.2.0).

@romainfrancois
Copy link
Member

I think the problem is with first, or actually with nth

> nth
function(x, n, order_by = NULL, default = default_missing(x)) {
  stopifnot(length(n) == 1, is.numeric(n))
  n <- trunc(n)

  if (n == 0 || n > length(x) || n < -length(x)) {
    return(default)
  }

  # Negative values index from RHS
  if (n < 0) {
    n <- length(x) + n + 1
  }

  if (is.null(order_by)) {
    x[[n]]
  } else {
    x[[ order(order_by)[[n]] ]]
  }
}
<environment: namespace:dplyr>

which does not return something of the same type as the input, it should return a length-1 list rather than its content.

@romainfrancois
Copy link
Member

Things appear slightly more consistent in the hybrid case

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

d <- tibble(x = list(1L, 1:2))

# hybrid
d %>% summarise(first = first(x), last = last(x))
#> # A tibble: 1 x 2
#>   first     last     
#>   <list>    <list>   
#> 1 <int [1]> <int [2]>

# standard
d %>% summarise(first = (first(x)))
#> # A tibble: 1 x 1
#>   first
#>   <int>
#> 1     1

# not working because last element of length 2
d %>% summarise(last = (last(x)))
#> Error: Column `last` must be length 1 (a summary value), not 2

Created on 2018-09-05 by the reprex package (v0.2.0).

@krlmlr
Copy link
Member

krlmlr commented Sep 15, 2018

This means that dplyr::first(as.list(1:3)) should return list(1L) . I'll take a look.

@krlmlr krlmlr self-assigned this Sep 15, 2018
@hadley
Copy link
Member

hadley commented Sep 15, 2018

Please do not change the behaviour of first(). I would like to fully understand the interaction of lag() and first() before making any changes.

@krlmlr
Copy link
Member

krlmlr commented Jun 27, 2019

I feel pretty confident that first() shouldn't unpack (because it should return a vector of length 1 so that it can be used in a summarize()) and lead() and lag() should insert a NULL instead of NA:

library(tidyverse)

x <- as.list(1:3)

# Shouldn't unpack
first(x)
#> [1] 1
nth(x, 2)
#> [1] 2

# Should insert NULL aka vctrs::vec_na(list())
lead(x)
#> [[1]]
#> [1] 2
#> 
#> [[2]]
#> [1] 3
#> 
#> [[3]]
#> [1] NA
lag(x)
#> [[1]]
#> [1] NA
#> 
#> [[2]]
#> [1] 1
#> 
#> [[3]]
#> [1] 2

Created on 2019-06-27 by the reprex package (v0.3.0)

@hadley hadley added feature a feature request or enhancement funs 😆 and removed vctrs ↗️ labels Dec 10, 2019
@hadley
Copy link
Member

hadley commented Dec 11, 2019

We'll resolve this in tidyverse/funs#35

@hadley hadley closed this as completed Dec 11, 2019
@lock
Copy link

lock bot commented Jun 24, 2020

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

1 similar comment
@lock
Copy link

lock bot commented Jun 24, 2020

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement funs 😆
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants