Lead/Lag behavior with nested column #3789

TylerGrantSmith · 2018-08-31T14:20:49Z

I'm not sure if this is an intentional design, but I believe that lead and lag do not behave appropriately with nested columns.

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 3.4.4
#> -- Attaching packages -------------------------------- tidyverse 1.2.1 --
#> v ggplot2 3.0.0     v purrr   0.2.4
#> v tibble  1.4.2     v dplyr   0.7.6
#> v tidyr   0.8.1     v stringr 1.3.1
#> v readr   1.1.1     v forcats 0.2.0
#> Warning: package 'ggplot2' was built under R version 3.4.4
#> Warning: package 'tibble' was built under R version 3.4.4
#> Warning: package 'tidyr' was built under R version 3.4.4
#> Warning: package 'purrr' was built under R version 3.4.2
#> Warning: package 'dplyr' was built under R version 3.4.4
#> Warning: package 'stringr' was built under R version 3.4.4
#> -- Conflicts ----------------------------------- tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()

dt <- 
  tibble::tribble(
  ~id, ~a, ~b,
   1L, 0L, 0L,
   2L, 0L, 1L,
   3L, 1L, 0L,
   4L, 1L, 1L
  )

# doesn't work
dt %>% 
  nest(-id) %>% 
  mutate(prior = lag(data, default = first(data)))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
#> Error in mutate_impl(.data, dots): Column `prior` must be length 4 (the number of rows) or one, not 5

# works
dt %>% 
  nest(-id) %>% 
  mutate(prior = lag(data, default = list(first(data))))
#> # A tibble: 4 x 3
#>      id data             prior           
#>   <int> <list>           <list>          
#> 1     1 <tibble [1 x 2]> <tibble [1 x 2]>
#> 2     2 <tibble [1 x 2]> <tibble [1 x 2]>
#> 3     3 <tibble [1 x 2]> <tibble [1 x 2]>
#> 4     4 <tibble [1 x 2]> <tibble [1 x 2]>

Why do I need to encapsulate with list?

Edit: On second thought, perhaps this has to do with how first works?

The text was updated successfully, but these errors were encountered:

cderv · 2018-08-31T16:39:24Z

library(dplyr, warn.conflicts = FALSE)
dt <- 
  tibble::tribble(
    ~id, ~a, ~b,
    1L, 0L, 0L,
    2L, 0L, 1L,
    3L, 1L, 0L,
    4L, 1L, 1L
  )

# get the nested data
dt_nested <- dt %>% 
  tidyr::nest(-id)

first is equivalent to [[1]] and you get what is inside of the first element of the list.
So a tibble with two colums, a list of length 2.

first(dt_nested$data) %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>  $ a: int 0
#>  $ b: int 0
first(dt_nested$data) %>% length()
#> [1] 2
identical(
  first(dt_nested$data), 
  dt_nested$data[[1]]
)
#> [1] TRUE

lag gets you a list of 4 elements and NA for the first row.

lag(dt_nested$data) %>% str()
#> List of 4
#>  $ : logi NA
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 1
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 1
#>   ..$ b: int 0

Using first, it is like putting as default a list of 2 elements at the first element.

lag(dt_nested$data, default = dt_nested$data[[1]]) %>% str()
#> List of 5
#>  $ : int 0
#>  $ : int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 1
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 1
#>   ..$ b: int 0

You have a list of length 5, more than the number of rows in your tibble. It is why mutate is throwing an error, because
you are trying to add a list column to long.
If you select the first element instead of what is inside, you’ll a get an element of length one to use as default.
putting first result inside a list works also.

lag(dt_nested$data, default = dt_nested$data[1]) %>% str()
#> List of 4
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 1
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 1
#>   ..$ b: int 0

You can use base R head also instead, as it select first part of any object

head(dt_nested$data, n = 1) %>% str()
#> List of 1
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  2 variables:
#>   ..$ a: int 0
#>   ..$ b: int 0
dt %>% 
  tidyr::nest(-id) %>% 
  mutate(prior = lag(data, default = head(data, 1)))
#> # A tibble: 4 x 3
#>      id data             prior           
#>   <int> <list>           <list>          
#> 1     1 <tibble [1 x 2]> <tibble [1 x 2]>
#> 2     2 <tibble [1 x 2]> <tibble [1 x 2]>
#> 3     3 <tibble [1 x 2]> <tibble [1 x 2]>
#> 4     4 <tibble [1 x 2]> <tibble [1 x 2]>

So I think you get the error because the result of first cannot be pass as default in lag here.
Created on 2018-08-31 by the reprex package (v0.2.0).

hadley · 2018-08-31T18:50:56Z

This does feel surprising to me. It's possible that I've forgotten some reason that this can't work, but I suspect we've forgotten to think fully about lists in either lag() or first()

krlmlr · 2018-08-31T19:26:40Z

Is this correct behavior?

dplyr::first(as.list(1:3))
#> [1] 1

Created on 2018-08-31 by the reprex package (v0.2.0).

romainfrancois · 2018-09-05T08:37:28Z

I think the problem is with first, or actually with nth

> nth
function(x, n, order_by = NULL, default = default_missing(x)) {
  stopifnot(length(n) == 1, is.numeric(n))
  n <- trunc(n)

  if (n == 0 || n > length(x) || n < -length(x)) {
    return(default)
  }

  # Negative values index from RHS
  if (n < 0) {
    n <- length(x) + n + 1
  }

  if (is.null(order_by)) {
    x[[n]]
  } else {
    x[[ order(order_by)[[n]] ]]
  }
}
<environment: namespace:dplyr>

which does not return something of the same type as the input, it should return a length-1 list rather than its content.

romainfrancois · 2018-09-05T08:48:23Z

Things appear slightly more consistent in the hybrid case

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

d <- tibble(x = list(1L, 1:2))

# hybrid
d %>% summarise(first = first(x), last = last(x))
#> # A tibble: 1 x 2
#>   first     last     
#>   <list>    <list>   
#> 1 <int [1]> <int [2]>

# standard
d %>% summarise(first = (first(x)))
#> # A tibble: 1 x 1
#>   first
#>   <int>
#> 1     1

# not working because last element of length 2
d %>% summarise(last = (last(x)))
#> Error: Column `last` must be length 1 (a summary value), not 2

Created on 2018-09-05 by the reprex package (v0.2.0).

krlmlr · 2018-09-15T08:58:20Z

This means that dplyr::first(as.list(1:3)) should return list(1L) . I'll take a look.

hadley · 2018-09-15T15:16:11Z

Please do not change the behaviour of first(). I would like to fully understand the interaction of lag() and first() before making any changes.

krlmlr · 2019-06-27T08:13:03Z

I feel pretty confident that first() shouldn't unpack (because it should return a vector of length 1 so that it can be used in a summarize()) and lead() and lag() should insert a NULL instead of NA:

library(tidyverse)

x <- as.list(1:3)

# Shouldn't unpack
first(x)
#> [1] 1
nth(x, 2)
#> [1] 2

# Should insert NULL aka vctrs::vec_na(list())
lead(x)
#> [[1]]
#> [1] 2
#> 
#> [[2]]
#> [1] 3
#> 
#> [[3]]
#> [1] NA
lag(x)
#> [[1]]
#> [1] NA
#> 
#> [[2]]
#> [1] 1
#> 
#> [[3]]
#> [1] 2

^{Created on 2019-06-27 by the reprex package (v0.3.0)}

hadley · 2019-12-11T22:36:22Z

We'll resolve this in tidyverse/funs#35

lock · 2020-06-24T23:10:56Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

lock · 2020-06-24T23:13:11Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

krlmlr self-assigned this Sep 15, 2018

krlmlr removed their assignment Oct 5, 2018

romainfrancois mentioned this issue Nov 26, 2018

vec_get() r-lib/vctrs#141

Open

romainfrancois added the vctrs ↗️ label Jan 29, 2019

romainfrancois mentioned this issue Nov 27, 2019

Rework nth() on top of vctrs #4636

Closed

hadley added feature a feature request or enhancement funs 😆 and removed vctrs ↗️ labels Dec 10, 2019

hadley closed this as completed Dec 11, 2019

lock bot locked and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lead/Lag behavior with nested column #3789

Lead/Lag behavior with nested column #3789

TylerGrantSmith commented Aug 31, 2018 •

edited

Loading

cderv commented Aug 31, 2018

hadley commented Aug 31, 2018

krlmlr commented Aug 31, 2018

romainfrancois commented Sep 5, 2018

romainfrancois commented Sep 5, 2018

krlmlr commented Sep 15, 2018

hadley commented Sep 15, 2018

krlmlr commented Jun 27, 2019

hadley commented Dec 11, 2019

lock bot commented Jun 24, 2020

lock bot commented Jun 24, 2020

Lead/Lag behavior with nested column #3789

Lead/Lag behavior with nested column #3789

Comments

TylerGrantSmith commented Aug 31, 2018 • edited Loading

cderv commented Aug 31, 2018

hadley commented Aug 31, 2018

krlmlr commented Aug 31, 2018

romainfrancois commented Sep 5, 2018

romainfrancois commented Sep 5, 2018

krlmlr commented Sep 15, 2018

hadley commented Sep 15, 2018

krlmlr commented Jun 27, 2019

hadley commented Dec 11, 2019

lock bot commented Jun 24, 2020

lock bot commented Jun 24, 2020

TylerGrantSmith commented Aug 31, 2018 •

edited

Loading