Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vctrs for flattening into atomic vectors #785

Closed
wants to merge 6 commits into from

Conversation

lionel-
Copy link
Member

@lionel- lionel- commented Aug 7, 2020

I added a bit of theory in ?flatten to explain the current flattening semantics.

The new implementation supports vctrs coercions:

# Before
flatten_chr(list("foo", factor("bar")))
#> [1] "foo" "1"

# After
flatten_chr(list("foo", factor("bar")))
#> [1] "foo" "bar"

The historical numeric-to-character coercion is still supported (when vctrs coercion fails) but deprecated:

flatten_chr(list("foo", 1))
#> [1] "foo"      "1.000000"
#> Warning message:
#> Numeric to character coercion is deprecated as of purrr 0.4.0.
#> This warning is displayed once per session.

The historical behaviour with data frames is preserved:

flatten_dbl(data.frame(x = 1:2, y = 3:4))
#> [1] 1 2 3 4

But I wonder if this should be deprecated? In a way this is consistent with how map() supports data frames though.

Improves performance. Interestingly vctrs has gotten faster than unlist()?

flatten_int_old <- function(.x) {
  .Call(vflatten_impl, .x, "integer")
}
flatten_int_rlang <- rlang::flatten_int

x <- list(1L, 2:10 + 0L, 11:12 + 0L)
bench::mark(
  base = unlist(x, recursive = FALSE),
  vctrs = flatten_int(x),
  old = flatten_int_old(x),
  rlang = flatten_int_rlang(x)
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 base          949ns   1.21µs   727077.        0B      0   10000     0
#> 2 vctrs        3.71µs   4.77µs   195917.        0B     19.6  9999     1
#> 3 old           863ns   1.18µs   801488.        0B      0   10000     0
#> 4 rlang         988ns   1.27µs   726620.        0B      0   10000     0


y <- list(1L, 2:10 + 0L, 11:1e6 + 0L)
bench::mark(
  base = unlist(y, recursive = FALSE),
  vctrs = flatten_int(y),
  old = flatten_int_old(y),
  rlang = flatten_int_rlang(y)
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 base         1.63ms   1.85ms     532.     3.81MB     81.5   137    21
#> 2 vctrs      828.31µs   1.05ms     941.     3.81MB    134.    189    27
#> 3 old         10.97ms   11.6ms      84.0   11.44MB     25.2    20     6
#> 4 rlang       459.1µs  654.3µs    1464.     3.81MB    121.    291    24

@lionel- lionel- added the advice label Aug 7, 2020
@lionel- lionel- force-pushed the vctrs-typed-flatten branch from 160f07f to 8783f11 Compare August 7, 2020 16:19
@DavisVaughan
Copy link
Member

DavisVaughan commented Aug 7, 2020

It might be worth revisiting this if vec_unchop() has gotten faster
https://github.com/tidyverse/dplyr/blob/dc3e0e6ccc54b049ecd89ca83455cdb101e6583d/R/join-rows.R#L54

Hmm, maybe not faster in some odd cases. This was from that example

library(vctrs)

index_flatten <- function(x) {
  unlist(x, recursive = FALSE, use.names = FALSE)
}

x <- rep_len(list(1L), 1e6)

bench::mark(
  vec_unchop(x, ptype = integer()),
  index_flatten(x),
  iterations = 10
)
#> # A tibble: 2 x 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 vec_unchop(x, ptype = integer())  138.3ms 149.3ms      6.62   15.27MB    15.5 
#> 2 index_flatten(x)                   10.9ms  11.5ms     74.1     3.81MB     8.23

Created on 2020-08-07 by the reprex package (v0.3.0)

@lionel-
Copy link
Member Author

lionel- commented Aug 7, 2020

oh so the crucial test case is long lists rather than heavy lists.

@lionel-
Copy link
Member Author

lionel- commented Aug 7, 2020

which makes total sense since the likely bottleneck is the common type determination? Though in this case we pass a prototype so that can't be it, maybe the casting (which should essentially be a no-op here).

@DavisVaughan
Copy link
Member

I think most of it comes from the casting and the assignment. I feel like we could probably make the assignment faster for at least the atomic cases, but that's just a guess.

Screen Shot 2020-08-07 at 4 32 59 PM

@lionel-
Copy link
Member Author

lionel- commented Aug 10, 2020

I added a name_spec argument with a backward-compatible default "{inner}". This ignores the outer names, as in CRAN purrr. There is a behaviour change (improvement) in case of a single outer name for a scalar inner vector:

# With CRAN purrr
flatten_int(list(x = c(foo = 1L, bar = 2L), baz = 3L))
#> foo bar
#>   1   2   3

# With this PR
flatten_int(list(x = c(foo = 1L, bar = 2L), baz = 3L))
#> foo bar baz
#>   1   2   3

@lionel- lionel- force-pushed the vctrs-typed-flatten branch from 6b35f74 to cd88237 Compare August 10, 2020 08:46
@lionel-
Copy link
Member Author

lionel- commented Aug 10, 2020

I now wonder if "{inner}" should be our general default in vec-c / list-unchop:

  • No transformation of data.
  • Outer names are ignored (lost data), but the more specific (and thus probably more relevant) inner names are kept.
  • No errors which might be annoying during data analysis.

@lionel- lionel- force-pushed the vctrs-typed-flatten branch from cd88237 to 861c2f8 Compare August 10, 2020 11:19
@lionel- lionel- force-pushed the vctrs-typed-flatten branch from 861c2f8 to a1d204c Compare August 10, 2020 11:26
@lionel-
Copy link
Member Author

lionel- commented Aug 10, 2020

Actually "{inner}" doesn't work as a default because inner might be numeric if the elements are unnamed.

I fixed it but there is currently no way to zap names from a name-spec so this is a behaviour change:

# CRAN
names(purrr::flatten_int(list(x = c(1L, 2L), 3L)))
NULL

# This PR
names(flatten_int(list(x = c(1L, 2L), 3L)))
#> [1] "" "" ""

Feature request at r-lib/vctrs#1215.

@hadley
Copy link
Member

hadley commented Aug 26, 2022

We should consider renaming to list_flatten() (with a .ptype argument) and deprecating flatten_*, if this will affect much existing code.

@hadley
Copy link
Member

hadley commented Sep 4, 2022

Superseded by #912.

@hadley hadley closed this Sep 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants