Use vctrs for flattening into atomic vectors #785

lionel- · 2020-08-07T16:12:40Z

I added a bit of theory in ?flatten to explain the current flattening semantics.

The new implementation supports vctrs coercions:

# Before
flatten_chr(list("foo", factor("bar")))
#> [1] "foo" "1"

# After
flatten_chr(list("foo", factor("bar")))
#> [1] "foo" "bar"

The historical numeric-to-character coercion is still supported (when vctrs coercion fails) but deprecated:

flatten_chr(list("foo", 1))
#> [1] "foo"      "1.000000"
#> Warning message:
#> Numeric to character coercion is deprecated as of purrr 0.4.0.
#> This warning is displayed once per session.

The historical behaviour with data frames is preserved:

flatten_dbl(data.frame(x = 1:2, y = 3:4))
#> [1] 1 2 3 4

But I wonder if this should be deprecated? In a way this is consistent with how map() supports data frames though.

Improves performance. Interestingly vctrs has gotten faster than unlist()?

flatten_int_old <- function(.x) {
  .Call(vflatten_impl, .x, "integer")
}
flatten_int_rlang <- rlang::flatten_int

x <- list(1L, 2:10 + 0L, 11:12 + 0L)
bench::mark(
  base = unlist(x, recursive = FALSE),
  vctrs = flatten_int(x),
  old = flatten_int_old(x),
  rlang = flatten_int_rlang(x)
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 base          949ns   1.21µs   727077.        0B      0   10000     0
#> 2 vctrs        3.71µs   4.77µs   195917.        0B     19.6  9999     1
#> 3 old           863ns   1.18µs   801488.        0B      0   10000     0
#> 4 rlang         988ns   1.27µs   726620.        0B      0   10000     0


y <- list(1L, 2:10 + 0L, 11:1e6 + 0L)
bench::mark(
  base = unlist(y, recursive = FALSE),
  vctrs = flatten_int(y),
  old = flatten_int_old(y),
  rlang = flatten_int_rlang(y)
)[1:8]
#>   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 base         1.63ms   1.85ms     532.     3.81MB     81.5   137    21
#> 2 vctrs      828.31µs   1.05ms     941.     3.81MB    134.    189    27
#> 3 old         10.97ms   11.6ms      84.0   11.44MB     25.2    20     6
#> 4 rlang       459.1µs  654.3µs    1464.     3.81MB    121.    291    24

DavisVaughan · 2020-08-07T18:17:04Z

It might be worth revisiting this if vec_unchop() has gotten faster
https://github.com/tidyverse/dplyr/blob/dc3e0e6ccc54b049ecd89ca83455cdb101e6583d/R/join-rows.R#L54

Hmm, maybe not faster in some odd cases. This was from that example

library(vctrs)

index_flatten <- function(x) {
  unlist(x, recursive = FALSE, use.names = FALSE)
}

x <- rep_len(list(1L), 1e6)

bench::mark(
  vec_unchop(x, ptype = integer()),
  index_flatten(x),
  iterations = 10
)
#> # A tibble: 2 x 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 vec_unchop(x, ptype = integer())  138.3ms 149.3ms      6.62   15.27MB    15.5 
#> 2 index_flatten(x)                   10.9ms  11.5ms     74.1     3.81MB     8.23

^{Created on 2020-08-07 by the reprex package (v0.3.0)}

lionel- · 2020-08-07T19:27:42Z

oh so the crucial test case is long lists rather than heavy lists.

lionel- · 2020-08-07T19:39:24Z

which makes total sense since the likely bottleneck is the common type determination? Though in this case we pass a prototype so that can't be it, maybe the casting (which should essentially be a no-op here).

DavisVaughan · 2020-08-07T20:34:39Z

I think most of it comes from the casting and the assignment. I feel like we could probably make the assignment faster for at least the atomic cases, but that's just a guess.

lionel- · 2020-08-10T08:39:56Z

I added a name_spec argument with a backward-compatible default "{inner}". This ignores the outer names, as in CRAN purrr. There is a behaviour change (improvement) in case of a single outer name for a scalar inner vector:

# With CRAN purrr
flatten_int(list(x = c(foo = 1L, bar = 2L), baz = 3L))
#> foo bar
#>   1   2   3

# With this PR
flatten_int(list(x = c(foo = 1L, bar = 2L), baz = 3L))
#> foo bar baz
#>   1   2   3

lionel- · 2020-08-10T10:19:19Z

I now wonder if "{inner}" should be our general default in vec-c / list-unchop:

No transformation of data.
Outer names are ignored (lost data), but the more specific (and thus probably more relevant) inner names are kept.
No errors which might be annoying during data analysis.

lionel- · 2020-08-10T11:27:01Z

Actually "{inner}" doesn't work as a default because inner might be numeric if the elements are unnamed.

I fixed it but there is currently no way to zap names from a name-spec so this is a behaviour change:

# CRAN
names(purrr::flatten_int(list(x = c(1L, 2L), 3L)))
NULL

# This PR
names(flatten_int(list(x = c(1L, 2L), 3L)))
#> [1] "" "" ""

Feature request at r-lib/vctrs#1215.

hadley · 2022-08-26T08:57:58Z

We should consider renaming to list_flatten() (with a .ptype argument) and deprecating flatten_*, if this will affect much existing code.

hadley · 2022-09-04T14:44:43Z

Superseded by #912.

lionel- added the advice label Aug 7, 2020

lionel- force-pushed the vctrs-typed-flatten branch from 160f07f to 8783f11 Compare August 7, 2020 16:19

lionel- force-pushed the vctrs-typed-flatten branch from 6b35f74 to cd88237 Compare August 10, 2020 08:46

lionel- added 5 commits August 10, 2020 13:00

Use vec_unchop() in the atomic variants of flatten()

dd791bf

Support data frames in atomic flattens for compatibility

1299d69

Support numeric to character coercions in flattens for compatibility

d29745e

More accurate NEWS

e4d2e45

Add name_spec arguments and use backward compatible default

d4298ad

lionel- force-pushed the vctrs-typed-flatten branch from cd88237 to 861c2f8 Compare August 10, 2020 11:19

Fix default name_spec with unnamed inner elements

a1d204c

lionel- force-pushed the vctrs-typed-flatten branch from 861c2f8 to a1d204c Compare August 10, 2020 11:26

lionel- mentioned this pull request Aug 13, 2020

Allow returning NULL names from a name_spec r-lib/vctrs#1219

Merged

hadley mentioned this pull request Aug 25, 2022

Consider preserving precision when casting doubles to char #853

Closed

hadley closed this Sep 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use vctrs for flattening into atomic vectors #785

Use vctrs for flattening into atomic vectors #785

lionel- commented Aug 7, 2020 •

edited

Loading

DavisVaughan commented Aug 7, 2020 •

edited

Loading

lionel- commented Aug 7, 2020

lionel- commented Aug 7, 2020 •

edited

Loading

DavisVaughan commented Aug 7, 2020

lionel- commented Aug 10, 2020

lionel- commented Aug 10, 2020

lionel- commented Aug 10, 2020

hadley commented Aug 26, 2022

hadley commented Sep 4, 2022

Use vctrs for flattening into atomic vectors #785

Use vctrs for flattening into atomic vectors #785

Conversation

lionel- commented Aug 7, 2020 • edited Loading

DavisVaughan commented Aug 7, 2020 • edited Loading

lionel- commented Aug 7, 2020

lionel- commented Aug 7, 2020 • edited Loading

DavisVaughan commented Aug 7, 2020

lionel- commented Aug 10, 2020

lionel- commented Aug 10, 2020

lionel- commented Aug 10, 2020

hadley commented Aug 26, 2022

hadley commented Sep 4, 2022

lionel- commented Aug 7, 2020 •

edited

Loading

DavisVaughan commented Aug 7, 2020 •

edited

Loading

lionel- commented Aug 7, 2020 •

edited

Loading