Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement map_vec() #683

Closed
wants to merge 4 commits into from
Closed

Implement map_vec() #683

wants to merge 4 commits into from

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jun 18, 2019

This PR implements map_vec(), a vctrs powered map() that attempts to simplify the output automatically by computing the common type of the result elements. If a common type can be found, the result will be simplified to that type. If not, a list is returned.

This is a draft PR for now as I'm not currently proposing that it be merged. I'd like to get feedback and see what others think about it.

  • Add tests
  • Add examples

This function maintains the invariant that:

vec_size(.x) == vec_size(map_vec(.x))

After glancing at the issues, it seems this PR touches: #679, #633, #472, #376

For now, here are a number of examples that demonstrate the features of map_vec():

library(vctrs)
library(purrr)
library(tibble)
options(rlang_backtrace_on_error = "reminder")

# Auto simplified to an integer vector
map_vec(1:2, ~ .x)
#> [1] 1 2

# Prevent simplification with `.ptype = list()`
map_vec(1:2, ~ .x, .ptype = list())
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2

# If a `.ptype` is specified, and casting to that type
# is not possible, an error is raised
map_vec(1:2, ~ .x, .ptype = factor())
#> Can't cast <integer> to <factor<>>

# If simplification is possible, all elements must have size 1,
# otherwise an error is raised
map_vec(1:2, ~ if (.x == 1L) 1:2 else 3)
#> Result 1 must be a single double, not an integer vector of length 2

# If you use `.ptype = list()`, this is pseudo-relaxed
# (The size of the output is still the same as the 
# size of the input (2), but each inner element can vary in size)
map_vec(1:2, ~ if (.x == 1L) 1:2 else 3, .ptype = list())
#> [[1]]
#> [1] 1 2
#> 
#> [[2]]
#> [1] 3

# The best thing about `map_vec()` is its flexibility with other
# non-atomic types, for example, simplifying to a date vector
map_vec(1:2, ~ Sys.Date() + .x)
#> [1] "2019-06-19" "2019-06-20"

# If a common type cannot be determined, a list is returned
map_vec(list(1, "x"), ~ .x)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] "x"

# Note that just because a common type isn't found doesn't mean you
# can't still forcibly cast to a certain type.
map_vec(list(1, "x"), ~ .x, .ptype = character())
#> [1] "1" "x"

# The underlying `vec_c()` engine knows how to combine 
# data frame output, but this isn't a `map_dfr()` replacement
map_vec(1:2, ~ tibble(x = .x))
#> # A tibble: 2 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2

# You can enforce the structure of the data frame output with a ptype.
# This has the same result as before but coerces the integers to characters
ptype <- tibble(x = character())
map_vec(1:2, ~ tibble(x = .x), .ptype = ptype)
#> # A tibble: 2 x 1
#>   x    
#>   <chr>
#> 1 1    
#> 2 2

# And this errors
map_vec(1:2, ~ tibble(x = .x, y = 1), .ptype = ptype)
#> Lossy cast from <tbl_df<
#>   x: integer
#>   y: double
#> >> to <tbl_df<x:character>>.
#> Dropped variables: `y`

# Or you can enforce a partial structure with a partial_frame()
# This ensures that a `y` column exists
partial_ptype <- partial_frame(y = numeric())
map_vec(1:2, ~ tibble(x = .x), .ptype = partial_ptype)
#> # A tibble: 2 x 2
#>       x     y
#>   <int> <dbl>
#> 1     1    NA
#> 2     2    NA

# Turning on strict mode forces the user to provide a .ptype
withr::with_options(
  list(vctrs.no_guessing = TRUE),
  map_vec(1:5, ~ .x)
)
#> Error: strict mode is activated; you must supply complete `.ptype`.

@DavisVaughan
Copy link
Member Author

A few questions:

  • Should a .name_repair argument be passed to vec_c()?
  • If a .ptype is specified and vec_c() fails because it cannot cast to the .ptype, we currently get a vctrs cast error. Should that be more informative? Can it be (we don't have the index where the cast failure occured)?

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jun 18, 2019

Other things vctrs can help with:

  • Using vec_rbind() in map_dfr()
  • Using vec_cbind() in map_dfc()
  • Possibly powering map_dbl() and friends by just doing map_vec(.ptype = double())?
  • Altering the current recycling rules in purrr. Current thinking suggests that this should produce an error purrr::map2(1:2, integer(), ~ .x)

@lionel-
Copy link
Member

lionel- commented Jun 19, 2019

Last time we talked about this with Hadley, the idea was to have funs::s() to simplify an output. Then our sapply would be s(map(...)). This approach seems better than multiplying the number of functions in purrr / other packages as it is more composable.

Similarly, map_int(...) could be int(map(...)). However the typed variants are more efficient because they can omit the intermediate representation as a list. In general, performance is a reason to provide variants.

Does this suggest we should have map_vec() in addition to s(map()) if it can help performance? The prototype should be computed in advance and then we'd build the vector and vec_slice into it. This would require exporting some of the C API from vctrs.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jun 21, 2019

I see that mention of funs::s() here. I guess it would be like s(x, ptype = NULL)?

I am a bit worried about performance here. If purrr was being implemented from scratch today, I feel like a C version of map_vec() would still exist just for its performance gains. I think even map_int() and some of the more common variants might even still exist because it is so much easier to type than map_vec(x, fun, .ptype = int()) when you want to guarantee that the result is an integer + have the performance.

I do think generally funs::s() will be a nice solution to not multiply the number of these kinds of variants in other packages, but purrr feels like an exception.

@DavisVaughan
Copy link
Member Author

Thought: How can we compute the prototype of the output in advance if we need all elements of the output to determine the common type? Unless, .ptype is supplied, we can't? Which means 2 allocations are needed, one for a list to contain the n results, and then one for the common type container of size n.

@hadley
Copy link
Member

hadley commented Aug 27, 2022

Superseded by #894

@hadley hadley closed this Aug 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants