Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic map implementation #495

Closed
hadley opened this issue Jul 22, 2019 · 15 comments
Closed

Basic map implementation #495

hadley opened this issue Jul 22, 2019 · 15 comments
Labels
feature a feature request or enhancement

Comments

@hadley
Copy link
Member

hadley commented Jul 22, 2019

Need to benchmark the two strategies: in the first we just put in a list and then vec_c() at the end; in the second we assume that the type is mostly stable so we can create the space for output in the beginning, and only occasionally change.

vec_map <- function(.x, .f, ..., .ptype = NULL) {
  .x <- lapply(.x, .f, ...)
  vec_c(!!!.x, .ptype = .ptype)
}


vec_map <- function(.x, .f, ..., .ptype = NULL) {
  first <- .x[[1]]
  first_out <- .f(.x, ...) 
  
  out <- vec_na(first_out, vec_size(.x)) 
  vec_slice(out, 1) <- first_out

  for (i in seq2(2, vec_size(.x))) {
    outi <- .f(vec_slice(x, i), ...)
    if (vec_is(outi, out)) {
      # change type
    }
    vec_slice(out, i) <- outi
  }
  
  out
}
@hadley
Copy link
Member Author

hadley commented Jul 22, 2019

My implementation isn't quite right because vec_slice() is like [, but we need something like [[.

@DavisVaughan
Copy link
Member

Size 1 constraint too?

library(vctrs)

vec_map <- function(.x, .f, ..., .ptype = NULL) {
  .x <- lapply(.x, .f, ...)
  vec_c(!!!.x, .ptype = .ptype)
}

vec_map(list(1, 1:2), function(x) {x})
#> [1] 1 1 2

Created on 2019-07-22 by the reprex package (v0.2.1)

@hadley
Copy link
Member Author

hadley commented Jul 22, 2019

Oh yeah, good point.

@DavisVaughan
Copy link
Member

Related / my attempt: tidyverse/purrr#683

@lionel-
Copy link
Member

lionel- commented Jul 22, 2019

Shouldn't we expect vec_map() to pass scalar rows to .fn?

@lionel-
Copy link
Member

lionel- commented Jul 22, 2019

I would expect the [[ extraction with map_vec() rather than vec_map().

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 22, 2019

i.e. .f(vec_slice(.x, i), ...) rather than .f(.x[[i]], ...). If this is to replace compat-purrr.R then maybe it should do [[ to be less surprising. If it should be vctrs-y then it seems like passing scalar rows to .f would be correct. But then again we don't have a [[ variant, so maybe if we had that then we could say we were "always iterating over the elements returned by vec_rip()".

@lionel-
Copy link
Member

lionel- commented Jul 22, 2019

Since there is no hurry for replacing compat-purrr.R, getting the semantics vctrs-y seems more important.

Unlike slide() which always returns a list, vec_map() would have to wrap in a list or data frame to return ragged elements, otherwise we get a size error.

@lionel-
Copy link
Member

lionel- commented Jul 22, 2019

Also purrr is now about as lightweight as vctrs in terms of dependencies, so purrr itself can replace compat-purrr.R

@DavisVaughan
Copy link
Member

DavisVaughan commented Jul 23, 2019

@lionel- and i talked about this a bit more, based on our notes I have come up with this for vec_map(), which is subtly different from map() when working with lists:

library(vctrs)

vec_map <- function(.x, .f, ..., .ptype = list()) {
  if (is.null(.ptype)) {
    .ptype <- list()
    simplify <- TRUE
  } else {
    simplify <- FALSE
  }
  
  .f <- as_function(.f)
  
  out <- vec_map_impl(.x, .f, ..., .ptype = .ptype)
  
  if (simplify) {
    out <- vec_c(!!!out)
  }
  
  out
}

vec_map_impl <- function(.x, .f, ..., .ptype) {
  out <- vec_init(.ptype, vec_size(.x)) 
  
  for (i in seq2(1, vec_size(.x))) {
    elt <- .f(vec_slice(.x, i), ...)
    elt <- vec_cast(elt, .ptype)
    out <- vec_assign(out, i, elt)
  }
  
  out
}
# each element is castable to a length 1 list
vec_map(1:3, ~.x)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2
#> 
#> [[3]]
#> [1] 3

# cannot do this, unlike map(). c(.x, 1) is cast to a length 2 list, which
# cannot be assigned to a single location in the output
vec_map(1:3, ~c(.x, 1))
#> Error: Incompatible lengths: 2, 1

# must wrap in list for it to work
vec_map(1:3, ~list(c(.x, 1)))
#> [[1]]
#> [1] 1 1
#> 
#> [[2]]
#> [1] 2 1
#> 
#> [[3]]
#> [1] 3 1

vec_map(1:3, ~.x, .ptype = double())
#> [1] 1 2 3

vec_map(1:3, ~.x, .ptype = NULL)
#> [1] 1 2 3

vec_map(1:3, ~if (.x == 1) {"ugh"} else {1}, .ptype = NULL)
#> No common type for `..1` <character> and `..2` <double>.

@hadley
Copy link
Member Author

hadley commented Jul 23, 2019

That assumes that .ptype is supplied; I'm mostly interested in the case where it's not and we want to work as efficiently as possible — if we assume that the function is type-stable, we can figure out the type based on the first result, and use it, only widening the type if it becomes necessary later on. I want to understand the performance characteristics of the two approaches so we can figure out which approach to take inside of dplyr.

@lionel-
Copy link
Member

lionel- commented Jul 23, 2019

The simplify step seems suspect to me. Could you init a ptype vector, then cast each result to it and assign-splice the results outside the loop in out? The results would be generated by map().

@DavisVaughan
Copy link
Member

Yea i think the simplify part is wrong.

cars <- mtcars[1:3,]

# this is wrong
vec_map(1:3, ~list(cars), .ptype = NULL)
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 5 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 6 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 7 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> 8 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> 9 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

# ^ should return what this does
vec_map(1:3, ~list(list(cars)), .ptype = NULL)
#> [[1]]
#>                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 
#> [[2]]
#>                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> 
#> [[3]]
#>                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

@DavisVaughan
Copy link
Member

(This is not related to the performance of the two methods mentioned in the original comment, but this seems relevant to the conversation anyways)

A few more notes that came out of some slack conversation that might be useful to have documented:

purrr::map() can be thought of as a special function that preserves the type and size of each of the outputs from .f, without any casting.

This makes it useful as a building block for flat-map techniques, with map_dfr() being a good example as just map() + bind_rows() (which can return an object with a size != vec_size(.x)).

map_int() and the other suffix functions are different, and don't have the flat-map semantics. Instead they force the output from .f to conform to the required type (integer) and have size 1.

vec_map() should be considered more like map_int() than map(). In fact, if there was a map_list() then that would be equivalent to the defaults of vec_map() (.ptype = list()). vec_map() will force the output from .f to have type list, and size 1.

This results in the following difference between vec_map() and map():

library(purrr)
library(vctrs)
library(rlang, warn.conflicts = FALSE)

vec_map <- function(.x, .f, ..., .ptype = list()) {
  .f <- as_function(.f)
  out <- lapply(.x, .f, ...)
  
  if (vec_size_common(!!! out) != 1L) {
    abort("All results must have size 1.")
  }
  
  vec_c(!!! out, .ptype = .ptype)
}

# maintain type/size of `.f` result, no matter what that may be
map(1:2, ~c(.x, 1))
#> [[1]]
#> [1] 1 1
#> 
#> [[2]]
#> [1] 2 1

# force result of `.f` to be a list of size 1
vec_map(1:2, ~c(.x, 1))
#> All results must have size 1.

# fails because of how this works (which is expected and good).
# result of `.f` is cast to list, but becomes length 2
vec_cast(c(2, 1), list())
#> [[1]]
#> [1] 2
#> 
#> [[2]]
#> [1] 1

# must wrap in a list of size 1
vec_map(1:2, ~list(c(.x, 1)))
#> [[1]]
#> [1] 1 1
#> 
#> [[2]]
#> [1] 2 1

I think this means:

  • map() should not ever get a .ptype argument
  • In theory map_int() could use vec_map(.ptype = int())?
  • slide() should work similarly, and not get a .ptype argument
  • But there could be vec_slide() which works like vec_map(), and then slide_int() could be built from that

@lionel- lionel- added the feature a feature request or enhancement label Apr 20, 2020
@lionel-
Copy link
Member

lionel- commented Sep 12, 2022

Closing in favour of tidyverse/purrr#894

@lionel- lionel- closed this as completed Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants