Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting to different behaviour between case_match and recode #6856

Closed
manhnguyen48 opened this issue May 23, 2023 · 5 comments
Closed

Adjusting to different behaviour between case_match and recode #6856

manhnguyen48 opened this issue May 23, 2023 · 5 comments

Comments

@manhnguyen48
Copy link

manhnguyen48 commented May 23, 2023

Since recode is being superseded, I'm looking at changing to case_match. I usually use it to reverse code an integer vector (e.g. 1 -> 5, 5 -> 1, etc.) and have found named list/vector useful but case_match doesn't accept the same.
The named list approach is useful to include recode in other functions as named list are more straightforward to produce.
How should I adapt here?

library(dplyr, warn.conflicts = FALSE)

x <- rep(1:5, 5)
#This still works
dplyr::recode(x, !!!setNames(1:5, 5:1))
#>  [1] 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1
#This does not work
dplyr::case_match(x, !!!setNames(1:5, 5:1))
#> Error in `dplyr::case_match()`:
#> ! Case 1 (`!!!setNames(1:5, 5:1)`) must be a two-sided formula, not the
#>   number 1.
#> Backtrace:
#>     ▆
#>  1. └─dplyr::case_match(x, !!!setNames(1:5, 5:1))
#>  2.   └─dplyr:::case_formula_evaluate(...)
#>  3.     └─dplyr:::map2(...)
#>  4.       └─base::mapply(.f, .x, .y, MoreArgs = list(...), SIMPLIFY = FALSE)
#>  5.         └─dplyr (local) `<fn>`(dots[[1L]][[1L]], dots[[2L]][[1L]])
#>  6.           └─dplyr:::validate_and_split_formula(...)
#>  7.             └─rlang::abort(message, call = error_call)
#Instead we have to do this
dplyr::case_match(x, 1~5, 2~4, 3~3, 4~2, 5~1)
#>  [1] 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1

Created on 2023-05-23 with reprex v2.0.2

@manhnguyen48
Copy link
Author

Ok, I've figured you just need an extra step to convert the named list to a formula list and we're good. There seems to be some performance benefit too.

library(dplyr, warn.conflicts = FALSE)

x <- rep(1:5, 5)

recode_list <- setNames(1:5, 5:1)

formula_list <- paste0(names(recode_list), "~" ,recode_list) |> 
  lapply(as.formula)

bench::mark(
  recode = dplyr::recode(x, !!!recode_list), 
  case_match = dplyr::case_match(x, !!!formula_list) 
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 recode       1.99ms   2.04ms      479.     1.1MB     14.8
#> 2 case_match  199.3µs 211.15µs     4515.   297.7KB     25.6

Created on 2023-06-20 with reprex v2.0.2

@TimTeaFan
Copy link

TimTeaFan commented Jun 20, 2023

While the above is possible, I think we should at least have one recoding function that allows us the splice in arguments. After stuying the life cycle stages here it seems like "superseded" functions will not be deprecated in the future and basically stay "forever". If that is really the case then this issue can be closed, because there is not need to make adjustments to case_match(). If however dplyr::recode() is scheduled to become deprecated in one of the upcoming major releases, then we should think about how to make splicing work with case_match().

Here are a few suggestions:

  1. dplyr, or rather the larger tidyverse, should provide a reliable helper function to convert named vectors to a list of formulas, lets say rlang::as_formulas().
    I think this is the least that the tidyverse should offer in terms of usability. Especially since I’m not sure about how reliable custom functions like the one above are, given that as.formula has an env argument which might be problematic in more complex setups.

  2. case_match() should not only support formulas, but also named vectors (this is, of course, only possible for simple cases like c(a = 1, b = 2))
    I played around with this and I think its easily implemented. The question is, whether it is worth the conceptual overhead. In this case_match() wouldn’t throw an error if a named vector is supplied to the ellipsis argument. Would the documentation need to explain that the arguments in the dots can also be named vectors? Or would it be enough that the docs mention that argument splicing of named vectors is supported (and omit the fact that the formula notation for simple cases is not necessary)?

  3. case_match() should have an additional argument .splice = FALSE, which if set to TRUE converts a named vector input in the ellipsis argument into a list of formulas
    In this case, we wouldn't need rlang's tripple bang operator !!!, which would break with the tidyverse splicing style and rather resemble the behavior of bquote(). Compared to 3. this behavior would be easier to document.

  4. case_match() should have an additional optional argument .list = NULL that takes a named vector, transforms it into a list of formulas and adds it to the arguments in the ellipsis.
    This is an approach which is often used in shiny packages to allow functions to use programmatic inputs. But I’m not sure if it’s a good fit for the tidyverse. I do not recall to have seen this kind of functionality in the tidyverse.

  5. there should be a sister function, case_match2(), which accepts named vectors instead of a list of formulas.
    This style would probably best fit to the tidyverse, but the downside would be that the number of functions and thereby the complexity of dplyr increases.

  6. rlang should introduce a new splicing operator, for example !~!, which splices named vectors as a list of formulas.
    While being another option to solve the problem, I don’t think that this is a good idea. At the moment there is only case_when and case_match where this operator would be useful and splicing named vectors as arguments is much more common practice with classical recoding as seen in case_match and less needed with the more general case_when.

A remaining question is: if splicing is an important feature for recoding functions, which I think it is, then is recode() really "superseded" in the meaning of "there is a better alternative"? For interactive use case_match() is admittedly better, but not when used programmatically with a data dictionary.

@psychelzh
Copy link

See #6623, Hopefully, there might be better one.

@DavisVaughan
Copy link
Member

First - recode() being superseded means that it will never disappear.

My goal with case_match() (and also case_when() and possibly if_else()) is to move their guts to vctrs and reexport them there with a programmatic friendly interface. That already kind of exists, it's unexported from dplyr:

needles <- rep(1:5, 5)

haystacks <- as.list(1:5)
values <- as.list(5:1)

dplyr:::vec_case_match(
  needles = needles,
  haystacks = haystacks,
  values = values
)
#>  [1] 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1

I think case_match() is still quite nice for a lot of ad hoc interactive usage, but vec_case_match() will be nice for programmatic usage when you can procedurally generate the things to look for and values to replace with.

@avhz
Copy link

avhz commented Sep 24, 2024

Is there any plan to allow spliced arguments in case_match/case_when (without having to convert to a list of formulas) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants