Adjusting to different behaviour between `case_match` and `recode` #6856

manhnguyen48 · 2023-05-23T16:33:23Z

Since recode is being superseded, I'm looking at changing to case_match. I usually use it to reverse code an integer vector (e.g. 1 -> 5, 5 -> 1, etc.) and have found named list/vector useful but case_match doesn't accept the same.
The named list approach is useful to include recode in other functions as named list are more straightforward to produce.
How should I adapt here?

library(dplyr, warn.conflicts = FALSE)

x <- rep(1:5, 5)
#This still works
dplyr::recode(x, !!!setNames(1:5, 5:1))
#>  [1] 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1
#This does not work
dplyr::case_match(x, !!!setNames(1:5, 5:1))
#> Error in `dplyr::case_match()`:
#> ! Case 1 (`!!!setNames(1:5, 5:1)`) must be a two-sided formula, not the
#>   number 1.
#> Backtrace:
#>     ▆
#>  1. └─dplyr::case_match(x, !!!setNames(1:5, 5:1))
#>  2.   └─dplyr:::case_formula_evaluate(...)
#>  3.     └─dplyr:::map2(...)
#>  4.       └─base::mapply(.f, .x, .y, MoreArgs = list(...), SIMPLIFY = FALSE)
#>  5.         └─dplyr (local) `<fn>`(dots[[1L]][[1L]], dots[[2L]][[1L]])
#>  6.           └─dplyr:::validate_and_split_formula(...)
#>  7.             └─rlang::abort(message, call = error_call)
#Instead we have to do this
dplyr::case_match(x, 1~5, 2~4, 3~3, 4~2, 5~1)
#>  [1] 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1

^{Created on 2023-05-23 with reprex v2.0.2}

The text was updated successfully, but these errors were encountered:

manhnguyen48 · 2023-06-20T11:54:24Z

Ok, I've figured you just need an extra step to convert the named list to a formula list and we're good. There seems to be some performance benefit too.

library(dplyr, warn.conflicts = FALSE)

x <- rep(1:5, 5)

recode_list <- setNames(1:5, 5:1)

formula_list <- paste0(names(recode_list), "~" ,recode_list) |> 
  lapply(as.formula)

bench::mark(
  recode = dplyr::recode(x, !!!recode_list), 
  case_match = dplyr::case_match(x, !!!formula_list) 
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 recode       1.99ms   2.04ms      479.     1.1MB     14.8
#> 2 case_match  199.3µs 211.15µs     4515.   297.7KB     25.6

^{Created on 2023-06-20 with reprex v2.0.2}

TimTeaFan · 2023-06-20T17:53:29Z

While the above is possible, I think we should at least have one recoding function that allows us the splice in arguments. After stuying the life cycle stages here it seems like "superseded" functions will not be deprecated in the future and basically stay "forever". If that is really the case then this issue can be closed, because there is not need to make adjustments to case_match(). If however dplyr::recode() is scheduled to become deprecated in one of the upcoming major releases, then we should think about how to make splicing work with case_match().

Here are a few suggestions:

dplyr, or rather the larger tidyverse, should provide a reliable helper function to convert named vectors to a list of formulas, lets say rlang::as_formulas().
I think this is the least that the tidyverse should offer in terms of usability. Especially since I’m not sure about how reliable custom functions like the one above are, given that as.formula has an env argument which might be problematic in more complex setups.
case_match() should not only support formulas, but also named vectors (this is, of course, only possible for simple cases like c(a = 1, b = 2))
I played around with this and I think its easily implemented. The question is, whether it is worth the conceptual overhead. In this case_match() wouldn’t throw an error if a named vector is supplied to the ellipsis argument. Would the documentation need to explain that the arguments in the dots can also be named vectors? Or would it be enough that the docs mention that argument splicing of named vectors is supported (and omit the fact that the formula notation for simple cases is not necessary)?
case_match() should have an additional argument .splice = FALSE, which if set to TRUE converts a named vector input in the ellipsis argument into a list of formulas
In this case, we wouldn't need rlang's tripple bang operator !!!, which would break with the tidyverse splicing style and rather resemble the behavior of bquote(). Compared to 3. this behavior would be easier to document.
case_match() should have an additional optional argument .list = NULL that takes a named vector, transforms it into a list of formulas and adds it to the arguments in the ellipsis.
This is an approach which is often used in shiny packages to allow functions to use programmatic inputs. But I’m not sure if it’s a good fit for the tidyverse. I do not recall to have seen this kind of functionality in the tidyverse.
there should be a sister function, case_match2(), which accepts named vectors instead of a list of formulas.
This style would probably best fit to the tidyverse, but the downside would be that the number of functions and thereby the complexity of dplyr increases.
rlang should introduce a new splicing operator, for example !~!, which splices named vectors as a list of formulas.
While being another option to solve the problem, I don’t think that this is a good idea. At the moment there is only case_when and case_match where this operator would be useful and splicing named vectors as arguments is much more common practice with classical recoding as seen in case_match and less needed with the more general case_when.

A remaining question is: if splicing is an important feature for recoding functions, which I think it is, then is recode() really "superseded" in the meaning of "there is a better alternative"? For interactive use case_match() is admittedly better, but not when used programmatically with a data dictionary.

psychelzh · 2023-07-10T16:15:40Z

See #6623, Hopefully, there might be better one.

DavisVaughan · 2023-11-03T17:47:25Z

First - recode() being superseded means that it will never disappear.

My goal with case_match() (and also case_when() and possibly if_else()) is to move their guts to vctrs and reexport them there with a programmatic friendly interface. That already kind of exists, it's unexported from dplyr:

needles <- rep(1:5, 5)

haystacks <- as.list(1:5)
values <- as.list(5:1)

dplyr:::vec_case_match(
  needles = needles,
  haystacks = haystacks,
  values = values
)
#>  [1] 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1

I think case_match() is still quite nice for a lot of ad hoc interactive usage, but vec_case_match() will be nice for programmatic usage when you can procedurally generate the things to look for and values to replace with.

avhz · 2024-09-24T11:31:46Z

Is there any plan to allow spliced arguments in case_match/case_when (without having to convert to a list of formulas) ?

DavisVaughan closed this as completed Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjusting to different behaviour between `case_match` and `recode` #6856

Adjusting to different behaviour between `case_match` and `recode` #6856

manhnguyen48 commented May 23, 2023 •

edited

Loading

manhnguyen48 commented Jun 20, 2023

TimTeaFan commented Jun 20, 2023 •

edited

Loading

psychelzh commented Jul 10, 2023

DavisVaughan commented Nov 3, 2023

avhz commented Sep 24, 2024

Adjusting to different behaviour between case_match and recode #6856

Adjusting to different behaviour between case_match and recode #6856

Comments

manhnguyen48 commented May 23, 2023 • edited Loading

manhnguyen48 commented Jun 20, 2023

TimTeaFan commented Jun 20, 2023 • edited Loading

psychelzh commented Jul 10, 2023

DavisVaughan commented Nov 3, 2023

avhz commented Sep 24, 2024

Adjusting to different behaviour between `case_match` and `recode` #6856

Adjusting to different behaviour between `case_match` and `recode` #6856

manhnguyen48 commented May 23, 2023 •

edited

Loading

TimTeaFan commented Jun 20, 2023 •

edited

Loading