Add `.ptype` and `.size` arguments, adjust `NULL` and empty handling #80

DavisVaughan · 2022-05-11T18:27:10Z

Closes #48
Closes #64

A few changes to coalesce() as I prep to port this over to dplyr.

For #48, over slack we decided that we'd rather keep coalesce() simple and always use vctrs invariants, so when given data frames as input it should only ever update entirely missing rows (that is what vec_equal_na() thinks is "missing"). If you want to coalesce by column then you should really just map2() or pmap() over the data frames, calling coalesce() as the function to use on each column. I have added a few tests to reflect this expectation.

For #64, I had suggested we switch to coalesce(x, ...) and always cast to the type of x and recycle to the size of x. I did some research, and the SQL standards state that coalesce() should cast to the common type of all of its inputs. See this link for one example where that is documented https://docs.microsoft.com/en-us/sql/t-sql/language-elements/coalesce-transact-sql?view=sql-server-ver15#return-types. So I think we should keep that behavior by default. It sort of makes sense if you have >=2 full vectors and you don't really have a "primary" input, like coalesce(x, y, z). My original use case was something like coalesce(x, 0), where you'd want to retain the type of x.

To optionally enforce type and size stability in cases like coalesce(x, 0), I've added .ptype and .size arguments. I think this is good enough for what I wanted.
I've added list_check_all_vectors() to check that all inputs are vectors. This disallows NULL values in .... I think this is a good idea because coalesce(NULL, .size = 5) would need to return something of size 5 and there is no way to do that without typed inputs. An alternative to this is to drop all NULL inputs at the beginning, but I have a feeling a NULL input would be a user error in this function, so I prefer an error. We can change to this approach in the future if this proves to be too strict (better to start strict, I think?).
I've required at least 1 input, rather than returning NULL. This is the current dplyr behavior and makes sense because coalesce(.size = 5) would have to return something of size 5, and there is no way to do that if no inputs are given.

Only entirely missing rows are updated, which is inline with `vec_equal_na()`. It is entirely possible to update a missing row with a partially missing row.

hadley · 2022-05-11T18:48:37Z

If coalesce() errors with 0 inputs, wouldn't you expect coalesce(NULL) to error? I think our general philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,), which is generally equivalent to foo(). I'd prefer to stick to that principle here, if we can.

R/coalesce.R

DavisVaughan · 2022-05-11T18:55:51Z

If coalesce() errors with 0 inputs, wouldn't you expect coalesce(NULL) to error?

They both error, just with different errors:

coalesce()
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(NULL)
#> Error in `coalesce()`:
#> ! `..1` must be a vector, not NULL.

I think our general philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,)

Yea this is what I was alluding to at the end of the second bullet point: "An alternative to this is to drop all NULL inputs at the beginning". I can do that, that would make coalesce(NULL) fail "automatically" because there would be 0 inputs after removing the nulls.

Are you ok with coalesce(NULL, 1:5) returning 1:5? I guess that is the main question here. (Seems consistent with the idea that NULL inputs are basically equivalent to not providing anything, as you suggested)

This is more in line with our standard assumption that `fun(x, NULL, NULL)` is treated like `fun(x, , )`, which is essentially `fun(x)`.

DavisVaughan · 2022-05-11T19:20:24Z

Okay we now stick to this principle:

philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,), which is generally equivalent to foo(x)

Giving us:

coalesce()
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(NULL)
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(c(1, NA), NULL, 2)
#> [1] 1 2

I am fairly certain this is what you wanted, so I'll merge this. Feel free to follow up if not.

hadley · 2022-05-12T20:35:38Z

Yeah, that's what I was thinking, thanks!

But reading through #64 again, I do find your arguments there compelling, even if it would make coalesce() less consistent with SQL. It's hard to imagine a case where you'd want to recycle the first argument to match the length of the second, and then you might as well apply that principle to the types as well. (And I think the only place where that makes a significant difference is whether coalesce(c(NA, 1L), 1)) returns an integer or double)

DavisVaughan · 2022-05-12T20:50:28Z

It does mean you can't splice in all of the inputs at once anymore

coalesce2 <- function(x, ...) {
  funs::coalesce(x, ...)
}

vecs <- list(
  c(1, 2, NA, NA, 5),
  c(NA, NA, 3, 4, 5)
)

coalesce2(!!!vecs)
#> Error in !vecs: invalid argument type

funs::coalesce(!!!vecs)
#> [1] 1 2 3 4 5

I think the main question is deciding which of these two is the main use case:

coalesce(x, 0) (i.e. you have a primary first input)
coalesce(x, y, z) (i.e. all inputs are the same to you)

The first favors a signature of coalesce(x, ...) and the second favors coalesce(...).

The other way it is described by most of the SQL docs I found is as syntactic sugar for case_when(), which is why it takes the common type in SQL:

COALESCE(expression1, expression2, ..., expressionN)

CASE  
WHEN (expression1 IS NOT NULL) THEN expression1  
WHEN (expression2 IS NOT NULL) THEN expression2  
...  
ELSE expressionN  
END

https://docs.microsoft.com/en-us/sql/t-sql/language-elements/coalesce-transact-sql?view=sql-server-ver15#comparing-coalesce-and-case

DavisVaughan · 2022-05-12T21:02:04Z

Assuming you have GitHub code search:
https://cs.github.com/?scopeName=All+repos&scope=&q=language%3Ar+coalesce%28%21%21%21

Approximately 20-ish distinct uses of coalesce(!!!<stuff>) that would break with a coalesce(x, ...) signature

hadley · 2022-05-12T21:20:38Z

Oh I forgot about the !!! case 😞. We do have a hack from aes() that we could apply here, but I don't think it's worth it.

FWIW I do think in coalesce(x, y, z) x is still primary, since that's the vector that you're filling in.

I guess we could still make the first argument of ... primary, even if we don't give it its own argument. But that starts to maybe feel maybe inconsistent? What other functions do we have that are like coalesce()? pmin() and pmax() are similar, which connects us to the set of functions suggested in #5. I guess it's also related to case_when() and if_else(). if_else() uses the size of the first argument, but the common type of all four.

DavisVaughan · 2022-05-12T22:16:47Z

Base R's pmin() and pmax() pull attributes from the first input, but I don't think we should let that behavior sway us.

I'd argue that ideally pmin(), pmax(), case_when(), and if_else() should all take the common type by default. I do think all 4 should have optional ptype and size arguments to override the default common type/size computation. That was an improvement I was planning on doing when I rewrite case_when() and if_else() with vctrs.

The main difference between those 4 and coalesce() is that coalesce() seems to have a primary input, and those don't.

However, if we stick with computing the common type for the 4 functions mention above, then this comment would indeed feel inconsistent, as you suggested:

make the first argument of ... primary

hadley · 2022-05-12T23:49:27Z

Ok, that reasoning sounds solid to me.

One more thought: we should probably add a check_dots_unnamed() here.

DavisVaughan · 2022-05-13T12:58:40Z

Re: check_dots_unnamed()

There are also a number of cases that code search brought up where people are splicing in data frames. Like they used select() to tidyselect some columns and then spliced those in. So this would break that :/

Here is one example:
https://cs.github.com/CorrelAid/projectutils/blob/a2984cc11ca1c88aeb3436c8b661f16356b1903d/R/surveymonkey.R#L127

hadley · 2022-05-13T18:27:50Z

Oh bummer 😞

* Port tidyverse/funs#80 to `coalesce()` * NEWS bullet * Use `vec_case_when()` infrastructure in `coalesce()` * Backtick `NULL` * Tweak parameter documentation one more time

DavisVaughan added 2 commits May 11, 2022 13:44

Add .ptype and .size arguments, adjust NULL and empty handling

b77c566

Add tests for our expectations about data frame / array input

fe6d947

Only entirely missing rows are updated, which is inline with `vec_equal_na()`. It is entirely possible to update a missing row with a partially missing row.

DavisVaughan requested a review from hadley May 11, 2022 18:39

hadley approved these changes May 11, 2022

View reviewed changes

R/coalesce.R Outdated Show resolved Hide resolved

R/coalesce.R Outdated Show resolved Hide resolved

R/coalesce.R Outdated Show resolved Hide resolved

R/coalesce.R Outdated Show resolved Hide resolved

DavisVaughan added 2 commits May 11, 2022 15:14

Discard NULL values up front

a7a5d09

This is more in line with our standard assumption that `fun(x, NULL, NULL)` is treated like `fun(x, , )`, which is essentially `fun(x)`.

Refine documentation using review feedback

77c0505

DavisVaughan merged commit 5c6070e into tidyverse:main May 11, 2022

DavisVaughan deleted the feature/tweak-coalesce branch May 11, 2022 20:04

DavisVaughan added a commit to DavisVaughan/dplyr that referenced this pull request May 11, 2022

Port tidyverse/funs#80 to coalesce()

b2a40ec

DavisVaughan mentioned this pull request May 11, 2022

Update coalesce() to use vec_case_when() tidyverse/dplyr#6265

Merged

DavisVaughan added a commit to DavisVaughan/dplyr that referenced this pull request Jul 5, 2022

Port tidyverse/funs#80 to coalesce()

bd40c9b

DavisVaughan added a commit to DavisVaughan/dplyr that referenced this pull request Jul 11, 2022

Port tidyverse/funs#80 to coalesce()

00f1a94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `.ptype` and `.size` arguments, adjust `NULL` and empty handling #80

Add `.ptype` and `.size` arguments, adjust `NULL` and empty handling #80

DavisVaughan commented May 11, 2022

hadley commented May 11, 2022 •

edited

Loading

DavisVaughan commented May 11, 2022 •

edited

Loading

DavisVaughan commented May 11, 2022

hadley commented May 12, 2022

DavisVaughan commented May 12, 2022

DavisVaughan commented May 12, 2022 •

edited

Loading

hadley commented May 12, 2022

DavisVaughan commented May 12, 2022

hadley commented May 12, 2022

DavisVaughan commented May 13, 2022 •

edited

Loading

hadley commented May 13, 2022

Add .ptype and .size arguments, adjust NULL and empty handling #80

Add .ptype and .size arguments, adjust NULL and empty handling #80

Conversation

DavisVaughan commented May 11, 2022

hadley commented May 11, 2022 • edited Loading

DavisVaughan commented May 11, 2022 • edited Loading

DavisVaughan commented May 11, 2022

hadley commented May 12, 2022

DavisVaughan commented May 12, 2022

DavisVaughan commented May 12, 2022 • edited Loading

hadley commented May 12, 2022

DavisVaughan commented May 12, 2022

hadley commented May 12, 2022

DavisVaughan commented May 13, 2022 • edited Loading

hadley commented May 13, 2022

Add `.ptype` and `.size` arguments, adjust `NULL` and empty handling #80

Add `.ptype` and `.size` arguments, adjust `NULL` and empty handling #80

hadley commented May 11, 2022 •

edited

Loading

DavisVaughan commented May 11, 2022 •

edited

Loading

DavisVaughan commented May 12, 2022 •

edited

Loading

DavisVaughan commented May 13, 2022 •

edited

Loading