Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .ptype and .size arguments, adjust NULL and empty handling #80

Merged
merged 4 commits into from
May 11, 2022

Conversation

DavisVaughan
Copy link
Member

Closes #48
Closes #64

A few changes to coalesce() as I prep to port this over to dplyr.

For #48, over slack we decided that we'd rather keep coalesce() simple and always use vctrs invariants, so when given data frames as input it should only ever update entirely missing rows (that is what vec_equal_na() thinks is "missing"). If you want to coalesce by column then you should really just map2() or pmap() over the data frames, calling coalesce() as the function to use on each column. I have added a few tests to reflect this expectation.

For #64, I had suggested we switch to coalesce(x, ...) and always cast to the type of x and recycle to the size of x. I did some research, and the SQL standards state that coalesce() should cast to the common type of all of its inputs. See this link for one example where that is documented https://docs.microsoft.com/en-us/sql/t-sql/language-elements/coalesce-transact-sql?view=sql-server-ver15#return-types. So I think we should keep that behavior by default. It sort of makes sense if you have >=2 full vectors and you don't really have a "primary" input, like coalesce(x, y, z). My original use case was something like coalesce(x, 0), where you'd want to retain the type of x.

  • To optionally enforce type and size stability in cases like coalesce(x, 0), I've added .ptype and .size arguments. I think this is good enough for what I wanted.

  • I've added list_check_all_vectors() to check that all inputs are vectors. This disallows NULL values in .... I think this is a good idea because coalesce(NULL, .size = 5) would need to return something of size 5 and there is no way to do that without typed inputs. An alternative to this is to drop all NULL inputs at the beginning, but I have a feeling a NULL input would be a user error in this function, so I prefer an error. We can change to this approach in the future if this proves to be too strict (better to start strict, I think?).

  • I've required at least 1 input, rather than returning NULL. This is the current dplyr behavior and makes sense because coalesce(.size = 5) would have to return something of size 5, and there is no way to do that if no inputs are given.

Only entirely missing rows are updated, which is inline with `vec_equal_na()`. It is entirely possible to update a missing row with a partially missing row.
@DavisVaughan DavisVaughan requested a review from hadley May 11, 2022 18:39
@hadley
Copy link
Member

hadley commented May 11, 2022

If coalesce() errors with 0 inputs, wouldn't you expect coalesce(NULL) to error? I think our general philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,), which is generally equivalent to foo(). I'd prefer to stick to that principle here, if we can.

R/coalesce.R Outdated Show resolved Hide resolved
R/coalesce.R Outdated Show resolved Hide resolved
R/coalesce.R Outdated Show resolved Hide resolved
R/coalesce.R Outdated Show resolved Hide resolved
@DavisVaughan
Copy link
Member Author

DavisVaughan commented May 11, 2022

If coalesce() errors with 0 inputs, wouldn't you expect coalesce(NULL) to error?

They both error, just with different errors:

coalesce()
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(NULL)
#> Error in `coalesce()`:
#> ! `..1` must be a vector, not NULL.

I think our general philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,)

Yea this is what I was alluding to at the end of the second bullet point: "An alternative to this is to drop all NULL inputs at the beginning". I can do that, that would make coalesce(NULL) fail "automatically" because there would be 0 inputs after removing the nulls.

Are you ok with coalesce(NULL, 1:5) returning 1:5? I guess that is the main question here. (Seems consistent with the idea that NULL inputs are basically equivalent to not providing anything, as you suggested)

This is more in line with our standard assumption that `fun(x, NULL, NULL)` is treated like `fun(x, , )`, which is essentially `fun(x)`.
@DavisVaughan
Copy link
Member Author

Okay we now stick to this principle:

philosophy is that foo(x, NULL,NULL) should be equivalent to foo(x, ,), which is generally equivalent to foo(x)

Giving us:

coalesce()
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(NULL)
#> Error in `coalesce()`:
#> ! `...` must contain at least one input.

coalesce(c(1, NA), NULL, 2)
#> [1] 1 2

I am fairly certain this is what you wanted, so I'll merge this. Feel free to follow up if not.

@DavisVaughan DavisVaughan merged commit 5c6070e into tidyverse:main May 11, 2022
@DavisVaughan DavisVaughan deleted the feature/tweak-coalesce branch May 11, 2022 20:04
DavisVaughan added a commit to DavisVaughan/dplyr that referenced this pull request May 11, 2022
@hadley
Copy link
Member

hadley commented May 12, 2022

Yeah, that's what I was thinking, thanks!

But reading through #64 again, I do find your arguments there compelling, even if it would make coalesce() less consistent with SQL. It's hard to imagine a case where you'd want to recycle the first argument to match the length of the second, and then you might as well apply that principle to the types as well. (And I think the only place where that makes a significant difference is whether coalesce(c(NA, 1L), 1)) returns an integer or double)

@DavisVaughan
Copy link
Member Author

It does mean you can't splice in all of the inputs at once anymore

coalesce2 <- function(x, ...) {
  funs::coalesce(x, ...)
}

vecs <- list(
  c(1, 2, NA, NA, 5),
  c(NA, NA, 3, 4, 5)
)

coalesce2(!!!vecs)
#> Error in !vecs: invalid argument type

funs::coalesce(!!!vecs)
#> [1] 1 2 3 4 5

I think the main question is deciding which of these two is the main use case:

  • coalesce(x, 0) (i.e. you have a primary first input)
  • coalesce(x, y, z) (i.e. all inputs are the same to you)

The first favors a signature of coalesce(x, ...) and the second favors coalesce(...).


The other way it is described by most of the SQL docs I found is as syntactic sugar for case_when(), which is why it takes the common type in SQL:

COALESCE(expression1, expression2, ..., expressionN)
CASE  
WHEN (expression1 IS NOT NULL) THEN expression1  
WHEN (expression2 IS NOT NULL) THEN expression2  
...  
ELSE expressionN  
END  

https://docs.microsoft.com/en-us/sql/t-sql/language-elements/coalesce-transact-sql?view=sql-server-ver15#comparing-coalesce-and-case

@DavisVaughan
Copy link
Member Author

DavisVaughan commented May 12, 2022

Assuming you have GitHub code search:
https://cs.github.com/?scopeName=All+repos&scope=&q=language%3Ar+coalesce%28%21%21%21

Approximately 20-ish distinct uses of coalesce(!!!<stuff>) that would break with a coalesce(x, ...) signature

@hadley
Copy link
Member

hadley commented May 12, 2022

Oh I forgot about the !!! case 😞. We do have a hack from aes() that we could apply here, but I don't think it's worth it.

FWIW I do think in coalesce(x, y, z) x is still primary, since that's the vector that you're filling in.

I guess we could still make the first argument of ... primary, even if we don't give it its own argument. But that starts to maybe feel maybe inconsistent? What other functions do we have that are like coalesce()? pmin() and pmax() are similar, which connects us to the set of functions suggested in #5. I guess it's also related to case_when() and if_else(). if_else() uses the size of the first argument, but the common type of all four.

@DavisVaughan
Copy link
Member Author

Base R's pmin() and pmax() pull attributes from the first input, but I don't think we should let that behavior sway us.

I'd argue that ideally pmin(), pmax(), case_when(), and if_else() should all take the common type by default. I do think all 4 should have optional ptype and size arguments to override the default common type/size computation. That was an improvement I was planning on doing when I rewrite case_when() and if_else() with vctrs.

The main difference between those 4 and coalesce() is that coalesce() seems to have a primary input, and those don't.

However, if we stick with computing the common type for the 4 functions mention above, then this comment would indeed feel inconsistent, as you suggested:

make the first argument of ... primary

@hadley
Copy link
Member

hadley commented May 12, 2022

Ok, that reasoning sounds solid to me.

One more thought: we should probably add a check_dots_unnamed() here.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented May 13, 2022

Re: check_dots_unnamed()

There are also a number of cases that code search brought up where people are splicing in data frames. Like they used select() to tidyselect some columns and then spliced those in. So this would break that :/

Here is one example:
https://cs.github.com/CorrelAid/projectutils/blob/a2984cc11ca1c88aeb3436c8b661f16356b1903d/R/surveymonkey.R#L127

@hadley
Copy link
Member

hadley commented May 13, 2022

Oh bummer 😞

DavisVaughan added a commit to DavisVaughan/dplyr that referenced this pull request Jul 5, 2022
DavisVaughan added a commit to DavisVaughan/dplyr that referenced this pull request Jul 11, 2022
DavisVaughan added a commit to tidyverse/dplyr that referenced this pull request Jul 11, 2022
* Port tidyverse/funs#80 to `coalesce()`

* NEWS bullet

* Use `vec_case_when()` infrastructure in `coalesce()`

* Backtick `NULL`

* Tweak parameter documentation one more time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

coalesce() invariants Option to coalesce by column with data frames?
2 participants