`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

DavisVaughan · 2023-01-25T15:31:55Z

Related to #4723

With the introduction of .by, it seems reasonable to once again reconsider rowwise() as well. I think we are convinced that the idea of rowwise is useful, but the implementation could possibly be improved. A few pain points:

rowwise() is a form of persistent grouping, but you rarely want it on for more than 1 operation
ungroup() is an odd verb for turning off rowwise behavior
It still sucks that you need summarise(model = list(lm(...))), i.e. the list() wrapping is manual
Maintaining the rowwise_df class is difficult and error prone for us
There are very few times where rowwise behavior is actually useful. I think the two cases are mutate() and reframe().

With that in mind, I'd like to suggest a two-part replacement for rowwise():

Two per-operation rowwise verbs, mutate_row() and reframe_row(). These become the only two places in dplyr where rowwise behavior is applicable.
Give mutate(), summarise(), reframe(), mutate_row(), and reframe_row() the ability to automatically wrap scalars in a list. i.e. if vec_is(elt) is FALSE, wrap automatically into a list. This means that value could never exist in a data frame column as is, so there is no ambiguity about wrapping and it is fairly easy to explain.

Those two proposals result in the following new patterns:

# dplyr 1.1.0
iris %>%
  tidyr::nest(.by = Species) %>%
  rowwise(Species) %>%
  mutate(model = list(lm(Petal.Length ~ Sepal.Length, data = data))) %>%
  reframe(broom::tidy(model))
   
# New 1:
# (note the lack of list(), and no persistant rowwise-ness)
# (note how we carry Species along in the reframe_row() call)
iris %>%
  tidyr::nest(.by = Species) %>%
  mutate_row(model = lm(Petal.Length ~ Sepal.Length, data = data)) %>%
  reframe_row(Species, broom::tidy(model))

# New 2:
# (note that even summarise() doesn't need manual list() wrapping)
iris %>%
  summarise(
    model = lm(Petal.Length ~ Sepal.Length, data = pick(everything())),
    .by = Species
  ) %>%
  reframe_row(Species, broom::tidy(model))

# All result in:

#> # A tibble: 6 × 6
#>   Species    term         estimate std.error statistic  p.value
#>   <fct>      <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#> 1 setosa     (Intercept)     0.803    0.344      2.34  2.38e- 2
#> 2 setosa     Sepal.Length    0.132    0.0685     1.92  6.07e- 2
#> 3 versicolor (Intercept)     0.185    0.514      0.360 7.20e- 1
#> 4 versicolor Sepal.Length    0.686    0.0863     7.95  2.59e-10
#> 5 virginica  (Intercept)     0.610    0.417      1.46  1.50e- 1
#> 6 virginica  Sepal.Length    0.750    0.0630    11.9   6.30e-16

This two part proposal has the very nice property that the difference between mutate() and mutate_row() becomes purely about column access:

mutate() accesses columns using vec_slice() / [
mutate_row() accesses columns using vec_slice2() / [[

In other words, rowwise has nothing to do with the output type of each column expression, and you still get useful results.

In terms of other invariants, there is one related to vec_size():

mutate_row() requires each expression to return an element of vec_size() == 1
reframe_row() allows each expression to return an element of any size
(the size invariant is enforced after list wrapping)

Other niceties:

It becomes very clear when you are doing a rowwise operation, because it is in the name of the verb (similar to .by being in the verb)
Somewhat obvious, but it means rowwise behavior isn't persistent. You always have bare tibble in, bare tibble out, which greatly simplifies things.

Extra notes:

Somewhat obvious, but mutate_row() and reframe_row() won't get .by because they operation "by row"
We don't want to teach .by about rowwise behavior, like .by = .row or something. We want .by to be pure tidyselect. Plus this special behavior would only apply for mutate() and reframe() and that would be very confusing.
We do not need summarise_row(). This would have the exact same semantics as mutate_row(), but would just drop unused columns (which can mostly be done with .keep in mutate_row()). In particular summarise_row() and mutate_row() would both have to have the vec_size() == 1 invariant from above, so we really don't need both.
There is no need for filter_row(). The only useful thing I can think of is something like filter_row(!is.null(model)) for filtering out NULL list elements. But you can do that way more efficiently with an ungrouped call to filter(!funs::is_na(model)).

mutate_row() and reframe_row() mostly have the semantics of the wrappers below, but this doesn't do the automatic list-wrapping of scalars:

mutate_row <- function(.data, ...) {
  .data <- rowwise(.data)
  .data <- mutate(.data, ...)
  ungroup(.data)
}

reframe_row <- function(.data, ...) {
  .data <- rowwise(.data)
  reframe(.data, ...)
}

The text was updated successfully, but these errors were encountered:

hadley · 2023-01-25T18:45:22Z

Sounds good!

romainfrancois · 2023-01-27T09:21:58Z

I like this a lot. Reading the first part, I thought about a .by = row() or something, but the extra notes convinced me otherwise.

So, now wandering in another direction, which I know is a bit silly, but what if %>% mutate_by(<tidy select ... >)(...)

lionel- · 2023-02-01T16:20:01Z

I like the idea of automatically wrapping scalars in a list. This is the sort of things that vctrs makes possible in a predictable and consistent manner.

However, I feel like we should commit to the argument syntax of .by, even if it ends up being a different argument for the reasons that you mention. I find it not very consistent to modify the semantics of execution with two completely different syntaxes. It also increases the API surface (one more thing to know about, less discoverable than an argument).

So in this case I'd like us to consider using an argument. It could be a simple boolean:

df |> mutate(foo(bar), .by = baz)         # By group
df |> mutate(foo(bar), .by_rows = TRUE)   # By row

We could also add a variant of .by that is data-masked instead of a selection. It would create a grouping variable on the fly that is not retained in the data frame (doesn't change the shape, would be automatically named). It's occasionally useful in interactive analysis to create a variable on the fly to group with and if we supersede group_by() I think we'll be missing an easy way to group by a temporary variable:

# Like `.by_row` but `[` subsetting
df |> mutate(foo(bar), .by_vector = 1:n())

df |> summarise(foo(bar), .by_vector = cut(baz, 3))

In this case we'd end up with a trio of complementary arguments that change the semantics of evaluation: .by = (groups, tidyselection), .by_vector = (groups, data-masked), .by_rows = (rows, boolean).

I think using modifiers instead of variants fits the general evolution of the dplyr API, e.g. we've removed the suffixed variants of the verbs in favour of across().

DavisVaughan · 2023-02-01T19:40:39Z

I'd be open to .by_row as a boolean argument to mutate() and reframe(). It does feel better than re-adding suffixed variants since we worked so hard to back away from those in 1.0.0. It would be much less confusing than .by = .row because only the verbs that have rowwise support would get that argument.

I'm also slightly more empathetic to the idea of also adding this to filter(), since Hadley had some old R4DS example that did something equivalent to filter(df, is.numeric(list_col), .by_row = TRUE)

torfason · 2024-05-14T18:13:10Z

I see that my suggestion for allowing .by=row_number() was closed as a duplicate of this issue. I agree that the key thing is to have this work in a good way, but I would suggest that there is a considerable benefit with regards to discoverability of having .by=row_number(). I think that this is different that .by = .row since .by=row_number() would be a legal argument to all functions that take .by (meaning the same as dat |> mutate(rowid=row_number()) |> mutate(..., .by=rowid) - it is just that the result would be more or less applicable for different functions.

Anyway, just wanted to voice this. In the end I trust your judgment and will hold my peace regarding this issue forevermore. Thank for the ongoing dedication to and improvement of dplyr and friends!

ggrothendieck · 2024-10-27T13:09:14Z

Maybe row_number() or some synonym of it could be added to select expression syntax so that any use of a select expression could access that pseudo column for greater consistency rather than having special verbs. I get the impression that the reason not to consider that is more implementation related but the consistency of the user interface should be the primary consideration.

Currently if there is a column which is unique I will use that or if I am sure that there are no duplicate rows then .by=everthing() but the latter is still not ideal.

twhitehead · 2025-02-09T05:13:46Z

Had been considering opening a feature request to add .by = 'row' to get rowwise() behaviour (only have a superficial understanding, but figured it could maybe work as 'row' is quoted while tidy select statement are bare?) and found this.

Now that I have read through @DavisVaughan's proposal, I like that better. While I would be happy with however it got done, I don't think adding a .by_row = TRUE argument is as clean as it adds the ability to set both .by and .by_row, which makes no sense. Within reason, it is better to not be able to specify nonsense than have it blow up at runtime.

I would agree that creating new functions would not be good if it meant every function that took .by would now need a new _row variant. That would suggest it probably should be tied to .by itself. However, after reading @DavisVaughan argument that it is only a mutate and a reframe, I think maybe it is cleaner to have them by themselves.

I believe both mutate and summarize are essentially restricted versions of reframe. Presumably there is some sort of balance in these choices. For what it is worth, if suffixes are unpopular, in my mind mutate_byrow is actually just map where you don't have to specify the bindings and reframe_byrow is likewise mapconcat. 😁

ggrothendieck · 2025-02-09T14:15:40Z

"We don't want to teach .by about rowwise behavior, like .by = .row or something. We want .by to be pure tidyselect.
Plus this special behavior would only apply for mutate() and reframe() and that would be very confusing."

Yes but the suggestion was to add row_number() to tidy select so it would be available wherever tidy select is available so one would not be teaching .by= separately.

twhitehead · 2025-02-09T15:46:13Z

What I was trying to get at is other tidy select operators (e.g., starts_with(), everything(), etc.) all all about selecting and make sense everywhere you want to select.

Adding row_number() breaks this pattern as it is about grouping and not selecting. For example, what would these mean

select(row_number())
summarize(across(row_number(), sum)
mutate(y = sin(x), .by = !row_number())

etc.

ggrothendieck · 2025-02-10T14:31:29Z

row_number() is a pseudo column with name row_number and it acts just like a normal column. There really is no difference. The first two examples return data.frame(row_number = row_number()) just as if row_number were a column and the last is really no different than asking what BOD %>% mutate(y = sin(demand), .by = !Time) would return.

To me it makes more sense to extend tidy-select than to introduce a bunch of special case functions. The extension can have uses in other contexts too as these examples show.

DanChaltiel · 2025-02-10T14:58:46Z

I don't think adding a .by_row = TRUE argument is as clean as it adds the ability to set both .by and .by_row, which makes no sense. Within reason, it is better to not be able to specify nonsense than have it blow up at runtime.

IMHO, this is the best argument.

There will be defensive programming either way, throwing an error either if .by and .by_row are used at the same time or if .by is used in a non-logical place like in summarize(). The latter error would feel much more natural to me.

As using row_number() in a mutate call is the usual way for many people (like me) to replace rowwise(), using .by=row_number() is the most intuitive solution, one that I've written many times wondering why it didn't work.
Note that group_by(row_number()) works as expected.

twhitehead · 2025-02-10T15:11:00Z

You have to be a bit careful with row_number() (the integer ranking function) as it doesn't actually give you the row number. Rather it seems to be the rank in an order maintaining sort. That is

> x = c(1,6,3,8)
> row_number(x)
[1] 1 3 2 4
> x[row_number(x)]
[1] 1 3 6 8

twhitehead · 2025-02-10T17:12:09Z

@ggrothendieck I understand the idea and the appeal, but I think there may be some devil in the details in figuring out

how it would work though the tidy select specification, and
what functions that use it should do if any pseudo columns wind up selected.

Specifically, adding pseudo columns means you have to deal with the fact that

you can have stuff selected that isn't actually in the data set,
you can have stuff selected that doesn't actually have a name.

For example, let's take the algebra portion. How do these operators work

!(peseudo columns, regular columns)
(peseudo columns, regular columns) & (peseudo columns, regular columns)
(peseudo columns, regular columns) | (peseudo columns, regular columns)

It might be tempting to say it is just like a regular column. Doing this then mean something like

mutate(across(starts_with(!time), cumsum)

is now also computing the cummulative sum of the pseudo row number column (as !time selects it because it is just like any other column).

And once it has computed this value, what is it supposed to do with it? It can't write it back to the pseudo column because it doesn't actually exist. Does it create a new column named row_number? And what happens if that name already exists? Do you overwrite it? Do you create a new name. All seem to have issues.

And then what about things like everything(). Does this include pseudo columns? Or what about where()? And, whatever the answer, you have to consider the fact it will get negated too, and what would happen when you feed it into any of the many places tidy selection is used.

twhitehead · 2025-02-10T20:03:34Z

I guess the other question about row_number() (the suggested tidy select pseudo column) is, assuming suitable semantics could be nailed down, would it be useful outside of mutate/reframe?

An answer of no might suggest that really it is more about mutate/reframe and should probably live there, while an answer of yes might suggest that tidy select might be the most sensible place for it.

ggrothendieck · 2025-02-11T16:17:29Z

The examples already given show examples of row_number() in tidy-select context outside of mutate. e.g. select(BOD, row_number(), Time) gives row_number and Time columns. In absence of row_number() it would require a mutate and a select rather than just a select.

Also Dan's post points out that group_by(row_number()) works now and this suggests that at the same time thinking about what else in data masking could usefully be supported by tidy-select as well.

twhitehead · 2025-02-11T16:41:15Z

group_by doesn't use tidy select. Rather it is variables or computations to group by. Compare

> t = tibble(x=c(0,0,0,0,1,1,1,1),y=c(0,0,1,1,0,0,1,1),z=c(0,1,0,1,0,1,0,1))

> t |> group_by(!x) |> summarize(across(everything(),max)
# A tibble: 2 × 4
  `!x`      x     y     z
  <lgl> <dbl> <dbl> <dbl>
1 FALSE     1     1     1
2 TRUE      0     1     1

to

> t |> summarize(across(everything(),max),.by=!x)
# A tibble: 4 × 3
      y     z     x
  <dbl> <dbl> <dbl>
1     0     0     1
2     0     1     1
3     1     0     1
4     1     1     1

ggrothendieck · 2025-02-11T16:58:13Z

@twhitehead, I know that. The point of my comment was that since row_number() works in data masking that adding it to tidy-select can be regarded as having tidy-select support a data masking construct so the natural question is what else in data masking could be supported in tidy-select.

twhitehead · 2025-02-14T14:52:52Z

Was working on some rowwise() code for some highly nested data with a lot of n-dimensional coordinate calculations, and, in addition to a lot of list(...) wrappers, I am seeing a lot of matrix(..., 1) wrappers. As in

rowwise(data) |>
mutate(var1 = list(...),
       var2 = matrix(..., 1),
       var3 = matrix(..., 1),
       var4 = list(...),
       ...) |>
ungroup()

This makes me think, with regard to @DavisVaughan's second point about turning something that would be an error into something sensible

Give mutate(), summarise(), reframe(), mutate_row(), and reframe_row() the ability to automatically wrap scalars in a list. i.e. if vec_is(elt) is FALSE, wrap automatically into a list. This means that value could never exist in a data frame column as is, so there is no ambiguity about wrapping and it is fairly easy to explain

maybe it made also make sense for the ones that have to return single rows (the *_row() variants and summarize()) to consider any returned vectors to be legal row vectors instead of illegal column vectors. That would make the above just

mutate_row(data,
           var1 = ...,
           var2 = ...,
           var3 = ...,
           var4 = ...,
           ...)

DavisVaughan added the feature a feature request or enhancement label Jan 26, 2023

DavisVaughan changed the title ~~mutate_row() and reframe_row()~~ mutate(.by_row =), reframe(.by_row =), and possibly filter(.by_row =) Feb 8, 2023

This comment was marked as resolved.

Sign in to view

DavisVaughan mentioned this issue May 14, 2024

Allow .by=row_number() in mutate statements #7009

Closed

DavisVaughan mentioned this issue May 16, 2024

feat: Reexport non-deprecated dplyr functions tidyverse/duckplyr#163

Merged

DavisVaughan mentioned this issue Nov 18, 2024

Feature request: A function to check if a set of variables form a unique ID in a dataframe #7098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

DavisVaughan commented Jan 25, 2023 •

edited

Loading

hadley commented Jan 25, 2023

romainfrancois commented Jan 27, 2023

lionel- commented Feb 1, 2023 •

edited

Loading

DavisVaughan commented Feb 1, 2023 •

edited

Loading

This comment was marked as resolved.

torfason commented May 14, 2024 •

edited

Loading

ggrothendieck commented Oct 27, 2024 •

edited

Loading

twhitehead commented Feb 9, 2025 •

edited

Loading

ggrothendieck commented Feb 9, 2025

twhitehead commented Feb 9, 2025 •

edited

Loading

ggrothendieck commented Feb 10, 2025

DanChaltiel commented Feb 10, 2025

twhitehead commented Feb 10, 2025

twhitehead commented Feb 10, 2025 •

edited

Loading

twhitehead commented Feb 10, 2025

ggrothendieck commented Feb 11, 2025

twhitehead commented Feb 11, 2025 •

edited

Loading

ggrothendieck commented Feb 11, 2025

twhitehead commented Feb 14, 2025 •

edited

Loading

mutate(.by_row =), reframe(.by_row =), and possibly filter(.by_row =) #6660

mutate(.by_row =), reframe(.by_row =), and possibly filter(.by_row =) #6660

Comments

DavisVaughan commented Jan 25, 2023 • edited Loading

hadley commented Jan 25, 2023

romainfrancois commented Jan 27, 2023

lionel- commented Feb 1, 2023 • edited Loading

DavisVaughan commented Feb 1, 2023 • edited Loading

This comment was marked as resolved.

torfason commented May 14, 2024 • edited Loading

ggrothendieck commented Oct 27, 2024 • edited Loading

twhitehead commented Feb 9, 2025 • edited Loading

ggrothendieck commented Feb 9, 2025

twhitehead commented Feb 9, 2025 • edited Loading

ggrothendieck commented Feb 10, 2025

DanChaltiel commented Feb 10, 2025

twhitehead commented Feb 10, 2025

twhitehead commented Feb 10, 2025 • edited Loading

twhitehead commented Feb 10, 2025

ggrothendieck commented Feb 11, 2025

twhitehead commented Feb 11, 2025 • edited Loading

ggrothendieck commented Feb 11, 2025

twhitehead commented Feb 14, 2025 • edited Loading

`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

`mutate(.by_row =)`, `reframe(.by_row =)`, and possibly `filter(.by_row =)` #6660

DavisVaughan commented Jan 25, 2023 •

edited

Loading

lionel- commented Feb 1, 2023 •

edited

Loading

DavisVaughan commented Feb 1, 2023 •

edited

Loading

torfason commented May 14, 2024 •

edited

Loading

ggrothendieck commented Oct 27, 2024 •

edited

Loading

twhitehead commented Feb 9, 2025 •

edited

Loading

twhitehead commented Feb 9, 2025 •

edited

Loading

twhitehead commented Feb 10, 2025 •

edited

Loading

twhitehead commented Feb 11, 2025 •

edited

Loading

twhitehead commented Feb 14, 2025 •

edited

Loading