Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Speed up nrow() on filtered dataset #43659

Open
nealrichardson opened this issue Aug 12, 2024 · 1 comment
Open

[R] Speed up nrow() on filtered dataset #43659

nealrichardson opened this issue Aug 12, 2024 · 1 comment

Comments

@nealrichardson
Copy link
Member

Describe the enhancement requested

From the Arrow workship at posit::conf 2024. Several participants reported that

ds |>
  summarize(sum(some_filter_expression)) |>
  collect()

was faster than

ds |>
  filter(some_filter_expression) |>
  nrow()

nrow() uses Scanner$CountRows: https://github.com/apache/arrow/blob/main/r/R/dplyr.R#L186

We could replace that with something that runs an ExecPlan instead, as the comment above that line suggests, and perhaps that is more performant.

cc @thisisnic @steph

Component(s)

R

@nealrichardson
Copy link
Member Author

FWIW I'm not seeing this at least on this query using a smaller sample of nyc_taxi:

bench::mark(
  old = nyc_taxi |> filter(total_amount > 100) |> nrow(), 
  new = nyc_taxi |> summarize(sum(total_amount > 100)) |> collect() |> pull()
)

# A tibble: 2 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 old          16.3ms 17.3ms      57.1    69.1KB     11.9    24     5      421ms
2 new          20.4ms 21.3ms      46.5   129.5KB     16.4    17     6      366ms
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>

That said, the first time I did it, it was slower, but on subsequent tries it was faster. Sounds like disk caching or something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant