-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test parsnip "overhead" #1071
test parsnip "overhead" #1071
Conversation
tests/testthat/test_fit_interfaces.R
Outdated
time_parsnip_form <- | ||
timing(fit(parsnip::linear_reg(), mpg ~ ., mtcars)) | ||
time_parsnip_xy <- | ||
timing(fit_xy(parsnip::linear_reg(), mtcars[2:11], mtcars[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bracket subsetting does add some additional compute that's not "our fault," though it's a fraction of time_engine
.
ubuntu-latest (devel) failure is due to unrelated #1074. |
I feel a little weird trying to test these, but I agree about the sentiment. I totally agree about the skip on cran! Firstly, how variable are the ratios when you test them locally? |
Heard. Here are the distributions of those ratios locally: library(parsnip)
timing <- function(expr) {
expr <- substitute(expr)
system.time(replicate(100, eval(expr)))[["elapsed"]]
}
time_engine <-
replicate(100, timing(lm(mpg ~ ., mtcars)))
time_parsnip_form <-
replicate(100, timing(fit(linear_reg(), mpg ~ ., mtcars)))
time_parsnip_xy <-
replicate(100, timing(fit_xy(linear_reg(), mtcars[2:11], mtcars[1])))
hist(time_parsnip_form / time_engine) hist(time_parsnip_xy / time_engine) Created on 2024-02-26 with reprex v2.1.0 |
Given the current timings, we could even use 1000 replicates for each and this test would only take a couple seconds, if we wanted to cut down on that variability further. I've set the thresholds to be quite permissive generally, though. |
that is not a bad idea. Up to you 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of summing the elapsed times for 100 reps, take the median of the reps (or mean) and do a ratio of those.
I like the idea of using the median timing! The only issue is that individual fits go very fast, and |
You gotta bump them numbers up |
The idea to use medians was spot on. Prioritizing the consistency of these tests (i.e. avoiding false failures), I propose we add a Suggests for bench in favor of library(ggplot2)
library(parsnip)
bm <- function() {
res <- bench::mark(
time_engine = lm(mpg ~ ., mtcars),
time_parsnip_form = fit(linear_reg(), mpg ~ ., mtcars),
time_parsnip_xy = fit_xy(linear_reg(), mtcars[2:11], mtcars[1]),
relative = TRUE,
check = FALSE
)
c(form = res$median[2], xy = res$median[3])
}
ratios <- replicate(100, bm())
ratios <- data.frame(t(ratios))
ggplot(ratios) +
aes(x = form) +
geom_histogram() +
geom_vline(xintercept = 3)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ggplot(ratios) +
aes(x = xy) +
geom_histogram() +
geom_vline(xintercept = 3.5)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Created on 2024-02-27 with reprex v2.1.0 The alternative is to test this in extratests to avoid the Suggests, but these aren't really integration tests and moving them brings them farther away from our development cycle. |
The decreased ratio cutoffs in the above comment are due to speedups merged upstream from PRs in the last two days! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great with bench. Not much dependency overhead either.
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Some minimal testing to alert us when we've allowed additional "overhead" to creep in. A more true test of overhead would test differences rather than ratios, but those numbers are largely system-dependent.
My intent here is not that we'd automatically reject a PR that causes this test to fail, but to let us know when a PR changes these ratios and give us a chance to consider whether the slowdown is worth the benefit or ought to be addressed.