Don't coerce sparse data to non-sparse during predict() #950

EmilHvitfeldt · 2023-04-12T17:09:30Z

This bug was first reported here: https://community.rstudio.com/t/predict-not-working-with-ranger-model-when-using-sparse-data/163352/7

When you call predict() you eventually get to run prepare_data(), which didn't know about the alllow_sparse_x encoding so it would try to turn it into matrices/data.frames. This was bad for 2 reasons. First we are losing some performance. Secondly, in the case of the ranger method, it wants the data as a data.frame which a sparse data can't be turned into, yielding the error seen below.

Main

library(parsnip)

data(agaricus.train, package = 'xgboost')
train <- agaricus.train

rf_model <- parsnip::rand_forest(trees = 100) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_fit <- fit_xy(rf_model, x = train$data, y = factor(train$label))

predict(rf_fit, train$data)
#> Error in as.data.frame.default(new_data): cannot coerce class 'structure("dgCMatrix", package = "Matrix")' to a data.frame

This PR

library(parsnip)

data(agaricus.train, package = 'xgboost')
train <- agaricus.train

rf_model <- parsnip::rand_forest(trees = 100) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

rf_fit <- fit_xy(rf_model, x = train$data, y = factor(train$label))

predict(rf_fit, train$data)
#> # A tibble: 6,513 × 1
#>    .pred_class
#>    <fct>      
#>  1 1          
#>  2 0          
#>  3 0          
#>  4 1          
#>  5 0          
#>  6 0          
#>  7 0          
#>  8 1          
#>  9 0          
#> 10 0          
#> # ℹ 6,503 more rows

EmilHvitfeldt · 2023-04-12T17:14:33Z

Small speedup benchmark. (this is likely to vary majorly for different data, but it should almost always be in the right direction

Code used for the following reprexes

library(parsnip)

data(agaricus.train, package = 'xgboost')
train <- agaricus.train

xg_model <- parsnip::boost_tree(trees = 100) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

xg_fit <- fit_xy(xg_model, x = train$data, y = factor(train$label))

Main

bench::mark(
  old = predict(xg_fit, train$data)
)
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old          8.96ms   9.55ms      105.     7.2MB     60.4

This PR

bench::mark(
  new = predict(xg_fit, train$data)
)
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new          6.24ms   6.65ms      150.    1.13MB     4.17

simonpcouch

Love it. :)

Closes #694. Does this also address #690?

Could we also clarify in these docs that allow_sparse_x now applies to predict()ion, too? (I think?) Are there any engines that would allow sparsity at fit() but not predict() time?

parsnip/R/aaa_models.R

Lines 553 to 555 in 51b0cd7

    
           #' Finally, `allow_sparse_x` specifies whether the model function can natively 
        
           #'  accommodate a sparse matrix representation for predictors during fitting 
        
           #'  and tuning.

github-actions · 2023-05-31T01:04:55Z

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

EmilHvitfeldt added 2 commits April 12, 2023 09:59

pass sparse data through prepare_data()

5fd716b

add news

fcd2aaf

EmilHvitfeldt added 2 commits April 12, 2023 10:24

Merged origin/main into sparse-predict

ff5e7c4

move news bullet to devel

61710f4

EmilHvitfeldt requested a review from simonpcouch April 12, 2023 17:40

simonpcouch approved these changes Apr 12, 2023

View reviewed changes

Merge branch 'main' into sparse-predict

12ec6ac

topepo merged commit 145bac2 into main May 17, 2023

topepo deleted the sparse-predict branch May 17, 2023 01:00

github-actions bot locked and limited conversation to collaborators May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't coerce sparse data to non-sparse during predict() #950

Don't coerce sparse data to non-sparse during predict() #950

Uh oh!

EmilHvitfeldt commented Apr 12, 2023

Uh oh!

EmilHvitfeldt commented Apr 12, 2023 •

edited

Loading

Uh oh!

simonpcouch left a comment

Uh oh!

github-actions bot commented May 31, 2023

Uh oh!

Uh oh!

	#' Finally, `allow_sparse_x` specifies whether the model function can natively
	#' accommodate a sparse matrix representation for predictors during fitting
	#' and tuning.

Don't coerce sparse data to non-sparse during predict() #950

Don't coerce sparse data to non-sparse during predict() #950

Uh oh!

Conversation

EmilHvitfeldt commented Apr 12, 2023

Main

This PR

Uh oh!

EmilHvitfeldt commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main

This PR

Uh oh!

simonpcouch left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 31, 2023

Uh oh!

Uh oh!

EmilHvitfeldt commented Apr 12, 2023 •

edited

Loading