-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Things to keep in mind when saving lightgbm models? #44
Comments
Thanks for the issue!
You'll indeed need to keep an eye out for lack of native serialization support for lightgbm in bonsai. We're actively working on better infrastructure for supporting native serialization methods. That experimental work currently lives at rstudio/bundle [edit: changed URL] if you'd like to follow our development, but we hope to integrate this functionality under the hood in objects outputted by tidymodels / vetiver soon. I'd anticipate this work to reach our CRAN packages before the end of the year.👍
I'm not sure I'd put forth a "recommended approach" for now—the somewhat hacky approach of
works, but is quite painful. At the same time, our approach with bundle
works but is still experimental/unstable, and may just happen under the hood here soon. Whichever feels better for you is fine for now, though we hope to confidently recommend the latter soon. :) Related to tidymodels/butcher#147, tidymodels/parsnip#779, tidymodels/stacks#145. Again, thanks for bringing this up.🏄♀️ |
Thank you very much @simonpcouch! This looks great already. |
I tried to create a reprex with the two approaches and manually saving the lgbm object ("painful" approach) does not work for me, i.e. it throws an error on library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#>
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#>
#> slice
library(bonsai)
data(ames)
## build model
ames <-
ames |>
select(
Sale_Price,
Neighborhood,
Gr_Liv_Area,
Year_Built,
Bldg_Type,
Latitude,
Longitude
) |>
mutate(Sale_Price = log10(Sale_Price))
spec <-
boost_tree() |>
set_engine("lightgbm") |>
set_mode("regression")
rec <-
recipe(Sale_Price ~ ., data = ames) |>
step_dummy(all_nominal_predictors())
wf <-
workflow() |>
add_model(spec) |>
add_recipe(rec)
ft <- fit(wf, ames)
## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 × 1
#> .pred
#> <dbl>
#> 1 5.24
## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit <- readRDS.lgb.Booster("ft_engine.rds")
predict(ft_read, ames[1,])
#> Error in `extract_fit_parsnip()`:
#> ! Can't extract a model fit from an untrained workflow.
#> ℹ Do you need to call `fit()`?
## using {bundle} works
library(bundle)
ft |> bundle() |> saveRDS("ft_bndl.rds")
ft_bndl_read <- readRDS("ft_bndl.rds") |> unbundle()
predict(ft_bndl_read, ames[1,])
#> # A tibble: 1 × 1
#> .pred
#> <dbl>
#> 1 5.24 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23 ucrt)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language en
#> collate German_Germany.utf8
#> ctype German_Germany.utf8
#> tz Europe/Berlin
#> date 2022-08-08
#> pandoc 2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> bonsai * 0.1.0 2022-06-23 [1] CRAN (R 4.2.1)
#> broom * 1.0.0 2022-07-01 [1] CRAN (R 4.2.1)
#> bundle * 0.0.0.9200 2022-08-08 [1] Github (simonpcouch/bundle@77d630c)
#> class 7.3-20 2022-01-16 [2] CRAN (R 4.2.1)
#> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
#> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
#> data.table 1.14.2 2021-09-27 [1] CRAN (R 4.2.0)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
#> dials * 1.0.0 2022-06-14 [1] CRAN (R 4.2.0)
#> DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.2.0)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
#> dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> furrr 0.3.0 2022-05-04 [1] CRAN (R 4.2.0)
#> future 1.27.0 2022-07-22 [1] CRAN (R 4.2.1)
#> future.apply 1.9.0 2022-04-25 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
#> ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
#> globals 0.15.1 2022-06-24 [1] CRAN (R 4.2.1)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> gower 1.0.0 2022-02-03 [1] CRAN (R 4.2.0)
#> GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.2.0)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
#> hardhat 1.2.0 2022-06-30 [1] CRAN (R 4.2.1)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.1)
#> infer * 1.0.2 2022-06-10 [1] CRAN (R 4.2.0)
#> ipred 0.9-13 2022-06-02 [1] CRAN (R 4.2.0)
#> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0)
#> jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
#> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
#> lava 1.6.10 2021-09-02 [1] CRAN (R 4.2.0)
#> lhs 1.1.5 2022-03-22 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
#> lightgbm * 3.3.2 2022-01-14 [1] CRAN (R 4.2.1)
#> listenv 0.8.0 2019-12-05 [1] CRAN (R 4.2.0)
#> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.1)
#> Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.1)
#> modeldata * 1.0.0 2022-07-01 [1] CRAN (R 4.2.1)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#> nnet 7.3-17 2022-01-16 [2] CRAN (R 4.2.1)
#> parallelly 1.32.1 2022-07-21 [1] CRAN (R 4.2.1)
#> parsnip * 1.0.0 2022-06-16 [1] CRAN (R 4.2.0)
#> pillar 1.8.0 2022-07-18 [1] CRAN (R 4.2.1)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.2.0)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.0 2022-06-28 [1] CRAN (R 4.2.1)
#> R6 * 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.1)
#> recipes * 1.0.1 2022-07-07 [1] CRAN (R 4.2.1)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0)
#> rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.1)
#> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0)
#> rpart 4.1.16 2022-01-24 [2] CRAN (R 4.2.1)
#> rsample * 1.0.0 2022-06-24 [1] CRAN (R 4.2.1)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0)
#> scales * 1.2.0 2022-04-13 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.8 2022-07-11 [1] CRAN (R 4.2.1)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0)
#> survival 3.3-1 2022-03-03 [2] CRAN (R 4.2.1)
#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.1)
#> tidymodels * 1.0.0 2022-07-13 [1] CRAN (R 4.2.1)
#> tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.0)
#> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0)
#> timeDate 4021.104 2022-07-19 [1] CRAN (R 4.2.1)
#> tune * 1.0.0 2022-07-07 [1] CRAN (R 4.2.1)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> workflows * 1.0.0 2022-07-05 [1] CRAN (R 4.2.1)
#> workflowsets * 1.0.0 2022-07-12 [1] CRAN (R 4.2.1)
#> xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
#> yardstick * 1.0.0 2022-06-06 [1] CRAN (R 4.2.0)
#>
#> [1] C:/Users/Daniel.AK-HAMBURG/AppData/Local/R/win-library/4.2
#> [2] C:/Program Files/R/R-4.2.1/library
#>
#> ────────────────────────────────────────────────────────────────────────────── I am a bit worried about bundle still being experimental, so ideally, I'd like the more verbose but stable way to work as well. |
Sure thing! Thanks for the reprex. Since you're fitting with a workflow rather than a plain parsnip model spec, that original library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#>
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#>
#> slice
library(bonsai)
data(ames)
## build model
ames <-
ames |>
select(
Sale_Price,
Neighborhood,
Gr_Liv_Area,
Year_Built,
Bldg_Type,
Latitude,
Longitude
) |>
mutate(Sale_Price = log10(Sale_Price))
spec <-
boost_tree() |>
set_engine("lightgbm") |>
set_mode("regression")
rec <-
recipe(Sale_Price ~ ., data = ames) |>
step_dummy(all_nominal_predictors())
wf <-
workflow() |>
add_model(spec) |>
add_recipe(rec)
ft <- fit(wf, ames)
## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 × 1
#> .pred
#> <dbl>
#> 1 5.24
## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit$fit$fit <- readRDS.lgb.Booster("ft_engine.rds")
predict(ft_read, ames[1,])
#> # A tibble: 1 × 1
#> .pred
#> <dbl>
#> 1 5.24 Created on 2022-08-08 by the reprex package (v2.0.1) |
Just want to add to this conversation that since December 2021, Sorry that that hasn't made it into a CRAN release yet. You can subscribe to microsoft/LightGBM#5153 to be notified when that happens. Just mentioning it because if using a development version of |
Thanks for the note here, @jameslamb! Hadn't noticed that PR. Will consider that when figuring out our approach here / in bundle. |
haha, I tried I think I might just go with the dev/4.0 version of {lightgbm}. 😉 |
An update from the bundle side: We've opted to remove the lightgbm bundle method in light of that upcoming feature in lightgbm. This should "just work" in good time. :) |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
I've had some issues with saving and {butcher}ing (reducing file size of saved model) xgboost models via tidymodels some months ago. What this came down to (IIUC) is that tidymodels does not support native serialization of those models at the moment.
Is that something I would have worry about when working with {bonsai}/{lightgbm} as well? Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?
The text was updated successfully, but these errors were encountered: