Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Things to keep in mind when saving lightgbm models? #44

Closed
dpprdan opened this issue Aug 5, 2022 · 9 comments
Closed

Things to keep in mind when saving lightgbm models? #44

dpprdan opened this issue Aug 5, 2022 · 9 comments

Comments

@dpprdan
Copy link

dpprdan commented Aug 5, 2022

I've had some issues with saving and {butcher}ing (reducing file size of saved model) xgboost models via tidymodels some months ago. What this came down to (IIUC) is that tidymodels does not support native serialization of those models at the moment.

Is that something I would have worry about when working with {bonsai}/{lightgbm} as well? Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?

@simonpcouch
Copy link
Contributor

simonpcouch commented Aug 5, 2022

Thanks for the issue!

Is that something I would have worry about when working with {bonsai}/{lightgbm} as well?

You'll indeed need to keep an eye out for lack of native serialization support for lightgbm in bonsai.

We're actively working on better infrastructure for supporting native serialization methods. That experimental work currently lives at rstudio/bundle [edit: changed URL] if you'd like to follow our development, but we hope to integrate this functionality under the hood in objects outputted by tidymodels / vetiver soon. I'd anticipate this work to reach our CRAN packages before the end of the year.👍

Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?

I'm not sure I'd put forth a "recommended approach" for now—the somewhat hacky approach of

saveRDS(bonsai_fit, path1)
saveRDS.lgb.Booster(extract_fit_engine(bonsai_fit), path2)
bonsai_fit_read <- readRDS(path1)
bonsai_fit_engine_read <- readRDS.lgb.Booster(path2)
bonsai_fit_read$fit <- bonsai_fit_engine_read

works, but is quite painful. At the same time, our approach with bundle

bonsai_fit_bundled <- bundle(bonsai_fit)
saveRDS(bonsai_fit_bundled, path1)

bonsai_fit_read <- readRDS(path1)
bonsai_fit_new <- unbundle(bonsai_fit_bundle)

works but is still experimental/unstable, and may just happen under the hood here soon. Whichever feels better for you is fine for now, though we hope to confidently recommend the latter soon. :)

Related to tidymodels/butcher#147, tidymodels/parsnip#779, tidymodels/stacks#145.

Again, thanks for bringing this up.🏄‍♀️

@dpprdan
Copy link
Author

dpprdan commented Aug 8, 2022

Thank you very much @simonpcouch! This looks great already.
For now I think I'll go with the first "painful" approach, since that looks like it could still work even if {bundle} is in effect under the hood. 🤔

@dpprdan
Copy link
Author

dpprdan commented Aug 8, 2022

I tried to create a reprex with the two approaches and manually saving the lgbm object ("painful" approach) does not work for me, i.e. it throws an error on predict() that the workflow is not yet fit().

library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(bonsai)
data(ames)

## build model

ames <-
  ames |>
  select(
    Sale_Price,
    Neighborhood,
    Gr_Liv_Area,
    Year_Built,
    Bldg_Type,
    Latitude,
    Longitude
  ) |> 
  mutate(Sale_Price = log10(Sale_Price))

spec <- 
  boost_tree() |> 
  set_engine("lightgbm") |> 
  set_mode("regression")

rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_dummy(all_nominal_predictors()) 

wf <- 
  workflow() |> 
  add_model(spec) |> 
  add_recipe(rec)

ft <- fit(wf, ames)

## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  5.24

## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit <- readRDS.lgb.Booster("ft_engine.rds")

predict(ft_read, ames[1,])
#> Error in `extract_fit_parsnip()`:
#> ! Can't extract a model fit from an untrained workflow.
#> ℹ Do you need to call `fit()`?

## using {bundle} works
library(bundle)
ft |> bundle() |> saveRDS("ft_bndl.rds")
ft_bndl_read <- readRDS("ft_bndl.rds") |> unbundle()
predict(ft_bndl_read, ames[1,])
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  5.24
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2022-08-08
#>  pandoc   2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.2.0)
#>  backports      1.4.1      2021-12-13 [1] CRAN (R 4.2.0)
#>  bonsai       * 0.1.0      2022-06-23 [1] CRAN (R 4.2.1)
#>  broom        * 1.0.0      2022-07-01 [1] CRAN (R 4.2.1)
#>  bundle       * 0.0.0.9200 2022-08-08 [1] Github (simonpcouch/bundle@77d630c)
#>  class          7.3-20     2022-01-16 [2] CRAN (R 4.2.1)
#>  cli            3.3.0      2022-04-25 [1] CRAN (R 4.2.0)
#>  codetools      0.2-18     2020-11-04 [2] CRAN (R 4.2.1)
#>  colorspace     2.0-3      2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon         1.5.1      2022-03-26 [1] CRAN (R 4.2.0)
#>  data.table     1.14.2     2021-09-27 [1] CRAN (R 4.2.0)
#>  DBI            1.1.3      2022-06-18 [1] CRAN (R 4.2.0)
#>  dials        * 1.0.0      2022-06-14 [1] CRAN (R 4.2.0)
#>  DiceDesign     1.9        2021-02-13 [1] CRAN (R 4.2.0)
#>  digest         0.6.29     2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr        * 1.0.9      2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate       0.15       2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi          1.0.3      2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap        1.1.0      2021-01-25 [1] CRAN (R 4.2.0)
#>  foreach        1.5.2      2022-02-02 [1] CRAN (R 4.2.0)
#>  fs             1.5.2      2021-12-08 [1] CRAN (R 4.2.0)
#>  furrr          0.3.0      2022-05-04 [1] CRAN (R 4.2.0)
#>  future         1.27.0     2022-07-22 [1] CRAN (R 4.2.1)
#>  future.apply   1.9.0      2022-04-25 [1] CRAN (R 4.2.0)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.2.1)
#>  ggplot2      * 3.3.6      2022-05-03 [1] CRAN (R 4.2.0)
#>  globals        0.15.1     2022-06-24 [1] CRAN (R 4.2.1)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
#>  gower          1.0.0      2022-02-03 [1] CRAN (R 4.2.0)
#>  GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.2.0)
#>  gtable         0.3.0      2019-03-25 [1] CRAN (R 4.2.0)
#>  hardhat        1.2.0      2022-06-30 [1] CRAN (R 4.2.1)
#>  highr          0.9        2021-04-16 [1] CRAN (R 4.2.0)
#>  htmltools      0.5.3      2022-07-18 [1] CRAN (R 4.2.1)
#>  infer        * 1.0.2      2022-06-10 [1] CRAN (R 4.2.0)
#>  ipred          0.9-13     2022-06-02 [1] CRAN (R 4.2.0)
#>  iterators      1.0.14     2022-02-05 [1] CRAN (R 4.2.0)
#>  jsonlite       1.8.0      2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr          1.39       2022-04-26 [1] CRAN (R 4.2.0)
#>  lattice        0.20-45    2021-09-22 [2] CRAN (R 4.2.1)
#>  lava           1.6.10     2021-09-02 [1] CRAN (R 4.2.0)
#>  lhs            1.1.5      2022-03-22 [1] CRAN (R 4.2.0)
#>  lifecycle      1.0.1      2021-09-24 [1] CRAN (R 4.2.0)
#>  lightgbm     * 3.3.2      2022-01-14 [1] CRAN (R 4.2.1)
#>  listenv        0.8.0      2019-12-05 [1] CRAN (R 4.2.0)
#>  lubridate      1.8.0      2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  MASS           7.3-58.1   2022-08-03 [1] CRAN (R 4.2.1)
#>  Matrix         1.4-1      2022-03-23 [2] CRAN (R 4.2.1)
#>  modeldata    * 1.0.0      2022-07-01 [1] CRAN (R 4.2.1)
#>  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.2.0)
#>  nnet           7.3-17     2022-01-16 [2] CRAN (R 4.2.1)
#>  parallelly     1.32.1     2022-07-21 [1] CRAN (R 4.2.1)
#>  parsnip      * 1.0.0      2022-06-16 [1] CRAN (R 4.2.0)
#>  pillar         1.8.0      2022-07-18 [1] CRAN (R 4.2.1)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.2.0)
#>  prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.2.0)
#>  purrr        * 0.3.4      2020-04-17 [1] CRAN (R 4.2.0)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0     2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.0     2022-06-28 [1] CRAN (R 4.2.1)
#>  R6           * 2.5.1      2021-08-19 [1] CRAN (R 4.2.0)
#>  Rcpp           1.0.9      2022-07-08 [1] CRAN (R 4.2.1)
#>  recipes      * 1.0.1      2022-07-07 [1] CRAN (R 4.2.1)
#>  reprex         2.0.1      2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang          1.0.4      2022-07-12 [1] CRAN (R 4.2.1)
#>  rmarkdown      2.14       2022-04-25 [1] CRAN (R 4.2.0)
#>  rpart          4.1.16     2022-01-24 [2] CRAN (R 4.2.1)
#>  rsample      * 1.0.0      2022-06-24 [1] CRAN (R 4.2.1)
#>  rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.2.0)
#>  scales       * 1.2.0      2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi        1.7.8      2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr        1.4.0      2019-02-10 [1] CRAN (R 4.2.0)
#>  styler         1.7.0      2022-03-13 [1] CRAN (R 4.2.0)
#>  survival       3.3-1      2022-03-03 [2] CRAN (R 4.2.1)
#>  tibble       * 3.1.8      2022-07-22 [1] CRAN (R 4.2.1)
#>  tidymodels   * 1.0.0      2022-07-13 [1] CRAN (R 4.2.1)
#>  tidyr        * 1.2.0      2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect     1.1.2      2022-02-21 [1] CRAN (R 4.2.0)
#>  timeDate       4021.104   2022-07-19 [1] CRAN (R 4.2.1)
#>  tune         * 1.0.0      2022-07-07 [1] CRAN (R 4.2.1)
#>  utf8           1.2.2      2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs          0.4.1      2022-04-13 [1] CRAN (R 4.2.0)
#>  withr          2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
#>  workflows    * 1.0.0      2022-07-05 [1] CRAN (R 4.2.1)
#>  workflowsets * 1.0.0      2022-07-12 [1] CRAN (R 4.2.1)
#>  xfun           0.31       2022-05-10 [1] CRAN (R 4.2.0)
#>  yaml           2.3.5      2022-02-21 [1] CRAN (R 4.2.0)
#>  yardstick    * 1.0.0      2022-06-06 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Daniel.AK-HAMBURG/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

I am a bit worried about bundle still being experimental, so ideally, I'd like the more verbose but stable way to work as well.

@simonpcouch
Copy link
Contributor

Sure thing! Thanks for the reprex.

Since you're fitting with a workflow rather than a plain parsnip model spec, that original lgb.Booster fit object lives in the $fit$fit$fit slot rather than $fit. With your reprex:

library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(bonsai)
data(ames)

## build model

ames <-
  ames |>
  select(
    Sale_Price,
    Neighborhood,
    Gr_Liv_Area,
    Year_Built,
    Bldg_Type,
    Latitude,
    Longitude
  ) |> 
  mutate(Sale_Price = log10(Sale_Price))

spec <- 
  boost_tree() |> 
  set_engine("lightgbm") |> 
  set_mode("regression")

rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_dummy(all_nominal_predictors()) 

wf <- 
  workflow() |> 
  add_model(spec) |> 
  add_recipe(rec)

ft <- fit(wf, ames)

## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  5.24

## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit$fit$fit <- readRDS.lgb.Booster("ft_engine.rds")

predict(ft_read, ames[1,])
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  5.24

Created on 2022-08-08 by the reprex package (v2.0.1)

@jameslamb
Copy link
Contributor

Just want to add to this conversation that since December 2021, {lightgbm}'s development version has supported using readsRDS() / saveRDS() directly for {lightgbm} models: microsoft/LightGBM#4685

Sorry that that hasn't made it into a CRAN release yet. You can subscribe to microsoft/LightGBM#5153 to be notified when that happens.

Just mentioning it because if using a development version of {lightgbm} built from source is an option (which I do understand is kind of painful), it might remove the need for other workarounds.

@simonpcouch
Copy link
Contributor

Thanks for the note here, @jameslamb! Hadn't noticed that PR. Will consider that when figuring out our approach here / in bundle.

@dpprdan
Copy link
Author

dpprdan commented Aug 8, 2022

that original lgb.Booster fit object lives in the $fit$fit$fit slot rather than $fit

haha, I tried $fit$fit before but not $fit$fit$fit. 😂

I think I might just go with the dev/4.0 version of {lightgbm}. 😉

@simonpcouch
Copy link
Contributor

An update from the bundle side:

We've opted to remove the lightgbm bundle method in light of that upcoming feature in lightgbm. This should "just work" in good time. :)

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants