Things to keep in mind when saving lightgbm models? #44

dpprdan · 2022-08-05T17:46:57Z

I've had some issues with saving and {butcher}ing (reducing file size of saved model) xgboost models via tidymodels some months ago. What this came down to (IIUC) is that tidymodels does not support native serialization of those models at the moment.

Is that something I would have worry about when working with {bonsai}/{lightgbm} as well? Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?

simonpcouch · 2022-08-05T20:07:46Z

Thanks for the issue!

Is that something I would have worry about when working with {bonsai}/{lightgbm} as well?

You'll indeed need to keep an eye out for lack of native serialization support for lightgbm in bonsai.

We're actively working on better infrastructure for supporting native serialization methods. That experimental work currently lives at rstudio/bundle [edit: changed URL] if you'd like to follow our development, but we hope to integrate this functionality under the hood in objects outputted by tidymodels / vetiver soon. I'd anticipate this work to reach our CRAN packages before the end of the year.👍

Or more generally, what is the recommended approach for saving lightgbm models and reading them back in for prediction / using the saved models in a "prediction package" later? Is it okay to "just" saveRDS.lgb.Booster() and readRDS.lgb.Booster() them?

I'm not sure I'd put forth a "recommended approach" for now—the somewhat hacky approach of

saveRDS(bonsai_fit, path1)
saveRDS.lgb.Booster(extract_fit_engine(bonsai_fit), path2)
bonsai_fit_read <- readRDS(path1)
bonsai_fit_engine_read <- readRDS.lgb.Booster(path2)
bonsai_fit_read$fit <- bonsai_fit_engine_read

works, but is quite painful. At the same time, our approach with bundle

bonsai_fit_bundled <- bundle(bonsai_fit)
saveRDS(bonsai_fit_bundled, path1)

bonsai_fit_read <- readRDS(path1)
bonsai_fit_new <- unbundle(bonsai_fit_bundle)

works but is still experimental/unstable, and may just happen under the hood here soon. Whichever feels better for you is fine for now, though we hope to confidently recommend the latter soon. :)

Related to tidymodels/butcher#147, tidymodels/parsnip#779, tidymodels/stacks#145.

Again, thanks for bringing this up.🏄‍♀️

dpprdan · 2022-08-08T08:31:59Z

Thank you very much @simonpcouch! This looks great already.
For now I think I'll go with the first "painful" approach, since that looks like it could still work even if {bundle} is in effect under the hood. 🤔

dpprdan · 2022-08-08T12:38:49Z

I tried to create a reprex with the two approaches and manually saving the lgbm object ("painful" approach) does not work for me, i.e. it throws an error on predict() that the workflow is not yet fit().

library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(bonsai)
data(ames)

## build model

ames <-
  ames |>
  select(
    Sale_Price,
    Neighborhood,
    Gr_Liv_Area,
    Year_Built,
    Bldg_Type,
    Latitude,
    Longitude
  ) |> 
  mutate(Sale_Price = log10(Sale_Price))

spec <- 
  boost_tree() |> 
  set_engine("lightgbm") |> 
  set_mode("regression")

rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_dummy(all_nominal_predictors()) 

wf <- 
  workflow() |> 
  add_model(spec) |> 
  add_recipe(rec)

ft <- fit(wf, ames)

## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  5.24

## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit <- readRDS.lgb.Booster("ft_engine.rds")

predict(ft_read, ames[1,])
#> Error in `extract_fit_parsnip()`:
#> ! Can't extract a model fit from an untrained workflow.
#> ℹ Do you need to call `fit()`?

## using {bundle} works
library(bundle)
ft |> bundle() |> saveRDS("ft_bndl.rds")
ft_bndl_read <- readRDS("ft_bndl.rds") |> unbundle()
predict(ft_bndl_read, ames[1,])
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  5.24

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language en
#>  collate  German_Germany.utf8
#>  ctype    German_Germany.utf8
#>  tz       Europe/Berlin
#>  date     2022-08-08
#>  pandoc   2.18 @ C:/Program Files/RStudio/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.2.0)
#>  backports      1.4.1      2021-12-13 [1] CRAN (R 4.2.0)
#>  bonsai       * 0.1.0      2022-06-23 [1] CRAN (R 4.2.1)
#>  broom        * 1.0.0      2022-07-01 [1] CRAN (R 4.2.1)
#>  bundle       * 0.0.0.9200 2022-08-08 [1] Github (simonpcouch/bundle@77d630c)
#>  class          7.3-20     2022-01-16 [2] CRAN (R 4.2.1)
#>  cli            3.3.0      2022-04-25 [1] CRAN (R 4.2.0)
#>  codetools      0.2-18     2020-11-04 [2] CRAN (R 4.2.1)
#>  colorspace     2.0-3      2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon         1.5.1      2022-03-26 [1] CRAN (R 4.2.0)
#>  data.table     1.14.2     2021-09-27 [1] CRAN (R 4.2.0)
#>  DBI            1.1.3      2022-06-18 [1] CRAN (R 4.2.0)
#>  dials        * 1.0.0      2022-06-14 [1] CRAN (R 4.2.0)
#>  DiceDesign     1.9        2021-02-13 [1] CRAN (R 4.2.0)
#>  digest         0.6.29     2021-12-01 [1] CRAN (R 4.2.0)
#>  dplyr        * 1.0.9      2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate       0.15       2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi          1.0.3      2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap        1.1.0      2021-01-25 [1] CRAN (R 4.2.0)
#>  foreach        1.5.2      2022-02-02 [1] CRAN (R 4.2.0)
#>  fs             1.5.2      2021-12-08 [1] CRAN (R 4.2.0)
#>  furrr          0.3.0      2022-05-04 [1] CRAN (R 4.2.0)
#>  future         1.27.0     2022-07-22 [1] CRAN (R 4.2.1)
#>  future.apply   1.9.0      2022-04-25 [1] CRAN (R 4.2.0)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.2.1)
#>  ggplot2      * 3.3.6      2022-05-03 [1] CRAN (R 4.2.0)
#>  globals        0.15.1     2022-06-24 [1] CRAN (R 4.2.1)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
#>  gower          1.0.0      2022-02-03 [1] CRAN (R 4.2.0)
#>  GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.2.0)
#>  gtable         0.3.0      2019-03-25 [1] CRAN (R 4.2.0)
#>  hardhat        1.2.0      2022-06-30 [1] CRAN (R 4.2.1)
#>  highr          0.9        2021-04-16 [1] CRAN (R 4.2.0)
#>  htmltools      0.5.3      2022-07-18 [1] CRAN (R 4.2.1)
#>  infer        * 1.0.2      2022-06-10 [1] CRAN (R 4.2.0)
#>  ipred          0.9-13     2022-06-02 [1] CRAN (R 4.2.0)
#>  iterators      1.0.14     2022-02-05 [1] CRAN (R 4.2.0)
#>  jsonlite       1.8.0      2022-02-22 [1] CRAN (R 4.2.0)
#>  knitr          1.39       2022-04-26 [1] CRAN (R 4.2.0)
#>  lattice        0.20-45    2021-09-22 [2] CRAN (R 4.2.1)
#>  lava           1.6.10     2021-09-02 [1] CRAN (R 4.2.0)
#>  lhs            1.1.5      2022-03-22 [1] CRAN (R 4.2.0)
#>  lifecycle      1.0.1      2021-09-24 [1] CRAN (R 4.2.0)
#>  lightgbm     * 3.3.2      2022-01-14 [1] CRAN (R 4.2.1)
#>  listenv        0.8.0      2019-12-05 [1] CRAN (R 4.2.0)
#>  lubridate      1.8.0      2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  MASS           7.3-58.1   2022-08-03 [1] CRAN (R 4.2.1)
#>  Matrix         1.4-1      2022-03-23 [2] CRAN (R 4.2.1)
#>  modeldata    * 1.0.0      2022-07-01 [1] CRAN (R 4.2.1)
#>  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.2.0)
#>  nnet           7.3-17     2022-01-16 [2] CRAN (R 4.2.1)
#>  parallelly     1.32.1     2022-07-21 [1] CRAN (R 4.2.1)
#>  parsnip      * 1.0.0      2022-06-16 [1] CRAN (R 4.2.0)
#>  pillar         1.8.0      2022-07-18 [1] CRAN (R 4.2.1)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.2.0)
#>  prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.2.0)
#>  purrr        * 0.3.4      2020-04-17 [1] CRAN (R 4.2.0)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0     2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.0     2022-06-28 [1] CRAN (R 4.2.1)
#>  R6           * 2.5.1      2021-08-19 [1] CRAN (R 4.2.0)
#>  Rcpp           1.0.9      2022-07-08 [1] CRAN (R 4.2.1)
#>  recipes      * 1.0.1      2022-07-07 [1] CRAN (R 4.2.1)
#>  reprex         2.0.1      2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang          1.0.4      2022-07-12 [1] CRAN (R 4.2.1)
#>  rmarkdown      2.14       2022-04-25 [1] CRAN (R 4.2.0)
#>  rpart          4.1.16     2022-01-24 [2] CRAN (R 4.2.1)
#>  rsample      * 1.0.0      2022-06-24 [1] CRAN (R 4.2.1)
#>  rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.2.0)
#>  scales       * 1.2.0      2022-04-13 [1] CRAN (R 4.2.0)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi        1.7.8      2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr        1.4.0      2019-02-10 [1] CRAN (R 4.2.0)
#>  styler         1.7.0      2022-03-13 [1] CRAN (R 4.2.0)
#>  survival       3.3-1      2022-03-03 [2] CRAN (R 4.2.1)
#>  tibble       * 3.1.8      2022-07-22 [1] CRAN (R 4.2.1)
#>  tidymodels   * 1.0.0      2022-07-13 [1] CRAN (R 4.2.1)
#>  tidyr        * 1.2.0      2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect     1.1.2      2022-02-21 [1] CRAN (R 4.2.0)
#>  timeDate       4021.104   2022-07-19 [1] CRAN (R 4.2.1)
#>  tune         * 1.0.0      2022-07-07 [1] CRAN (R 4.2.1)
#>  utf8           1.2.2      2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs          0.4.1      2022-04-13 [1] CRAN (R 4.2.0)
#>  withr          2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
#>  workflows    * 1.0.0      2022-07-05 [1] CRAN (R 4.2.1)
#>  workflowsets * 1.0.0      2022-07-12 [1] CRAN (R 4.2.1)
#>  xfun           0.31       2022-05-10 [1] CRAN (R 4.2.0)
#>  yaml           2.3.5      2022-02-21 [1] CRAN (R 4.2.0)
#>  yardstick    * 1.0.0      2022-06-06 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/Daniel.AK-HAMBURG/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

I am a bit worried about bundle still being experimental, so ideally, I'd like the more verbose but stable way to work as well.

simonpcouch · 2022-08-08T18:28:07Z

Sure thing! Thanks for the reprex.

Since you're fitting with a workflow rather than a plain parsnip model spec, that original lgb.Booster fit object lives in the $fit$fit$fit slot rather than $fit. With your reprex:

library(tidymodels, warn.conflicts = FALSE)
library(lightgbm)
#> Loading required package: R6
#> 
#> Attaching package: 'lightgbm'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(bonsai)
data(ames)

## build model

ames <-
  ames |>
  select(
    Sale_Price,
    Neighborhood,
    Gr_Liv_Area,
    Year_Built,
    Bldg_Type,
    Latitude,
    Longitude
  ) |> 
  mutate(Sale_Price = log10(Sale_Price))

spec <- 
  boost_tree() |> 
  set_engine("lightgbm") |> 
  set_mode("regression")

rec <- 
  recipe(Sale_Price ~ ., data = ames) |> 
  step_dummy(all_nominal_predictors()) 

wf <- 
  workflow() |> 
  add_model(spec) |> 
  add_recipe(rec)

ft <- fit(wf, ames)

## predicting fitted workflow works fine.
predict(ft, ames[1,])
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  5.24

## saving and reading the lgb separately throws an error on predict()
saveRDS(ft, "ft.rds")
saveRDS.lgb.Booster(extract_fit_engine(ft), "ft_engine.rds")
ft_read <- readRDS("ft.rds")
ft_read$fit$fit$fit <- readRDS.lgb.Booster("ft_engine.rds")

predict(ft_read, ames[1,])
#> # A tibble: 1 × 1
#>   .pred
#>   <dbl>
#> 1  5.24

^{Created on 2022-08-08 by the reprex package (v2.0.1)}

jameslamb · 2022-08-08T18:54:47Z

Just want to add to this conversation that since December 2021, {lightgbm}'s development version has supported using readsRDS() / saveRDS() directly for {lightgbm} models: microsoft/LightGBM#4685

Sorry that that hasn't made it into a CRAN release yet. You can subscribe to microsoft/LightGBM#5153 to be notified when that happens.

Just mentioning it because if using a development version of {lightgbm} built from source is an option (which I do understand is kind of painful), it might remove the need for other workarounds.

simonpcouch · 2022-08-08T19:16:36Z

Thanks for the note here, @jameslamb! Hadn't noticed that PR. Will consider that when figuring out our approach here / in bundle.

dpprdan · 2022-08-08T20:21:39Z

that original lgb.Booster fit object lives in the $fit$fit$fit slot rather than $fit

haha, I tried $fit$fit before but not $fit$fit$fit. 😂

I think I might just go with the dev/4.0 version of {lightgbm}. 😉

simonpcouch · 2022-08-15T21:08:28Z

An update from the bundle side:

We've opted to remove the lightgbm bundle method in light of that upcoming feature in lightgbm. This should "just work" in good time. :)

github-actions · 2023-01-11T01:42:57Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

simonpcouch mentioned this issue Aug 8, 2022

remove lightgbm support? rstudio/bundle#24

Closed

simonpcouch closed this as completed Aug 15, 2022

simonpcouch mentioned this issue Aug 16, 2022

saveRDS / readRDS issue with 'lightgbm' engine tidymodels/stacks#145

Closed

github-actions bot locked and limited conversation to collaborators Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Things to keep in mind when saving lightgbm models? #44

Things to keep in mind when saving lightgbm models? #44

dpprdan commented Aug 5, 2022

simonpcouch commented Aug 5, 2022 •

edited

Loading

dpprdan commented Aug 8, 2022

dpprdan commented Aug 8, 2022

simonpcouch commented Aug 8, 2022

jameslamb commented Aug 8, 2022

simonpcouch commented Aug 8, 2022

dpprdan commented Aug 8, 2022 •

edited

Loading

simonpcouch commented Aug 15, 2022

github-actions bot commented Jan 11, 2023

Things to keep in mind when saving lightgbm models? #44

Things to keep in mind when saving lightgbm models? #44

Comments

dpprdan commented Aug 5, 2022

simonpcouch commented Aug 5, 2022 • edited Loading

dpprdan commented Aug 8, 2022

dpprdan commented Aug 8, 2022

simonpcouch commented Aug 8, 2022

jameslamb commented Aug 8, 2022

simonpcouch commented Aug 8, 2022

dpprdan commented Aug 8, 2022 • edited Loading

simonpcouch commented Aug 15, 2022

github-actions bot commented Jan 11, 2023

simonpcouch commented Aug 5, 2022 •

edited

Loading

dpprdan commented Aug 8, 2022 •

edited

Loading