Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CppMethod error when applying prepped UMAP recipe after saving/reading as .rds #84

Closed
juliasilge opened this issue Aug 2, 2021 · 7 comments · Fixed by #142
Closed
Labels
bug an unexpected problem or unintended behavior

Comments

@juliasilge
Copy link
Member

juliasilge commented Aug 2, 2021

Seems like there is a bug 🐛 for step_umap() when trying to save a prepped recipe as .rds and reading it back to apply it new data.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(tidyverse)
library(embed)

split <- seq.int(1, 150, by = 9)
tr <- iris[-split, ]
te <- iris[ split, ]

set.seed(11)
supervised <- 
   recipe(Species ~ ., data = tr) %>%
   step_center(all_predictors()) %>% 
   step_scale(all_predictors()) %>% 
   step_umap(all_predictors(), outcome = vars(Species), num_comp = 2) %>% 
   prep(training = tr)

write_rds(supervised, here::here(tempdir(), "umap.rds"))
saved_rec <- read_rds(here::here(tempdir(), "umap.rds"))
saved_rec %>% bake(new_data = te)
#> Error in .External(structure(list(name = "CppMethod__invoke_notvoid", : NULL value passed as symbol address

Created on 2021-08-02 by the reprex package (v2.0.0)

I'm sure this is not us (i.e. not the embed package) but I wonder if there is anything we can do about this.

The recipe is fine if you don't save as .rds and then read it back.

@jlmelville
Copy link

I am very late to discovering this, but yes this is almost certainly because of the underlying UMAP package (uwot), which uses RcppAnnoy, which itself wraps the C++ library Annoy to find approximate nearest neighbors. The RcppAnnoy objects have save and load methods that must be called and just using saveRDS with them won't work (at least I couldn't get it to work). In turn uwot needs to provide special functions to save and load its state but it's all very unsatisfactory. Sorry about that. I was unable to think of a workaround.

I do intend to fix this but my current solution involves writing an entirely new approximate nearest neighbors package. As that and maintaining uwot exists entirely as a spare time endeavor, it's taking quite a long time (3 years and counting for the nearest neighbor package). I'll get there in the end. Probably.

@juliasilge
Copy link
Member Author

Thanks for the message @jlmelville and for your work on uwot! 🙌 We also are thinking about serialization for trained model objects like xgboost, torch, etc, that have native methods for saving/loading. Definitely an area that needs some attention from all of us!

@juliasilge
Copy link
Member Author

This has now been solved with the new bundle package:

library(tidymodels)
library(tidyverse)
library(embed)

split <- seq.int(1, 150, by = 9)
tr <- iris[-split, ]
te <- iris[ split, ]

set.seed(11)
supervised <- 
  recipe(Species ~ ., data = tr) %>%
  step_center(all_predictors()) %>% 
  step_scale(all_predictors()) %>% 
  step_umap(all_predictors(), outcome = vars(Species), num_comp = 2) %>% 
  prep(training = tr)

library(bundle)
temp_file <- fs::file_temp(pattern = "umap", ext = "rds")
bundle(supervised) %>% write_rds(temp_file)

saved_rec <- read_rds(temp_file)
unbundle(saved_rec) %>% bake(new_data = te)
#> # A tibble: 17 × 3
#>    Species     UMAP1  UMAP2
#>    <fct>       <dbl>  <dbl>
#>  1 setosa      13.3    2.93
#>  2 setosa      12.0    4.69
#>  3 setosa      14.5    3.12
#>  4 setosa      13.5    3.07
#>  5 setosa      13.4    2.99
#>  6 setosa      12.0    4.86
#>  7 versicolor -10.1    8.80
#>  8 versicolor  -9.79   8.28
#>  9 versicolor  -4.91 -11.6 
#> 10 versicolor  -9.66   6.12
#> 11 versicolor -10.1    6.61
#> 12 versicolor -10.3    6.98
#> 13 virginica   -4.14 -11.6 
#> 14 virginica   -2.69 -12.1 
#> 15 virginica   -4.06 -10.3 
#> 16 virginica   -1.73 -11.5 
#> 17 virginica   -2.33 -10.9

Created on 2022-09-16 with reprex v2.0.2

We should document somewhere that this step needs to be bundled for use in a new session. How do you all want to do that?

@jlmelville
Copy link

Looks like I need to get in on this bundle thing...

@EmilHvitfeldt
Copy link
Member

I think we should document it as a section. Like we do with Tidying and Case weights, this way it will be easier to link to the documentation when the question pops up

@topepo
Copy link
Member

topepo commented Sep 22, 2022

Agreed. We just did this for the parsnip engine docs.

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants