Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding recipes #140

Closed
nipnipj opened this issue May 3, 2023 · 8 comments
Closed

Question regarding recipes #140

nipnipj opened this issue May 3, 2023 · 8 comments
Labels
reprex needs a minimal reproducible example

Comments

@nipnipj
Copy link

nipnipj commented May 3, 2023

Hello!
How can we correctly use recipes with spatial data? I'm getting the following error The number of roles should be the same as the number of variables with

data("ames", package = "modeldata")
data_raw <- st_as_sf(ames, coords = c("Longitude", "Latitude"))  %>%  
  mutate(Sale_Price = log(Sale_Price))
@mikemahoney218
Copy link
Member

Hi @nipnipj ! Can you please provide a reprex that shows the error you're getting? The code you provided should run perfectly fine, and without seeing what code you're running to trigger that error I'm not able to guess what's going on here.

data("ames", package = "modeldata")
(data_raw <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"))  |> 
  dplyr::mutate(Sale_Price = log(Sale_Price)))
#> Simple feature collection with 2930 features and 72 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -93.69315 ymin: 41.9865 xmax: -93.57743 ymax: 42.06339
#> CRS:           NA
#> # A tibble: 2,930 × 73
#>    MS_SubClass            MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
#>  * <fct>                  <fct>            <dbl>    <int> <fct>  <fct> <fct>    
#>  1 One_Story_1946_and_Ne… Resident…          141    31770 Pave   No_A… Slightly…
#>  2 One_Story_1946_and_Ne… Resident…           80    11622 Pave   No_A… Regular  
#>  3 One_Story_1946_and_Ne… Resident…           81    14267 Pave   No_A… Slightly…
#>  4 One_Story_1946_and_Ne… Resident…           93    11160 Pave   No_A… Regular  
#>  5 Two_Story_1946_and_Ne… Resident…           74    13830 Pave   No_A… Slightly…
#>  6 Two_Story_1946_and_Ne… Resident…           78     9978 Pave   No_A… Slightly…
#>  7 One_Story_PUD_1946_an… Resident…           41     4920 Pave   No_A… Regular  
#>  8 One_Story_PUD_1946_an… Resident…           43     5005 Pave   No_A… Slightly…
#>  9 One_Story_PUD_1946_an… Resident…           39     5389 Pave   No_A… Slightly…
#> 10 Two_Story_1946_and_Ne… Resident…           60     7500 Pave   No_A… Regular  
#> # ℹ 2,920 more rows
#> # ℹ 66 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#> #   Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#> #   Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#> #   Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#> #   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#> #   Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, …

Created on 2023-05-03 with reprex v2.0.2

@mikemahoney218 mikemahoney218 added the reprex needs a minimal reproducible example label May 3, 2023
@EmilHvitfeldt
Copy link
Member

If this is really a {recipes} question, then we are aware that it doesn't work. We have ideas of how to make it work but it isn't scheduled to happen in the near or medium future.

library(recipes)
library(sf)

data("ames", package = "modeldata")
data_raw <- st_as_sf(ames, coords = c("Longitude", "Latitude"))  %>%  
  mutate(Sale_Price = log(Sale_Price))

recipe(~., data = data_raw)
#> Error in model.frame.default(formula, data[1, ]): invalid type (list) for variable 'geometry'

@nipnipj
Copy link
Author

nipnipj commented May 3, 2023

Yes, I forgot to add

rec <- data_raw %>% 
  recipe(Sale_Price ~ Year_Built + Gr_Liv_Area + Bldg_Type)

I see thank you both for answering!

@mikemahoney218
Copy link
Member

mikemahoney218 commented May 3, 2023

Now here's a question for @EmilHvitfeldt (thanks for stepping in 😄 ) -- any reason to expect the below to error or cause problems? Specifically, dropping the spatial information for the recipe specification, but fitting to resamples from spatialsample?

data("ames", package = "modeldata")
ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326)

# Drop the spatial information for the recipe:
recipe <- recipes::recipe(Sale_Price ~ Year_Built, data = sf::st_drop_geometry(ames_sf)) |> 
  recipes::step_log(recipes::all_outcomes())

workflows::workflow(recipe, parsnip::linear_reg()) |> 
  # but keep it when assigning resamples
  tune::fit_resamples(spatialsample::spatial_clustering_cv(ames_sf))
#> # Resampling results
#> # 10-fold spatial cross-validation 
#> # A tibble: 10 × 4
#>    splits             id     .metrics         .notes          
#>    <list>             <chr>  <list>           <list>          
#>  1 <split [2559/371]> Fold01 <tibble [2 × 4]> <tibble [0 × 3]>
#>  2 <split [2740/190]> Fold02 <tibble [2 × 4]> <tibble [0 × 3]>
#>  3 <split [2685/245]> Fold03 <tibble [2 × 4]> <tibble [0 × 3]>
#>  4 <split [2777/153]> Fold04 <tibble [2 × 4]> <tibble [0 × 3]>
#>  5 <split [2656/274]> Fold05 <tibble [2 × 4]> <tibble [0 × 3]>
#>  6 <split [2668/262]> Fold06 <tibble [2 × 4]> <tibble [0 × 3]>
#>  7 <split [2496/434]> Fold07 <tibble [2 × 4]> <tibble [0 × 3]>
#>  8 <split [2570/360]> Fold08 <tibble [2 × 4]> <tibble [0 × 3]>
#>  9 <split [2709/221]> Fold09 <tibble [2 × 4]> <tibble [0 × 3]>
#> 10 <split [2510/420]> Fold10 <tibble [2 × 4]> <tibble [0 × 3]>

Created on 2023-05-03 with reprex v2.0.2

@EmilHvitfeldt
Copy link
Member

It might break in the future, before it gets official support 😬 We are being bitten by non-tibble-tibbles. So we are starting to force data.frames to be bare data.frames internally some places, while we wait for potential future native sf support.

See r-spatial/sf#2131 for reference for some of the struggles

@mikemahoney218
Copy link
Member

I think that wouldn't cause any problems (and makes a lot of sense for tidymodels to do, if you're accepting inputs of any subclass and expecting them to not have any different methods or behaviors from tibbles). To be clear, that sf::st_drop_geometry() in the recipe() call is already casting the sf object to a tibble:

data("ames", package = "modeldata")
ames_sf <- sf::st_as_sf(ames, coords = c("Longitude", "Latitude"), crs = 4326)
sf::st_drop_geometry(ames_sf)
#> # A tibble: 2,930 × 72
#>    MS_SubClass            MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
#>  * <fct>                  <fct>            <dbl>    <int> <fct>  <fct> <fct>    
#>  1 One_Story_1946_and_Ne… Resident…          141    31770 Pave   No_A… Slightly…
#>  2 One_Story_1946_and_Ne… Resident…           80    11622 Pave   No_A… Regular  
#>  3 One_Story_1946_and_Ne… Resident…           81    14267 Pave   No_A… Slightly…
#>  4 One_Story_1946_and_Ne… Resident…           93    11160 Pave   No_A… Regular  
#>  5 Two_Story_1946_and_Ne… Resident…           74    13830 Pave   No_A… Slightly…
#>  6 Two_Story_1946_and_Ne… Resident…           78     9978 Pave   No_A… Slightly…
#>  7 One_Story_PUD_1946_an… Resident…           41     4920 Pave   No_A… Regular  
#>  8 One_Story_PUD_1946_an… Resident…           43     5005 Pave   No_A… Slightly…
#>  9 One_Story_PUD_1946_an… Resident…           39     5389 Pave   No_A… Slightly…
#> 10 Two_Story_1946_and_Ne… Resident…           60     7500 Pave   No_A… Regular  
#> # ℹ 2,920 more rows
#> # ℹ 65 more variables: Land_Contour <fct>, Utilities <fct>, Lot_Config <fct>,
#> #   Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>,
#> #   Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>, Year_Built <int>,
#> #   Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#> #   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#> #   Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>, …

Created on 2023-05-04 with reprex v2.0.2

The data in the resamples from spatial_clustering_cv() is still an sf object, but the recipe isn't looking for the geometry column, so casting that to a tibble should be fine. I think that means this should be decently future-proof. It does mean you can't directly include geometry columns as predictors or as recipe steps, but tidymodels doesn't support that anyway, so I think this workaround will work for most use-cases.

@mikemahoney218
Copy link
Member

Going to go ahead and close this issue now, as it sounds like there might be better venues ( https://github.com/tidymodels/planning/ , maybe?) for discussions of how & if tidymodels wants to support spatial data moving forward. I believe that the workaround I shared will be pretty robust going forward, as only spatialsample functions actually need an sf object, so fit_resamples() or recipe() casting to data frames shouldn't matter, but future readers of this thread be aware that situations may have changed!

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators May 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

3 participants