Skip to content

Unexpectedly different behavior for factors/dummy variables between parsnip and workflows #326

Closed
@juliasilge

Description

@juliasilge

When training what seems like the same model (same model specification, same formula, same data) using parsnip vs. using workflows, it is surprising to see different results. I found this behavior quite unexpected, especially what workflows did.

Some options to reduce user surprise 😮 would be more clarity in the functions either in parsnip, in workflows, or both.

lm(Sepal.Length ~ ., iris)
#> 
#> Call:
#> lm(formula = Sepal.Length ~ ., data = iris)
#> 
#> Coefficients:
#>       (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
#>            2.1713             0.4959             0.8292            -0.3152  
#> Speciesversicolor   Speciesvirginica  
#>           -0.7236            -1.0235

library(parsnip)
lm_spec <- linear_reg() %>%
  set_engine(engine = "lm") 

## parsnip version looks the same as lm
lm_spec %>%
  fit(Sepal.Length ~ ., data = iris)
#> parsnip model object
#> 
#> Fit time:  2ms 
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#>       (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
#>            2.1713             0.4959             0.8292            -0.3152  
#> Speciesversicolor   Speciesvirginica  
#>           -0.7236            -1.0235

## workflows version has made a different choice about dummy variables
library(workflows)
workflow() %>%
  add_model(lm_spec) %>%
  add_formula(Sepal.Length ~ .) %>%
  fit(data = iris)
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Sepal.Length ~ .
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#>       (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
#>            1.1478             0.4959             0.8292            -0.3152  
#>     Speciessetosa  Speciesversicolor   Speciesvirginica  
#>            1.0235             0.2999                 NA

Created on 2020-02-06 by the reprex package (v0.3.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions