Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency with the glmnet model ? #1124

Closed
ZWael opened this issue May 24, 2024 · 4 comments
Closed

Inconsistency with the glmnet model ? #1124

ZWael opened this issue May 24, 2024 · 4 comments

Comments

@ZWael
Copy link

ZWael commented May 24, 2024

I was trying to use glmnet for feature selection.

I used the code below inspired from Julia Silge's blog post (many thanks to her) https://juliasilge.com/blog/lasso-the-office/
So I was selecting variables with 2 different lambda.
If I am not mistaken, selected variable with the higher penalty should be included among the variables selected with the lowest penalty. which isn't the case in my data and reproduced in the toy data below.

The final model with finalize_workflow() should no longer variate ?

library(dplyr)
library(tidymodels)
library(vip)
library(glmnet) # for the toy dataset
data(QuickStartExample) # glmnet toy data
set.seed(1234)
# we add some vars and noise to simulate data with p>>>n
x <- replicate(8, {
  QuickStartExample$x+rnorm(2000)
})
QuickStartExample$x = cbind(QuickStartExample$x,
                            x[,,1],x[,,2],x[,,3],x[,,4],
                            x[,,5],x[,,6],x[,,7],x[,,8]
                            )
data=data.frame(y=QuickStartExample$y,QuickStartExample$x)

rm(QuickStartExample)
set.seed(1234)
split_init <- initial_split(data, strata = y)
train_d <- training(split_init)
test_d <- testing(split_init)

cv=vfold_cv(train_d,v=10,repeats = 3)
rec <- recipe(y ~ ., data = train_d) 

lasso_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet")

wf <- workflow() %>%
  add_recipe(rec)%>%
  add_model(lasso_spec) 
res <- tune_grid(
  wf,
  resamples = cv,
  grid = 100
)
autoplot(res)

L1=res %>% 
  select_by_pct_loss(metric = "rmse",limit = 10, desc(penalty))
L2=res %>% 
  select_by_one_std_err(metric = "rmse",desc(penalty)) 
L1$penalty
#> [1] 0.2018086
L2$penalty
#> [1] 0.1004087
final_mod1 <- finalize_workflow(wf,L1)%>%
  fit(train_d)
final_mod2 <- finalize_workflow(wf,L2)%>%
  fit(train_d)
vars1=vip::vi(final_mod1%>%extract_fit_parsnip(),lambda =L1$penalty)
vars2=vip::vi(final_mod2%>%extract_fit_parsnip(),lambda =L2$penalty)
L1$penalty > L2$penalty  ## lambda L2 is less stringent so we expect to see selected vars by L1 included in the selected ones by L1
#> [1] TRUE
v1=vars1%>%filter(Importance!=0)%>%pull(Variable)
v2=vars2%>%filter(Importance!=0)%>%pull(Variable)

table(v1%in%v2)
#> 
#> FALSE  TRUE 
#>     2    25

v1
#>  [1] "X1"   "X14"  "X5"   "X6"   "X3"   "X40"  "X20"  "X143" "X160" "X140"
#> [11] "X111" "X26"  "X8"   "X75"  "X76"  "X163" "X120" "X65"  "X72"  "X146"
#> [21] "X171" "X29"  "X60"  "X148" "X74"  "X165" "X35"
v2
#>  [1] "X1"   "X14"  "X5"   "X40"  "X6"   "X3"   "X143" "X140" "X26"  "X111"
#> [11] "X8"   "X160" "X75"  "X163" "X65"  "X72"  "X120" "X76"  "X134" "X66" 
#> [21] "X35"  "X165" "X60"  "X177" "X56"  "X73"  "X59"  "X29"  "X146" "X24" 
#> [31] "X170" "X64"  "X159" "X148" "X141" "X54"  "X74"  "X67"

Created on 2024-05-24 with reprex v2.1.0

@ZWael
Copy link
Author

ZWael commented May 24, 2024

Same result with some code modification in the intention to fit the model only once

final_mod <- finalize_workflow(wf,L1)%>%
  fit(train_d)
vars1=vip::vi(final_mod%>%extract_fit_parsnip(),lambda =L1$penalty)
vars2=vip::vi(final_mod%>%extract_fit_parsnip(),lambda =L2$penalty)

v1=vars1%>%filter(Importance!=0)%>%pull(Variable)
v2=vars2%>%filter(Importance!=0)%>%pull(Variable)

table(v1%in%v2)
#> 
#> FALSE  TRUE 
#>     2    25

Created on 2024-05-24 with reprex v2.1.0

@simonpcouch
Copy link
Contributor

If I am not mistaken, selected variable with the higher penalty should be included among the variables selected with the lowest penalty.

This is a reasonable hypothesis, but turns out to actually be untrue! The path of coefficient estimates as the penalty varies does not guarantee that the variables included at a higher penalty will persist at a lower penalty; the process is not strictly nested. The actual set of variables with non-zero coefficients changes due to the interplay of penalty strength, variable correlation, and the optimization algorithm.

Here's an example with glmnet itself, no tidymodels involved:

library(dplyr)
library(vip)
library(glmnet) # for the toy dataset
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack
#> Loaded glmnet 4.1-8
data(QuickStartExample) # glmnet toy data
set.seed(1234)
# we add some vars and noise to simulate data with p>>>n
x <- replicate(8, {
  QuickStartExample$x+rnorm(2000)
})
QuickStartExample$x = cbind(QuickStartExample$x,
                            x[,,1],x[,,2],x[,,3],x[,,4],
                            x[,,5],x[,,6],x[,,7],x[,,8]
)


glmnet_fit <- glmnet(QuickStartExample$x, QuickStartExample$y, alpha = 1)

vars1 = vip::vi(glmnet_fit, lambda = .2) %>% filter(Importance!=0) %>% pull(Variable)
vars2 = vip::vi(glmnet_fit, lambda = .1) %>% filter(Importance!=0) %>% pull(Variable)

table(vars1 %in% vars2)
#> 
#> FALSE  TRUE 
#>     2    15

Created on 2024-05-24 with reprex v2.1.0

@ZWael
Copy link
Author

ZWael commented May 25, 2024

Thank you for the feedback

Copy link

github-actions bot commented Jun 9, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants