You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I thought it was equivalent to express the input data in train() either as a formula (y ~ x, data=) or as y=y, x=x but I just found a difference when retrieving importances for the rf method.
If train() is expressed as a formula, then the importances will be given by factors' levels of the categorical variable.
If train() is NOT expressed as formula, then the importances will be given for the whole factor variable, as expected.
The package randomForest, from which train() is based for this method, does not allow to express as a formula and only give the overall importances.
Where does this difference come from ?
library(tidyverse)
data("chickwts")
# Add a numeric predictor in the datadt<-data.frame(chickwts, xx= rnorm(length(chickwts$feed)))
rfCARET<-caret::train(
y=dt$weight,
x=dt %>% select(-weight),
method="rf",
importance=T
)
caret::varImp(rfCARET$finalModel)
#> Overall#> feed 46.4034574#> xx -0.4728728rfCARET_form<-caret::train(
weight~.,
data=dt,
method="rf",
importance=T
)
caret::varImp(rfCARET_form$finalModel)
#> Overall#> feedhorsebean 29.3780596#> feedlinseed 18.8974110#> feedmeatmeal 1.8305809#> feedsoybean 11.5654634#> feedsunflower 13.5071325#> xx -0.6113386## Try with RandomForest package.
library(randomForest)
rf<- randomForest(x=dt %>% select(-weight),
y=dt$weight)
importance(rf)
#> IncNodePurity#> feed 213583.0#> xx 129131.1
If train() is expressed as a formula, then the importances will be given by factors' levels of the categorical variable.
If train() is NOT expressed as formula, then the importances will be given for the whole factor variable, as expected.
Your expectation is pretty reasonable. 99.9% of the time, a formula method will generate indicator variables for qualitative predictors. train is consistent with the majority of functions that use formulas.
However, there are a variety of package functions whose models do not require that all of the predictors be encoded as numbers. Trees, rule-based models, naive Bayes, and others fall into this bucket.
So, if you want to keep factors as factors, use the non-formula method for train.
People could still benefit from the flexibility of formulas while getting keeping factors as factors using model.frame function. See workaround here #803
Hi,
I thought it was equivalent to express the input data in
train()
either as a formula (y ~ x, data=
) or asy=y, x=x
but I just found a difference when retrieving importances for therf
method.train()
is expressed as a formula, then the importances will be given by factors' levels of the categorical variable.train()
is NOT expressed as formula, then the importances will be given for the whole factor variable, as expected.The package
randomForest
, from whichtrain()
is based for this method, does not allow to express as a formula and only give the overall importances.Where does this difference come from ?
Session info
The text was updated successfully, but these errors were encountered: