Formula vs non-formula interface with train() #803

ghost · 2017-12-10T12:44:04Z

Hello,

First, I cannot thank you enough for all your tremendous contributions with packages and book/seminars/webinars/courses. I am new to caret, and learning a lot everyday. I have the following issue that I reported elsewhere, and can see that others have had similar problems too, but haven't found a solution yet. The issue is that I can getglmnet, rangerandxgbTreeworking with formula interface for both classification and regression problem, but they all fail with non-formula interface. Judging by the consistency with which I am facing this issue, it could be either a feature of the models/methods that I am yet to understand, or perhaps something is wrong with my own setup. The only thing I know (?) is that the formula interface causes train()to convert each categorical variable into indicator variables, but not sure if that could be the source of this difference. All my datasets have multiple categorical predictors, so could not test with one that does not.

I could use my own codes/datasets here, but thought a better illustration could be with one of your example codes pasted below (taken from DataCamp course). The formula/non-formula interface and method can be chosen by uncommenting appropriately. The error I get with this code is different from what I get with my own code, but the pattern of failure seems similar. I am using RStudio v1.1.382, R v3.4.3_x64, caret v6.0-78 (devtools version).

Thanks again,
Manojit

library(caret)
library(C50)
library(glmnet)
library(mlbench)
library(xgboost)

data(churn)

set.seed(42)
myFolds <- createFolds(churnTrain$churn, k = 5)

myControl <- trainControl(
  summaryFunction = twoClassSummary,
  classProbs = T,
  savePredictions = T,
  index = myFolds,
  allowParallel = F
)

churnTrain$churn <- factor(churnTrain$churn, levels = c("no", "yes"))
X <- churnTrain[, !(names(churnTrain) %in% "churn")]
Y <- churnTrain$churn
fit <- train(
#  x = X, y = Y,                 # non-formula
  churn ~ ., churnTrain,         # formula
  metric = "ROC",
#  method = "glmnet",
  method = "ranger",
#  method = "xgbTree",
  trControl = myControl
)

print(plot(fit))

The text was updated successfully, but these errors were encountered:

coforfe · 2018-01-26T19:05:08Z

Hello,

Review this presentation for the explanation (slide 16):

https://www.slideshare.net/Work-Bench/i-dont-want-to-be-a-dummy-encoding-predictors-for-trees

Regards,
Carlos.

topepo · 2018-01-30T14:34:46Z

Most[*] models require numeric representations of the data so you would have to convert them to dummy variables before using the non-formula method, or use the formula or recipe interfaces to train.

[*] 99.X% of them but trees, rule-based models, and a few others (naive Bayes) generally do not. xgboost is an exception to tree-based models; it required dummy variables.

ghost · 2018-01-30T15:22:21Z

I figured much of this out by now. Thanks to both of you for comments.

yimingli · 2018-02-25T02:19:28Z

For caret newbies like me, here is another caveat for using the non-formula method: train(x, y, ...). y should not be a single-column data frame, but a vector. When you supply y with a single-column data frame, the error message is likely to be "Error: nrow(x) == n is not TRUE". It's in the train() help file, but it took me a while to debug.

ciberger · 2018-08-16T09:00:16Z

There is a workaround for people who wants to makes use of the flexibility of formulas when modelling with catboost. Define a formula as train.formula <- formula( y ~ x1 + x2 ... ) and then access the response variable and covariants as follows:

train(
        y = model.frame(train.formula, df.train)[,1], # response variable
        x = model.frame(train.formula, df.train)[,-1], # covariants
        .....
)

where df.train is you training set.

topepo closed this as completed Jan 30, 2018

ciberger mentioned this issue Aug 16, 2018

Difference between formula and "x=, y=" in train() #913

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Formula vs non-formula interface with train() #803

Formula vs non-formula interface with train() #803

ghost commented Dec 10, 2017 •

edited by ghost

Loading

coforfe commented Jan 26, 2018

topepo commented Jan 30, 2018

ghost commented Jan 30, 2018

yimingli commented Feb 25, 2018 •

edited

Loading

ciberger commented Aug 16, 2018

Formula vs non-formula interface with train() #803

Formula vs non-formula interface with train() #803

Comments

ghost commented Dec 10, 2017 • edited by ghost Loading

coforfe commented Jan 26, 2018

topepo commented Jan 30, 2018

ghost commented Jan 30, 2018

yimingli commented Feb 25, 2018 • edited Loading

ciberger commented Aug 16, 2018

ghost commented Dec 10, 2017 •

edited by ghost

Loading

yimingli commented Feb 25, 2018 •

edited

Loading