Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formula vs non-formula interface with train() #803

Closed
ghost opened this issue Dec 10, 2017 · 5 comments
Closed

Formula vs non-formula interface with train() #803

ghost opened this issue Dec 10, 2017 · 5 comments

Comments

@ghost
Copy link

ghost commented Dec 10, 2017

Hello,

First, I cannot thank you enough for all your tremendous contributions with packages and book/seminars/webinars/courses. I am new to caret, and learning a lot everyday. I have the following issue that I reported elsewhere, and can see that others have had similar problems too, but haven't found a solution yet. The issue is that I can getglmnet, rangerandxgbTreeworking with formula interface for both classification and regression problem, but they all fail with non-formula interface. Judging by the consistency with which I am facing this issue, it could be either a feature of the models/methods that I am yet to understand, or perhaps something is wrong with my own setup. The only thing I know (?) is that the formula interface causes train()to convert each categorical variable into indicator variables, but not sure if that could be the source of this difference. All my datasets have multiple categorical predictors, so could not test with one that does not.

I could use my own codes/datasets here, but thought a better illustration could be with one of your example codes pasted below (taken from DataCamp course). The formula/non-formula interface and method can be chosen by uncommenting appropriately. The error I get with this code is different from what I get with my own code, but the pattern of failure seems similar. I am using RStudio v1.1.382, R v3.4.3_x64, caret v6.0-78 (devtools version).

Thanks again,
Manojit

library(caret)
library(C50)
library(glmnet)
library(mlbench)
library(xgboost)

data(churn)

set.seed(42)
myFolds <- createFolds(churnTrain$churn, k = 5)

myControl <- trainControl(
  summaryFunction = twoClassSummary,
  classProbs = T,
  savePredictions = T,
  index = myFolds,
  allowParallel = F
)

churnTrain$churn <- factor(churnTrain$churn, levels = c("no", "yes"))
X <- churnTrain[, !(names(churnTrain) %in% "churn")]
Y <- churnTrain$churn
fit <- train(
#  x = X, y = Y,                 # non-formula
  churn ~ ., churnTrain,         # formula
  metric = "ROC",
#  method = "glmnet",
  method = "ranger",
#  method = "xgbTree",
  trControl = myControl
)

print(plot(fit))

@coforfe
Copy link

coforfe commented Jan 26, 2018

Hello,

Review this presentation for the explanation (slide 16):

https://www.slideshare.net/Work-Bench/i-dont-want-to-be-a-dummy-encoding-predictors-for-trees

Regards,
Carlos.

@topepo
Copy link
Owner

topepo commented Jan 30, 2018

Most[*] models require numeric representations of the data so you would have to convert them to dummy variables before using the non-formula method, or use the formula or recipe interfaces to train.

[*] 99.X% of them but trees, rule-based models, and a few others (naive Bayes) generally do not. xgboost is an exception to tree-based models; it required dummy variables.

@topepo topepo closed this as completed Jan 30, 2018
@ghost
Copy link
Author

ghost commented Jan 30, 2018

I figured much of this out by now. Thanks to both of you for comments.

@yimingli
Copy link

yimingli commented Feb 25, 2018

For caret newbies like me, here is another caveat for using the non-formula method: train(x, y, ...). y should not be a single-column data frame, but a vector. When you supply y with a single-column data frame, the error message is likely to be "Error: nrow(x) == n is not TRUE". It's in the train() help file, but it took me a while to debug.

@ciberger
Copy link

There is a workaround for people who wants to makes use of the flexibility of formulas when modelling with catboost. Define a formula as train.formula <- formula( y ~ x1 + x2 ... ) and then access the response variable and covariants as follows:

train(
        y = model.frame(train.formula, df.train)[,1], # response variable
        x = model.frame(train.formula, df.train)[,-1], # covariants
        .....
)

where df.train is you training set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants