-
Notifications
You must be signed in to change notification settings - Fork 45
/
Copy pathAdvanced_Feature_Selection.Rmd
162 lines (130 loc) · 5.87 KB
/
Advanced_Feature_Selection.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
title: "Advanced Feature Selection"
output: html_document
---
**Author**: Rajiv Shah
**Contributors**: Emily Webber
**Label**: Modeling Options
### Scope
The scope of this notebook is to provide instructions on how to do advanced feature selection using all of the models created during a DataRobot autopilot.
### Background
This is the procedure we are going to follow:
* Calculate the feature importance for each trained model
* Get the feature ranking for each trained model
* Get the ranking distribution for each feature across models
* Sort by mean rank and visualize
### Requirements
- R version 3.6.2
- DataRobot API version 2.2.0.
Small adjustments might be needed depending on the R version and DataRobot API version you are using.
Full documentation of the R package can be found here: https://cran.r-project.org/web/packages/datarobot/index.html
It is assumed you already have a DataRobot <code>Project</code> object and a DataRobot <code>Model </code> object.
#### Import Packages
```{r results = 'hide', warning=FALSE, message=FALSE}
library(datarobot)
library(dplyr)
library(stringr)
library(ggplot2)
library(purrr)
```
I do not want to pick all of the models. I will ignore Blender, Auto-Tuned, and models trained with a small percentage of data. Lastly, I only care about models that were trained on 'Informative Features' but this can change depending on your needs. I also limited my selection to models that used between 64% and 80% of the data in training.
```{r echo=FALSE, results = 'hide', warning=FALSE, message=FALSE}
#This piece of code will not show
ConnectToDataRobot(endpoint = "YOUR_DATAROBOT_HOSTNAME",
token = "YOUR_API_KEY")
project <- GetProject("YOUR_PROJECT_ID")
allModels <- ListModels(project)
modelFrame <- as.data.frame(allModels)
metric <- modelFrame$validationMetric
if (project$metric %in% c('AUC', 'Gini Norm')) {
bestIndex <- which.max(metric)
} else {
bestIndex <- which.min(metric)
}
model <- allModels[[bestIndex]]
model$modelType
```
```{r}
models <- ListModels(project)
bestmodels <- Filter(function(m) m$featurelistName == "Informative Features" & m$samplePct >= 64 & m$samplePct <= 80 & !str_detect(m$modelType, 'Blender') & !str_detect(m$modelType, 'Auto-Tuned') , models)
bestmodels <- Filter(function(m) m$samplePct >= 63, models)
```
Then, we will Collect all the feature impact info for the best models.
```{r}
all_impact<- NULL
for(i in 1:length(bestmodels)) {
featureImpact <- GetFeatureImpact(bestmodels[[i]])
featureImpact$modelname <- bestmodels[[i]]$modelType
all_impact <- rbind(all_impact,featureImpact)
}
```
Now we can plot these features.
```{r}
all_impact <- all_impact %>% mutate(finalnorm = impactUnnormalized/max(impactUnnormalized))
p <- ggplot(all_impact, aes(x=reorder(featureName, finalnorm, FUN=median), y=finalnorm))
p + geom_boxplot(fill= "#2D8FE2") + coord_flip() + theme(axis.text=element_text(size=16),
axis.title=element_text(size=12,face="bold"))
```
Finally, create a new feature list with the top features and re-run DataRobot's Autopilot
```{r eval=FALSE}
## Process Feature impact
ranked_impact <- all_impact %>% group_by(featureName) %>%
summarise(impact = mean(finalnorm)) %>%
arrange(desc(impact))
#New Feature List
topfeatures <- pull(ranked_impact,featureName)
No_of_features_to_select <- 10
listname = paste0("TopFI_", No_of_features_to_select)
Feature_id_percent_rank = CreateFeaturelist(project, listName= listname , featureNames = topfeatures[1:No_of_features_to_select])$featurelistId
# Run AutoPilot on new Top Features
StartNewAutoPilot(project,featurelistId = Feature_id_percent_rank)
WaitForAutopilot(project)
```
### Determine Ideal Number of Features
In addition to running autopilot on a single new feature list, you can do multiple runs of autopilot and plot the performance for each feature list length. This can help you decide how many features to include in a model. The code below creates a loop that does an autopilot run using the top 9, 6, 3 and 1 feature.
```{r eval=FALSE}
No_of_features_to_loop <- c(9, 6, 3, 1)
for(i in 1:length(No_of_features_to_loop)) {
listname = paste0("TopFI_", No_of_features_to_loop[i])
Feature_id_percent_rank = CreateFeaturelist(project, listName= listname , featureNames = topfeatures[1:i])$featurelistId
StartNewAutoPilot(project,featurelistId = Feature_id_percent_rank)
WaitForAutopilot(project)
}
```
The next step is to extract the best AUC for each feature list. The next code block uses a custom function to pull this data from an updated model list, and then graphs it using GGPLOT2.
```{r eval=FALSE}
#Create a list of models and select the ones that used the new feature lists.
models <- ListModels(project)
FI_list <- c("TOPFI_1", "TOPFI_3", "TOPFI_6", "TOPFI_9")
unpack_func <- function(i) {
unpack <- models[[i]]
modelid <- unpack$modelId
auc <- unpack$metrics$AUC$validation
fname <- unpack$featurelistName
output <- c(modelid, auc, fname)
return(output)
}
# map over function and convert to dataframe
fi_models <- map(1:length(models), unpack_func)
output <- as.data.frame(t(sapply(fi_models, function(x) x[1:3])))
output$V2 <- as.numeric(as.character(output$V2))
#find the max score for each feature list.
result <- output %>%
group_by(V3) %>%
filter(V2 == max(V2)) %>%
filter(!duplicated(V3))
#Filter only the ones we want to plot
graph <- filter(result, V3 == "TopFI_1" | V3 == "TopFI_3" | V3 == "TopFI_6" | V3 == "TopFI_9")
#Create numeric for the number of features in each list.
graph$number_of_features <- c(1, 3, 6, 9)
```
```{r eval=FALSE}
ggplot(data=graph, aes(x=graph$number_of_features, y=graph$V2) ) +
geom_line(color = "#2D8FE2")+
geom_point(color = "#2D8FE2" )+
theme(axis.text=element_text(size=16), axis.title=element_text(size=12,face="bold")) +
labs(title= "AUC Validation Scores") +
xlab("Number of Features") +
ylab("AUC") +
ylim(0, 0.8)
```