-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
165 lines (108 loc) · 4.97 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# treezy
[](https://travis-ci.org/njtierney/treezy)[](https://ci.appveyor.com/project/njtierney/treezy)[](https://codecov.io/github/njtierney/treezy?branch=master)[](https://www.tidyverse.org/lifecycle/#experimental)
Makes handling output from decision trees easy. Treezy.
Decision trees are a commonly used tool in statistics and data science, but sometimes getting the information out of them can be a bit tricky, and can make other operations in a pipeline difficult.
`treezy` makes it easy to:
* Get varaible importance information
* Visualise variable importance
* Visualise partial dependence
The data structures created in `treezy` - `importance_table` are making their way over to the [`broomstick`](www.github.com/njtierney/broomstick) package - a member of the broom family specifically focussing on decision trees, which gives different output to many of the (many!) [packages/analyses that broom deals with](https://github.com/tidyverse/broom#available-tidiers).
I am interested in feedback, so please feel free to [file an issue](github.com/njtierney/treezy/issues/new) if you have any problems!
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# Installation
```{r eval = FALSE}
# install.packages("remotes")
remotes::install_github("njtierney/treezy")
```
# Example usage
## Explore variable importance with `importance_table` and `importance_plot`
### rpart
```{r rpart-run}
library(treezy)
library(rpart)
fit_rpart_kyp <- rpart(Kyphosis ~ ., data = kyphosis)
```
```{r rpart-example}
# default method for looking at importance
# variable importance
fit_rpart_kyp$variable.importance
# with treezy
importance_table(fit_rpart_kyp)
importance_plot(fit_rpart_kyp)
# extend and modify
library(ggplot2)
importance_plot(fit_rpart_kyp) +
theme_bw() +
labs(title = "My Importance Scores",
subtitle = "For a CART Model")
```
### randomForest
```{r randomForest}
library(randomForest)
set.seed(131)
fit_rf_ozone <- randomForest(Ozone ~ .,
data = airquality,
mtry=3,
importance=TRUE,
na.action=na.omit)
fit_rf_ozone
## Show "importance" of variables: higher value mean more important:
# randomForest has a better importance method than rpart
importance(fit_rf_ozone)
## use importance_table
importance_table(fit_rf_ozone)
# now plot it
importance_plot(fit_rf_ozone)
```
## Calculate residual sums of squares for rpart and randomForest
```{r rss}
# CART
rss(fit_rpart_kyp)
# randomForest
rss(fit_rf_ozone)
```
## plot partial effects
## Using gbm.step from dismo package
```{r plot-partial-effects, echo = TRUE, message = FALSE, warning = FALSE, results = FALSE}
# using gbm.step from the dismo package
library(gbm)
library(dismo)
# load data
data(Anguilla_train)
anguilla_train <- Anguilla_train[1:200,]
# fit model
angaus_tc_5_lr_01 <- gbm.step(data = anguilla_train,
gbm.x = 3:14,
gbm.y = 2,
family = "bernoulli",
tree.complexity = 5,
learning.rate = 0.01,
bag.fraction = 0.5)
```
```{r gg-partial-plot}
gg_partial_plot(angaus_tc_5_lr_01,
var = c("SegSumT",
"SegTSeas"))
```
# Known issues
- The functions **have not been made compatible with Gradient Boosted Machines**, but this is on the cards. This was initially written for some old code which used gbm.step
- The partial dependence plots have not been tested, and were initially intended for use with gbm.step, as in the [elith et al. paper](https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf)
# Future work
- Extend to other kinds of decision trees (`gbm`, `tree`, `ranger`, `xgboost`, and more)
- Provide tools for extracting out other decision tree information (decision tree rules, surrogate splits, burling).
- Provide a method to extract out decision trees from randomForest and BRT so that they can be visualised with rpart.plot,
- Provide tidy summary information of the decision trees, potentially in the format of `broom`'s `augment`, `tidy`, and `glance` functions. For example, `rpart_fit$splits`
- Think about a way to store the data structure of a decision tree as a nested dataframe
- Functions to allow for plotting of a prediction grid over two variables
# Acknowledgements
Credit for the name, "treezy", goes to @MilesMcBain, thanks Miles!