README.Rmd

---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->

# treezy
[![Travis-CI Build Status](https://travis-ci.org/njtierney/treezy.svg?branch=master)](https://travis-ci.org/njtierney/treezy)[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/njtierney/treezy?branch=master&svg=true)](https://ci.appveyor.com/project/njtierney/treezy)[![Coverage Status](https://img.shields.io/codecov/c/github/njtierney/treezy/master.svg)](https://codecov.io/github/njtierney/treezy?branch=master)[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

Makes handling output from decision trees easy. Treezy.

Decision trees are a commonly used tool in statistics and data science, but sometimes getting the information out of them can be a bit tricky, and can make other operations in a pipeline difficult.  

`treezy` makes it easy to:

* Get varaible importance information
* Visualise variable importance
* Visualise partial dependence

The data structures created in `treezy` -  `importance_table` are making their way over to the [`broomstick`](www.github.com/njtierney/broomstick) package - a member of the broom family specifically focussing on decision trees, which gives different output to many of the (many!) [packages/analyses that broom deals with](https://github.com/tidyverse/broom#available-tidiers).
I am interested in feedback, so please feel free to [file an issue](github.com/njtierney/treezy/issues/new) if you have any problems!


```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
```

# Installation

```{r eval = FALSE}

# install.packages("remotes")
remotes::install_github("njtierney/treezy")

```

# Example usage

## Explore variable importance with `importance_table` and `importance_plot`

### rpart

```{r rpart-run}

library(treezy)
library(rpart)

fit_rpart_kyp <- rpart(Kyphosis ~ ., data = kyphosis)

```

```{r rpart-example}

# default method for looking at importance

# variable importance
fit_rpart_kyp$variable.importance

# with treezy

importance_table(fit_rpart_kyp)

importance_plot(fit_rpart_kyp)

# extend and modify
library(ggplot2)
importance_plot(fit_rpart_kyp) + 
    theme_bw() + 
    labs(title = "My Importance Scores",
         subtitle = "For a CART Model")

```


### randomForest

```{r randomForest}
library(randomForest)
set.seed(131)
fit_rf_ozone <- randomForest(Ozone ~ ., 
                             data = airquality, 
                             mtry=3,
                             importance=TRUE, 
                             na.action=na.omit)
  
fit_rf_ozone

## Show "importance" of variables: higher value mean more important:

# randomForest has a better importance method than rpart
importance(fit_rf_ozone)

## use importance_table
importance_table(fit_rf_ozone)

# now plot it
importance_plot(fit_rf_ozone)

```

## Calculate residual sums of squares for rpart and randomForest

```{r rss}

# CART
rss(fit_rpart_kyp)

# randomForest
rss(fit_rf_ozone)

```


## plot partial effects

## Using gbm.step from dismo package

```{r plot-partial-effects, echo = TRUE, message = FALSE, warning = FALSE, results = FALSE}
# using gbm.step from the dismo package
library(gbm)
library(dismo)
# load data
data(Anguilla_train)

anguilla_train <- Anguilla_train[1:200,]

# fit model
angaus_tc_5_lr_01 <- gbm.step(data = anguilla_train,
                              gbm.x = 3:14,
                              gbm.y = 2,
                              family = "bernoulli",
                              tree.complexity = 5,
                              learning.rate = 0.01,
                              bag.fraction = 0.5)

```

```{r gg-partial-plot}

gg_partial_plot(angaus_tc_5_lr_01,
                var = c("SegSumT",
                        "SegTSeas"))
```

# Known issues

- The functions **have not been made compatible with Gradient Boosted Machines**, but this is on the cards. This was initially written for some old code which used gbm.step
- The partial dependence plots have not been tested, and were initially intended for use with gbm.step, as in the [elith et al. paper](https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf)

# Future work

- Extend to other kinds of decision trees (`gbm`, `tree`, `ranger`, `xgboost`, and more)
- Provide tools for extracting out other decision tree information (decision tree rules, surrogate splits, burling).
- Provide a method to extract out decision trees from randomForest and BRT so that they can be visualised with rpart.plot, 
- Provide tidy summary information of the decision trees, potentially in the format of `broom`'s `augment`, `tidy`, and `glance` functions. For example, `rpart_fit$splits`
- Think about a way to store the data structure of a decision tree as a nested dataframe
- Functions to allow for plotting of a prediction grid over two variables

# Acknowledgements

Credit for the name, "treezy", goes to @MilesMcBain, thanks Miles!