Skip to content

Commit

Permalink
AZtrees vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
tdhock committed May 2, 2024
1 parent 4e27ae3 commit 479c6be
Show file tree
Hide file tree
Showing 2 changed files with 282 additions and 0 deletions.
Binary file modified data/AZtrees.RData
Binary file not shown.
282 changes: 282 additions & 0 deletions vignettes/AZtrees.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,282 @@
---
title: "Arizona trees data"
author: "Toby Dylan Hocking"
vignette: >
%\VignetteIndexEntry{Arizona trees data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

<style type="text/css">
.main-container {
max-width: 1200px !important;
margin: auto;
}
</style>

```{r setup, include = FALSE}
knitr::opts_chunk$set(
fig.width=6,
fig.height=6)
data.table::setDTthreads(1)
## output: rmarkdown::html_vignette above creates html where figures are limited to 700px wide.
## Above CSS from https://stackoverflow.com/questions/34906002/increase-width-of-entire-html-rmarkdown-output main-container is for html_document, body is for html_vignette
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

The goal of this vignette is explain the differences between various
column roles:

* `group` is used to designate observations which should stay together
when splitting. In other words, two rows in the same `group` should
never appear in different sets.
* `subset` designates a column whose values are each treated as a test
set (the train data come from Same/Other/All subsets).

## What is a group?

Below we load the data set.

```{r}
data(AZtrees,package="mlr3resampling")
library(data.table)
AZdt <- data.table(AZtrees)
AZdt[1]
```

Above we see one row of data.
Below we see a scatterplot of the data:

* Every row is a labeled pixel.
* Every dot is plotted at the xcoord/ycoord (lat/long) position on a
map around Flagstaff, AZ.

```{r}
library(ggplot2)
x.center <- -111.72
y.center <- 35.272
rect.size <- 0.01/2
x.min.max <- x.center+c(-1, 1)*rect.size
y.min.max <- y.center+c(-1, 1)*rect.size
rect.dt <- data.table(
xmin=x.min.max[1], xmax=x.min.max[2],
ymin=y.min.max[1], ymax=y.min.max[2])
tree.fill.scale <- scale_fill_manual(
values=c(Tree="black", "Not tree"="white"))
ggplot()+
theme_bw()+
tree.fill.scale+
geom_rect(aes(
xmin=xmin, xmax=xmax, ymin=ymin,ymax=ymax),
data=rect.dt,
fill="red",
linewidth=3,
color="red")+
geom_point(aes(
xcoord, ycoord, fill=y),
shape=21,
data=AZdt)+
coord_equal()
```

Note the red square in the plot above. Below we zoom into that square.

```{r}
ggplot()+
theme_bw()+
tree.fill.scale+
geom_point(aes(
xcoord, ycoord, fill=y),
shape=21,
data=AZdt)+
coord_equal()+
scale_x_continuous(
limits=x.min.max)+
scale_y_continuous(
limits=y.min.max)+
directlabels::geom_dl(aes(
xcoord, ycoord, label=polygon),
data=AZdt,
method="smart.grid")
```

In the plot above, we see that there are several groups of points,
each with a black number. Each group of points comes from a single
polygon (label drawn in GIS software), and the black number is the
polygon ID number. So each polygon represents one label, either tree
or not, and there are one or more points/pixels with that label inside
each polygon.

A polygon is an example of a group. Each polygon results in one or
more rows of training data (pixels), but since pixels in a given group
were all labeled together, we would like to keep them together when
splitting the data.

## What is a subset?

Below we plot the same data, but this time colored by region.

```{r}
##dput(RColorBrewer::brewer.pal(3,"Dark2"))
region.colors <- c(NW="#1B9E77", NE="#D95F02", S="#7570B3")
ggplot()+
theme_bw()+
tree.fill.scale+
scale_color_manual(
values=region.colors)+
geom_point(aes(
xcoord, ycoord, color=region3, fill=y),
shape=21,
data=AZdt)+
coord_equal()
```

We can see in the plot above that there are three values in the
`region3` column: NE, NW, and S (different geographical regions on the
map which are well-separated). We would like to know if it is possible
to train on one region, and then accurately predict on another region.

## Cross-validation

First we create a task:

```{r}
ctask <- mlr3::TaskClassif$new(
"AZtrees", AZdt, target="y")
ctask$col_roles$subset <- "region3"
ctask$col_roles$group <- "polygon"
ctask$col_roles$stratum <- "y"
ctask$col_roles$feature <- grep("SAMPLE",names(AZdt),value=TRUE)
str(ctask$col_roles)
```

Then we can instantiate the CV to see how it works (but usually you do
not need to instantiate, if you are using `benchmark` it does it for
you).

```{r}
same.other.cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
same.other.cv$param_set$values$folds <- 3
same.other.cv$instantiate(ctask)
same.other.cv$instance$iteration.dt[, .(
train.subsets, test.fold, test.subset, n.train.groups,
train.rows=sapply(train, length))]
```

The table above has one row per train/test split for which
error/accuracy metrics will be computed. The `n.train.groups` column
is the number of polygons which are used in the train set, which is
defined as the intersection of the train subsets and the train folds.
To double check, below we compute the total number of groups/polygons per
subset/region, and the expected number of train groups/polygons.

```{r}
AZdt[, .(
polygons=length(unique(polygon))
), by=region3][
, train.polygons := polygons*with(same.other.cv$param_set$values, (folds-1)/folds)
][]
```

It is clear that the counts in the `train.polygons` column above match
the numbers in the previous table column `n.train.groups`. To
determine the number of rows of train data, we can look at the
`train.rows` column in the previous table.

## Benchmark and test error computation

Below we define the benchmark experiment.

```{r}
same.other.cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
(learner.list <- list(
mlr3::LearnerClassifFeatureless$new()))
if(requireNamespace("rpart")){
learner.list$rpart <- mlr3::LearnerClassifRpart$new()
}
for(learner.i in seq_along(learner.list)){
learner.list[[learner.i]]$predict_type <- "prob"
}
(bench.grid <- mlr3::benchmark_grid(ctask, learner.list, same.other.cv))
```

Above we see one row per combination of task, learner, and resampling.
Below we compute the benchmark result and test accuracy.

```{r}
bench.result <- mlr3::benchmark(bench.grid)
measure.list <- mlr3::msrs(c("classif.acc","classif.auc"))
score.dt <- mlr3resampling::score(bench.result, measure.list)
score.dt[1]
```

Above we see one row of the result, for one train/test split.
Below we plot the accuracy results.

```{r}
score.long <- melt(
score.dt,
measure.vars=measure(variable, pattern="classif.(acc|auc)"))
ggplot()+
geom_point(aes(
value, train.subsets, color=algorithm),
data=score.long)+
facet_grid(test.subset ~ variable, labeller=label_both, scales="free")
```

Above we show one dot per train/test split, and below we take the
mean/SD over folds.

```{r}
score.wide <- dcast(
score.long,
algorithm + test.subset + train.subsets + variable ~ .,
list(mean, sd),
value.var="value")
ggplot()+
geom_point(aes(
value_mean, train.subsets, color=algorithm),
size=3,
fill="white",
shape=21,
data=score.wide)+
geom_segment(aes(
value_mean+value_sd, train.subsets,
color=algorithm,
linewidth=algorithm,
xend=value_mean-value_sd, yend=train.subsets),
data=score.wide)+
scale_linewidth_manual(values=c(featureless=2, rpart=1))+
facet_grid(test.subset ~ variable, labeller=label_both, scales="free")+
scale_x_continuous(
"Mean +/- SD of test accuracy/AUC over folds/splits")
```

The plot above shows an interesting pattern:

* For test subsets NE and NW, training on other subsets is less
accurate than training on the same subset. Training on All subsets
is no more accurate than training on the same subset. These results
suggest that learnable patterns in other subsets are too different
to be beneficial for predicting on these subsets.
* For test subset S, training on other subsets is slightly more
accurate than training on the same subset, and training on all
subsets is slightly more accurate still. These results suggest that
the learnable pattern is similar enough in the other subsets so as
to be beneficial for prediction in subset S.

## Conclusion

Column roles `group`, `stratum`, and `subset` may be used together, in
the same task, in order to perform a cross-validation experiment which
captures the structure in the data.

## Session info

```{r}
sessionInfo()
```

0 comments on commit 479c6be

Please sign in to comment.