diff --git a/data/AZtrees.RData b/data/AZtrees.RData index a15cf1a..f8d371e 100644 Binary files a/data/AZtrees.RData and b/data/AZtrees.RData differ diff --git a/vignettes/AZtrees.Rmd b/vignettes/AZtrees.Rmd new file mode 100644 index 0000000..45447f7 --- /dev/null +++ b/vignettes/AZtrees.Rmd @@ -0,0 +1,282 @@ +--- +title: "Arizona trees data" +author: "Toby Dylan Hocking" +vignette: > + %\VignetteIndexEntry{Arizona trees data} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + + + +```{r setup, include = FALSE} +knitr::opts_chunk$set( + fig.width=6, + fig.height=6) +data.table::setDTthreads(1) +## output: rmarkdown::html_vignette above creates html where figures are limited to 700px wide. +## Above CSS from https://stackoverflow.com/questions/34906002/increase-width-of-entire-html-rmarkdown-output main-container is for html_document, body is for html_vignette +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +The goal of this vignette is explain the differences between various +column roles: + +* `group` is used to designate observations which should stay together + when splitting. In other words, two rows in the same `group` should + never appear in different sets. +* `subset` designates a column whose values are each treated as a test + set (the train data come from Same/Other/All subsets). + +## What is a group? + +Below we load the data set. + +```{r} +data(AZtrees,package="mlr3resampling") +library(data.table) +AZdt <- data.table(AZtrees) +AZdt[1] +``` + +Above we see one row of data. +Below we see a scatterplot of the data: + +* Every row is a labeled pixel. +* Every dot is plotted at the xcoord/ycoord (lat/long) position on a + map around Flagstaff, AZ. + +```{r} +library(ggplot2) +x.center <- -111.72 +y.center <- 35.272 +rect.size <- 0.01/2 +x.min.max <- x.center+c(-1, 1)*rect.size +y.min.max <- y.center+c(-1, 1)*rect.size +rect.dt <- data.table( + xmin=x.min.max[1], xmax=x.min.max[2], + ymin=y.min.max[1], ymax=y.min.max[2]) +tree.fill.scale <- scale_fill_manual( + values=c(Tree="black", "Not tree"="white")) +ggplot()+ + theme_bw()+ + tree.fill.scale+ + geom_rect(aes( + xmin=xmin, xmax=xmax, ymin=ymin,ymax=ymax), + data=rect.dt, + fill="red", + linewidth=3, + color="red")+ + geom_point(aes( + xcoord, ycoord, fill=y), + shape=21, + data=AZdt)+ + coord_equal() +``` + +Note the red square in the plot above. Below we zoom into that square. + +```{r} +ggplot()+ + theme_bw()+ + tree.fill.scale+ + geom_point(aes( + xcoord, ycoord, fill=y), + shape=21, + data=AZdt)+ + coord_equal()+ + scale_x_continuous( + limits=x.min.max)+ + scale_y_continuous( + limits=y.min.max)+ + directlabels::geom_dl(aes( + xcoord, ycoord, label=polygon), + data=AZdt, + method="smart.grid") +``` + +In the plot above, we see that there are several groups of points, +each with a black number. Each group of points comes from a single +polygon (label drawn in GIS software), and the black number is the +polygon ID number. So each polygon represents one label, either tree +or not, and there are one or more points/pixels with that label inside +each polygon. + +A polygon is an example of a group. Each polygon results in one or +more rows of training data (pixels), but since pixels in a given group +were all labeled together, we would like to keep them together when +splitting the data. + +## What is a subset? + +Below we plot the same data, but this time colored by region. + +```{r} +##dput(RColorBrewer::brewer.pal(3,"Dark2")) +region.colors <- c(NW="#1B9E77", NE="#D95F02", S="#7570B3") +ggplot()+ + theme_bw()+ + tree.fill.scale+ + scale_color_manual( + values=region.colors)+ + geom_point(aes( + xcoord, ycoord, color=region3, fill=y), + shape=21, + data=AZdt)+ + coord_equal() +``` + +We can see in the plot above that there are three values in the +`region3` column: NE, NW, and S (different geographical regions on the +map which are well-separated). We would like to know if it is possible +to train on one region, and then accurately predict on another region. + +## Cross-validation + +First we create a task: + +```{r} +ctask <- mlr3::TaskClassif$new( + "AZtrees", AZdt, target="y") +ctask$col_roles$subset <- "region3" +ctask$col_roles$group <- "polygon" +ctask$col_roles$stratum <- "y" +ctask$col_roles$feature <- grep("SAMPLE",names(AZdt),value=TRUE) +str(ctask$col_roles) +``` + +Then we can instantiate the CV to see how it works (but usually you do +not need to instantiate, if you are using `benchmark` it does it for +you). + +```{r} +same.other.cv <- mlr3resampling::ResamplingSameOtherSizesCV$new() +same.other.cv$param_set$values$folds <- 3 +same.other.cv$instantiate(ctask) +same.other.cv$instance$iteration.dt[, .( + train.subsets, test.fold, test.subset, n.train.groups, + train.rows=sapply(train, length))] +``` + +The table above has one row per train/test split for which +error/accuracy metrics will be computed. The `n.train.groups` column +is the number of polygons which are used in the train set, which is +defined as the intersection of the train subsets and the train folds. +To double check, below we compute the total number of groups/polygons per +subset/region, and the expected number of train groups/polygons. + +```{r} +AZdt[, .( + polygons=length(unique(polygon)) +), by=region3][ +, train.polygons := polygons*with(same.other.cv$param_set$values, (folds-1)/folds) +][] +``` + +It is clear that the counts in the `train.polygons` column above match +the numbers in the previous table column `n.train.groups`. To +determine the number of rows of train data, we can look at the +`train.rows` column in the previous table. + +## Benchmark and test error computation + +Below we define the benchmark experiment. + +```{r} +same.other.cv <- mlr3resampling::ResamplingSameOtherSizesCV$new() +(learner.list <- list( + mlr3::LearnerClassifFeatureless$new())) +if(requireNamespace("rpart")){ + learner.list$rpart <- mlr3::LearnerClassifRpart$new() +} +for(learner.i in seq_along(learner.list)){ + learner.list[[learner.i]]$predict_type <- "prob" +} +(bench.grid <- mlr3::benchmark_grid(ctask, learner.list, same.other.cv)) +``` + +Above we see one row per combination of task, learner, and resampling. +Below we compute the benchmark result and test accuracy. + +```{r} +bench.result <- mlr3::benchmark(bench.grid) +measure.list <- mlr3::msrs(c("classif.acc","classif.auc")) +score.dt <- mlr3resampling::score(bench.result, measure.list) +score.dt[1] +``` + +Above we see one row of the result, for one train/test split. +Below we plot the accuracy results. + +```{r} +score.long <- melt( + score.dt, + measure.vars=measure(variable, pattern="classif.(acc|auc)")) +ggplot()+ + geom_point(aes( + value, train.subsets, color=algorithm), + data=score.long)+ + facet_grid(test.subset ~ variable, labeller=label_both, scales="free") +``` + +Above we show one dot per train/test split, and below we take the +mean/SD over folds. + +```{r} +score.wide <- dcast( + score.long, + algorithm + test.subset + train.subsets + variable ~ ., + list(mean, sd), + value.var="value") +ggplot()+ + geom_point(aes( + value_mean, train.subsets, color=algorithm), + size=3, + fill="white", + shape=21, + data=score.wide)+ + geom_segment(aes( + value_mean+value_sd, train.subsets, + color=algorithm, + linewidth=algorithm, + xend=value_mean-value_sd, yend=train.subsets), + data=score.wide)+ + scale_linewidth_manual(values=c(featureless=2, rpart=1))+ + facet_grid(test.subset ~ variable, labeller=label_both, scales="free")+ + scale_x_continuous( + "Mean +/- SD of test accuracy/AUC over folds/splits") +``` + +The plot above shows an interesting pattern: + +* For test subsets NE and NW, training on other subsets is less + accurate than training on the same subset. Training on All subsets + is no more accurate than training on the same subset. These results + suggest that learnable patterns in other subsets are too different + to be beneficial for predicting on these subsets. +* For test subset S, training on other subsets is slightly more + accurate than training on the same subset, and training on all + subsets is slightly more accurate still. These results suggest that + the learnable pattern is similar enough in the other subsets so as + to be beneficial for prediction in subset S. + +## Conclusion + +Column roles `group`, `stratum`, and `subset` may be used together, in +the same task, in order to perform a cross-validation experiment which +captures the structure in the data. + +## Session info + +```{r} +sessionInfo() +``` +