diff --git a/12-spatial-cv.Rmd b/12-spatial-cv.Rmd index e17bf8d4a..98f0374b5 100644 --- a/12-spatial-cv.Rmd +++ b/12-spatial-cv.Rmd @@ -10,11 +10,11 @@ knitr::opts_chunk$set(cache = FALSE) ## Prerequisites {-} -This chapter assumes proficiency with geographic data analysis\index{geographic data analysis}, for example gained by studying the contents and working-through the exercises in Chapters \@ref(spatial-class) to \@ref(reproj-geo-data). -A familiarity with generalized linear models (GLM)\index{GLM} and machine learning\index{machine learning} is highly recommended [for example from @zuur_mixed_2009;@james_introduction_2013]. +This chapter assumes proficiency with geographic data analysis\index{geographic data analysis}, for example gained by studying the contents and working through the exercises in Chapters \@ref(spatial-class) to \@ref(reproj-geo-data). +A familiarity with Generalized Linear Models (GLM)\index{GLM} and machine learning\index{machine learning} is highly recommended [for example @zuur_mixed_2009 and @james_introduction_2013]. The chapter uses the following packages:^[ -Packages **GGally**, **lgr**, **kernlab**, **mlr3measures**, **paradox**, **pROC**, **progressr** and **spDataLarge** must also be installed although these do not need to be attached. +Packages **GGally**, **lgr**, **kernlab**, **mlr3measures**, **paradox**, **pROC**, **progressr** and **spDataLarge** must also be installed, although these do not need to be attached. ] ```{r 12-spatial-cv-1, message=FALSE} @@ -27,7 +27,7 @@ library(mlr3) # unified interface to machine learning algorithms library(mlr3learners) # most important machine learning algorithms library(mlr3extralearners) # access to even more learning algorithms library(mlr3proba) # make probabilistic predictions, here only needed for mlr3extralearners::list_learners() -library(mlr3spatiotempcv) # spatio-temporal resampling strategies +library(mlr3spatiotempcv) # spatiotemporal resampling strategies library(mlr3tuning) # hyperparameter tuning library(mlr3viz) # plotting functions for mlr3 objects library(progressr) # report progress updates @@ -55,13 +55,13 @@ Machine learning is conducive to tasks such as the prediction of future customer This chapter is based on a case study: modeling the occurrence of landslides. This application links to the applied nature of geocomputation, defined in Chapter \@ref(intro), and illustrates how machine learning\index{machine learning} borrows from the field of statistics\index{statistics} when the sole aim is prediction. -Therefore, this chapter first introduces modeling and cross-validation\index{cross-validation} concepts with the help of a Generalized Linear Model \index{GLM} [@zuur_mixed_2009]. +Therefore, this chapter first introduces modeling and cross-validation\index{cross-validation} concepts with the help of a GLM \index{GLM} [@zuur_mixed_2009]. Building on this, the chapter implements a more typical machine learning\index{machine learning} algorithm\index{algorithm}, namely a Support Vector Machine (SVM)\index{SVM}. The models' **predictive performance** will be assessed using spatial cross-validation (CV)\index{cross-validation!spatial CV}, which accounts for the fact that geographic data is special. CV\index{cross-validation} determines a model's ability to generalize to new data, by splitting a dataset (repeatedly) into training and test sets. -It uses the training data to fit the model, and checks its performance when predicting against the test data. -CV helps to detect overfitting\index{overfitting} since models that predict the training data too closely (noise) will tend to perform poorly on the test data. +It uses the training data to fit the model and checks its performance when predicting against the test data. +CV helps to detect overfitting\index{overfitting}, since models that predict the training data too closely (noise) will tend to perform poorly on the test data. Randomly splitting spatial data can lead to training points that are neighbors in space with test points. Due to spatial autocorrelation\index{autocorrelation!spatial}, test and training datasets would not be independent in this scenario, with the consequence that CV\index{cross-validation} fails to detect a possible overfitting\index{overfitting}. @@ -131,13 +131,13 @@ Since terrain attributes are frequently associated with landsliding [@muenchow_g - `cplan`: plan curvature (rad m^−1^) expressing the convergence or divergence of a slope and thus water flow - `cprof`: profile curvature (rad m^-1^) as a measure of flow acceleration, also known as downslope change in slope angle - `elev`: elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area -- `log10_carea`: the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location +- `log10_carea`: the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing toward a location It might be a worthwhile exercise to compute the terrain attributes with the help of R-GIS bridges (see Chapter \@ref(gis)) and extract them to the landslide points (see Exercise section at the end of this chapter). ## Conventional modeling approach in R {#conventional-model} -Before introducing the **mlr3**\index{mlr3 (package)} package, an umbrella-package providing a unified interface to dozens of learning algorithms (Section \@ref(spatial-cv-with-mlr3)), it is worth taking a look at the conventional modeling interface in R\index{R}. +Before introducing the **mlr3**\index{mlr3 (package)} package, an umbrella package providing a unified interface to dozens of learning algorithms (Section \@ref(spatial-cv-with-mlr3)), it is worth taking a look at the conventional modeling interface in R\index{R}. This introduction to supervised statistical learning\index{statistical learning} provides the basis for doing spatial CV\index{cross-validation!spatial CV}, and contributes to a better grasp on the **mlr3**\index{mlr3 (package)} approach presented subsequently. Supervised learning involves predicting a response variable as a function of predictors (Section \@ref(intro-cv)). @@ -209,7 +209,7 @@ pred = terra::predict(ta, model = fit, type = "response") knitr::include_graphics("images/lsl-susc-1.png") ``` -Here, when making predictions we neglect spatial autocorrelation\index{autocorrelation!spatial} since we assume that on average the predictive accuracy remains the same with or without spatial autocorrelation structures. +Here, when making predictions, we neglect spatial autocorrelation\index{autocorrelation!spatial} since we assume that on average the predictive accuracy remains the same with or without spatial autocorrelation structures. However, it is possible to include spatial autocorrelation\index{autocorrelation!spatial} structures into models as well as into predictions. Though, this is beyond the scope of this book, we give the interested reader some pointers where to look it up: @@ -255,7 +255,7 @@ As we will see in Chapter \@ref(transport), the 'first law' of geography states This means these points are not statistically independent because training and test points in conventional CV\index{cross-validation} are often too close to each other (see first row of Figure \@ref(fig:partitioning)). 'Training' observations near the 'test' observations can provide a kind of 'sneak preview': information that should be unavailable to the training dataset. -To alleviate this problem 'spatial partitioning' is used to split the observations into spatially disjointed subsets (using the observations' coordinates in a *k*-means clustering\index{clustering!kmeans}; @brenning_spatial_2012; second row of Figure \@ref(fig:partitioning)). +To alleviate this problem, 'spatial partitioning' is used to split the observations into spatially disjointed subsets (using the observations' coordinates in a *k*-means clustering\index{clustering!kmeans}; @brenning_spatial_2012; second row of Figure \@ref(fig:partitioning)). This partitioning strategy is the **only** difference between spatial and conventional CV. As a result, spatial CV leads to a bias-reduced assessment of a model's predictive performance, and hence helps to avoid overfitting\index{overfitting}. @@ -274,7 +274,7 @@ It acts as a 'meta-package', providing a unified interface to popular supervised The standardized **mlr3** interface is based on eight 'building blocks'. As illustrated in Figure \@ref(fig:building-blocks), these have a clear order. -(ref:building-blocks) Basic building blocks of the mlr3 package. Source: @bischl_applied_2024. (Permission to reuse this figure was kindly granted.) +(ref:building-blocks) Basic building blocks of the mlr3 package [@bischl_applied_2024]. Permission to reuse this figure was kindly granted. ```{r building-blocks, echo=FALSE, fig.height=4, fig.width=4, fig.cap="(ref:building-blocks)", fig.scap="Basic building blocks of the mlr3 package."} knitr::include_graphics("images/12_ml_abstraction_crop.png") @@ -288,7 +288,7 @@ Third, the **resampling** approach assesses the predictive performance of the mo ### Generalized linear model {#glm} To use a GLM\index{GLM} in **mlr3**\index{mlr3 (package)}, we must create a **task** containing the landslide data. -Since the response is binary (two-category variable) and has a spatial dimension, we create a classification\index{classification} task with `as_task_classif_st()` of the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021, for non-spatial tasks, use `mlr3::as_task_classif()` or `mlr3::as_task_regr()` for regression\index{regression} tasks, see `?Task` for other task types].^[The **mlr3** ecosystem makes heavily use of **data.table** and **R6** classes. And though you might use **mlr3** without knowing the specifics of **data.table** or **R6**, it might be rather helpful. To learn more about **data.table**, please refer to https://rdatatable.gitlab.io/data.table/. To learn more about **R6**, we recommend [Chapter 14](https://adv-r.hadley.nz/fp.html) of the Advanced R book [@wickham_advanced_2019].] +Since the response is binary (two-category variable) and has a spatial dimension, we create a classification\index{classification} task with `as_task_classif_st()` of the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021, for non-spatial tasks, use `mlr3::as_task_classif()` or `mlr3::as_task_regr()` for regression\index{regression} tasks, see `?Task` for other task types].^[The **mlr3** ecosystem makes use of **data.table** and **R6** classes. And though you might use **mlr3** without knowing the specifics of **data.table** or **R6**, it might be rather helpful. To learn more about **data.table**, please refer to https://rdatatable.gitlab.io/data.table/. To learn more about **R6**, we recommend [Chapter 14](https://adv-r.hadley.nz/fp.html) of the *Advanced R* book [@wickham_advanced_2019].] The first essential argument of these `as_task_` functions is `x`. `x` expects that the input data includes the response and predictor variables. The `target` argument indicates the name of a response variable (in our case this is `lslpts`) and `positive` determines which of the two factor levels of the response variable indicate the landslide initiation point (in our case this is `TRUE`). @@ -399,7 +399,7 @@ We will use a 100-repeated 5-fold spatial CV\index{cross-validation!spatial CV}: [^13]: Note that package **sperrorest** initially implemented spatial cross-validation in R [@brenning_spatial_2012]. - In the meantime, its functionality was integrated into the **mlr3** ecosystem which is the reason why we are using **mlr3** [@schratz_hyperparameter_2019]. The **tidymodels** framework is another umbrella-package for streamlined modeling in R; however, it only recently integrated support for spatial cross validation via **spatialsample** which so far only supports one spatial resampling method. + In the meantime, its functionality was integrated into the **mlr3** ecosystem which is the reason why we are using **mlr3** [@schratz_hyperparameter_2019]. The **tidymodels** framework is another umbrella package for streamlined modeling in R; however, it only recently integrated support for spatial cross-validation via **spatialsample**, which so far only supports one spatial resampling method. ```{r 12-spatial-cv-18, eval=TRUE} @@ -409,7 +409,7 @@ resampling = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100) To execute the spatial resampling, we run `resample()` using the previously specified task, learner, and resampling strategy. This takes some time (around 15 seconds on a modern laptop) because it computes 500 resampling partitions and 500 models. -As performance measure, we again choose the AUROC. +Again, we choose the AUROC as performance measure. To retrieve it, we use the `score()` method of the resampling result output object (`score_spcv_glm`). This returns a `data.table` object with 500 rows -- one for each model. @@ -471,14 +471,14 @@ To recap, we adhere to the following definition of machine learning by [Jason Br In applied machine learning we will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends. In Section \@ref(glm) a GLM was used to predict landslide susceptibility. -This section introduces support vector machines (SVM)\index{SVM} for the same purpose. +This section introduces support vector machines (SVMs)\index{SVM} for the same purpose. Random forest\index{random forest} models might be more popular than SVMs; however, the positive effect of tuning hyperparameters\index{hyperparameter} on model performance is much more pronounced in the case of SVMs [@probst_hyperparameters_2018]. Since (spatial) hyperparameter tuning is the major aim of this section, we will use an SVM. For those wishing to apply a random forest model, we recommend to read this chapter, and then proceed to Chapter \@ref(eco) in which we will apply the currently covered concepts and techniques to make spatial distribution maps based on a random forest model. SVMs\index{SVM} search for the best possible 'hyperplanes' to separate classes (in a classification\index{classification} case) and estimate 'kernels' with specific hyperparameters\index{hyperparameter} to create non-linear boundaries between classes [@james_introduction_2013]. Machine learning algorithms often feature hyperparameters\index{hyperparameter} and parameters. -Parameters can be estimated from the data while hyperparameters\index{hyperparameter} are set before the learning begins (see also the [machine mastery blog](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/) and the [hyperparameter optimization chapter](https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html) of the mlr3 book). +Parameters can be estimated from the data, while hyperparameters\index{hyperparameter} are set before the learning begins (see also the [machine mastery blog](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/) and the [hyperparameter optimization chapter](https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html) of the mlr3 book). The optimal hyperparameter\index{hyperparameter} configuration is usually found within a specific search space and determined with the help of cross-validation methods. This is called hyperparameter\index{hyperparameter} tuning and the main topic of this section. @@ -486,7 +486,7 @@ Some SVM implementations such as that provided by **kernlab** allow hyperparamet This works for non-spatial data but is of less use for spatial data where 'spatial tuning' should be undertaken. Before defining spatial tuning, we will set up the **mlr3**\index{mlr3 (package)} building blocks, introduced in Section \@ref(glm), for the SVM. -The classification\index{classification} task remains the same, hence we can simply reuse the `task` object created in Section \@ref(glm). +The classification\index{classification} task remains the same, hence, we can simply reuse the `task` object created in Section \@ref(glm). Learners implementing SVM can be found using the `list_mlr3learners()` command of the **mlr3extralearners**. ```{r 12-spatial-cv-23, eval=TRUE, echo=TRUE} @@ -568,7 +568,7 @@ To make the performance estimation processing chain even clearer, let us write d 1. Performance level (upper left part of Figure \@ref(fig:inner-outer)) - split the dataset into five spatially disjoint (outer) subfolds 1. Tuning level (lower left part of Figure \@ref(fig:inner-outer)) - use the first fold of the performance level and split it again spatially into five (inner) subfolds for the hyperparameter tuning. Use the 50 randomly selected hyperparameters\index{hyperparameter} in each of these inner subfolds, i.e., fit 250 models -1. Performance estimation - Use the best hyperparameter combination from the previous step (tuning level) and apply it to the first outer fold in the performance level to estimate the performance (AUROC\index{AUROC}) +1. Performance estimation: use the best hyperparameter combination from the previous step (tuning level) and apply it to the first outer fold in the performance level to estimate the performance (AUROC\index{AUROC}) 1. Repeat steps 2 and 3 for the remaining four outer folds 1. Repeat steps 2 to 4, 100 times @@ -658,7 +658,7 @@ Machine learning algorithms often require hyperparameter\index{hyperparameter} i Machine learning overall, and its use to understand spatial data, is a large field and this chapter has provided the basics, but there is more to learn. We recommend the following resources in this direction: -- The **mlr3 book** (@bischl_applied_2024; https://mlr3book.mlr-org.com/) and especially the [chapter on the handling of spatiotemporal data](https://mlr3book.mlr-org.com/chapters/chapter13/beyond_regression_and_classification.html#sec-spatiotemporal) +- The **mlr3 book** (@bischl_applied_2024; https://mlr3book.mlr-org.com/) and especially the [chapter on the handling of spatiotemporal data](https://mlr3book.mlr-org.com/chapters/chapter13/beyond_regression_and_classification.html#spatiotemp-cv) - An academic paper on hyperparameter\index{hyperparameter} tuning [@schratz_hyperparameter_2019] - An academic paper on how to use **mlr3spatiotempcv** [@schratz_mlr3spatiotempcv_2021] - In case of spatiotemporal data, one should account for spatial\index{autocorrelation!spatial} and temporal\index{autocorrelation!temporal} autocorrelation when doing CV\index{cross-validation} [@meyer_improving_2018]