08-validation.Rmd

# Model checking and evaluation {#svalidation}

```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE}
load("cache-CakeMap.RData")
map_pack <- c("rgdal", "maptools", "rgeos", "dplyr", "tidyr", "gridExtra", "knitr")
lapply(map_pack, library, character.only = T)
# Todo: add figure comparing various goodness-of-fit measures from JASSS paper
```

In food safety, openness about mistakes is
a vital ingredient for high standards.^[This seems to be
because
hiding or being ashamed of inevitable mistakes allows bad practice to
continue unnoticed  (Powell et al. 2011).]
The same concept applies to modelling.
Transparency in model evaluation --- the process of deciding whether the
model is appropriate and identifying *how good* the results are --- 
is vital in spatial microsimulation for
similar reasons. Openness of code and method, as demonstrated and advocated
throughout this book, is easy using command-line open source software such as R
[(Wickham 2014)](http://adv-r.had.co.nz/Reproducibility.html).

Reproducibility is especially important during model checking and
evaluation, allowing you and others not only to *believe* that the model
is working, but to *affirm* that the results are as expected.
This chapter is about specific methods to check and evaluate the outputs of
spatial microsimulation. The aims are simple: to ensure that the models
1) make sense given the input data (*model checking*) and
2) coincide with external reality (*model evaluation*).
These strategies are described below:

1. *Model checking* --- also called *internal validation*
(Edwards et al. 2010) ---
is comparing model results against a priori knowledge of how they *should*
be. This level of model checking usually takes place only at the aggregate 
level of the constraint variables (\@ref(Internal)).
2. *Model evaluation* --- also known as *external validation* --- is the
process of comparing model results with external data. This approach to
verification relies on good 'ground-truth data' and can take place at either
the individual level (if geo-coded survey data are available) or (more
commonly) at the aggregate level (\@ref(External)).

The chapter develops these two possible validations and
explains the problem of the zero cells (\@ref(empty-cells)).

Internal validation is the most common form of model evaluation. In some cases
this type of validation is the only available test of the model's output,
because datasets for external validation are unavailable.
A common motivation for using spatial microsimulation is lack of
data on a specific variable (as with the CakeMap example in the previous chapter).
In such cases internal validation, combined with proxy variables for which external
datasets are available, may be the best approach to model evaluation.
This is the case with the CakeMap example explored in the previous chapter.
There are no readily available datasets on the geographic distribution of cake consumption,
so external validation of the dependent variable (frequency with which cake is
eaten) is deemed impossible in this case. However, new sources of data such as
number of confectionery shops, consumer surveys and even social media could be
explored to provide 'sanity checks' on the results. Sometimes you may need
to be creative to find data for external validation.

What is important to note on these two kinds of validation is that they
test the model at different levels. *Internal validation* tests
the model quality at the aggregate level, assuming the input data is
relevant to the research question,
accurate and representative. If this validation fails, you may
have a problem with the input microdata (e.g. an excess of 'empty cells' or an
unrepresentative sample of the population),
implementation of the population synthesis algorithm, or 
contradictory constraints. Internal validation highlights problems of
method: if internal validation results are poor, the cause of the
problem should be diagnosed (e.g. is it poor data or poor implementation?)
and fixed. 

By contrast, *external 
validation* compares the model results with data that is *external* to the
model. External validation is more rigorous as it relates
simultaneously to the model's performance and whether the input data are 
suitable for answering the research questions explored by spatial microsimulation.
Poor external validation results can come from everywhere, so are harder to fix
(internal validation can rule out faulty methods, however). Thus internal and
external validation complement each other.

This chapter explains how to undertake routine checks on spatial microsimulation
procedures, how to
identify outlying variables and zones which are simply not performing well
(internal validation) and
how to undertake external validation. Even in cases where there is a paucity of
data on the target variable, as with cake consumption, there is usually at least
some tests of the model's performance against external data that can be
undertaken. As we will see with the CakeMap example (where income is used as a
proxy variable for external validation purposes) this can involve the creation of
new target variables, purely for the purposes of validation.

## Internal validation {#Internal}

Internal validation is the process of comparing the model's output against data that is internal to the model itself. In practice this means converting the synthetic spatial microdata into a form that is commensurate with the constraint variables and comparing the two geographically aggregated datasets: the observed vs simulated values. Every spatial microsimulation model will have access to the data needed for this comparison. Internal validation should therefore be seen as **the bare minimum** in terms of model evaluation, to be conducted as a standard procedure on all spatial microsimulation runs. When authors refer to this procedure as "result validation" they are being misleading. Internal validation tells us simply that the results are internally consistent; it should always be conducted. The two main causes of
poor model fit in terms of internal validation are:

1. Incorrectly specified constraint variables. For example the total number
of people in each zone according to one variable (e.g. age) may be different
from that according to another (e.g. employment status). This could be because
each variable uses a different *population base* (see Glossary).

2. *Empty cells*. These represent 'missing people' in the input microdata who
have a combination of variables that are needed for good model fit. If in
the input microdata for SimpleWorld there were no older males, for example,
the model would clearly perform much worse.

Other sources of poor fit between simulated and observed frequencies for
categories in the linking variables include simple mistakes in the code
defining the model, incorrect use of available algorithms for population
synthesis and, to a lesser extent, integerisation (Lovelace et al. 2015).

Because internal validation is so widely used in the literature
there are a number of established measures of internal fit that have been used. Yet there is little consistency in the measures that are used. This makes it difficult to assess which models are performing best across different studies, a major problem in spatial microsimulation research. If one study reports only *r* values, whereas another reports only *TAE* (each measure will be described shortly), there is no way to assess which is performing better. There is a need for more consistency in reporting internal validation. Hopefully this chapter, which provides descriptions of each of the commonly used and recommended measures of *goodness-of-fit* as well as guidance on which to use --- is a step in the right direction.

Several metrics of model fit exist. We will look at some commonly used measures and define them
mathematically before coding them in R and, in the subsequent section, implementing them to evaluate the CakeMap spatial microsimulation model. The measures developed in the section are:

- Pearson's correlation (*r*), a formula to quantify the linear correlation between the observed and final counts in each of the categories for every zone.
- Total absolute error (TAE), also know as the sum of absolute error (SAE), simply the sum of absolute (positive) differences between the observed and final counts in each of the categories for every zone.
- Relative error (RE), the TAE divided by the total concerned population.
- Mean relative error (MRE), the sum of RE calculated per category and per zone.
- Root mean squared error (RMSE), the square root of the mean of all squared errors. This metric emphasises the relative importance of a few large errors over the build-up of many small errors.
- Chi-squared, a statistical hypothesis test that compares two hypotheses and determines which one is more
probable. First hypothesis is that the final and the observed counts follow the same distribution. Second
hypothesis is the opposite.

Often, to illustrate the internal quality of the model, we add some representations.
Those can be maps or graphs. We will proceed to some representation for the example of CakeMap.

### Pearson's *r*

Pearson's coefficient of correlation ($r$) is the most commonly used measure of aggregate level model fit for internal validation. *r* is popular because it provides a fast and simple insight
into the fit between the simulated data and the constraints at an aggregate level. In most cases
$r$ values greater than 0.9 should be sought in spatial microsimulation and
in many cases $r$ values exceeding 0.99 are possible, even after integerisation.

$r$ is a measure of the linear correlation between two vectors or matrices. In spatial microsimulation,
if the model works, the observed and final counts in each of the categories for every zone are equal.
This means that, when plotting, for each category, the observed counts versus the final counts, we have 
a perfect line where abscissa and ordinates are equal.
Thus, the measure of a *linear* correlation is the one needed. 
The formula to calculate the Pearson's correlation between the vectors (or matrices) $x$ and $y$ is:

$$ r=\frac{s_{XY}}{S_X S_Y}=\frac{\frac{1}{n}\displaystyle\sum_{i=1}^n x_iy_i -\bar{x}\bar{y}}{\sqrt{\frac{1}{n}\displaystyle\sum_{i=1}^n x_i^2-\bar{x}^2}\sqrt{\frac{1}{n}\displaystyle\sum_{i=1}^n y_i^2-\bar{y}^2}}$$

This corresponds to the covariance divided by the product of the standard deviation of each vector. This 
can sound complicated, but it is just a standardized covariance. If the fit is perfect,
both vectors (simulated and constraint) have the same values and the covariance is
equal to the product of the standard deviations. Thus, the $r$ is then close to 1.

Note that this measure is very influenced by outliers in the vectors. This means that if only one 
category has a bad fit, the $r$ value is very affected.

### Absolute error measures

TAE and RE are crude yet effective measures of overall
model fit. TAE has the additional advantage of
being very easily understood as simply the sum of errors:

$$
e_{ij} = obs_{ij} - sim_{ij}
$$

$$
TAE = \sum\limits_{ij} | e_{ij} |
$$

where $e$ is error, $obs$ and $sim$ are the observed and simulated values for each constraint category ($j$) and each area ($i$), respectively.
Note the vertical lines $|$ mean we take the absolute value of
the error. This means that an error of -5 has the same impact
as an error of +5. This avoids the possibility of having an error of 0 if for one
category `obs`is bigger and for another `obs`is smaller. It really counts
the number of differences.

```{r, echo=FALSE}
# : $|-1| = 1$ and $|1| = 1$.
```

RE is the TAE divided by the total
population of the study area. This means that
if we compare the results for all variables simultaneously,
we divide the TAE by the total population multiplied by
the number of variables^[This is very intuitive, since
by considering all constraint variables together, we have taken
the whole population once for each variable.]. Thus, TAE is
sensitive to the number of people and of categories
within the model, while RE is not. RE can be interpreted as
the percentage of error and corresponds just
to the standardised version of TAE.

$$
RE = TAE / (total\_pop * n\_var)
$$

Mean relative error (MRE) needs first to calculate RE per variable and 
per zone. 

$$
MRE = \sum_{i=1}^{n\_var}\sum_{j=1}^{n\_zones}RE(zone=j; var=j)
$$

Before seeing how these metrics can easily be implemented in code, we will
define the other metrics defined in the above bullet points. Of the three
'absolute error' measures, we recommend reporting RE or MRE, as it scales with the
population of the study area.

### Root mean squared error

RMSE is similar to the absolute error metrics, but uses the *sum of squares* of the error. Recent work suggests that RMSE is preferable to absolute measures of error when the errors approximate a normal distribution
[@Chai2014].
Errors in spatial microsimulation tend to have a 'normal-ish' distribution, with many very small errors around the mean of zero and comparatively few larger errors. RMSE is defined as follows:

$$
RMSE = \sqrt{\frac{1}{n} \sum_i^n e^2_i}
$$

RMSE is an interesting measure of the error, since TAE and RE would be the same if the errors are $(1,1,1,1)$ or
$(0,0,0,4)$. However, we consider the fit as globally better if it contains several few errors than if it is perfect for 3 zones and a higher error for the fourth. In this case, RMSE will detect this difference. 
For the first case, RMSE is $\sqrt{\frac{4}{4}}$. For the second case, RMSE equals $\sqrt{\frac{4^2}{4}}=2$.

As with TAE, there is also a standardised version of RMSE, normalised root mean error squared (NRMSE). This is calculated by dividing RMSE by the range of the observed values:

$$
NRMSE = \frac{RMSE}{max(obs) - min(obs)} 
$$

### Chi-squared

```{r,echo=FALSE}
# (MD, 2015)
# Are you sure of the relevance of a chi squared in this case?
# No I'm not sure! (RL)

# Define the hypothesis
# Wedo not agree with the pvalue definition
```

Chi-squared is a commonly used test of the fit between absolute counts of categorical variables. It has the advantage of providing a *p value*, which represents the chances of obtaining a fit between observed and simulated values through chance alone. It is primarily used to test for relationships between categorical variables (e.g. socio-economic class and smoking) but has been used frequently in the spatial microsimulation literature (Voas and Williamson 2001; Wu et al. 2008).

The chi-squared statistic is defined as the sum of the square of the errors divided by the observed values [@Diez2012].
Suppose we have a simulated matrix $sim$ (for example, simulated counts) and an observed matrix $obs$, the
chi-squared is calculated by:

$$
\chi^2= \sum_{i=1}^{n\_line}\sum_{j=1}^{n\_column}\frac{(sim_{ij} - obs_{ij})^2}{obs_{ij}}
$$

The *chi-squared* test is the probability of obtaining the calculated $\chi^2$ value or a worst
(in terms of validating models, worst means a bigger difference), given the number of *degrees of freedom* 
(representing the number of categories) in the test.

An advantage of chi-squared is that it can compare vectors as well as matrices. As with all metrics presented in this section, it can also calculate fit for subsets of the data. A disadvantage is that chi-squared does not perform well when expected counts for cells are below 5. If this is the case it is recommended to use a subset of the aggregate level data for the test [@Diez2012].

### Which test to use?

```{r, echo=FALSE}
# TODO: add results from JASSS paper
```


The aforementioned tests are just some of the most commonly
used and most useful *goodness of fit* measures for internal validation in spatial microsimulation. The differences
between different measures are quite subtle.
@Voas2001 investigated the matter and found no consensus
on the measures that are appropriate for different situations.
Ten years later, we are no nearer consensus.

Such measures, that compare aggregate
count datasets, are *not* sufficient to ensure that the results of
spatial microsimulation are reliable; they are methods of *internal validation*.
They simply show that the individual level dataset has
been reweighted to fit with a handful of constraint variables: i.e. that the
process has worked under its own terms.

\pagebreak
Our view is that all the measures
outlined above are useful and roughly analogous (a perfect fit
will mean that measures of error evaluate to zero
and that $r = 1$). However, some are better than others.
Following @Chai2014, we recommend using *r* as a simple test of fit and reporting *RMSE*,
as it is a standard test used across the sciences. *RMSE* is robust to the number of observations and, using *NRMSE*, to the average size of zones also. Chi-squared is also a good option as it is very mature, provides *p values* and is well known. However, chi-squared is a more complex measure of fit and does not perform well when the table contains cells with less than 5 observations, as will be common in spatial microsimulation models of small areas and many constraint categories.

We recommend reporting more than one metric, while focusing
on measures that you and your colleagues
understand well. Comparing the results
with one or more alternative measures will add robustness.
However, a more important
issue is external validation: how well our individual level
results correspond with the real world.

### Internal validation of CakeMap

Following the 'learning by doing' ethic, let us now
implement what we have learned about internal validation. As a very basic test, we will calculate the correlation between the constraint table
cells and the corresponding simulated cell values for the CakeMap example:^[Data frames will not
work in this function and must be converted to matrices with `as.numeric`.]

```{r}
cor(as.numeric(cons), as.numeric(ind_agg))
```

We have just calculated our first goodness-of-fit measure for a 
spatial microsimulation model and the results are encouraging. 
The high correlation suggests that the model is working: 
it has internal consistency and could be described as 'internally valid'. 
Note that we have calculated the correlation 
before integerisation here. In the perfect fit, we would have a linear correlation of exactly 1.


In micro-simulation, we have the whole population with all characteristics of each individual, only after the simulation. 
For this reason, we have to aggregate the simulated population to have a matrix comparable with the constraint. In this sense, 
there are two ways to proceed. First, we can make the comparison variable per variable and the total number of individuals is the constraint number of people in the area. Secondly, we can take all variables together, meaning having a matrix including the whole population for each variable. This implies that the sum of all cells equals to the multiplication of the 
number of people in the area by the number of variables. Our choice here is the second alternative. Then, 
if we need more details on the fit in one zone, we can proceed to an analysis per variable for this specific case.

We can also calculate the correlation of these two vectors zone per zone. By this way, 
we will be able to notify for which zones our simulation could be less representative.
A vector of the correlation per zone, called `CorVec`, is calculated:

```{r}
# initialize the vector of correlations
CorVec <- rep (0, dim(cons)[1])

# calculate the correlation for each zone
for (i in 1:dim(cons)[1]){
  num_cons <- as.numeric(cons[i,])
  num_ind_agg <- as.numeric(ind_agg[i,])
  CorVec[i] <- cor (num_cons, num_ind_agg)
}
```

We can then proceed to a statistical analysis of the correlations and identify the worst zone.
In the code below, the summary of the vector of correlation is performed. The minimum value is
0.9451. This is the performance of the zone 84. This value is under the global correlation, 
but still close to 1. We can also observe that the first quartile is already 1. This means
that for more than 75% of the zones, the correlation is perfect (at least with an approximation
to 4 decimals). Moreover, by identifying the second worst zone, we can see that its correlation
is around 0.9816.

```{r}
# summary of the correlations per zone 
summary (CorVec)

# Identify the zone with the worst fit
which.min(CorVec)

# Top 3 worst values
head(order(CorVec), n = 3)
```

```{r, echo=FALSE}
# WARNING: THIS MAY generate the wrong result! (because the length of the vector changes)
# See here http://stackoverflow.com/questions/2453326/fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column
# Second minimal value
# which.min(CorVec[-84], ) # [-84] means 
```


This ends our analysis of correlation. Next we can calculate
total absolute error (TAE), which is easily defined as a function in R:

```{r}
tae <- function(observed, simulated){
  obs_vec <- as.numeric(observed)
  sim_vec <- as.numeric(simulated)
  sum(abs(obs_vec - sim_vec))
}
```

By applying this function to CakeMap, we find a TAE of 26445.57, as calculated below. This may
sound very big, but remember that this measure is very dependent on the scale of the problem.
26,445 may seem like a large number but it is small compared with the total population multiplied by
the number of constraints: 4,871,397. For this reason, 
the relative error RE (also called the standardised absolute error) is often preferable. We observe a RE of 
0.54%. Note that RE is simply TAE divided by the total of all observed cell
values (that is, the total population of the study area multiplied by the number of
constraints).

```{r}
# Calculate TAE
tae(cons, ind_agg)

# Total population (constraint)
sum(cons)

# RE
tae(cons, ind_agg) / sum(cons) 
```

As with all tests of goodness of fit, we can perform the analyses zone per zone. 
For the example, we call the vector of TAE and RE per zone, respectively, `TAEVec`
and `REVec`.

```{r}
# Initialize the vectors
TAEVec <- rep(0, nrow(cons))
REVec <- rep(0, nrow(cons))

# calculate the correlation for each zone
for (i in 1:nrow(cons)){
  TAEVec[i] <- tae (cons[i,], ind_agg[i,])
  REVec[i] <- TAEVec[i] / sum(cons[i,])
}
```

The next step is to interpret these results. The summary of each vector will help us.
Note that in the best case, the correlation is high, but the RE and TAE are small.
The zone with the highest error is also the number 84, which has a TAE of 14710 individuals times variables
and a RE of 21.3%. This zone seems to have a simulation a bit distant from the 
constraint. By watching the second and third worst zone, we can see that its RE is
respectively around 12.5% and 7.0%. The third quartile is of order $10^{-5}$ ($10^{-3}$%). This is pretty
close to 0. Thus, 75% of the zones has a RE smaller than the third quartile. The maximum values aside,
it appears that for the majority of the zones, the RE is small.

```{r}
# Summary of the TAE per zone 
summary (TAEVec)

# Summary of the RE per zone 
summary (REVec)

# Identify the worst zone
which.max(TAEVec)
which.max(REVec)

# Maximal value
tail(order(TAEVec), n = 3)
tail(order(REVec), n = 3)
```

Similar analyses can be applied for the other tests of goodness of fit.
In all cases, it is very important to have an idea of the internal validation
of your model. For example, if we want to analyse the cake consumption by
using your synthetic population created here, we have to be aware that for
the zone 84, the model does not fit so well the constraints.

Knowing that zone 84 is problematic, the next stage is to ask "how problematic?".
If a single zone is responsible for the majority of error, this would suggest
that action needs to be taken (e.g. by removing the offending zone
or by identifying which variable is causing the error).

To answer the previous question numerically, we can rephrase it in technical
terms: "Which proportion of error in the model arises from the worst zone?"
This is a question we can answer with a simple R query:

```{r}
worst_zone <- which.max(TAEVec)
TAEVec[worst_zone] / sum(TAEVec)
```

The result of the above code demonstrates that more than half (56%) of the error originates
from a single zone: 84. Therefore zone 84 certainly is anomalous and worthy of further
investigation. An early strategy to characterise this zone and compare it to the others is to visualise it.

To this end, Figure 8.1 places the TAE values calculated previously on a map,
with a base-layer supplied by Google for context --- see the book's
[online source code](https://github.com/Robinlovelace/spatial-microsim-book/blob/master/validation.Rmd) to see how. Zone 84 is clearly visible in this map as a ward
just above Leeds city centre. This does not immediately solve
the problem, but it confirms that only few zones have bigger errors.


```{r, echo=FALSE, fig.cap="Geographical distribution of Total Absolute Error (TAE) and Relative Error (RE). Note the zones of high error are clustered in university areas such as near the University of Leeds, where there is a high non-resident population.", message=FALSE, warning=FALSE, results=F, fig.pos='!h',fig.height=6, fig.width=5}
cons_codes <- read.csv("data/CakeMap/cars-raw.csv", stringsAsFactors = FALSE)
cons_codes <- as.character(cons_codes$GEO_CODE)[3:126]
cons_codes <- data.frame(CODE = cons_codes, TAEVec)
cons_codes$CODE = as.character(cons_codes$CODE)
# head(cons_codes)

# load the geographic data
load("data/CakeMap/wards.RData")
# summary(wards) # look at waht we've loaded - NB the coordinates
# head(wards@data) # take a look at the data
wards <- spTransform(wards, CRSobj=CRS("+init=epsg:4326")) # transform CRS for plotting
wards$CODE <- as.character(wards$CODE)
# wards@data$CODE[1:10] 
# cons_codes[1:10]
# summary(cons_codes %in% wards$CODE) # check codes match
# summary(wards$CODE %in% cons_codes)

# Merge the two
# head(cons_codes)
# head(wards@data[1:3])
cons_codes <- inner_join(wards@data[1:3], cons_codes, by = "CODE")
wards@data <- cons_codes

library(tmap)
library(tmaptools)
wards$`Sqrt(TAE)` <- sqrt(wards$TAEVec)
# p1 <- tm_shape(wards) +
#   tm_fill("Sqrt(TAE)", breaks = c(0, 50, 100, 150)) +
#   tm_borders()

cons_codes <- read.csv("data/CakeMap/cars-raw.csv")
cons_codes <- as.character(cons_codes$GEO_CODE)[3:126]
cons_codes <- data.frame(CODE = cons_codes, REVec)
cons_codes$CODE = as.character(cons_codes$CODE)

cons_codes <- inner_join(wards@data[1:3], cons_codes, by = "CODE")
wards@data <- cbind(wards@data, cons_codes)

wards$`Sqrt(RE)` <- sqrt(wards$REVec)
# p2 <- tm_shape(wards) +
#   tm_fill("Sqrt(RE)", n = 3) +
#   tm_borders()
# osm_tiles <- read_osm(bbox(wards))
# tm_shape(osm_tiles) +
#   tm_raster() +
tm_shape(wards) +
  tm_fill(c("Sqrt(TAE)", "Sqrt(RE)"), n = 3, alpha = 0.5) +
  tm_borders()

# Add RE map - commented out due to error with scale_transform
# bmap <- ggmap(get_map(bbbig, maptype = "terrain"))
# bmap2 <- ggplot()
# p2 <- bmap2 +
#   geom_polygon(data = fwards, aes(long, lat, group = group, fill = REVec)
#     , alpha = 0.4
#     ) +
#   geom_path(data = fwards, aes(long, lat, group = group), color = "white", size = 0.1) +
#   coord_map(xlim = c(bb[1], bb[3]), ylim = c(bb[2], bb[4])) +
#   scale_fill_gradient(low = "green", high = "red", trans = "sqrt") 

# gridExtra::grid.arrange(p1, p2)
# REVec_standard <- REVec * (mean(REVec) / mean(TAEVec))
# cons_codes <- cbind(cons_codes, REVec = REVec_standard)
# cons_codes_molten <- gather(cons_codes, variable, value, -CODE, -OLDCODE, -NAME)
# head(cons_codes_molten)
# fwards <- inner_join(fwards, cons_codes_molten, by = "CODE")

# plot the result
# ggmap(get_map(bbbig, maptype = "terrain")) +
#   geom_polygon(data = fwards, aes(long, lat, group = group, fill = value)
#     , alpha = 0.4
#     ) +
#   geom_path(data = fwards, aes(long, lat, group = group), color = "white", size = 0.1) +
#   coord_cartesian(xlim = c(bb[1], bb[3]), ylim = c(bb[2], bb[4])) +
#   facet_wrap(~ variable, scales = "free") +
#   scale_fill_gradient(low = "green", high = "red", trans = "sqrt")

# Plot REVec
# fwards <- fortify(wards, region = "CODE")
# fwards <- rename(fwards, CODE = id)
# # head(fwards)
# tmp_df <- data.frame(CODE = cons_codes$CODE, REVec)
# fwards <- inner_join(fwards, tmp_df)
# 
# ggmap(get_map(bbbig, maptype = "terrain")) +
#   geom_polygon(data = fwards, aes(long, lat, group = group, fill = REVec)
#     , alpha = 0.4
#     ) +
#   geom_path(data = fwards, aes(long, lat, group = group), color = "white", size = 0.1) +
#   coord_cartesian(xlim = c(bb[1], bb[3]), ylim = c(bb[2], bb[4])) +
#   scale_fill_gradient(low = "green", high = "red", trans = "sqrt")  
```

\pagebreak
Note that the maps presented in Figure 8.1 look identical for TAE and RE values
except for the scale; the similitude of these measures of fit can be
verified using a simple correlation:

```{r}
cor(TAEVec, REVec) # the measures are highly correlated
```

In this case, both are quite well correlated. However, when having very different zones, 
in terms of total population,
it can have more differences between the two maps. Indeed, with the same 
TAE, if the zone 84 had contained
a total population two times bigger, the RE would be very smaller. Thus, RE would be 
divided by the multiplication of 2 and the number of variables.


Having identified a zone that is particularly problematic (the 84), we will look at the responsible 
variables. We focus on the zone 84 and calculate the number of differences between the constraint
and the simulation for each category:

```{r, echo=FALSE}
options(scipen=100,digits=3) #to avoid scientific notation
```

```{r}
RudeDiff <- cons[84,] - ind_agg[84,] # differences for zone 84
diff <- round( abs(RudeDiff) ) # interesting differences

diff[diff > 1000] # printing the differences bigger than 1000
```

The responsible variable seems to be the car ownership. We have made
a similar check for the three worst zones and this variable is always the one with the
largest difference. To investigate the reasons for this, we print the constraints
for this variable inside the three worst zones and the marginals of the
observed individuals:

```{r}
worst <- tail(order(REVec), n = 3)
cons[worst, c("Car", "NoCar")] # constraint for 3 worst zones

table( ind[,2] ) # individuals to weight (1 = Car ; 2= NoCar)
```

Only few observed individuals did not own a car. Thus, for zones needing a lot of persons
that have no car, the weight of only 178 individuals out of 916 can be adapted.
The possibility of having an individual that has the whole range of possible characteristics
is then lower. The individuals without a car are saved in the `NoCar` variable. The contingency 
table of these people for the number of cakes and the age shows that we have nobody of age 55-64
eating more than 6 cakes.

```{r}
# individuals not owning a car
NoCar <- ind[ind$Car==2,] 

# Cross table
table(NoCar$NCakes,NoCar$ageband4) 
```

The three zones with the worst simulation needed a lot of people without a car. On the contrary, 
below, we print the constraint of car of the three best zones. They needed less people of this category.
This is the risk by generating a population of 1,623,800 of inhabitants and having a survey including 
only 916 persons.


```{r}
best <- head(order(REVec), n = 3)
 # constraint for 3 best zones
cons[best, c("Car", "NoCar")]
```

In conclusion, the simulation runs well for all zone excepts few ones. This is due to the individuals
present in the sample. This could be explained by a survey that was not uniformly distributed through
the different zones or does not include enough persons.

```{r, echo=FALSE}
# (MD, 2015) 
# check that somewhere, the problem of category non represented or to little sample is somewhere explained
```

## Empty cells {#empty-cells}

\index{empty cells}

Roughly speaking, 'empty cells' refer to individuals who are absent from
the input microdata. More specifically, empty cells represent
individuals with a combination of attributes in the constraint variables
that are likely to be present in the real spatial microdata but are known not
to exist in the individual level data. Empty cells are easiest to envision
when the 'seed' is represented as a contingency table. Imagine, for example,
if the microdata from SimpleWorld contained no young males. The associated 
individual data could be Table 8.1, leading to the cross table shown in Table 8.2.
We can clearly identify that there is no young male. Applying reweighting methods
on this kind of input data result in an unrealistic final population, without
young male.

```{r, include=FALSE}
# source("code/SimpleWorld.R")
indsw <- 
  structure(list(id = 1:5, age = structure(c(2L, 2L, 1L, 2L, 1L
), .Label = c("a0_49", "a50+"), class = "factor"), sex = structure(c(2L, 
2L, 2L, 1L, 1L), .Label = c("f", "m"), class = "factor")), .Names = c("id", 
"age", "sex"), row.names = c(NA, -5L), class = "data.frame")
```

```{r, results='hide', echo=FALSE}
ind_nym <- indsw[-3,2:3]
ind_nym
table(ind_nym)
```

```{r, echo=FALSE, results='asis', message=FALSE}

# # the kable way
# kable(ind_nym)
# kable(table(ind_nym))

# xtable way
# library(xtable)
# xt <- xtable(ind_nym, caption = "Individual level data from SimpleWorld with empty cells. Note there are no young males.")
# print(xt, caption.placement = "top", type = "latex", comment = FALSE)
# xt <- xtable(table(ind_nym), "Contingency table of the SimpleWorld microdata with no young males. Not the zero: this is the empty cell.")
# print(xt, caption.placement = "top", comment = FALSE, type = "latex", booktabs = TRUE)
```

Table: Individual level data from SimpleWorld with empty cells. Note there are no young males.

|   |age   |sex |
|:--|:-----|:---|
|1  |a50+  |m   |
|2  |a50+  |m   |
|4  |a50+  |f   |
|5  |a0_49 |f   |


Table:  Contingency table of the SimpleWorld microdata with no young males. Note the zero: this is the empty cell.

|      |  f|  m|
|:-----|--:|--:|
|a0_49 |  1|  0|
|a50+  |  1|  2|

The importance of empty cells and methods
for identifying whether or not they exist in the
individual level is explained in a recent paper (@Lovelace-ipfinr).
The number of different constraint variable permutations ($Nperm$)
increases rapidly with the number of constraints (see
equation \ref{eqempty} below),
where $n.cons$ is the total number of constraints and 
$n.cat_i$ is the number of categories
within constraint $i$:

\begin{equation}
\displaystyle Nperm = \prod_{i = 1}^{n.cons} n.cat_{i}
\label{eqempty}
\end{equation}

To exemplify this equation, the number of permutations of constraints in the
SimpleWorld microdata example is 4:
2 categories in the sex variables multiplied by
2 categories in the age variable. Clearly, $Nperm$ depends on how continuous variables
are binned, the number of constraints and diversity within each constraint.
Once we know the number of unique individuals (in terms of the constraint variables)
in the survey ($Nuniq$), the test to check a dataset for empty cells is straightforward,
based on equation \ref{eqempty}:

\begin{equation}
is.complete =
\left\{
	\begin{array}{ll}
		TRUE  & \mbox{if } Nuniq = Nperm \\
		FALSE & \mbox{if } Nuniq < Nperm
	\end{array}
\right\}
\end{equation}

Once the presence of empty cells is determined, the next
stage is to identify which types of individuals
are missing from the individual level input
dataset ($Ind$).

```{r, echo=FALSE}
# TODO: link this to discussion with paper (MD)
# TODO: say when it's better to use a sample-free method
```

The `missing' individuals, needed to be added to make $Ind$ complete, can be defined
by the following equation :

$$
Ind_{missing} = \{x | x \in complete \cap x \not \in Ind \}
$$ 


This means simply that the missing cells are defined as individuals with constraint categories
that are present in the complete dataset but absent from the input data.

## External validation {#External}

Beyond mistakes in the code, more fundamental
questions should be asked of results based on spatial microsimulation. The validity of the
assumptions affect confidence one should have in the
results. For this we need external datasets. External validation is
therefore a tricky topic which is highly dependent on the available data
[@Clarke2010-valid]. 

Geocoded survey data, *real* spatial microdata, is the
'gold standard' when it comes to official data.
Small representative samples of the population for small areas
can be used as a basis for individual level validation. 

## Chapter summary

This chapter explored methods for checking the results of
spatial microsimulation, building on the CakeMap example presented in the
previous chapter. This primarily involved checking
that the results are internally consistent: that the output
spatial microdata correspond with the geographical constraints.
This process, generally referred to as 'internal validation',
is important because it ensures that the model is internally consistent
and contains no obvious errors.

However, the term 'validation' can be misleading as it
implies that the model is in some way 'valid'. 
A model is only as good as its underlying assumptions,
which may involve some degree of subjectivity.
We therefore advocate talking about this phase as 'evaluation'
or simply 'model checking', if all we are doing is internal
validation.

In the example of CakeMap, no datasets are available to check if the
simulated rate of cake is comparable with that estimated from
other sources. In the case of microsimulation, 
external validation is often difficult because
available datasets are usually used for the simulation.
This helps explain why internal validation is far more common in spatial
microsimulation studies than external validation, although the latter is
generally more important.