04-data-prep.Rmd

# Data preparation {#data-prep}

```{r, echo=FALSE}
library(png)
library(grid)
# Applications of spatial microsimulation - to be completed!

## Updating cross-tabulated census data

## Economic forecasting

## Small area estimation

## Transport modelling

## Dynamic spatial microsimulation

## An input into agent based models
```


```{r, echo=FALSE}
# With the foundations built in the previous chapters now (hopefully) firmly in-place,
# we progress in this chapter to actually *run* a spatial microsimulation model
# This is designed to form the foundation of a spatial microsimulation course.
```


```{r, echo=FALSE}
# This next chapter is where people get their hands dirty for the first time -
# could be the beginning of part 2 if the book's divided into parts. 
```

Correctly loading, manipulating and assessing aggregate and individual level
datasets is critical for effectively modelling real-world data.
Getting the data into the right shape has the potential
to make your models run quicker and increase the ease of modifying them,
for example to include newly available
input data.  R is an accomplished tool for data reformatting,
so it can be used as an integrated solution
for both preparing the data and running the model.

In addition to providing a primer on data manipulation,
the objects loaded in this chapter also
provide the basis for Chapter 5, which covers the
process of population synthesis in detail.
Of course, the datasets you use in real applications will be much larger than
those in SimpleWorld, but the concepts and commands will be largely the same. 

Each project is different and data can be very
diverse. For more on data cleaning and 'tidying', see @tidy-data.
However, in most cases, the input datasets will be similar to the ones presented here:
an individual level dataset consisting of categorical
variables and aggregate level constraints containing categorical counts.

The process of loading, checking and preparing the input datasets for spatial
microsimulation is generally a linear process, encapsulating the following
stages, corresponding to the sections of this chapter:

- *Accessing the input data* (\@ref(accessing)) provides advice on
ways to collect and chose data.
- *Target and constraint variables* (\@ref(Selecting)) explains the
different types of data you can have and how to define the targets
and the constraints.
- *Loading input data* (\@ref(Loading)) contains the R code to load
the input data.
- *Subsetting to remove excess information* (\@ref(subsetting-prep)) 
gives the R code to not consider all variables.
- *Re-categorising individual level variables* (\@ref(re-categorise))
develops the R code to have pertinent categories in all variables.
- *Matching individual and aggregate level data names* (\@ref(matching))
makes the data names correspond to avoid problems when executing the 
spatial microsimulation.
- *'Flattening' the individual level data* (\@ref(flattening)) explains
a way to transform the data into a boolean matrices.


## Accessing the input data {#accessing}


\index{target variables}

Before loading the spatial microdata, you must decide on the data 
needed for your research. Focus on "target variables" related to the research
question will help decide on the constraint variables, as we will see
below (\@ref(Selecting)).
Selecting only on variables of interest will ensure you do not
waste time thinking about and processing 
additional variables that will be of little use in the future.
Regardless of the research question it is clear that the methodology depends
on the availability of data. If the problem relates to variables that are simply
unavailable at any level (e.g. hair length, if your research question related
to the hairdressing industry), spatial microsimulation may be unsuitable.
If, on the other hand, all the data is available at the geographical level
of interest, spatial microsimulation may not be necessary.^[To
explore geographical variability at a low spatial resolution,
for example, the necessary data may be already available, as surveys often
state which region each individual inhabits. Spatial microsimulation would
only be necessary in this case if higher spatial resolution were needed.]
Spatial microsimulation is useful when you
have an intermediary amount of data available: geographically aggregated count
and a non-spatial survey.
In any case, you should have a clear idea about the range of available
data on the topic of interest before embarking on spatial microsimulation.


\index{input data}
\index{example data}

As with most spatial microsimulation models, the input data in SimpleWorld consists of
microdata --- a non-geographical individual level dataset --- and a
constraint table which represents aggregate counts for a series of geographical zones. In some cases,
you may have geographical information in your microdata, but not at the
required level of detail.
For example, you may have a variable on the province of each individual
but need its municipality, a lower geographical level.
The input data can be accessed from the RStudio book project, as described
in \@ref(SimpleWorld).


\index{reproducibility}
To ease reproducibility of the analysis when working with real data, it is
recommended that the process begins with a copy of the *raw* dataset on your
hard disc. Rather than modifying this file, modified ('cleaned') versions
should be saved as separate files. This ensures that after any mistakes, one can
always recover information that otherwise could have been lost and makes the
project fully reproducible. In this chapter, a relatively clean and
very tiny dataset from SimpleWorld is used, but we still create a backup of the
original dataset. We will see in chapter
\ref{CakeMap} how to deal with larger and messier data, where being able to
refer to the original dataset is more important. Here the focus is on the principles.

```{r, echo=FALSE}
# It sounds trivial, but the *precise* origin of the input data
# should be described. Comments in code that loads the data (and resulting publications),
# allows you or others to recall the raw information. # going on a little -> rm
# Show directory structure plot from Gillespie here
# 2. Remove excess information
```

\index{data cleaning}


'Stripping down' the datasets so that they only contain the bare essential
information will enable focus on the information that really matters. The input 
datasets in the example used for this chapter
are already bare, but in real world surveys there may be literally hundreds 
of variables clogging up the analysis and your mind.
It is good practice to remove
excess data at the outset. Provided you have a good workflow,
keeping all original data files unmodified for future
reference, it will be easy to go back and add extra variables
at a later stage. Following Occam's razor
which favours simple solutions (Blumer et al. 1987),
it is often best to start with a minimum
of data and add complexity subsequently
rather than vice versa.

```{r, echo=FALSE}
# Reference for occam's razor!
```

\index{linking variables}
Spatial microsimulation involves combining
individual and aggregate level data. Each level should
be given equal consideration when preparing the inputs for the
modelling process. In many cases, the limiting factor for model fit will
be the number of *linking variables*, 
variables shared between the individual and aggregate level
datasets which are used to calculate the
weights allocated to the individuals for each zone (see Glossary).
These are also referred to as *constraint variables*, as they
*constrain* the individual weights per zone (Ballas et al. 2005).

If there are no shared variables between the aggregate and
individual level data, generating an accurate synthetic population is impossible. 
In this case, your only alternative is to consider only the
marginal distribution of the variable and make a random selection with 
the distribution used as a probability. However, this implicitly assumes
that there are no correlations between the available variables.
If there are only a few shared variables (e.g. age, sex,
employment status), your options are limited. Increasingly,
however, there are many linking variables to choose from
as questionnaires become broader and more available. In this
case the choice of constraints becomes important: which and
how many to use,
their order and their relationship to the target variable
should all be considered at the outset when choosing the
input data and constraints.

## Target and constraint variables {#Selecting}

\index{target variables}


The geographically aggregated data should be the
first consideration when deciding on the
input data. This is because the geographical data
is essential for making the model spatial: without
good geographical data, there is no way to
allocate the individuals to different geographical zones.

The first consideration for the constraint data
is coverage: ideally close to 100% of each zone's total
population should be included in the counts. Also, the
survey must have been completed by a number of residents  proportional to the
total number of inhabitants in every geographical zone under consideration. It is important
to recognise that many geographically aggregated datasets
may be unsuitable for spatial microsimulation due to sample
size: geographical information based on a small sample of
a zone's population is not a good basis for
spatial microsimulation.

To take one example,
estimates of the *proportion* of people who do not
get sufficient sleep in the Australian
state of Victoria, from the 2011 VicHealth Indicators
[Survey](http://data.aurin.org.au/dataset?q=sleep),
is not a suitable constraint variable.

Constraint variables should be integer counts,
and should contain a sufficient number
categories for each variable. Ideally, each relevant
category and cross-tabulation should be represented.
It is unusual to have
a baby with a high degree qualification, but it may be useful to know
there is an absence of such individuals. The sleep dataset
meets neither of these criteria. It is based on
a small sample of individuals (on average 300 per
geographic zone, less than 1% of the total population).
Also it is a binary variable, dividing people into
'adequate' and 'inadequate' sleep categories. Instead,
geographical datasets from the 2011 Census should be
used. These may contain fewer variables, but they will
provide counts for different categories and have almost
100% coverage of the population.

To continue with the Australian example, imagine we are
interested only in Census data at the SA2 level,
the second smallest unit for the release of Census data.
A browse of the available datasets on an
[online portal](http://data.aurin.org.au/dataset?q=2011-08+SA2)
reveals that information is available on a wide range of
topics, including industry of employment, education level
and total personal weekly income (divided into 10 bands from
$1 - $199 to $2000+). Which of these dozens of variables do we
select of the analysis? It depends on the research question and
the target variable or variables of interest.

If we are interested in geographic variability in economic
inequality, for example, the income data would be first on the
list to select. Additional variables would likely include
level of education, type of employment and age, to discover
the wider context of the problem. If the research question
related to energy and sustainability, to take another example,
variables related to distance travelled to work and housing
would be of greater relevance. In each of these examples
the *target variable* determines the selection of constraints.
The target variable is the thing that we would like spatial
microdata on, as illustrated by the following two research
questions.

How does income inequality vary from place to place?
To answer this question it is not sufficient to have
aggregate level statistics (e.g. average household income).
So income per person must be simulated for each zone, using
spatial microsimulation. Sampling from a national population
provides a more realistic distribution than simply assuming
every person of a given income band earns a certain amount;
this is clearly not the case as income distributions are
not flat (they tend to have positive skew). 

How does energy use vary over geographical space?
This question is more complicated as there is no single
variable on 'energy use' that is collected by statistical
agencies at the aggregate, let alone individual, level.
Rather, energy use is a *composite target variable* that
is composed of energy use in transport, household heating
and cooling and other things. For each type of energy use,
the question must eventually be simplified to coincide with
survey questions that are actually asked in Census surveys
(e.g. 'where do you work?', from which distance travelled
to work may be ascertained). Such considerations should guide
the selection of aggregate level (generally Census-based)
geographic count data. Of course, available datasets
vary greatly from country to country so selection of
appropriate datasets is highly context dependent.

The following considerations should inform the selection of
individual level microdata:

- Linking variables: are there enough
variables in the individual level data that are
also present in the geographically aggregated constraints?
Even when many linking variables are available, it is
important to determine the fit between the categories in each.
If age is reported in five-year bands in the aggregate level
data, but only in 20-year bands in the individual level data,
for example,
this could be problematic for highly
age-dependent research.
- Representiveness: is the sample population representative of the areas under investigation?
- Sample size and diversity: the input dataset must be sufficiently large and diverse to mitigate the *empty cell* problem.
- Target variables: are these, or proxies for them, included in the dataset?

In addition to these essential characteristics of the
individual level dataset, there are qualities that are desirable but not  required. These include:

- Continuity of the survey: will the survey be repeated on
a regular basis into the future? If so, the analysis can form
the basis of ongoing work to track change over time, perhaps
as part of a *dynamic microsimulation model*.
- Geographic variables: although it is the purpose of spatial
microsimulation to allocate geographic variables to
individual level data, it is still useful to have some
geographic information. In many national UK surveys, for
example, the region (the coarsest geographical level)
of the respondent is reported. This information can be useful in
identifying the extent to which the difference between the
geographic extent of the survey and microsimulation study area
affects the results. This is an understudied area of knowledge
where more work is needed.

In terms of the *number* of constraints that is appropriate
there is no 'magic number' that is correct in every case.
It is also important to note that the number of variables
is not a particularly good measure of how well constrained
the data will be. The total number of *categories* used to
constrain the weights is in fact more important.

For example, a model constrained by two variables,
each containing 10 categories (20 constraint categories
in total), will be better constrained
than a model constrained by 5 binary variables such as
male/female, young/old etc. That is not to say that the
former set of constraints is *better* (as emphasised above,
that depends on the research question), simply that the
weights will be more finely constrained.

A special type of
constraint variable is **cross-tabulated** categories.
This involves subdividing the categories in one variable
by the categories of another. Cross-tabulated
constraint variables provide a more detailed picture on not only the
prevalence of particular categories, but the relationships between them.
For the purposes of spatial microsimulation,
cross-tabulated variables can be seen as a single variable.

```{r, echo=FALSE}
# (Dumont, 2014). Are you sure about that? 
# Because in one way, individual level data can be consider as a big
# cross tabulated data set, but, I don't know... no so convinced
```

If there are 5 age categories and 2 sex categories, this
can be seen as a single constraint variable (age/sex)
with 10 categories. Clearly in terms of retaining
detailed local information (e.g. all young males moving out
of a zone), cross-tabulated variables are preferable.
This raises the question: what happens when a single
variable (e.g age) is used in multiple cross-tabulated
constraints (e.g. age/sex and age/income). In this case
spatial microsimulation can still be used,
but the result may not converge
to a single answer because the gender distribution may
be disrupted by the age/income constraint. Further work
is needed to test this but from theory we can be sure:
a 3-way cross-tabulated constraint (age/sex/income)
would be ideal in this case because it 
provides more information than 2-way marginal distributions.

```{r, echo=FALSE}
# Todo: define/write-about marginal distributions
# (Dumont, 2014). Pay attention, for the moment it is the first 
#  time that IPF is mentionned, so the reader do not know what it is
# Done 2014/12/22 RL
```

The level of detail within the linking variables is an important determinant of
the fidelity of the resulting synthetic population. This can be measured
in term of the number of linking variables (there are 2 in SimpleWorld: age and sex)
and the more measure of the number of categories within all linking variables
(there are 4 in SimpleWorld: male, female, young and old). This latter measure
is preferable as it provides a closer indication of the 'bin size' used to
categorise individuals during population synthesis. Still, the nature of both
linking variables and their categories should be considered: an age variable
containing 5 categories may look good on paper. However, if those categories are
defined by the breaks `c(0, 5, 10, 15, 20, 90)`, 
the linking variable will not be effective at accurately recreating age
distributions. More evenly distributed categories (such as
`c(0, 20, 40, 60, 80, 110)` to continue the age example) are
preferable.
\index{breaks}
\index{bins}

(As an important aside for R learners, we
use R syntax here to define the *breaks* instead of the more common
'0 to 5', 6 to 10' age category notation because this is how
numerical variables are best categorised in R. To see how this works,
try entering the following into R's command line:
`cut(x = c(20, 21, 90, 35), breaks = c(0, 20, 40, 60, 80, 110))`. The result
is a `factor` variable
as follows: `(0,20]   (20,40]  (80,110] (20,40] `. This output uses standard
notation for defining bins of continuous data: the `(0, 20]` term means, in
English, "any value from *greater than* zero, up to *and including* 20".
In classic mathematical notation this reads
$0 < x \leq 20$. Note the `]` symbol means 'including'.^[This `[a,b)` category notation follows the International Organization for Standardization (ISO) 80000-2:2009 standard for mathematical notation: Square brackets indicate that the endpoint is included in the set, curved brackets indicate that the endpoint is not included.
])
\index{cut command}

\index{number of constraints}
Even when measuring constraint variables by categories rather
than the cruder measure of number variables, there is still
no consensus about the most desirable number. 
@Norman1999a advocates
including the "maximum amount of
information in the constraints", implying that
the more constraint categories the merrier.
@harland2012
warns that over-constraining the model can cause
problems. In any case, there is a clear link between
the quality and quantity of constraint variables used
in the model and the fidelity of the output microdata.

```{r, echo=FALSE}
# TODO: link to the location of a discussion of the empty cell problem
```

\index{base population}
If constraint variables come from different sources, check the coherence 
of these datasets. In some cases the total number of individuals will not
be consistent between constraint variables and the procedure will
not work properly. This can happen if constraint variables are measured
using different *base populations* or at different levels. Number of cars
per household, for example, is usually collected at the household level
so will contain lower counts than variables on individuals (e.g. age).
A solution is to set all population totals equal to the most
reliable constraint variable by multiplying values in each category by a fixed
amount. However, caution should be taken when using this approach because
it assumes that the relationship between categories is the same
across all level or base populations. In the case of car ownership, where
larger households are likely to own more cars, this assumption clearly
would not hold; inferring individual level marginals from household level
data would in this case lead to an underestimate of car availability.

```{r, echo=FALSE}
# TODO: link to a section where this 'population balancing' happens
```

After suitable constraint variables have been chosen ---
and remember that the constraints can be changed at a later
stage to improve the model or for testing ---
the next stage is to load the data.
Following the SimpleWorld example, we
load the individual level dataset first. This
is because individual level data from surveys
is often
more difficult
to format. Individual level datasets are often larger
and more complex than the constraint data, which
are simply integer counts of different categories.
Of course, it is possible that the
data you have are not suitable for
spatial microsimulation because they lack linking variables.
We assume that you have already checked this. The
checking process for the datasets used in this chapter is simple: both aggregate
and individual level tables contain age (in continuous form
in the microdata, as two categories in the aggregate data)
and sex, so they can by combined. 
Loading the data involves transferring information from a
local hard disc into R's *environment*, where
it is available in memory.
\index{R environment}

## Loading input data {#Loading} 

\index{loading data}
Real-world individual level data are provided in many formats.
These ultimately need to be loaded into R as a `data.frame` object. Note 
that there are many ways to load data into R, including `read.csv()` and
`read.table()` from base R. Useful commands for loading proprietary data formats
include `read_excel` and  `read.spss()` from **readxl** and **foreign**
packages, respectively.

More time consuming is cleaning the data and there are also many ways to do this.
In the following example we present 
steps needed to load the data underlying the SimpleWorld
example into R.^[All
data and code to replicate the procedures outlined in the explicative chapters of the book are available
publicly from the spatial-microsimulation-book GitHub repository, as described above. 
We recommend loading the input data and playing
around with it.]
Some parts of the following explanation are specific to the example data and the
use of Iterative Proportional Fitting (IPF) as a reweighting procedure
(described in the next chapter). Combinatorial optimisation approaches, for
example (as well as careful use of IPF methods, e.g. using the **mipfp** package),
are resilient to differences in the number of individuals according to each
constraint. However, the approach,
functions and principles demonstrated will apply to the majority of real
input datasets for spatial
microsimulation. This section is quite specific to data preparation for
spatial microsimulation; for a more general introduction to data formatting and
cleaning methodology see Wickham (2014). To summarise this information, 
the last section of this chapter provides 
a check-list of items to ensure that the data has been adequately cleaned
ready for the next phase.

In the SimpleWorld example, the individual level dataset is loaded
from a 'plain text' (human readable) `.csv` file:

```{r}
# Load the individual level data
ind <- read.csv("data/SimpleWorld/ind-full.csv") 
class(ind) # verify the data type of the object
ind # print the individual level data
```

```{r, echo=FALSE}
### Loading and checking aggregate level data
```

Constraint data are usually made available one variable at a time,
so these are read in one file at a time:

```{r}
con_age <- read.csv("data/SimpleWorld/age.csv")
con_sex <- read.csv("data/SimpleWorld/sex.csv")
```

We have loaded the aggregate constraints. As with the individual level data, it is
worth inspecting each object to ensure that they make sense before continuing.
Taking a look at `age_con`, we can see that this data set consists of 2
categories for 3 zones:

```{r}
con_age
```

This tells us that there 12, 10 and 11 individuals in zones 1, 2 and 3,
respectively, with different proportions of young and old people. Zone 2, for
example, is heavily dominated by older people: there are 8 people over 50 whilst
there are only 2 young people (under 49) in the zone.

\index{areal data}
Even at this stage there is a potential for errors to be introduced.  A classic
mistake with areal (geographically aggregated) data is that the order in which zones are loaded can change from
one table to the next. The constraint data should therefore come with some kind of *zone
id*, an identifying code. This usually consists of a unique
character string or integer that allows the order of different datasets to be
verified and for data linkage using attribute joins. Moreover,
keeping the code associated with each administrative zone
will subsequently allow attribute data to be
combined with polygon shapes and visualised using GIS software.
\index{GIS software}

```{r, echo=FALSE}
# Make the constraint data contain an 'id' column, possibly scrambled 
```

If we're sure that the row numbers match between the age and sex tables (we are
sure in this case), the next important test is to check the total
populations of the constraint variables.  Ideally both the *total*
study area populations and *row totals* should match. If the *row totals* match,
this is a very good sign that not only confirms that the zones are listed in the
same order, but also that each variable is sampling from the same *population
base*. These tests are conducted in the following lines of code:

```{r}
sum(con_age)
sum(con_sex) 

rowSums(con_age)
rowSums(con_sex)
rowSums(con_age) == rowSums(con_sex)
```

The results of the previous operations are encouraging. The total population is
the same for each constraint overall and for each area (row) for both
constraints.  If the total populations between constraint variables do not match
(e.g. because the *population bases*^[see
the Glossary for a description of the population base.] are different) this is problematic.
Appropriate steps to normalise the errant constraint variables are described in
the CakeMap chapter (\ref{CakeMap}). This involves scaling the category totals
so the totals are equal across all categories.

## Subsetting to remove excess information {#subsetting-prep}

\index{subsetting data}
In the above code, `data.frame` objects containing precisely the information
required for the next stage were loaded.  More often, superfluous information
will need to be removed from the data and subsets taken. It is worth removing
superfluous variables early, to avoid over-complicating and slowing-down the
analysis.  For example, if `ind` had 100 variables of which only the 1st, 3rd and 19th were of
interest, the following command could be used to update the object. Note that
only the relevant variables, corresponding to the first, third and nineteenth
columns, are
retained:^[An
alternative way to remove excess variables is to use `NULL` assignment to remove columns. `ind$age <- NULL`, for example,
would remove the age 
variable. The minus sign can also be used to remove specific rows or columns,
as illustrated with the syntax `[, -4]` below.]

```{r, eval=FALSE}
ind <- ind[, c(1, 3, 19)]
```

In the SimpleWorld dataset, only the `age` and `sex` variables are useful
for reweighting: we can remove the others for the purposes of allocating
individuals to zone. Note that it is important to keep track of individual Id's,
to ensure individuals do not get mixed-up by a function that changes their
order (`merge()`, for example, is a function that can cause havoc by changing
the order of rows).
Before removing the superfluous `income` variable, we will create a backup of
`ind` that can be referred back to if necessary^[For example, when we will have a 
spatial microdataset replicating the individuals, we will be able to re-assign 
the income to each generated individual.]:

```{r}
ind_orig <- ind # store the original ind dataset
ind <- ind[, -4] # remove income variable
```

Although `ind` in this case is small, it will behave in the same way as
larger datasets. Starting small is sensible, providing
opportunities for testing subsetting syntax in R. It
is common, for example, to take a subset of the working *population base*: those
aged between 16 and 74 in full-time employment. Methods for doing this are provided in
the Appendix
([](#subsetting)).

## Re-categorising individual level variables {#re-categorise}

\index{recategorising data}
Before transforming the individual level dataset `ind` into a form that can be
compared with the aggregate level constraints, we must ensure that each dataset
contains the same information. It can be more challenging to re-categorise
individual level variables than to re-name or combine aggregate level variables,
so the former should usually be set first.  An obvious difference between the
individual and aggregate versions of the `age` variable is that the former is of
type `integer` whereas the latter is composed of discrete bins: 0 to 49 and 50+.
We can categorise the variable into these bins using 
`cut()`:

```{r, echo=FALSE}
# ^[The
# combination of curved and square brackets in the output from the `cut` function
# may seem strange but this is
# an International Standard: curved brackets mean 'excluding' and square brackets
# mean 'including'. The output `(49, 120]`, for example, means 'from
# 49 (excluding 49, but including 49.001) to 120 (including the exact value 120)'.
# See http://en.wikipedia.org/wiki/ISO_31-11 for more
# on international standards on mathematical notation.]
```


```{r}
# Test binning the age variable
brks <- c(0, 49, 120) # set break points from 0 to 120 years
cut(ind$age, breaks = brks) # bin the age variable
```

Note that the output of the above `cut()` command is correct,
with individuals binned into one of two bins, but that the labels
are rather
strange.
To change these category labels to something more readable
for people who do not read ISO standards for mathematical notation
(most people!),
we add another argument, `labels` to the `cut()` function:

```{r}
# Convert age into a categorical variable
labs <- c("a0_49", "a50+") # create the labels
cut(ind$age, breaks = brks, labels = labs)
```

The factor generated now has satisfactory labels: they match the column headings
of the age constraint, so we will save the result. (Note, we are not
losing any information at this stage because we have saved the original
`ind` object as `ind_orig` for future reference.)

```{r}
# Overwrite the age variable with categorical age bands
ind$age <- cut(ind$age, breaks = brks, labels = labs)
```

Users should beware that `cut` results in a vector of class *factor*.
This can cause problems in subsequent steps if the order of constraint
column headings is different from the order of the factor labels, as
we will see in the next section.

```{r, echo=FALSE}
# TODO (RL): add reference to ipf performance paper
```

## Matching individual and aggregate level data names {#matching}

\index{matching variables}
Before combining the newly re-categorised individual level data with the
aggregate constraints, it is useful for the category labels to match up.
This may seem trivial, but will save time in the long run. Here is the problem:

```{r}
levels(ind$age)
names(con_age)
```

Note that the names are subtly different. To solve this issue, we can
simply change the names of the constraint variable, after verifying they
are in the correct order:

```{r}
names(con_age) <- levels(ind$age) # rename aggregate variables
```

With both the age and sex constraint variable names now matching the
category labels of the individual level data, we can proceed to create a
single constraint object we label `cons`. We do this with `cbind()`:

```{r}
cons <- cbind(con_age, con_sex)
cons[1:2, ] # display the constraints for the first two zones
```

## 'Flattening' the individual level data {#flattening}

\index{flattening}
We have made steps towards combining the individual and aggregate datasets and
now only need to deal with 2 objects (`ind` and `cons`) which now share
category and variable names.
However, these datasets cannot possibly be compared because they
measure very different things. The `ind` dataset records
the value that each individual takes for a range of variables, whereas
`cons` counts the number of individuals in different groups at
the geographical level. These datasets are have different dimensions:


```{r}
dim(ind)
dim(cons)
```

The above code confirms this: we have one individual level dataset comprising 5
individuals with 3 variables (2 of which are constraint variables and the other an ID) and one
aggregate level constraint table called `cons`, representing 3 zones
with count data for 4 categories across 2 variables.

The dimensions of at least one of these objects must change
before they can be correctly easily compared. To do this
we 'flatten' the individual level dataset.
This means increasing its width so each column becomes a category
name. This allows the individual data to be matched
to the geographical constraint data. In this new dataset (which we
label `ind_cat`, short for 'categorical'),
each variable becomes a column containing Boolean numbers
(either 1 or 0, representing whether the individual belongs to each
category or not).
Note that each row in `ind_cat` must contain a one for each constraint variable;
the sum of every row in `ind_cat` should be equal to the number of constraints
(this can be verified with `rowSums(ind_cat)`).

To undertake this 'flattening' process the
`model.matrix()` function is used to expand each variable in turn.
The result for each variable is a new matrix with the same number of columns as
there are categories in the variable. Note that the order of columns is usually
alphabetical: this can cause problems if the columns in the constraint tables
are not ordered in this way. @knoblauch2012modeling provide a lengthier description of this
flattening process.

The second stage is to use the `colSums()` function 
to take the sum of each column.^[As we shall see in Section \ref{ipfp},
only the former of these is needed if we use the
**ipfp** package for re-weighting the data, but both are presented to enable
a better understanding of how IPF works.]

```{r}
cat_age <- model.matrix(~ ind$age - 1)
cat_sex <- model.matrix(~ ind$sex - 1)[, c(2, 1)]

 # Combine age and sex category columns into single data frame
(ind_cat <- cbind(cat_age, cat_sex)) # brackets -> print result
```

Note that second call to `model.matrix` is suffixed with `[, c(2, 1)]`.
This is to swap the order of the columns: the column variables are produced
from `model.matrix` is alphabetic, whereas the order in which the variables
have been saved in the constraints object `cons` is `male` then `female`.
Such subtleties can be hard to notice yet completely change one's results
so be warned: the output from `model.matrix` will not always be compatible
with the constraint variables.

To check that the code worked properly,
let's count the number of individuals
represented in the new `ind_cat` variable, using `colSums`:

```{r}
colSums(ind_cat) # view the aggregated version of ind
ind_agg <- colSums(ind_cat) # save the result
```

The sum of both age and sex variables is 5 
(the total number of individuals): it worked! 
Looking at `ind_agg`, it is also clear that the object has the same 'width',
or number of columns,
`cons`. This means that the individual level data can now be compared with
the aggregate level data. We can check this by inspecting
each object (e.g. via `View(ind_agg)`). A more rigorous test is to see
if `cons` can be combined with `ind_agg`, using `rbind`:

```{r}
rbind(cons[1,], ind_agg) # test compatibility of ind_agg and cons
```

If no error message is displayed on your computer, the answer is yes.
This shows us a direct comparison between the number of people in each
category of the constraint variables in zone and in the individual level
dataset overall. Clearly, this is a very small example
 with only 5 individuals
in total existing in `ind_agg` (the total for each constraint) and 12
in zone 1. We can measure the size of this difference using measures of
*goodness of fit*. A simple measure is total absolute error (TAE), calculated in this
case as `sum(abs(cons[1,] - ind_agg))`: the sum of the positive differences
between cell values in the individual and aggregate level data.

The purpose of the *reweighting* procedure in spatial microsimulation is
to minimise this difference (as measured in TAE above)
by adding high weights to the most representative individuals.

## Chapter summary

To summarise, we have learned about and implemented
methods for loading and preparing input data for spatial microsimulation
in this chapter.
The following checklist outlines the main features of the input datasets
to ensure they are ready for reweighting, covered in the next chapter:

1. Both constraint and target variables are loaded in R:
    in the former rows correspond to
    individuals; in the latter rows correspond to a 
    spatial zone.
2. The categories of the constraint variables in the individual level dataset
    are identical to the column names of the constraint
    variables.
    (An example of the process needed to arrive at this state is the
    conversion of a continuous ages variable in
    the individual level dataset into a categorical variable to match the
    constraint data.)
3. The structure of the individual and aggregate level datasets must
    eventually be the same: the column names of `cons` and `ind_cat` in the above
    example are the categories
    of the constraint variables. `ind_cat` is the
    binary (or Boolean, containing 0s and 1s) version of the individual level dataset.
    This allows creation of an aggregated version of the individual level
    data that has the same dimensions (and is comparable with)
    the constraint data for each zone.
4. The total population of each zone is represented as the sum counts for
    each constraint variable.

Note that in much research requiring the analysis of complex data,
the collection, cleaning and loading of the data can consume the majority of the
project's time. This applies as much to
spatial microsimulation as to any other data analysis task and is an essential
stage before moving on the more exciting analysis and modelling.
In the next chapter we
progress to allocate weights to each individual in our sample, continuing
with the example of SimpleWorld. 

```{r, echo=FALSE}
# save.image("cache-data-prep.RData") # save data for next chapter
```