forked from geocompx/geocompr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
03-attribute-operations.Rmd
781 lines (615 loc) · 43.3 KB
/
03-attribute-operations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
# Attribute data operations {#attr}
## Prerequisites {-}
- This chapter requires the following packages to be installed and attached:
```{r 03-attribute-operations-1, message=FALSE}
library(sf) # vector data package introduced in Chapter 2
library(terra) # raster data package introduced in Chapter 2
library(dplyr) # tidyverse package for data frame manipulation
```
- It relies on **spData**, which loads datasets used in the code examples of this chapter:
```{r 03-attribute-operations-2, results='hide'}
library(spData) # spatial data package introduced in Chapter 2
```
- Also ensure you have installed the **tidyr** package, or the **tidyverse** of which it is a part, if you want to run data 'tidying' operations in Section \@ref(vec-attr-creation):
## Introduction
Attribute data is non-spatial information associated with geographic (geometry) data.
A bus stop provides a simple example: its position would typically be represented by latitude and longitude coordinates (geometry data), in addition to its name.
The [Elephant & Castle / New Kent Road](https://www.openstreetmap.org/relation/6610626) stop in London, for example has coordinates of -0.098 degrees longitude and 51.495 degrees latitude which can be represented as `POINT (-0.098 51.495)` in the `sfc` representation described in Chapter \@ref(spatial-class).
Attributes such as the name *attribute*\index{attribute} of the POINT feature (to use Simple Features terminology) are the topic of this chapter.
```{r, eval=FALSE, echo=FALSE}
# Aim: find a bus stop in central London
library(osmdata)
london_coords = c(-0.1, 51.5)
london_bb = c(-0.11, 51.49, -0.09, 51.51)
bb = tmaptools::bb(london_bb)
osm_data = opq(bbox = london_bb) %>%
add_osm_feature(key = "highway", value = "bus_stop") %>%
osmdata_sf()
osm_data_points = osm_data$osm_points
osm_data_points[4, ]
point_vector = round(sf::st_coordinates(osm_data_points[4, ]), 3)
point_df = data.frame(name = "London bus stop", point_vector)
point_sf = sf::st_as_sf(point_df, coords = c("X", "Y"))
```
Another example is the elevation value (attribute) for a specific grid cell in raster data.
Unlike the vector data model, the raster data model stores the coordinate of the grid cell indirectly, meaning the distinction between attribute and spatial information is less clear.
To illustrate the point, think of a pixel in the 3^rd^ row and the 4^th^ column of a raster matrix.
Its spatial location is defined by its index in the matrix: move from the origin four cells in the x direction (typically east and right on maps) and three cells in the y direction (typically south and down).
The raster's *resolution* defines the distance for each x- and y-step which is specified in a *header*.
The header is a vital component of raster datasets which specifies how pixels relate to geographic coordinates (see also Chapter \@ref(spatial-operations)).
This teaches how to manipulate geographic objects based on attributes such as the names of bus stops in a vector dataset and elevations of pixels in a raster dataset.
For vector data, this means techniques such as subsetting and aggregation (see Sections \@ref(vector-attribute-subsetting) and \@ref(vector-attribute-aggregation)).
Sections \@ref(vector-attribute-joining) and \@ref(vec-attr-creation) demonstrate how to join data onto simple feature objects using a shared ID and how to create new variables, respectively.
Each of these operations has a spatial equivalent:
the `[` operator in base R, for example, works equally for subsetting objects based on their attribute and spatial objects; you can also join attributes in two geographic datasets using spatial joins.
This is good news: skills developed in this chapter are cross-transferable.
Chapter \@ref(spatial-operations) extends the methods presented here to the spatial world.
After a deep dive into various types of *vector* attribute operations in the next section, *raster* attribute data operations are covered in Section \@ref(manipulating-raster-objects), which demonstrates how to create raster layers containing continuous and categorical attributes and extracting cell values from one or more layer (raster subsetting).
Section \@ref(summarizing-raster-objects) provides an overview of 'global' raster operations which can be used to summarize entire raster datasets.
## Vector attribute manipulation
Geographic vector datasets are well supported in R thanks to the `sf` class, which extends base R's `data.frame`.
Like data frames, `sf` objects have one column per attribute variable (such as 'name') and one row per observation or *feature* (e.g., per bus station).
`sf` objects differ from basic data frames because they have a `geometry` column of class `sfc` which can contain a range of geographic entities (single and 'multi' point, line, and polygon features) per row.
This was described in Chapter \@ref(spatial-class), which demonstrated how *generic methods* such as `plot()` and `summary()` work with `sf` objects.
**sf** also provides generics that allow `sf` objects to behave like regular data frames, as shown by printing the class's methods:
```{r 03-attribute-operations-3, eval=FALSE}
methods(class = "sf") # methods for sf objects, first 12 shown
```
```{r 03-attribute-operations-4}
#> [1] aggregate cbind coerce
#> [4] initialize merge plot
#> [7] print rbind [
#> [10] [[<- $<- show
```
```{r 03-attribute-operations-5, eval=FALSE, echo=FALSE}
# Another way to show sf methods:
attributes(methods(class = "sf"))$info %>%
dplyr::filter(!visible)
```
Many of these (`aggregate()`, `cbind()`, `merge()`, `rbind()` and `[`) are for manipulating data frames.
`rbind()`, for example, is binds rows two data frames together, one 'on top' of the other.
`$<-` creates new columns.
A key feature of `sf` objects is that they store spatial and non-spatial data in the same way, as columns in a `data.frame`.
```{block2 03-attribute-operations-6, type = 'rmdnote'}
The geometry column of `sf` objects is typically called `geometry` or `geom` but any name can be used.
The following command, for example, creates a geometry column named g:
`st_sf(data.frame(n = world$name_long), g = world$geom)`
This enables geometries imported from spatial databases to have a variety of names such as `wkb_geometry` and `the_geom`.
```
`sf` objects can also extend the `tidyverse` classes for data frames, `tibble` and `tbl`.
\index{tidyverse (package)}.
Thus **sf** enables the full power of R's data analysis capabilities to be unleashed on geographic data, whether you use base R or tidyverse functions for data analysis.
\index{tibble}
**sf** objects can also be used with the high-performance data processing package **data.table** although, as documented in the issue [`Rdatatable/data.table#2273`](https://github.com/Rdatatable/data.table/issues/2273), is not fully [compatible](https://github.com/Rdatatable/data.table/issues/5352) with `sf` objects.
Before using these capabilities it is worth re-capping how to discover the basic properties of vector data objects.
Let's start by using base R functions to learn about the `world` dataset from the **spData** package:
```{r 03-attribute-operations-7}
class(world) # it's an sf object and a (tidy) data frame
dim(world) # it is a 2 dimensional object, with 177 rows and 11 columns
```
`world` contains ten non-geographic columns (and one geometry list column) with almost 200 rows representing the world's countries.
The function `st_drop_geometry()` keeps only the attributes data of an `sf` object, in other words removing its geometry:
```{r 03-attribute-operations-8}
world_df = st_drop_geometry(world)
class(world_df)
ncol(world_df)
```
Dropping the geometry column before working with attribute data can be useful; data manipulation processes can run faster when they work only on the attribute data and geometry columns are not always needed.
For most cases, however, it makes sense to keep the geometry column, explaining why the column is 'sticky' (it remains after most attribute operations unless specifically dropped).
Non-spatial data operations on `sf` objects only change an object's geometry when appropriate (e.g., by dissolving borders between adjacent polygons following aggregation).
Becoming skilled at geographic attribute data manipulation means becoming skilled at manipulating data frames.
For many applications, the tidyverse\index{tidyverse (package)} package **dplyr** offers an effective approach for working with data frames.
Tidyverse compatibility is an advantage of **sf** over its predecessor **sp**, but there are some pitfalls to avoid (see the supplementary `tidyverse-pitfalls` vignette at [geocompr.github.io](https://geocompr.github.io/geocompkg/articles/tidyverse-pitfalls.html) for details).
### Vector attribute subsetting
Base R subsetting methods include the operator `[` and the function `subset()`.
The key **dplyr** subsetting functions are `filter()` and `slice()` for subsetting rows, and `select()` for subsetting columns.
Both approaches preserve the spatial components of attribute data in `sf` objects, while using the operator `$` or the **dplyr** function `pull()` to return a single attribute column as a vector will lose the attribute data, as we will see.
\index{attribute!subsetting}
This section focuses on subsetting `sf` data frames; for further details on subsetting vectors and non-geographic data frames we recommend reading section section [2.7](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Index-vectors) of An Introduction to R [@rcoreteam_introduction_2021] and Chapter [4](https://adv-r.hadley.nz/subsetting.html) of Advanced R Programming [@wickham_advanced_2019], respectively.
The `[` operator can subset both rows and columns.
Indices placed inside square brackets placed directly after a data frame object name specify the elements to keep.
The command `object[i, j]` means 'return the rows represented by `i` and the columns represented by `j`, where `i` and `j` typically contain integers or `TRUE`s and `FALSE`s (indices can also be character strings, indicating row or column names).
`object[5, 1:3]`, for example, means 'return data containing the 5th row and columns 1 to 3: the result should be a data frame with only 1 row and 3 columns, and a fourth geometry column if it's an `sf` object.
Leaving `i` or `j` empty returns all rows or columns, so `world[1:5, ]` returns the first five rows and all 11 columns.
The examples below demonstrate subsetting with base R.
Guess the number of rows and columns in the `sf` data frames returned by each command and check the results on your own computer (see the end of the chapter for more exercises):
```{r 03-attribute-operations-9, eval=FALSE}
world[1:6, ] # subset rows by position
world[, 1:3] # subset columns by position
world[1:6, 1:3] # subset rows and columns by position
world[, c("name_long", "pop")] # columns by name
world[, c(T, T, F, F, F, F, F, T, T, F, F)] # by logical indices
world[, 888] # an index representing a non-existent column
```
```{r, eval=FALSE, echo=FALSE}
# these fail
world[c(1, 5), c(T, T)]
world[c(1, 5), c(T, T, F, F, F, F, F, T, T, F, F, F)]
```
A demonstration of the utility of using `logical` vectors for subsetting is shown in the code chunk below.
This creates a new object, `small_countries`, containing nations whose surface area is smaller than 10,000 km^2^:
```{r 03-attribute-operations-10}
i_small = world$area_km2 < 10000
summary(i_small) # a logical vector
small_countries = world[i_small, ]
```
The intermediary `i_small` (short for index representing small countries) is a logical vector that can be used to subset the seven smallest countries in the `world` by surface area.
A more concise command, which omits the intermediary object, generates the same result:
```{r 03-attribute-operations-11}
small_countries = world[world$area_km2 < 10000, ]
```
The base R function `subset()` provides another way to achieve the same result:
```{r 03-attribute-operations-12, eval=FALSE}
small_countries = subset(world, area_km2 < 10000)
```
Base R functions are mature, stable and widely used, making them a rock solid choice, especially in contexts where reproducibility and reliability are key.
**dplyr** functions enable 'tidy' workflows which some people (the authors of this book included) find intuitive and productive for interactive data analysis, especially when combined with code editors such as RStudio that enable [auto-completion](https://support.rstudio.com/hc/en-us/articles/205273297-Code-Completion-in-the-RStudio-IDE) of column names.
Key functions for subsetting data frames (including `sf` data frames) with **dplyr** functions are demonstrated below.
<!-- The sentence below seems to be untrue based on the benchmark below. -->
<!-- `dplyr` is also faster than base R for some operations, due to its C++\index{C++} backend. -->
<!-- Something on dbplyr? I've never seen anyone use it regularly for spatial data 'in the wild' so leaving out the bit on integration with dbs for now (RL 2021-10) -->
<!-- The main **dplyr** subsetting functions are `select()`, `slice()`, `filter()` and `pull()`. -->
```{r, echo=FALSE, eval=FALSE}
# Aim: benchmark base vs dplyr subsetting
# Could move elsewhere?
i = sample(nrow(world), size = 10)
benchmark_subset = bench::mark(
world[i, ],
world %>% slice(i)
)
benchmark_subset[c("expression", "itr/sec", "mem_alloc")]
# # October 2021 on laptop with CRAN version of dplyr:
# # A tibble: 2 × 3
# expression `itr/sec` mem_alloc
# <bch:expr> <dbl> <bch:byt>
# 1 world[i, ] 1744. 5.55KB
# 2 world %>% slice(i) 671. 4.45KB
```
`select()` selects columns by name or position.
For example, you could select only two columns, `name_long` and `pop`, with the following command:
```{r 03-attribute-operations-14}
world1 = dplyr::select(world, name_long, pop)
names(world1)
```
Note: as with the equivalent command in base R (`world[, c("name_long", "pop")]`), the sticky `geom` column remains.
`select()` also allows selecting a range of columns with the help of the `:` operator:
```{r 03-attribute-operations-15}
# all columns between name_long and pop (inclusive)
world2 = dplyr::select(world, name_long:pop)
```
You can remove specific columns with the `-` operator:
```{r 03-attribute-operations-16}
# all columns except subregion and area_km2 (inclusive)
world3 = dplyr::select(world, -subregion, -area_km2)
```
Subset and rename columns at the same time with the `new_name = old_name` syntax:
```{r 03-attribute-operations-17}
world4 = dplyr::select(world, name_long, population = pop)
```
It is worth noting that the command above is more concise than base R equivalent, which requires two lines of code:
```{r 03-attribute-operations-18, eval=FALSE}
world5 = world[, c("name_long", "pop")] # subset columns by name
names(world5)[names(world5) == "pop"] = "population" # rename column manually
```
`select()` also works with 'helper functions' for more advanced subsetting operations, including `contains()`, `starts_with()` and `num_range()` (see the help page with `?select` for details).
Most **dplyr** verbs return a data frame, but you can extract a single column as a vector with `pull()`.
<!-- Note: I have commented out the statement below because it is not true for `sf` objects, it's a bit confusing that the behaviour differs between data frames and `sf` objects. -->
<!-- The subsetting operator in base R (see `?[`), by contrast, tries to return objects in the lowest possible dimension. -->
<!-- This means selecting a single column returns a vector in base R as demonstrated in code chunk below which returns a numeric vector representing the population of countries in the `world`: -->
You can get the same result in base R with the list subsetting operators `$` and `[[`, the three following commands return the same numeric vector:
```{r 03-attribute-operations-21, eval = FALSE}
pull(world, pop)
world$pop
world[["pop"]]
```
<!-- Commenting out the following because it's confusing and covered better in other places (RL, 2021-10) -->
<!-- To turn off this behavior, set the `drop` argument to `FALSE`, -->
```{r 03-attribute-operations-19, eval=FALSE, echo=FALSE}
# create throw-away data frame
d = data.frame(pop = 1:10, area = 1:10)
# return data frame object when selecting a single column
d[, "pop", drop = FALSE] # equivalent to d["pop"]
select(d, pop)
# return a vector when selecting a single column
d[, "pop"]
pull(d, pop)
```
```{r 03-attribute-operations-20, echo=FALSE, eval=FALSE}
x1 = d[, "pop", drop = FALSE] # equivalent to d["pop"]
x2 = d["pop"]
identical(x1, x2)
```
`slice()` is the row-equivalent of `select()`.
The following code chunk, for example, selects rows 1 to 6:
```{r 03-attribute-operations-22, eval=FALSE}
slice(world, 1:6)
```
`filter()` is **dplyr**'s equivalent of base R's `subset()` function.
It keeps only rows matching given criteria, e.g., only countries with and area below a certain threshold, or with a high average of life expectancy, as shown in the following examples:
```{r 03-attribute-operations-23, eval=FALSE}
world7 = filter(world ,area_km2 < 10000) # countries with a small area
world7 = filter(world, lifeExp > 82) # with high life expectancy
```
The standard set of comparison operators can be used in the `filter()` function, as illustrated in Table \@ref(tab:operators):
```{r operators0, echo=FALSE}
if (knitr::is_html_output()){
operators = c("`==`", "`!=`", "`>`, `<`", "`>=`, `<=`", "`&`, <code>|</code>, `!`")
} else {
operators = c("==", "!=", ">, <", ">=, <=", "&, |, !")
}
```
```{r operators, echo=FALSE}
operators_exp = c("Equal to", "Not equal to", "Greater/Less than",
"Greater/Less than or equal",
"Logical operators: And, Or, Not")
knitr::kable(tibble(Symbol = operators, Name = operators_exp),
caption = paste("Comparison operators that return Booleans",
"(TRUE/FALSE)."),
caption.short = "Comparison operators that return Booleans.",
booktabs = TRUE)
```
### Chaining commands with pipes
Key to workflows using **dplyr** functions is the ['pipe'](http://r4ds.had.co.nz/pipes.html) operator `%>%` (or since R `4.1.0` the native pipe `|>`), which takes its name from the Unix pipe `|` [@grolemund_r_2016].
Pipes enable expressive code: the output of a previous function becomes the first argument of the next function, enabling *chaining*.
This is illustrated below, in which only countries from Asia are filtered from the `world` dataset, next the object is subset by columns (`name_long` and `continent`) and the first five rows (result not shown).
```{r 03-attribute-operations-24}
world7 = world %>%
filter(continent == "Asia") %>%
dplyr::select(name_long, continent) %>%
slice(1:5)
```
The above chunk shows how the pipe operator allows commands to be written in a clear order:
the above run from top to bottom (line-by-line) and left to right.
An alternative to piped operations is nested function calls, which are harder to read:
```{r 03-attribute-operations-25}
world8 = slice(
dplyr::select(
filter(world, continent == "Asia"),
name_long, continent),
1:5)
```
Another alternative is to split the operations into multiple self-contained lines, which is recommended when developing new R packages, an approach which has the advantage of saving intermediate results with distinct names which can be later inspected for debugging purposes (an approach which has disadvantages of being verbose and cluttering the global environment when undertaking interactive analysis):
```{r 03-attribute-operations-25-2}
world9_filtered = filter(world, continent == "Asia")
world9_selected = dplyr::select(world9_filtered, continent)
world9 = slice(world9_selected, 1:5)
```
Each approach has advantages and disadvantages, the importance of which depend on your programming style and applications.
For interactive data analysis, the focus of this chapter, we find piped operations fast and intuitive, especially when combined with [RStudio](https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts-in-the-RStudio-IDE)/[VSCode](https://github.com/REditorSupport/vscode-R/wiki/Keyboard-shortcuts) shortcuts for creating pipes and [auto-completing](https://support.rstudio.com/hc/en-us/articles/205273297-Code-Completion-in-the-RStudio-IDE) variable names.
### Vector attribute aggregation
\index{attribute!aggregation}
\index{aggregation}
Aggregation involves summarizing data with one or more 'grouping variables', typically from columns in the data frame to be aggregated (geographic aggregation is covered in the next chapter).
An example of attribute aggregation is calculating the number of people per continent based on country-level data (one row per country).
The `world` dataset contains the necessary ingredients: the columns `pop` and `continent`, the population and the grouping variable, respectively.
The aim is to find the `sum()` of country populations for each continent, resulting in a smaller data frame (aggregation is a form of data reduction and can be a useful early step when working with large datasets).
This can be done with the base R function `aggregate()` as follows:
```{r 03-attribute-operations-26}
world_agg1 = aggregate(pop ~ continent, FUN = sum, data = world,
na.rm = TRUE)
class(world_agg1)
```
The result is a non-spatial data frame with six rows, one per continent, and two columns reporting the name and population of each continent (see Table \@ref(tab:continents) with results for the top 3 most populous continents).
`aggregate()` is a [generic function](https://adv-r.hadley.nz/s3.html#s3-methods) which means that it behaves differently depending on its inputs.
**sf** provides the method `aggregate.sf()` which is activated automatically when `x` is an `sf` object and a `by` argument is provided:
```{r 03-attribute-operations-27}
world_agg2 = aggregate(world["pop"], list(world$continent), FUN = sum,
na.rm = TRUE)
class(world_agg2)
nrow(world_agg2)
```
The resulting `world_agg2` object is a spatial object containing 8 features representing the continents of the world (and the open ocean).
`group_by() %>% summarize()` is the **dplyr** equivalent of `aggregate()`, with the variable name provided in the `group_by()` function specifying the grouping variable and information on what is to be summarized passed to the `summarize()` function, as shown below:
```{r 03-attribute-operations-28}
world_agg3 = world %>%
group_by(continent) %>%
summarize(pop = sum(pop, na.rm = TRUE))
```
The approach may seem more complex but it has benefits: flexibility, readability, and control over the new column names.
This flexibility is illustrated in the command below, which calculates not only the population but also the area and number of countries in each continent:
```{r 03-attribute-operations-29}
world_agg4 = world %>%
group_by(continent) %>%
summarize(pop = sum(pop, na.rm = TRUE), `area_sqkm` = sum(area_km2), n = n())
```
In the previous code chunk `pop`, `area_sqkm` and `n` are column names in the result, and `sum()` and `n()` were the aggregating functions.
These aggregating functions return `sf` objects with rows representing continents and geometries containing the multiple polygons representing each land mass and associated islands (this works thanks to the geometric operation 'union', as explained in Section \@ref(geometry-unions)).
Let's combine what we have learned so far about **dplyr** functions, by chaining multiple commands to summarize attribute data about countries worldwide by continent.
The following command calculates population density (with `mutate()`), arranges continents by the number countries they contain (with `dplyr::arrange()`), and keeps only the 3 most populous continents (with `top_n()`), the result of which is presented in Table \@ref(tab:continents)):
```{r 03-attribute-operations-30}
world_agg5 = world %>%
st_drop_geometry() %>% # drop the geometry for speed
dplyr::select(pop, continent, area_km2) %>% # subset the columns of interest
group_by(continent) %>% # group by continent and summarize:
summarize(Pop = sum(pop, na.rm = TRUE), Area = sum(area_km2), N = n()) %>%
mutate(Density = round(Pop / Area)) %>% # calculate population density
top_n(n = 3, wt = Pop) %>% # keep only the top 3
arrange(desc(N)) # arrange in order of n. countries
```
```{r continents, echo=FALSE}
options(scipen = 999)
knitr::kable(
world_agg5,
caption = "The top 3 most populous continents ordered by number of countries.",
caption.short = "Top 3 most populous continents.",
booktabs = TRUE
)
```
```{block2 03-attribute-operations-31, type='rmdnote'}
More details are provided in the help pages (which can be accessed via `?summarize` and `vignette(package = "dplyr")` and Chapter 5 of [R for Data Science](http://r4ds.had.co.nz/transform.html#grouped-summaries-with-summarize).
```
### Vector attribute joining
Combining data from different sources is a common task in data preparation.
Joins do this by combining tables based on a shared 'key' variable.
**dplyr** has multiple join functions including `left_join()` and `inner_join()` --- see `vignette("two-table")` for a full list.
These function names follow conventions used in the database language [SQL](http://r4ds.had.co.nz/relational-data.html) [@grolemund_r_2016, Chapter 13]; using them to join non-spatial datasets to `sf` objects is the focus of this section.
**dplyr** join functions work the same on data frames and `sf` objects, the only important difference being the `geometry` list column.
The result of data joins can be either an `sf` or `data.frame` object.
The most common type of attribute join on spatial data takes an `sf` object as the first argument and adds columns to it from a `data.frame` specified as the second argument.
\index{join}
\index{attribute!join}
To demonstrate joins, we will combine data on coffee production with the `world` dataset.
The coffee data is in a data frame called `coffee_data` from the **spData** package (see `?coffee_data` for details).
It has 3 columns:
`name_long` names major coffee-producing nations and `coffee_production_2016` and `coffee_production_2017` contain estimated values for coffee production in units of 60-kg bags in each year.
A 'left join', which preserves the first dataset, merges `world` with `coffee_data`:
```{r 03-attribute-operations-32, warning=FALSE}
world_coffee = left_join(world, coffee_data)
class(world_coffee)
```
Because the input datasets share a 'key variable' (`name_long`) the join worked without using the `by` argument (see `?left_join` for details).
The result is an `sf` object identical to the original `world` object but with two new variables (with column indices 11 and 12) on coffee production.
This can be plotted as a map, as illustrated in Figure \@ref(fig:coffeemap), generated with the `plot()` function below:
```{r coffeemap, fig.cap="World coffee production (thousand 60-kg bags) by country, 2017. Source: International Coffee Organization.", fig.scap="World coffee production by country."}
names(world_coffee)
plot(world_coffee["coffee_production_2017"])
```
For joining to work, a 'key variable' must be supplied in both datasets.
By default **dplyr** uses all variables with matching names.
In this case, both `world_coffee` and `world` objects contained a variable called `name_long`, explaining the message `Joining, by = "name_long"`.
In the majority of cases where variable names are not the same, you have two options:
1. Rename the key variable in one of the objects so they match.
2. Use the `by` argument to specify the joining variables.
The latter approach is demonstrated below on a renamed version of `coffee_data`:
```{r 03-attribute-operations-33, warning=FALSE}
coffee_renamed = rename(coffee_data, nm = name_long)
world_coffee2 = left_join(world, coffee_renamed, by = c(name_long = "nm"))
```
```{r 03-attribute-operations-34, eval=FALSE, echo=FALSE}
identical(world_coffee, world_coffee2)
nrow(world)
nrow(world_coffee)
```
Note that the name in the original object is kept, meaning that `world_coffee` and the new object `world_coffee2` are identical.
Another feature of the result is that it has the same number of rows as the original dataset.
Although there are only 47 rows of data in `coffee_data`, all 177 country records are kept intact in `world_coffee` and `world_coffee2`:
rows in the original dataset with no match are assigned `NA` values for the new coffee production variables.
What if we only want to keep countries that have a match in the key variable?
In that case an inner join can be used:
```{r 03-attribute-operations-35, warning=FALSE}
world_coffee_inner = inner_join(world, coffee_data)
nrow(world_coffee_inner)
```
Note that the result of `inner_join()` has only 45 rows compared with 47 in `coffee_data`.
What happened to the remaining rows?
We can identify the rows that did not match using the `setdiff()` function as follows:
```{r 03-attribute-operations-36}
setdiff(coffee_data$name_long, world$name_long)
```
The result shows that `Others` accounts for one row not present in the `world` dataset and that the name of the `Democratic Republic of the Congo` accounts for the other:
it has been abbreviated, causing the join to miss it.
The following command uses a string matching (regex) function from the **stringr** package to confirm what `Congo, Dem. Rep. of` should be:
```{r 03-attribute-operations-37}
(drc = stringr::str_subset(world$name_long, "Dem*.+Congo"))
```
```{r, echo=FALSE, eval=FALSE}
world$name_long[grepl(pattern = "Dem*.+Congo", world$name_long)] # base R
```
```{r 03-attribute-operations-38, eval=FALSE, echo=FALSE}
# aim: test names in coffee_data and world objects
str_subset(coffee_data$name_long, "Ivo|Congo,")
.Last.value %in% str_subset(world$name_long, "Ivo|Dem*.+Congo")
```
To fix this issue, we will create a new version of `coffee_data` and update the name.
`inner_join()`ing the updated data frame returns a result with all 46 coffee-producing nations:
```{r 03-attribute-operations-39, warning=FALSE}
coffee_data$name_long[grepl("Congo,", coffee_data$name_long)] = drc
world_coffee_match = inner_join(world, coffee_data)
nrow(world_coffee_match)
```
It is also possible to join in the other direction: starting with a non-spatial dataset and adding variables from a simple features object.
This is demonstrated below, which starts with the `coffee_data` object and adds variables from the original `world` dataset.
In contrast with the previous joins, the result is *not* another simple feature object, but a data frame in the form of a **tidyverse** tibble:
the output of a join tends to match its first argument:
```{r 03-attribute-operations-40, warning=FALSE}
coffee_world = left_join(coffee_data, world)
class(coffee_world)
```
```{block2 03-attribute-operations-41, type='rmdnote'}
In most cases, the geometry column is only useful in an `sf` object.
The geometry column can only be used for creating maps and spatial operations if R 'knows' it is a spatial object, defined by a spatial package such as **sf**.
Fortunately, non-spatial data frames with a geometry list column (like `coffee_world`) can be coerced into an `sf` object as follows: `st_as_sf(coffee_world)`.
```
This section covers the majority of joining use cases.
For more information, we recommend reading the chapter [Relational data](https://r4ds.had.co.nz/relational-data.html?q=join#relational-data) in @grolemund_r_2016, the [join vignette](https://geocompr.github.io/geocompkg/articles/join.html) in the **geocompkg** package that accompanies this book, and [documentation](https://asardaes.github.io/table.express/articles/joins.html) describing joins with **data.table** and other packages.
Spatial joins are covered in the next chapter (Section \@ref(spatial-joining)).
### Creating attributes and removing spatial information {#vec-attr-creation}
Often, we would like to create a new column based on already existing columns.
For example, we want to calculate population density for each country.
For this we need to divide a population column, here `pop`, by an area column, here `area_km2` with unit area in square kilometers.
Using base R, we can type:
```{r 03-attribute-operations-42}
world_new = world # do not overwrite our original data
world_new$pop_dens = world_new$pop / world_new$area_km2
```
Alternatively, we can use one of **dplyr** functions - `mutate()` or `transmute()`.
`mutate()` adds new columns at the penultimate position in the `sf` object (the last one is reserved for the geometry):
```{r 03-attribute-operations-43, eval=FALSE}
world %>%
mutate(pop_dens = pop / area_km2)
```
The difference between `mutate()` and `transmute()` is that the latter drops all other existing columns (except for the sticky geometry column):
```{r 03-attribute-operations-44, eval=FALSE}
world %>%
transmute(pop_dens = pop / area_km2)
```
`unite()` from the **tidyr** package (which provides many useful functions for reshaping datasets, including `pivot_longer()`) pastes together existing columns.
For example, we want to combine the `continent` and `region_un` columns into a new column named `con_reg`.
Additionally, we can define a separator (here: a colon `:`) which defines how the values of the input columns should be joined, and if the original columns should be removed (here: `TRUE`):
```{r 03-attribute-operations-45, eval=FALSE}
world_unite = world %>%
tidyr::unite("con_reg", continent:region_un, sep = ":", remove = TRUE)
```
The resulting `sf` object has a new column called `con_reg` representing the continent and region of each country, e.g. `South America:Americas` for Argentina and other South America countries.
**tidyr**'s `separate()` function does the opposite of `unite()`: it splits one column into multiple columns using either a regular expression or character positions.
```{r 03-attribute-operations-46, eval=FALSE}
world_separate = world_unite %>%
tidyr::separate(con_reg, c("continent", "region_un"), sep = ":")
```
```{r 03-attribute-operations-47, echo=FALSE, eval=FALSE}
identical(world, world_separate)
```
The **dplyr** function `rename()` and the base R function `setNames()` are useful for renaming columns.
The first replaces an old name with a new one.
The following command, for example, renames the lengthy `name_long` column to simply `name`:
```{r 03-attribute-operations-48, eval=FALSE}
world %>%
rename(name = name_long)
```
`setNames()` changes all column names at once, and requires a character vector with a name matching each column.
This is illustrated below, which outputs the same `world` object, but with very short names:
```{r 03-attribute-operations-49, eval=FALSE, echo=FALSE}
abbreviate(names(world), minlength = 1) %>% dput()
```
```{r 03-attribute-operations-50, eval=FALSE}
new_names = c("i", "n", "c", "r", "s", "t", "a", "p", "l", "gP", "geom")
world_new_names = world %>%
setNames(new_names)
```
Each of these attribute data operations preserve the geometry of the simple features.
Sometimes it makes sense to remove the geometry, for example to speed-up aggregation.
Do this with `st_drop_geometry()`, **not** manually with commands such as `select(world, -geom)`, as shown below.^[
`st_geometry(world_st) = NULL` also works to remove the geometry from `world`, but overwrites the original object.
]
```{r 03-attribute-operations-51}
world_data = world %>% st_drop_geometry()
class(world_data)
```
## Manipulating raster objects
<!--jn-->
In contrast to the vector data model underlying simple features (which represents points, lines and polygons as discrete entities in space), raster data represent continuous surfaces.
This section shows how raster objects work by creating them *from scratch*, building on Section \@ref(an-introduction-to-terra).
Because of their unique structure, subsetting and other operations on raster datasets work in a different way, as demonstrated in Section \@ref(raster-subsetting).
\index{raster!manipulation}
The following code recreates the raster dataset used in Section \@ref(raster-classes), the result of which is illustrated in Figure \@ref(fig:cont-raster).
This demonstrates how the `rast()` function works to create an example raster named `elev` (representing elevations).
```{r 03-attribute-operations-52, message=FALSE, eval = FALSE}
elev = rast(nrows = 6, ncols = 6, resolution = 0.5,
xmin = -1.5, xmax = 1.5, ymin = -1.5, ymax = 1.5,
vals = 1:36)
```
The result is a raster object with 6 rows and 6 columns (specified by the `nrow` and `ncol` arguments), and a minimum and maximum spatial extent in x and y direction (`xmin`, `xmax`, `ymin`, `ymax`).
The `vals` argument sets the values that each cell contains: numeric data ranging from 1 to 36 in this case.
Raster objects can also contain categorical values of class `logical` or `factor` variables in R.
The following code creates the raster datasets shown in Figure \@ref(fig:cont-raster):
```{r 03-attribute-operations-53, eval = FALSE}
grain_order = c("clay", "silt", "sand")
grain_char = sample(grain_order, 36, replace = TRUE)
grain_fact = factor(grain_char, levels = grain_order)
grain = rast(nrows = 6, ncols = 6, resolution = 0.5,
xmin = -1.5, xmax = 1.5, ymin = -1.5, ymax = 1.5,
vals = grain_fact)
```
```{r 03-attribute-operations-54, include = FALSE}
elev = rast(system.file("raster/elev.tif", package = "spData"))
grain = rast(system.file("raster/grain.tif", package = "spData"))
```
The raster object stores the corresponding look-up table or "Raster Attribute Table" (RAT) as a list of data frames, which can be viewed with `cats(grain)` (see `?cats()` for more information).
Each element of this list is a layer of the raster.
It is also possible to use the function `levels()` for retrieving and adding new or replacing existing factor levels:
```{r 03-attribute-operations-56}
levels(grain)[[1]] = c(levels(grain)[[1]], wetness = c("wet", "moist", "dry"))
levels(grain)
```
```{r cont-raster, echo = FALSE, message = FALSE, fig.asp=0.5, fig.cap = "Raster datasets with numeric (left) and categorical values (right).", fig.scap="Raster datasets with numeric and categorical values.", warning=FALSE}
# knitr::include_graphics("https://user-images.githubusercontent.com/1825120/146617366-7308535b-30f6-4c87-83f7-21702c7d993b.png")
source("https://github.com/Robinlovelace/geocompr/raw/main/code/03-cont-raster-plot.R", print.eval = TRUE)
```
```{block2 coltab, type='rmdnote'}
Categorical raster objects can also store information about the colors associated with each value using a color table.
The color table is a data frame with three (red, green, blue) or four (alpha) columns, where each row relates to one value.
Color tables in **terra** can be viewed or set with the `coltab()` function (see `?coltab`).
Importantly, saving a raster object with a color table to a file (e.g., GeoTIFF) will also save the color information.
```
### Raster subsetting
Raster subsetting is done with the base R operator `[`, which accepts a variety of inputs:
\index{raster!subsetting}
- Row-column indexing
- Cell IDs
- Coordinates (see Section \@ref(spatial-raster-subsetting))
- Another spatial object (see Section \@ref(spatial-raster-subsetting))
Here, we only show the first two options since these can be considered non-spatial operations.
If we need a spatial object to subset another or the output is a spatial object, we refer to this as spatial subsetting.
Therefore, the latter two options will be shown in the next chapter (see Section \@ref(spatial-raster-subsetting)).
The first two subsetting options are demonstrated in the commands below ---
both return the value of the top left pixel in the raster object `elev` (results not shown):
```{r 03-attribute-operations-58, eval = FALSE}
# row 1, column 1
elev[1, 1]
# cell ID 1
elev[1]
```
Subsetting of multi-layered raster objects will return the cell value(s) for each layer.
For example, `two_layers = c(grain, elev); two_layers[1]` returns a data frame with one row and two columns --- one for each layer.
To extract all values or complete rows, you can also use `values()`.
Cell values can be modified by overwriting existing values in conjunction with a subsetting operation.
The following code chunk, for example, sets the upper left cell of `elev` to 0 (results not shown):
```{r 03-attribute-operations-60, results='hide'}
elev[1, 1] = 0
elev[]
```
Leaving the square brackets empty is a shortcut version of `values()` for retrieving all values of a raster.
Multiple cells can also be modified in this way:
```{r 03-attribute-operations-61}
elev[1, c(1, 2)] = 0
```
Replacing values of multilayered rasters can be done with a matrix with as many columns as layers and rows as replaceable cells (results not shown):
```{r 03-attribute-operations-61b, eval=FALSE}
two_layers = c(grain, elev)
two_layers[1] = cbind(c(1), c(4))
two_layers[]
```
### Summarizing raster objects
**terra** contains functions for extracting descriptive statistics\index{statistics} for entire rasters.
Printing a raster object to the console by typing its name returns minimum and maximum values of a raster.
`summary()` provides common descriptive statistics\index{statistics} -- minimum, maximum, quartiles and number of `NA`s for continuous rasters and a number of cells of each class for categorical rasters.
Further summary operations such as the standard deviation (see below) or custom summary statistics can be calculated with `global()`.
\index{raster!summarizing}
```{r 03-attribute-operations-62, eval = FALSE}
global(elev, sd)
```
```{block2 03-attribute-operations-63, type='rmdnote'}
If you provide the `summary()` and `global()` functions with a multi-layered raster object, they will summarize each layer separately, as can be illustrated by running: `summary(c(elev, grain))`.
```
Additionally, the `freq()` function allows to get the frequency table of categorical values.
Raster value statistics can be visualized in a variety of ways.
Specific functions such as `boxplot()`, `density()`, `hist()` and `pairs()` work also with raster objects, as demonstrated in the histogram created with the command below (not shown):
```{r 03-attribute-operations-64, eval=FALSE}
hist(elev)
```
In case the desired visualization function does not work with raster objects, one can extract the raster data to be plotted with the help of `values()` (Section \@ref(raster-subsetting)).
\index{raster!values}
Descriptive raster statistics belong to the so-called global raster operations.
These and other typical raster processing operations are part of the map algebra scheme, which are covered in the next chapter (Section \@ref(map-algebra)).
```{block 03-attribute-operations-65, type='rmdnote'}
Some function names clash between packages (e.g., a function with the name `extract()` exist in both **terra** and **tidyr** packages).
In addition to not loading packages by referring to functions verbosely (e.g., `tidyr::extract()`), another way to prevent function names clashes is by unloading the offending package with `detach()`.
The following command, for example, unloads the **terra** package (this can also be done in the *package* tab which resides by default in the right-bottom pane in RStudio): `detach("package:terra", unload = TRUE, force = TRUE)`.
The `force` argument makes sure that the package will be detached even if other packages depend on it.
This, however, may lead to a restricted usability of packages depending on the detached package, and is therefore not recommended.
```
## Exercises
```{r, echo=FALSE, results='asis'}
res = knitr::knit_child('_03-ex.Rmd', quiet = TRUE, options = list(include = FALSE, eval = FALSE))
cat(res, sep = '\n')
```