-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path125_nhpi_walkthrough.Rmd
571 lines (362 loc) · 15.9 KB
/
125_nhpi_walkthrough.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
<!-- This file by Martin Monkman
is licensed under a Creative Commons Attribution 4.0 International License
https://creativecommons.org/licenses/by/4.0/ -->
# An example of a data analytics project {#walkthrough}
## The data science process
This walkthrough follows the data science process, as described by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund in [_R for Data Science_, 2nd ed.](https://r4ds.hadley.nz/intro).
![_The data science process_](static/img/data-science.png)
***
## New Housing Price Index
This set of scripts creates summary tables and graphs plotting the New Housing Price Index (NHPI) data collected and reported by Statistics Canada.
Our end point will be this chart:
![New Housing Price Index, B.C. and Canada](static/img/NHPI_plot.png){width=100%}
This plot was created in early 2020, when the most recent data available was for the month of November 2019.
***
### Setup
This chunk of R code loads the packages that we will be using.
```{r setup_125}
# tidyverse
library(tidyverse)
library(lubridate) # date functions
library(scales) # extending {ggplot2}
library(glue) # for gluing strings together
#
# utilities
library(cansim) # data extract
library(janitor) # for `clean_names()`
library(knitr) # for publication - includes `kable()` for tables
library(kableExtra) # - format kable tables
library(flextable) # another table formatting package
```
First, we have some code for two custom functions that calculate month-over-month and year-over-year statistics. These will be called later in the scripts. Functions give us a shorthand way to do the same thing multiple times.
```{r functions}
get_mom_stats <- function(df) {
df |>
arrange(ref_date) |>
mutate(mom_val = lag(value),
mom_pct = ((value / lag(value, n = 1)) - 1) * 100,
mom_chg = (value - lag(value, n = 1)))
}
get_yoy_stats <- function(df) {
df |>
arrange(ref_date) |>
mutate(yoy_val = lag(value, n = 12),
yoy_pct = ((value / lag(value, n = 12)) - 1) * 100,
yoy_chg = (value - lag(value, n = 12)))
}
```
We then create a chart theme that will also be called later. This theme will be applied to all of our charts, and ensures that every one of our charts have exactly the same formatting. Like functions, this streamlines our code and helps reduce the chance of error.
For this chart theme, more than 10 lines of code gets reduced to a simple, memorable name: `bida_chart_theme`.
```{r bida_chart_theme}
bida_chart_theme <-
theme_bw() +
theme(
panel.border = element_rect(colour="white"),
plot.title = element_text(face="bold"),
legend.position=c(1,0),
legend.justification=c(1,0),
legend.title = element_text(size=12),
legend.text = element_text(size=11),
axis.line = element_line(colour="black"),
axis.title = element_text(size=12),
axis.text = element_text(size=12)
)
```
---
## IMPORT
The first step is to import the data. For this, we will access Statistics Canada's database CANSIM, using an R package [{cansim}](https://github.com/mountainMath/cansim). The file has 18 variables and over 56,000 rows (as of February 2020) and growing every month.
**data source**
[The Daily, 2019-10-10](https://www150.statcan.gc.ca/n1/daily-quotidien/191010/dq191010a-eng.htm)
Table: 18-10-0205-01 (formerly CANSIM 327-0056)
https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810020501
**Note:** "The index base period, for which the New Housing Price Index (NHPI) equals 100, is December 2016."
This version (updated `r Sys.Date()`) uses the `cansim` package to pull the data from CANSIM, and assign it to an object named "thedata".
```{r data_import_cansim}
table_id <- "18-10-0205-01"
thedata <- get_cansim(table_id)
```
***
## TIDY
### understanding the data
It's a good practice before starting any data analysis to understand the structure of your data, by interrogating the variables and the values in those variables.
{cansim} (like many other data-focused packages) provides some utility functions that let us quickly understand the data frame. Here we use the function `get_cansim_table_overview` to generate an overview of the table we downloaded.
```{r interogate1}
# from the {cansim} package
get_cansim_table_overview(table_id)
```
We can also create some summaries with the names of the variables. One way to do this is with the `ls.str()` function:
```{r}
# list the variables in the data
ls.str(thedata)
```
We can explore the individual variables as well—let's generate a small table counting the instances of the `GEO` and `GeoUID` variables:
```{r interogate2}
thedata |>
group_by(GEO, GeoUID) |>
tally()
```
The `tail()` function shows the last 6 rows of a data table.
```{r}
tail(thedata)
```
Based on this interrogation, we can see that the chart is made up of the following variables:
* "REF_DATE", starting in January 2007
* "GEO", with British Columbia & Canada
* "New housing price indexes", but total (not house and land separately)
### cleaning
Cleaning up our data table before we start working is often the next step. In this case, we use the [{janitor}](http://sfirke.github.io/janitor/) package to clean the variable names.
The values in the date field is also problematic—they are stored as character strings. We will clean the values by specifying them as a "date" format.
```{r cleanup}
thedata <- janitor::clean_names(thedata)
thedata <- thedata |>
mutate(ref_date = ymd(ref_date, truncated = 2))
head(thedata)
```
**Important Note**
The `clean_names()` function changes the names of the three variables that are shown in the chart:
| Variable: Old Name | New Name |
| --- | --- |
| REF_DATE | ref_date |
| GEO | geo |
| New housing price indexes | new_housing_price_indexes |
### filtering: BC & Canada data
For our chart, we don't need all the geographic regions—just B.C. and Canada. And for our purposes, we need the total value and want the chart to start in 2007, not 1981.
To do that, we use two functions from the {dplyr} package: `filter` (to pick the rows we want, based on the values of interest) and `select` (to pick the columns).
```{r filter}
startdate <- as.Date("2007-01-01")
# filter to have BC and Canada
thedata_BC_Can <- thedata |>
filter(ref_date >= startdate) |>
filter(geo %in% c("British Columbia", "Canada"),
new_housing_price_indexes == "Total (house and land)") |>
select(ref_date, geo, new_housing_price_indexes, value)
head(thedata_BC_Can)
```
---
## UNDERSTAND
### Visualize - tables
One way to understand your data is to create a summary table that is more easily scanned by the human eye.
In this case, we will put the rows as each year, and the columns as the months January through December.
```{r table1}
NHPI_table <- thedata_BC_Can |>
filter(geo == "British Columbia") |>
mutate(year = year(ref_date),
month = month(ref_date, label = TRUE)) |>
select(year, month, value) |>
pivot_wider(names_from = month, values_from = value) |>
arrange(desc(year))
# display the table
head(NHPI_table)
```
For the next piece, we will calculate an annual average for each year, and add that column to the table.
This is a good time to say that there are often multiple ways to get to the same result—they are not "right" or "wrong", just different. Here are three different ways to create a column with the annual average (and there are more):
```{r table2}
# how to add annual average
# Julie's genius solution
NHPI_table2 <- NHPI_table |>
mutate(annual_avg = rowMeans(NHPI_table[-1], na.rm = TRUE))
head(NHPI_table2)
```
```{r table3}
# Stephanie's genius solution
# starts with the raw data table
NHPI_table3 <- thedata_BC_Can |>
filter(geo == "British Columbia") |>
mutate(year = year(ref_date),
month = month(ref_date, label = TRUE)) |>
select(year, month, value) |>
group_by(year) |>
mutate(annual_avg = mean(value, na.rm = TRUE)) |>
pivot_wider(names_from = month, values_from = value) |>
arrange(desc(year))
head(NHPI_table3)
```
This is also a good example of how R is continuously evolving. When this course was first taught in autumn 2019, this is where the section on adding the annual average column ended.
But that ongoing evolution has given us a new {dplyr} function, `rowwise()`. This function allow us to calculate the average of the specified rows as above. What is required is telling R to treat the functions `rowwise()`, and then specify what we want using the same sort of functions and syntax that would be required for a column:
```{r table4a}
NHPI_table4 <- NHPI_table |>
rowwise() |>
mutate(annual_avg = mean(c(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec), na.rm = TRUE))
head(NHPI_table4)
```
For more information about row-wise operations with {dplyr}, see the following resources:
* https://dplyr.tidyverse.org/articles/rowwise.html
* https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/45
* https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse
- https://github.com/jennybc/row-oriented-workflows
- https://github.com/jennybc/row-oriented-workflows/blob/master/ex09_row-summaries.md
* https://github.com/tidyverse/dplyr/issues/4767
***
## COMMUNICATE - table
One way we often communicate the data is through a summary table.
The tables above have the data we are interested in, but are not appealing to the eye. Using the {flextable} package, we can apply formatting to the tables above. We can also add things like headers and footnotes.
```{r nhpi-flextable}
# print table with {flextable} formatting
NHPI_table4 |>
flextable() |>
# add_footer(glue::glue("Source: Statistics Canada, Table {table_id}"))
add_footer_row(values = glue::glue("Source: Statistics Canada, Table {table_id}"),
colwidths = 14)
```
### table: monthly summary
This code produces a monthly summary table with the most recent 13 months of B.C. data, which allows for year-over-year comparisons with the same month in the previous year.
The second part of the code then formats the table ready for publication.
```{r}
# create summary table
tbl_month_BC <-
thedata_BC_Can |>
filter(geo == "British Columbia") |>
arrange(ref_date) |>
# calculate percent change stats
get_mom_stats() |>
get_yoy_stats() |>
# pull year and month
mutate(year = year(ref_date),
month = month(ref_date, label = TRUE)) |>
# select relevant columns, rename as necessary
select(year, month, value,
"from previous month" = mom_chg,
"from same month, previous year" = yoy_chg) |>
arrange(desc(year), desc(month)) |>
# just print rows 1 to 13
slice(1:13)
# print table with {kableExtra} formatting
tbl_month_BC |>
kable(caption = "NHPI, British Columbia", digits = 1) |>
kable_styling(bootstrap_options = "striped") |>
row_spec(0, bold = T, font_size = 14) |>
row_spec(1, bold = T) |>
add_header_above(c(" " = 3, "index point change" = 2), font_size = 14)
```
### text
pull B.C. values for latest month
```{r}
# filter a B.C.-only data frame
thedata_BC <- thedata_BC_Can |>
filter(geo == "British Columbia") |>
arrange(ref_date) |>
# calculate percent change stats
get_mom_stats() |>
get_yoy_stats()
# determine most recent month
latest_ref_date <- max(thedata_BC$ref_date)
this_month <- month(latest_ref_date, label = TRUE)
this_year <- year(latest_ref_date)
this_month_nhpi <- thedata_BC |>
filter(ref_date == latest_ref_date) |>
pull(value)
this_month_mom <- thedata_BC |>
filter(ref_date == latest_ref_date) |>
pull(mom_chg)
this_month_yoy <- thedata_BC |>
filter(ref_date == latest_ref_date) |>
pull(yoy_chg)
```
Our text:
> In `r this_month` `r this_year`, the New Housing Price Index for British Columbia was `r this_month_nhpi`.
> This was a change of `r this_month_mom` compared to the previous month, and `r this_month_yoy` year-over-year.
---
## UNDERSTAND
### Visualize - plot
Making a plot of your data is often a good way to understand what the data can tell us, and helps us think about the story in the data. In this case, we will make a basic plot with just a few lines of code.
```{r nhpi-basic-plot}
# PLOT!
# basic
ggplot(thedata_BC_Can, aes(x=ref_date, y=value, group=geo)) +
geom_line()
```
We can make it a little more presentable with some additional formatting
```{r}
#
# with formatting applied
dataplot <- ggplot(thedata_BC_Can, aes(x=ref_date, y=value, colour=geo)) +
geom_line(size=1.5)
dataplot
```
And with even more formatting it's starting to look like something we could publish.
```{r}
dataplot2 <- dataplot +
scale_x_date(date_breaks = "2 years", labels = year) +
scale_y_continuous(labels = comma, limits = c(80, 125)) +
scale_colour_manual(name=NULL,
breaks=c("Canada", "British Columbia"),
labels=c("Canada", "British Columbia"),
values=c("#325A80", "#CCB550")) +
bida_chart_theme
```
Just a reminder: all of the code below is what's in the `bida_chart_theme`--the last line in the code chunk above.
If we were to make more than one copy of the plot, we would have to repeat this every time. And if we made more than one copy of the plot, we would have to make the same change in every place.
By assigning all of this code to the object `bida_chart_theme`, we can make a change in one place, and it will carry through all of the times we need it.
```{r}
# theme_bw() +
# theme(
# panel.border = element_rect(colour="white"),
# plot.title = element_text(face="bold"),
# legend.position=c(1,0),
# legend.justification=c(1,0),
# legend.title = element_text(size=12),
# legend.text = element_text(size=11),
# axis.line = element_line(colour="black"),
# axis.title = element_text(size=12),
# axis.text = element_text(size=12)
# )
#
dataplot2
```
## COMMUNICATE - plot
Let's finish our formatting...
```{r}
# experiments with ggplot2's new subtitle and caption options
NHPI_title <- as.character("New Housing Price Index, Canada & B.C.")
NHPI_subtitle <- as.character("December 2016 = 100")
NHPI_caption <- as.character("Source: Statistics Canada, CANSIM table 18-10-0205-01")
# add titles / X-Y axis labels
NHPI_plot <- dataplot2 +
labs(title = NHPI_title,
subtitle = NHPI_subtitle,
caption = NHPI_caption,
x = NULL, y = "NHPI (Dec. 2016 = 100)")
NHPI_plot
ggsave(filename = "NHPI_plot.jpg", plot = NHPI_plot,
width = 8, height = 6)
```
***
## REPEAT WITH DIFFERENT VALUES
### Vancouver house and land version
Now we've set up the B.C. comparison to Canada, we can quickly adapt our previous work to look at another region--in this case, a deeper dive into Vancouver, comparing the changes in the "land" and "house" components of the NHPI.
```{r}
startdate <- as.Date("2007-01-01")
# filter the original data to have Vancouver
thedata_YVR <- thedata |>
filter(ref_date >= startdate) |>
filter(geo %in% c("Vancouver, British Columbia"),
new_housing_price_indexes %in% c("House only", "Land only")) |>
arrange(desc(ref_date))
thedata_YVR
```
the plot
```{r NHPI_Vancouver_plot}
NHPI_title <- as.character("New Housing Price Index, Vancouver: house and land")
NHPI_subtitle <- as.character("December 2016 = 100")
NHPI_caption <- as.character("Source: Statistics Canada, CANSIM table 18-10-0205-01")
# with formatting applied
NHPI_Vancouver_plot <-
ggplot(thedata_YVR, aes(x=ref_date, y=value,
colour=new_housing_price_indexes)) +
geom_line(size=1.5) +
#
scale_x_date(date_breaks = "2 years", labels = year) +
scale_y_continuous(labels = comma, limits = c(80, 125)) +
scale_colour_manual(name=NULL,
breaks=c("House only", "Land only"),
labels=c("House only", "Land only"),
values=c("#325A80", "#CCB550")) +
bida_chart_theme +
# set chart titles and labels
labs(title = NHPI_title,
subtitle = NHPI_subtitle,
caption = NHPI_caption,
x = NULL, y = "NHPI (Dec. 2016 = 100)")
NHPI_Vancouver_plot
```
-30-