-
Notifications
You must be signed in to change notification settings - Fork 1
/
assignment4solution.Rmd
172 lines (140 loc) · 6.17 KB
/
assignment4solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "Statistical assignment 4"
author: "[add your name here] [add your candidate number here - mandatory]"
date: "[add date here]"
output: github_document
---
```{r setup, include=FALSE}
# Please note these options.
# This tells R Markdown that we want to show code in the output document.
knitr::opts_chunk$set(echo = TRUE)
# Switching off messages in the output document.
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
# Switching on caching to make things faster (don't commit cache files on Github).
knitr::opts_chunk$set(cache = TRUE)
```
In this assignment you will need to reproduce 5 ggplot graphs. I supply graphs as images; you need to write the ggplot2 code to reproduce them and knit and submit a Markdown document with the reproduced graphs (as well as your .Rmd file).
First we will need to open and recode the data. I supply the code for this; you only need to change the file paths.
```{r}
library(tidyverse)
Data8 <- read_tsv("C:\\Users\\ab789\\datan3_2019\\data\\UKDA-6614-tab\\tab\\ukhls_w8\\h_indresp.tab")
Data8 <- Data8 %>%
select(pidp, h_age_dv, h_payn_dv, h_gor_dv)
Stable <- read_tsv("C:\\Users\\ab789\\datan3_2019\\data\\UKDA-6614-tab\\tab\\ukhls_wx\\xwavedat.tab")
Stable <- Stable %>%
select(pidp, sex_dv, ukborn, plbornc)
Data <- Data8 %>% left_join(Stable, "pidp")
rm(Data8, Stable)
Data <- Data %>%
mutate(sex_dv = ifelse(sex_dv == 1, "male",
ifelse(sex_dv == 2, "female", NA))) %>%
mutate(h_payn_dv = ifelse(h_payn_dv < 0, NA, h_payn_dv)) %>%
mutate(h_gor_dv = recode(h_gor_dv,
`-9` = NA_character_,
`1` = "North East",
`2` = "North West",
`3` = "Yorkshire",
`4` = "East Midlands",
`5` = "West Midlands",
`6` = "East of England",
`7` = "London",
`8` = "South East",
`9` = "South West",
`10` = "Wales",
`11` = "Scotland",
`12` = "Northern Ireland")) %>%
mutate(placeBorn = case_when(
ukborn == -9 ~ NA_character_,
ukborn < 5 ~ "UK",
plbornc == 5 ~ "Ireland",
plbornc == 18 ~ "India",
plbornc == 19 ~ "Pakistan",
plbornc == 20 ~ "Bangladesh",
plbornc == 10 ~ "Poland",
plbornc == 27 ~ "Jamaica",
plbornc == 24 ~ "Nigeria",
TRUE ~ "other")
)
```
Reproduce the following graphs as close as you can. For each graph, write two sentences (not more!) describing its main message.
1. Histogram (20 points)
```{r}
Data %>%
ggplot(aes(x = h_age_dv)) +
geom_histogram(binwidth = 1) +
xlab("Age") +
ylab("Number of respondents")
```
2. Scatter plot (20 points). The red line shows a linear fit; the blue line shows a quadratic fit. Note the size and position of points.
```{r}
Data %>%
ggplot(aes(x = h_age_dv, y = h_payn_dv)) +
geom_point(size = 0.1, position = "jitter") +
geom_smooth(method = "lm", colour = "red") +
geom_smooth(method = "lm", formula = y ~ x + I(x^2), colour = "blue") +
xlim(16, 65) +
xlab("Age") +
ylab("Monthly earnings")
```
3. Faceted density chart (20 points).
```{r}
Data %>%
filter(!is.na(placeBorn)) %>%
ggplot(aes(x = h_age_dv)) +
geom_density(fill = "black") +
xlab("Age") +
facet_wrap(~ placeBorn)
```
4. Ordered bar chart of summary statistics (20 points).
```{r}
Data %>%
filter(!is.na(placeBorn)) %>%
filter(!is.na(sex_dv)) %>%
group_by(placeBorn, sex_dv) %>%
summarise(
medianIncome = median(h_payn_dv, na.rm = TRUE)
) %>%
ggplot(aes(x = reorder(placeBorn, medianIncome), y = medianIncome, fill = sex_dv)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("darkred", "darkblue")) +
coord_flip() +
ylab("Median net monthly earnings") +
xlab("Country of birth") +
theme(legend.position="top") +
guides(fill=guide_legend(reverse=TRUE)) +
labs(fill="")
```
5. Map (20 points). This is the most difficult problem in this set. You will need to use the NUTS Level 1 shape file (available here -- https://data.gov.uk/dataset/2aa6727d-c5f0-462a-a367-904c750bbb34/nuts-level-1-january-2018-full-clipped-boundaries-in-the-united-kingdom) and a number of packages for producing maps from shape files. You will need to google additional information; there are multiple webpages with the code that produces similar maps.
```{r}
library(rgdal)
library(ggmap)
library(mapproj)
library(rgeos)
shapefile <- readOGR(dsn = getwd(), layer = "NUTS_Level_1_January_2018_Full_Clipped_Boundaries_in_the_United_Kingdom", verbose = FALSE)
mapdata <- broom::tidy(shapefile, region="nuts118nm")
mapdata <- mapdata %>%
mutate(id = recode(id,
"East Midlands (England)" = "East Midlands",
"North East (England)" = "North East",
"North West (England)" = "North West",
"South East (England)" = "South East",
"South West (England)" = "South West",
"West Midlands (England)" = "West Midlands",
"Yorkshire and The Humber" = "Yorkshire"))
medianIncome <- Data %>%
filter(!is.na(sex_dv)) %>%
group_by(h_gor_dv) %>%
summarise(
medianIncome = median(h_payn_dv, na.rm = TRUE)
)
mapdata2 <- mapdata %>%
left_join(medianIncome, by = c("id" = "h_gor_dv"))
ggplot() +
geom_polygon(data = mapdata2, aes(x = long, y = lat, group = group, fill = medianIncome)) +
scale_fill_gradient(trans = "reverse") +
labs(fill="") +
theme_void() +
coord_map() +
ggtitle("Median earnings by region (£)")
```