-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlau_nuts.Rmd
251 lines (173 loc) · 11.2 KB
/
lau_nuts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
---
title: "LAU and NUTS"
description: |
Can they be friends?
author:
- name: Giorgio Comai
url: https://giorgiocomai.eu
affiliation: OBCT/EDJNet
affiliation_url: https://www.europeandatajournalism.eu/
date: "`r Sys.Date()`"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library("tidyverse", quietly = TRUE)
library("sf", quietly = TRUE)
library("latlon2map")
options(timeout = 60000)
cache_folder <- fs::path(fs::path_home_r(),
"R",
"ll_data")
fs::dir_create(cache_folder)
ll_set_folder(path = cache_folder)
```
Combining LAU and NUTS using publicly available datasets is not as straightforward as it may seem, due to small inconsistencies in the data, the [cumbersome and internally inconsistent format chosen by GISCO to distribute the concordance tables](https://ec.europa.eu/eurostat/web/nuts/local-administrative-units), and first of all, the puzzling choice not to include reference to NUTS in the datasets with local administrative units to begin with.
There does not seem to be any combination of concordance tables that fully match with the LAU dataset for the relevant year. Mixing and matching between different years for individual countries may eventually bring us closer to the desired outcome, but may lead to some inconsistencies of its own (see some details about the process in the page [botched attempts at combining LAU and NUTS](lau_nuts_botched.html))
Find below more details about proposed viable tables that offer *full* matching with the LAU of the relevant year, i.e. 100% of the LAUs are paired to a NUTS region.
To be clear, the full matching inevitably leads to some artifacts: if a LAU changed its borders between 2016 and 2021 (when NUTS regions were updated), and it's geographically located across the border, there is no *right* matching. These issues involve however a relatively small number of LAUs: if your goal is to use this for data visualisations, then it's probably fine. If you are using statistics from other sources, and you expect totals to add up, then even the odd LAU out of place may be something to be concerned about (likely, part of the reason why the concordance tables are not complete).
This repository includes concordance datasets [based on matching by geometry](https://github.com/EDJNet/lau_centres/tree/main/lau_nuts_concordance_by_geo), as well as a [combination](https://github.com/EDJNet/lau_centres/tree/main/lau_nuts_concordance_combo): based on concordance tables as distributed by Gisco, and falling back to the ones based on geometry for missing LAUs, which are probably [the ones you want to use](https://github.com/EDJNet/lau_centres/tree/main/lau_nuts_concordance_combo).
# LAU 2020 / NUTS 2021, based on geometry
Until Gisco finally distributes a consistent dataset (hopefully LAU 2021/NUTS2021), we decided to calculate belonging to a NUTS region based on the most recently distributed geographic dataset: LAU 2020/NUTS 2021. This may still potentially lead to some inconsistencies for LAUs with recently changed borders along the adminsitrative boundary of some NUTS region, but, all things considered, this is expected to offer an accurate dataset, possibly bar a handful of cases.
```{r}
nuts_year <- 2021
lau_year <- 2020
```
```{r}
nuts_countries_df <- ll_get_nuts_eu(year = nuts_year,
level = 3) %>%
sf::st_drop_geometry() %>%
dplyr::distinct(CNTR_CODE) %>%
dplyr::arrange(CNTR_CODE)
lau_countries_df <- ll_get_lau_eu(year = lau_year) %>%
sf::st_drop_geometry() %>%
dplyr::distinct(CNTR_CODE) %>%
dplyr::arrange(CNTR_CODE)
countries <- dplyr::semi_join(x = nuts_countries_df,
y = lau_countries_df,
by = "CNTR_CODE") %>%
dplyr::arrange(CNTR_CODE)
# dplyr::anti_join(x = lau_countries_df,
# y = nuts_countries_df,
# by = "CNTR_CODE")
#
# dplyr::anti_join(x = nuts_countries_df,
# y = lau_countries_df,
# by = "CNTR_CODE")
```
The area covered by both datasets (LAU 2020 / NUTS 2021) includes `r nrow(countries)` countries: `r countries %>% dplyr::summarise(country = stringr::str_c(CNTR_CODE, collapse = ", "))%>% dplyr::pull(country)`.
```{r}
# see `_process_lau_nuts_area.Rmd` for details
```
## How reliable is the matching?
```{r}
base_folder <- fs::path("lau_nuts_area",
stringr::str_c(c("lau", lau_year, "nuts", nuts_year, "area"),
collapse = "_"))
all_lau_nuts_area_df <- purrr::map_dfr(
.x = fs::dir_ls(path = base_folder),
.f = function(current_file) {
readr::read_csv(file = current_file,
col_types = cols(
gisco_id = col_character(),
nuts_3 = col_character(),
area = col_double(),
area_share = col_double()
))
})
all_lau_nuts_best_match_df <- all_lau_nuts_area_df %>%
dplyr::arrange(gisco_id, area) %>%
group_by(gisco_id) %>%
dplyr::slice_max(area) %>%
dplyr::ungroup()
```
Given some inconsistencies and different resolution we expect LAUs located along a NUTS boundary line to not match exactly with NUTS. We expect a difference up to 10, perhaps 20 percent, to be likely attributable to mismatches in the geo dataset, no actual changes on the ground.
As a consequence, there are probably just a handful of cases where the matching is not perfectly accurate.
Here is a full list of LAUs which, according to available datasets, do not have more than 90% of their area within a given NUTS:
```{r}
lau_less_than_90_df <- all_lau_nuts_best_match_df %>%
dplyr::filter(area_share<0.9) %>%
dplyr::arrange(dplyr::desc(area_share))
lau_less_than_90_df %>%
knitr::kable()
```
## How consistent is the matching?
Do *all* LAUs have their own NUTS? Here is a complete table of LAUs that are not paired to any NUTS:
```{r}
ll_get_lau_eu(year = lau_year) %>%
sf::st_drop_geometry() %>%
dplyr::select(gisco_id = GISCO_ID) %>%
dplyr::left_join(y = all_lau_nuts_best_match_df %>%
dplyr::select(gisco_id, nuts_3),
by = "gisco_id") %>%
dplyr::filter(is.na(nuts_3)) %>%
knitr::kable()
```
Yes. Full match. 😍
## How different is this from the official concordance tables?
If we only consider those LAUs that are actually present in the official concordance table, how many LAUs would be miscategorised relying on the geometries as described above?
In the case of LAU 2020, NUTS 2021 (with official concordance tables still provincial), only about 100 LAUs are misplaced: one in Greece, one in the Netherlands, all others in France (France has by far the highest number of LAUs of any country in Europe: more than a third of LAUs in Europe are located in France, and changes are frequent).
```{r}
concordance_check_df <- ll_get_lau_nuts_concordance(lau_year = 2020, nuts_year = 2021) %>%
transmute(country, nuts_2_official = nuts_2, nuts_3_official = nuts_3, gisco_id) %>%
dplyr::left_join(y = readr::read_csv(file = "lau_nuts_concordance_by_geo/lau_2020_nuts_2021_concordance_by_geo.csv", col_types = cols(
gisco_id = col_character(),
country = col_character(),
nuts_2 = col_character(),
nuts_3 = col_character(),
lau_id = col_character(),
lau_name = col_character(),
population = col_double(),
area_km2 = col_double(),
year = col_double(),
fid = col_character()
)) %>%
transmute(nuts_2_by_geo = nuts_2, nuts_3_by_geo = nuts_3, gisco_id),
by = "gisco_id") %>%
mutate(nuts_2_different = nuts_2_official != nuts_2_by_geo,
nuts_3_different = nuts_3_official != nuts_3_by_geo)
concordance_check_df %>%
filter(is.na(nuts_2_official)==FALSE, is.na(nuts_2_by_geo)==FALSE) %>%
group_by(country) %>%
summarise(total_nuts_2_different = sum(nuts_2_different),
total_nuts_3_different = sum(nuts_3_different),
total = n()) %>%
knitr::kable()
```
The situation is quite similar for LAU 2019, NUTS 2016 (the most recent for which there is a validated concordance table): the records for Albania are different only in format (the official concordance tables have e.g. "AL11" rather than "AL011"). Almost all other miscategorised are in France. .
```{r}
concordance_check_df <- ll_get_lau_nuts_concordance(lau_year = 2019, nuts_year = 2016) %>%
transmute(country, nuts_2_official = nuts_2, nuts_3_official = nuts_3, gisco_id) %>%
dplyr::left_join(y = readr::read_csv(file = "lau_nuts_concordance_by_geo/lau_2019_nuts_2016_concordance_by_geo.csv", col_types = cols(
gisco_id = col_character(),
country = col_character(),
nuts_2 = col_character(),
nuts_3 = col_character(),
lau_id = col_character(),
lau_name = col_character(),
population = col_double(),
area_km2 = col_double(),
year = col_double(),
fid = col_character()
)) %>%
transmute(nuts_2_by_geo = nuts_2, nuts_3_by_geo = nuts_3, gisco_id),
by = "gisco_id") %>%
mutate(nuts_2_different = nuts_2_official != nuts_2_by_geo,
nuts_3_different = nuts_3_official != nuts_3_by_geo)
concordance_check_df %>%
filter(is.na(nuts_2_official)==FALSE, is.na(nuts_2_by_geo)==FALSE) %>%
group_by(country) %>%
summarise(total_nuts_2_different = sum(nuts_2_different),
total_nuts_3_different = sum(nuts_3_different),
total = n()) %>%
knitr::kable()
```
## Accessing the dataset
The following datasets generated with this approach (i.e. attributing the LAU to the NUTS region where the largest part of its area is located according to available gemostries) are currently available.
N.B. Unless you are looking for a specific combination of LAU and NUTS, or you really want them matched by geometry, you probably want to download the [main datasets](datasets.html), using the official concordance tables, and falls back on these datases only when the relevant data is missing.
- [LAU 2020/NUTS 2021](lau_nuts_concordance_by_geo/lau_2020_nuts_2021_concordance_by_geo.csv) for all of Europe, or [by country](lau_nuts_concordance_by_geo/lau_2020_nuts_2021_concordance_by_geo)
- [LAU 2020/NUTS 2016](lau_nuts_concordance_by_geo/lau_2020_nuts_2016_concordance_by_geo.csv) for all of Europe, or [by country](lau_nuts_concordance_by_geo/lau_2020_nuts_2016_concordance_by_geo)
- [LAU 2019/NUTS 2016](lau_nuts_concordance_by_geo/lau_2019_nuts_2016_concordance_by_geo.csv) for all of Europe, or [by country](lau_nuts_concordance_by_geo/lau_2019_nuts_2016_concordance_by_geo)
- [LAU 2018/NUTS 2016](lau_nuts_concordance_by_geo/lau_2018_nuts_2016_concordance_by_geo.csv) for all of Europe, or [by country](lau_nuts_concordance_by_geo/lau_2018_nuts_2016_concordance_by_geo)
- [LAU 2017/NUTS 2016](lau_nuts_concordance_by_geo/lau_2017_nuts_2016_concordance_by_geo.csv) for all of Europe, or [by country](lau_nuts_concordance_by_geo/lau_2017_nuts_2016_concordance_by_geo)
- [LAU 2016/NUTS 2016](lau_nuts_concordance_by_geo/lau_2016_nuts_2016_concordance_by_geo.csv) for all of Europe, or [by country](lau_nuts_concordance_by_geo/lau_2016_nuts_2016_concordance_by_geo)
Datasets with the surface of each LAU recorded in each NUTS are available in the [lau_nuts_area](https://github.com/EDJNet/lau_centres/tree/main/lau_nuts_area) folder. They are likely most useful for pre-caching when processing large amounts of data.