-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
257 lines (170 loc) · 10.3 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
title: "Untitled"
author: "Paul Oldham"
date: "12/06/2018"
output: html_document
---
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](http://www.repostatus.org/badges/latest/wip.svg)](http://www.repostatus.org/#wip)
[![Build Status](https://travis-ci.org/poldham/places.svg?branch=master)](https://travis-ci.org/poldham/places)
---
[![minimal R version](https://img.shields.io/badge/R%3E%3D-3.4.0-6666ff.svg)](https://cran.r-project.org/)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/places)](https://cran.r-project.org/package=places)
[![packageversion](https://img.shields.io/badge/Package%20version-0.0.0.9000-orange.svg?style=flat-square)](commits/master)
---
[![Last-changedate](https://img.shields.io/badge/last%20change-`r gsub('-', '--', Sys.Date())`-yellowgreen.svg)](/commits/master)
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
library(tidyverse)
```
## Introduction
The `places` package provides access to the daily data files from the [geonames data dump](http://download.geonames.org/export/dump/). The aim of the package is to make the geonames dump consisting of over 11 million place names with geographic coordinates available for large scale mapping or text mining projects involving multiple countries or world regions.
`places` is intended to complement the [rOpenSci `geonames()` package](https://github.com/ropensci/geonames) by Barry Rowlingson. The geonames package provides access to the geonames API and is recommended for smaller scale projects. The `places` package is better suited for larger scale projects involving text mining and mapping place names from literature or other projects involving federating different kinds of data with geoinformation at scale. The package also adds additional tables that make it easier to select subsets of data for geographic regions or subregions or economic status using United Nations and World Bank datasets.
The places package was developed with support from the Research Council of Norway (RCN project number 257631/E10) as part of the Biospolar Project.
The purpose of this walk through is to take you through how to access the data, what is available, and decisions on issues such as tidying the data.
## Installing
places is not on CRAN but can be installed from Github with devtools:
```{r eval=FALSE}
devtools::install_github("poldham/places")
```
## The geonames data dump
The geonames dump is a set of daily dump files available from the export home page [http://download.geonames.org/export/dump/](http://download.geonames.org/export/dump/). This page provides .zip files for individual country data along with other data files (such as alternate names) in text files and links to other directories. The mix of files makes it messy to work with.
```{r echo=FALSE, eval=FALSE}
knitr::include_graphics("https://github.com/poldham/places/blob/master/inst/images/genames_part1.png")
```
```{r echo=FALSE, eval=FALSE}
knitr::include_graphics("https://github.com/poldham/places/blob/master/inst/images/geonames_part2.png")
```
The `places_table()` function downloads and parses the page into a tibble to make it easier to work with.
```{r}
library(places)
places_table()
```
The original [table](http://download.geonames.org/export/dump/) contained the following columns:
- name
- last_modified,
- size
- description (empty)
To make the table easier to work with `places_table()` separates out the files into `iso` (for country files) and `other` for other files. `File_name`, `file_type` and `url` for file paths are added.
## Looking Up Countries
The geonames data mainly works on two letter country codes (variously called iso and iso2c). Country names can be expressed in all kinds of different ways. If you don't know the two letter country code you can look it up with `places_lookup()`.
```{r}
places::places_lookup("Kenya")
```
We can also look up ambiguous names:
```{r}
places::places_lookup(c("Peoples republic of china", "Viet Nam", "Vietnam", "Lao", "Laos"))
```
Behind the scenes the name matching is handled by the [countrycode](https://github.com/vincentarelbundock/countrycode) package by Vincent Arul Bundock and collaborators. Country names can be expressed in all kinds of different ways and the countrycode package does a good job of recognising them. At present the places package only implements country names in English.
We use the countrycode package for lookup because of its flexibility. However, Geonames produces its own [countryinfo table](http://download.geonames.org/export/dump/countryInfo.txt) that can be imported as follows.
```{r}
countryinfo <- places::places_countryinfo()
countryinfo
```
The advantage of this table is that it includes information such as the capital city, the area in square kilometres, the population of the capital city, continent and so on. Some of this data is included in the `regions` table (below). However, for general lookup of country names `places_lookup()` will be more flexible and forgiving.
## Individual Country data
You can download the latest raw data for an individual country using `places_download()` and import it as a data frame with tidy column names using `places_import()`. A `download_date` column is added by default to assist with keeping track of the file history.
Download and import and presently handled in two steps to avoid assigning to the global environment. The pipe `%>%` is built into places.
```{r}
KE <- places_download(code = "KE") %>%
places_import()
KE
```
`places_download` uses `places_lookup` internally meaning that you can simply enter a country name using the `country =` argument.
```{r}
anguilla <- places_download(country = "anguilla") %>%
places_import()
anguilla
```
`places_download` can handle different cases and will fail fast on common errors.
```{r eval=FALSE}
places_download(code = "Kenya")
```
or
```{r eval=FALSE}
places_download(country = "KE")
```
If you try and download more than one file at a time things will quickly go wrong.
```{r eval=FALSE}
places_download(code = c("AI", "GB"))
```
`places_download()` is not vectorised and only handles one country file at a time. For multiple countries or regions it is easier to work with the `allcountries` table.
### All Countries
Geonames produces a daily file containing the data for all countries. The `allcountries` daily file contains over 11 million place names in a +330MB compressed file that is 1.4Gb when uncompressed.
For many purposes you may be happy with a highly compressed archive of the allcountries file. The archive was created on the 1st of January 2018 as a .rda file and is a 257MB compressed .rda file that can be called with:
```{r eval=FALSE}
places_archive()
load("allcountries.rda")
```
This will take a few minutes to download and then to load... so maybe take a break for a cup of tea.
If you would like to retrieve the latest data file, use:
```{r eval=FALSE}
allcountries <- places_download(country = "allcountries") %>% places_import()
```
You may want to have another cup of tea while waiting for this.
### Place names
The geonames tables contains three name fields. The asciiname is the most useful for text mining and you may want to take a look at the alternatenames field for known variants. The built in Kenya dataset (KE) can be useful for getting to grips with the data.
```{r}
library(tidyverse)
places::KE %>%
separate_rows(alternatenames, sep = ",")
```
### Feature Codes
Geonames uses [feature codes](http://www.geonames.org/export/codes.html) to describe the georeferenced data. Feature codes are divided between classes and codes. The classes are as follows:
- A Administrative Boundary Features
- H Hydrographic Features
- L Area Features
- P Populated Place Features
- R Road / Railroad Features
- S Spot Features
- T Hypsographic Features
- U Undersea Features
- V Vegetation Features
An example of a corresponding code is:
- H.ANCH anchorage
- R.OILP oil pipeline
Note that the original featurecode table concatenated the class and feature code as in the examples above, but in the actual country and `allcountries` files they are separated into `feature_class` and `feature_code`. This makes joining awkward. To solve this a new field called `feature_full` is created at import while dropping the feature_class and feature code fields to prevent duplicates when joining. That is simpler than it sounds.
To view the featurecodes use:
```{r}
featurecodes
```
```{r echo=FALSE}
load("data/KE.rda")
load("data/featurecodes.rda")
```
To join the featurecodes table onto a dataset you could use:
```{r}
KE <- dplyr::left_join(KE, places::featurecodes, by = "feature_full")
KE %>% dplyr::select(feature_full, feature_name, name, longitude, latitude)
```
The `featurecodes` table, includes the feature code table divided into four columns that were parsed from the original file.
- feature_full (for joining)
- feature name (the short description)
- feature detail (a longer description)
- feature_clean (basic cleaning on the feature name to remove plurals such as forest(s))
- MULTI A logical field indicating whether the feature_clean fields contains multi word phrases (TRUE)
You may wish to join this table to the main table or to investigate the codes to use as filters. The feature code for a mountain is "MT" but to get all mountainous features you may need to look up other codes (e.g. MTS and MTU)
```{r}
KE %>% filter(feature_code == "MT" | feature_code == "MTS")
```
### Regions, Sub-Regions, intermediate regions and continents
To aid with multicountry analysis the package includes two add on tables.
1. The [United Nations regions (M49)](https://unstats.un.org/unsd/methodology/m49/overview/) from the United Nations Statistical Division
2. World Bank regional divisions (through the World Development Indicators ([WDI](https://github.com/vincentarelbundock/WDI)) package).
You can call these table directly:
```{r}
unregions
```
For the World Bank WDI indicators from the WDI package:
```{r}
wdicountry
```
Additional regional information is available from the countrycode package and may be incorporated into places in future.
The two regional tables are combined with selected fields from the geonames countryinfo table.
The regions file can be called as follows:
```{r}
regions
```
To join the regions table with a set of results try:
```{r eval=FALSE}
df <- dplyr::left_join(df, regions, by = "iso")
```