forked from aaronrudkin/autumn
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
182 lines (132 loc) · 9.17 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
library(autumn)
library(bench)
library(tidyverse)
```
# autumn: Fast, Modern, and Tidy Raking <img src="man/figures/autumn.png" align="right" width="120" />
> *"And as to me, I know nothing else but miracles"*
> - Walt Whitman, probably talking about this package.
[![Travis-CI build status](https://travis-ci.org/r-lib/pkgdown.svg?branch=master)](https://travis-ci.org/aaronrudkin/autumn)
[![Coverage Status](https://coveralls.io/repos/github/aaronrudkin/autumn/badge.svg?branch=master)](https://coveralls.io/github/aaronrudkin/autumn?branch=master)
Iterative proportional fitting (raking) is a straightforward and fast way to generate weights which ensure a dataset reflects known target marginal distributions: put simply, survey professionals use raking to ensure that samples represent the population they are drawn from.
Existing `R` implementations of raking are frustrating to use, have antiquated syntax, require external dependencies or compilation, have inadequate documentation, generate difficult to understand errors, run slowly, and don't support "tidy" workflows. **autumn** is a modern package built from the ground up to fix these problems.
## Installation
**autumn** will be submitted to CRAN in January 2020. In the meantime, you can install it using the following command:
```{r, eval = FALSE}
# Install GitHub version:
devtools::install_github("aaronrudkin/autumn")
```
## Usage
The workhorse function of **autumn** is `harvest()`, which takes at minimum two arguments: 1) a data.frame (or tibble) containing data; 2) target proportions. At its simplest, a call to `harvest()` works as follows:
```{r eval=FALSE}
# Standard R function call
harvest(respondent_data, ns_target)
# Using `magrittr`'s pipe operator
respondent_data %>% harvest(ns_target)
```
It just works! This function call will iteratively weight observations to match the target proportions and add a column `weights` to the data frame (it is also possible to rename the column or return the weights as a vector). Default parameters are helpful and sane: weights are guaranteed mean 1 and maximum 5.
### Specifying a Target
The main challenge when running `harvest()` is to correctly specify target proportions. Two formats are supported: 1) a list of named vectors; 2) a data.frame or tibble.
When supplying targets as a list of named vectors, it looks like this:
```{r, eval=FALSE}
list(
gender = c(Male = 0.4829, Female = 0.5171),
region = c(Midwest = 0.2086,
Northeast = 0.1764,
South = 0.3775,
West = 0.2374)
)
```
Each list element should match the name of a single variable in the data, and each vector name should match a value the variable can take. The numeric values should be positive and sum to 1 within each variable.
When supplying data as a data.frame or tibble, the data.frame should have three columns (by default `harvest()` looks for columns named "variable", "level", and "proportion" -- although these names can be overridden):
```{r echo=FALSE}
target_tbl = as_tibble(autumn:::list_targets_to_df(ns_target[c("gender", "region")])) %>% mutate(proportion = round(proportion, 4))
```
```{r}
target_tbl
```
### Advanced Usage
**autumn** supports a variety of advanced features including:
- Supplying starting weights
- Adjusting maximum weights
- Adjusting convergence and iteration criteria
- Adjusting variable selection and error calculation criteria
- Handling missing data appropriately
- Calculating design effects for produced weights
- Summarizing raking results
Interested in doing something fancy? Check out our R vignettes for more details:
TODO VIGNETTES GO HERE
## Speed `r emo::ji("rocket")`
How fast is **autumn**? Fast.
Below, we present results of three different benchmark scenarios, each using real data (the first two benchmarks use the `respondent_data` and `ns_target` datasets included with **autumn**). All of these benchmarks use identical data and default parameterizations, and were run on a low power 2016-vintage personal computer. The larger the the dataset and the more complicated the rake, the more you benefit from using **autumn**. Customizing convergence criteria to allow for earlier termination can result in further speed improvements over existing software.
*Note:*
### Small scale
This benchmark generates weights for a dataset of 6,691 observations, raking on 10 variables. Compared with the implementation in **anesrake**, **autumn** is about *67% faster* and allocates one third less memory. Compared with the implementation in **survey**, **autumn** is about *4X as fast* and allocates 20% more memory.
```{r echo=FALSE}
# Pre-compute the result above on real data and store as a pre-made expression
# so that we can
result_small_benchmark = structure(list(expression = c("autumn", "anesrake", "survey"),
min = structure(c(1.34557561800001, 2.462163475, 4.56870103200001
), class = c("bench_time", "numeric")), median = structure(c(1.73368290000001,
2.892148122, 6.756519067), class = c("bench_time", "numeric"
)), `itr/sec` = c(0.572472241007729, 0.314504693882058, 0.147574009494806
), mem_alloc = structure(c(774216136, 1189511664, 644706568
), class = c("bench_bytes", "numeric")), `gc/sec` = c(3.00547926529058,
2.42483118983067, 0.866259435734511), n_itr = c(100L, 100L,
100L)), class = c("bench_mark", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L))
result_small_benchmark
```
### Medium scale
Consider a raking task that is more difficult to converge: the same dataset (6,691 observations) raked on 17 variables. The extra variables involve interactions which greatly complicate convergence. **autumn** is *three times as fast* as **anesrake** and uses almost two thirds less memory (**survey** will not complete the rake):
```{r echo=FALSE}
result_big_benchmark = structure(list(expression = c("autumn", "anesrake"), min = structure(c(2.29740926100001,
8.409736236), class = c("bench_time", "numeric")), median = structure(c(3.0174227935,
9.993015223), class = c("bench_time", "numeric")), `itr/sec` = c(0.327217747698054,
0.0945252613310439), mem_alloc = structure(c(1305062264, 3338550552
), class = c("bench_bytes", "numeric")), `gc/sec` = c(6.57707672873089,
4.76974468676447), n_itr = c(100L, 100L), n_gc = c(2010, 5046
)), class = c("bench_mark", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L))
result_big_benchmark
```
### Large scale
Finally, consider an extremely resource intensive problem: raking a much larger dataset of 108,660 observations on 17 variables. In this scenario, **autumn** is 11 times faster and uses 92% less memory. (This benchmark is limited to 10 iterations):
```{r echo=FALSE}
# result_mega_benchmark = bench::mark(
# "autumn" = { autumn::harvest(result_mega, ns_target) },
# "anesrake" = { anesrake::anesrake(ns_target,
# junk_data_mega,
# junk_data_mega$ResponseID) },
# check = FALSE,
# min_iterations = 10
# ) %>% select(1:8) %>%
# mutate(expression = c("autumn", "anesrake"))
result_mega_benchmark = structure(list(expression = c("autumn", "anesrake"), min = structure(c(47.098076077,
489.156242585), class = c("bench_time", "numeric")), median = structure(c(48.7915496325,
527.8651931685), class = c("bench_time", "numeric")), `itr/sec` = c(0.0200139846259755,
0.00189092458918188), mem_alloc = structure(c(22328143016, 256220489208
), class = c("bench_bytes", "numeric")), `gc/sec` = c(2.43570192898122,
2.52211521705079), n_itr = c(10L, 10L), n_gc = c(1217, 13338)), class = c("bench_mark",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -2L))
result_mega_benchmark
```
## Why is the package called "autumn"?
<p align="center">
<img src="man/figures/raking_leaves.jpg" align="center" width="480" />
</p>
## Authorship and Funding
**autumn** is written and maintained by [Aaron Rudkin](https://github.com/aaronrudkin/). Target proportions in the included `ns_target` data were developed by [Alex Rossell-Hayes](https://github.com/rossellhayes).
If you have any comments, issues, or concerns, please [open a GitHub issue](https://github.com/aaronrudkin/autumn/issues). Contributions are welcome. Please see our [Contributor Code of Conduct](.github/CODE_OF_CONDUCT.md) for details.
**autumn** was developed in conjunction with [Democracy Fund + UCLA Nationscape](https://www.voterstudygroup.org/nationscape), one of the largest public opinion surveys ever conducted. UCLA's Nationscape team are: [Tyler Reny](http://tylerreny.github.io/), [Alex Rossell-Hayes](http://alexander.rossellhayes.com/), [Aaron Rudkin](https://github.com/aaronrudkin/), [Chris Tausanovitch](http://www.ctausanovitch.com/), and [Lynn Vavreck](https://www.lynnvavreck.com/). Funding for this project was provided by [Democracy Fund](https://www.democracyfund.org/), part of the [Omidyar Group](http://omidyargroup.com/).
![UCLA + Democracy Fund](man/figures/logo_ucla_demfund.png "UCLA + Democracy Fund")
Package hex logo adapted from art by [Freepik](https://www.flaticon.com/authors/Freepik) from [flaticon.com](https://www.flaticon.com/)