A collection of scripts and notes used for The Accountability Project.
Files are typically organized by either state or federal agency. While this repository serves as an R project, and many files are written in the R language, but SQL or Python files are also kept here and should run normally.
To begin working on the project, clone the master branch of this repository.
git clone git@github.com:irworkshop/accountability_datacleaning.git
If you're submitting documentation on a new collection of data, you can create a new branch, make all your changes and additions, and then submit a pull request that somebody on the TAP team can review and approve.
git branch md_contribs
git checkout md_contribs
When working in R with RStudio, open the tap.Rproj
file in RStudio.
This will allow you create and edit data documentation using the proper file
hierarchy. Your files should run better and save output to the right location.
Data is organized by state at the top level of the R project, with files organized by data type sub-directories (e.g., contributions, expenditures, lobbyists, voters, salaries).
In each data type directory, file are then typically organized by their file type:
docs/
for code, diaries, and keysdata/raw/
for immutable raw datadata/clean/
for processed dataplots/
for exploratory graphics
md/contribs/
├── data
│ ├── clean
│ │ └── md_contribs_clean.csv
│ ├── dupes.csv.xz
│ ├── fix_file.txt
│ └── raw
│ ├── ContributionsList-2019.csv
│ ├── ContributionsList-2020.csv
│ └── ContributionsList-2021.csv
├── docs
│ ├── md_contribs_diary.Rmd
│ └── md_contribs_diary.md
└── plots
├── amount_histogram.png
└── year_bar.png
Data is collected from the individual states or agencies. Data is typically public record, but not all data is easily accessible from the internet. Some states provided data in bulk downloads from a website while others deliver them in hard copy for a nominal fee in response to a record request.
We are standardizing public data on a few key fields by thinking of each dataset row as a transaction or registration. For each row there should typically be:
- All parties to a transaction
- The date of the transaction
- Any amount of money involved
Data manipulation follows the IRW data cleaning guide to achieve the following objectives:
- How many records are in the database?
- Check for duplicate records if that might be a problem?
- Check numeric and date ranges. Anything too high or too low?
- Is there anything blank or missing?
- Is there information in the wrong field?
- Check for consistency issues - particularly on city, state and ZIP.
- Create a five-digit
zip
code variable if one does not exist. - Create a four-digit
year
field from the transactiondate
. - Make sure there is both a donor and recipient for transactions.
The documents in each state’s docs/
folder record the entire process to
promote for reproducibility and transparency. The use_diary()
from our own
campfin package creates a simple template diary. This diary contains the steps
we typically takes to prepare data:
- Describe
- Import
- Explore
- Wrangle
- Export
campfin::use_diary(
st = "MD",
type = "contribs",
author = "Kiernan Nicholls",
auto = TRUE
)
# ✓ ~/states/md/contribs/docs/md_contribs_diary.Rmd was created
This template should approximate the workflow but please tweak each section according to your data source and structure (e.g., template column names are replaced with the actual names).
The R markdown diary should run/knit from start to finish without errors,
ending with a saved comma-delimited file from readr::write_csv()
.
The processed CSV file is then uploaded to the Workshop's AWS server where it can be searched from the Accountability Project website.
put_object(
file = "md/contribs/data/clean/md_contribs_2008-2022.csv",
object = "csv/md_contribs_2008-2022.csv",
bucket = "publicaccountability",
acl = "public-read"
)
Knitting the diary should produce a .md
markdown file alongside your .Rmd
diary. This markdown file is rendered on GitHub as an HTML page for others to
view your work.
Software used is free and open source. R can be downloaded from the CRAN.
The campfin R package has been written by IRW to facilitate exploration and wrangling of campaign finance data. The stable version is on CRAN but the latest version lives on GitHub.
# install.packages("remotes")
remotes::install_github("irworkshp/campfin")
Most cleaning is done using the tidyverse, an opinionated collection of R packages for data manipulation.
install.packages("tidyverse")
If you know of a dataset that you think belongs here, suggest it for inclusion. We’re especially interested in the data that agencies have hidden behind “search portals” or state legislative exemptions. Have you scraped a gnarly records site? Share it with us and we’ll credit you. And more importantly, other people may benefit from your hard work.