Skip to content

Latest commit

 

History

History
87 lines (71 loc) · 6.99 KB

README.MD

File metadata and controls

87 lines (71 loc) · 6.99 KB

#NADC data

Scripts to process data from the Nebraska Accountability and Disclosure Commission into something our campaign finance app can import. Run pip install -r requirements.txt to install the dependencies:

The NADC now offers weekly updates of campaign finance data. Every Monday, our script fetches new data, parses it and returns a set of clean files that our app can ingest. The update process:

  1. Run ./parser.sh
  2. If there are new, invalid dates that we haven't encountered in the past, you'll be prompted to add those to canonical/canonical.py

##Overview A standard data dump from the NADC yields 61 pipe-delimited text files (data dictionary at: nadc_tables.rtf). We extract data from 24 of them:

  • Form A1: Lookup table for most committees
  • Form A1CAND: Candidates tied to campaign committees
  • Form B1: Campaign statements for candidate or ballot question committees
  • Form B1AB: Main table of individual/other contributions
  • Form B1C: Loans to campaign committees
  • Form B1D: Expenditures by campaign committees
  • Form B2: Campaign statements for political party committees
  • Form B2A: Contributions to political party committees
  • Form B2B: Expenditures by political party committees
  • Form B4: Campaign statements for independent committees
  • Form B4A: Contributions to independent committees
  • Form B4B1: Expenditures by independent committees
  • Form B4B2: Federal and out-of-state disbursements
  • Form B4B3: Administrative/operating disbursements
  • Form B5: Late contributions
  • Form B6: Reports of an independent expenditure or donation made by people or entities that are not registered as committees
  • Form B6EXPEND: Expenditures made on behalf of committees by people who do not have an ID
  • Form B7: Registration of corporations, unions and other associations
  • Form B72: Donations by corporations, unions and other associations
  • Form B73: Indirect contributions by corporations, unions and other associations
  • Form B9: Out of state expenditures/donations
  • Form B11: Report of late independent expenditure

A shell script, parser.sh, makes backups of the raw data and boils down these 24 files (which contain some duplicate transactions) into five data tables:

  • toupload/entity.txt: Any group, committee, donor, lender or other entity that has been assigned a unique ID by the NADC. This is the parent table.
  • toupload/donations.txt: Money, inkind donations and pledges.
  • toupload/candidates.txt: Candidates tied to campaign committees and ballot questions.
  • toupload/loans.txt: Lending.
  • toupload/expenditures.txt: Expenditures.

Then, the script bakes out some SQL to upload to our live server.

##Handling duplication ###Names NADC has unique identifiers for each donor, but they identify only the address and exact name. If "Matt Wynn" at 1314 Douglas St. gave money, and "Matthew A Wynn" at 1300 Douglas St. gave money, they're considered two different donors.

This is wrong.

We can't deduplicate every donor, so our solution was to create a lookup dictionary (The "CANON" variable in canonical/canonical.py) for any "large" donors, whether in terms of total donations or the number of donations. Super fellow Daniel Wheaton assigned new, real unique identifiers for any of the top 100 donors by both measures. Those lists overlapped a bit, so we wound up deduplicating around 70 donors on the first pass. This is why each entity has two ids, with canonical_id representing our assignment of an identity.

The NADC ID is copied to canonical_id for records that are not deduplicated. Same with the name.

###Donations Some donations are recorded in several places. A late donation, for instance, may also show up as a normal donation record in B1AB.

Donations can also be duplicated within a table, inaccurately. For example, a 1999 ballot committee reported each of its donations twice, leading to a vastly inflated fundraising report.

Susan Lorenz at the Nebraska Accountability and Disclosure Commission said donations of the same amount given on the same day to the same organization are duplicates. Therefore, we deduplicated using those three values.

##Known problems ###Dates Some records have invalid dates. We added these to a python dictionary in nadc/data/canonical/canonical.py and they get fixed on import.

We can't predict the ways dates will be screwed up in the future. Any invalid dates that don't already exist in our lookup get printed to the screen mid-script for you to fix later.

###Purposeful duplication To get an exhaustive list of ID'd entities, the script slurps up a half-million records and reduces them to a file about a tenth that size. We use pandas' drop_duplicates function, grep and python to ensure to return a set of deduplicated records with the most recent, comprehensive data.

##Data excluded

  • Pre-1999 records, which the NADC does not consider especially reliable.
  • Unpaid pledges, which are recorded as a cash donation if they're actually paid.
  • Late expenditures in Form B11, which sometimes show up in other tables but cannot reliably be deduplicated. As of Jan. 4, 2016, the universe of late expenditures was 475 records spanning more than a decade and totaling $1.35 million.
  • Any transaction involving roughly 250 committees and PACs that lack identifying information because of a "glitch" in the NADC's database.
  • The small number of detail records inForm B6CONT, part of a group of cleanup files for expenditures that aren't recorded elsewhere. It is more trouble than it's worth to pull in.
  • Out-of-state expenditures and donations listed in Form B9A and Form B9B, which are problematically duplicated in other tables.
  • Records associated with candidate committee No. 99CAC00167, Friends of Doug Kristensen, which was related to Mr. Kristensen's candidacy for a delegate slot at the 1996 Republican National Convention. As far as we can tell, this is the only time a candidacy for party convention delegate shows up in the data.