This repository contains data, analytic code, and findings that support portions of the BuzzFeed News article, “She Paid A Lawyer Thousands Of Dollars To Apply For A Green Card. Then She Got A Deportation Order Instead.,” published September 29, 2018. Please read that article, which contains important context and details, before proceeding.
This analysis uses three spreadsheets containing complaints filed to the Federal Trade Commission (FTC) in relation to 'immigration services'. The spreadsheets come from two sources:
-
The FTC provided redacted records from July of 2015 through May of 2018 in response to a BuzzFeed News FOIA request. The two files provided are the following:
consumer_compls_redacted_2018.xlsx
: Raw data of the complaints filed between Jan. 1, 2018 through May 7, 2018consumer_compls_redacted_2015_2017.xls
: Raw data of the complaints filed between July 1, 2015 through Dec. 31 7, 2017
-
Matthew Blaisdell of the American Immigration Lawyers Association (AILA) provided a spreadsheet of FTC complaints beginning in 2006 through 2015 (the spreadsheet contained one entry from the year of 2000)
consumer_compls_redacted_2006_2015.xlsx
: Raw data of the complaints filed between 2006 through July 15, 2015 (as well as two pivot tables, one tallying up complaints by month and another by geolocation)
Each of the spreadsheets contain, among others, the following columns relevant to the analysis:
Created Date
— The date the complaint was received.Reference Number
— A unique reference number affiliated with each complaint.Complaint Info Comments
— Prose affiliated with the complaint (redacted, however, inconsumer_compls_redacted_2015_2017.xls
)
The notebook 2018-09-ftc-analysis.ipynb
performs two analyses:
- Merge the three spreadsheets received from FTC and AILA
- Calculate monthly count of complaints for each spreadsheet
- Calculate a yearly tally of complaints based on these counts
- Filter data down to rows that contain prose in the
Complaint Info Comments
column. (Some are redacted.) - Create a dataframe containing one row per word in each complaint. Each row contains a complaint ID, the word, and the word's lemmatization.
- For each lemmatized word, count the total number of times it appeared, how many distinct complaints it appeared in, and the lemma's different variations in the complaints.
The results of "Part 2" above are saved as output/word_counts.csv
.
You can run the analysis yourself. To do so, you'll need the following installed on your computer:
- Python 3
- The Python libraries libraries specified in
requirements.txt
Additionally, before running the analysis, you'll need to install spaCy
's English language model. You can do so by running the following command in your terminal: python -m spacy download en
. (The analysis uses spaCy
, a natural language processing library for Python, to tokenize and lemmatize the text of the complaints.)
All code in this repository is available under the MIT License. The data file in the output/ directory is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. All files in the data/ directory are released into the public domain.
Contact Lam Thuy Vo at lam.vo@buzzfeed.com. Looking for more from BuzzFeed News? Click here for a list of our open-sourced projects, data, and code..