Intercept-Federal-Prosecutors

A Nationwide Investigation for Federal Prosecutors

Code for BU Spark! Summer 2021

The code for this project relies on the Google Cloud Vision Api. To install and get started with it, please follow steps 1-6 in this codelab (https://codelabs.developers.google.com/codelabs/cloud-vision-api-python#0) for the API. Use this link to set up the API Authentication Key and save it to your computer (and be sure to set the path as an environment variable using a .env file): https://cloud.google.com/vision/docs/libraries

In the Google Cloud Storage Bucket, there will be folders for pdf files and text files. Each of these folders is broken down by circuit and then by case ID. In the text folders there will be output JSON files containing the raw text of each case along with a .txt file containing the full text only if the case mentioned "prosecutorial misconduct". However, the text can be extracted from the JSON files so even if a case does not have a .txt file in its folder, it can be extracted later if desired.

How to Run Scraper (Date last scraped: July 14):

Bash Script:

1. Inside ScrapeDecisions.py, change the date to the date which was last scraped
2. Inside the project directory, open terminal and enter this command: sh pipeline.bash (this can take several hours to finish)
3. The result will be three spreadsheets: newCases.csv (contains list of all newly identified cases), newCasesWithMentions.csv (contains list of all cases which mentioned prosecutorial misconduct) and scores_added.csv (final output, contains cases + scores based on keywords)

To run each part of scraper individually, do so in this order:

1. ScrapeDecisions.py: Submits GET requests to API to find all new cases on appellate court website and saves their PDF links in a spreadsheet called NewCases.csv (Must change date to last scraped date)
2. download_pdfs_v2.py: Downloads all PDFs from NewCases.csv and uploads to GCS (~50 minutes for 6,000 cases)
3. FilterDecisions.py: Extracts all text from PDFs using Vision API, saves text if and only if it mentions prosecutorial misconduct and saves a copy of the NewCases spreadsheet with only the cases that had mentions. Returns newCasesWithMentions.csv. (14 hr for ~6000 cases)
4. FindDecisionKeywords.py: Checks the text of all cases given by a spreadsheet for certain keywords and adds score columns based on the frequency of certain keywords. Returns scores_added.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
FilterDecisions.py		FilterDecisions.py
FindDecisionKeywords.py		FindDecisionKeywords.py
README.md		README.md
ScrapeDecisions.py		ScrapeDecisions.py
download_pdfs_v2.py		download_pdfs_v2.py
misconduct_words.json		misconduct_words.json
no_misconduct.json		no_misconduct.json
pipeline.bash		pipeline.bash
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intercept-Federal-Prosecutors

How to Run Scraper (Date last scraped: July 14):

Bash Script:

To run each part of scraper individually, do so in this order:

About

Releases

Packages

Contributors 2

Languages

della222/Intercept-Federal-Prosecutors

Folders and files

Latest commit

History

Repository files navigation

Intercept-Federal-Prosecutors

How to Run Scraper (Date last scraped: July 14):

Bash Script:

To run each part of scraper individually, do so in this order:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages