Skip to content

Utilities and applications for the FlatGov project by Demand Progress

License

Notifications You must be signed in to change notification settings

unitedstates/BillMap

Repository files navigation

BillMap: A Demand Progress Project

A live demo of the application is here.

The application version is shown in the top right of the page; it is set by the project’s latest git tag or, if that is not available, by the version string set in _version.py.

This documentation are also available there, at https://BillMap.linkedlegislation.com/static/docs/README.html When the documentation has been updated in the git repository, it can be converted to html and copied to the application directory with the script scripts/docs_generator.sh (requires installation of asciidoctor).

This repository contains:

  • A web application showing information for a given bill (Django/Python)

  • Utilities to scrape and process bill data (Python)

Both components are described below.

A separate repository (github.com/aih/bills) contains tools in Go to process bill data. This repository now uses some of those tools instead of the Python ones.

Web Application Quickstart (development)

The BillMap web application is built using the Django/Python web application framework. The application is contained in the server_py directory of this repository. It makes use of data that is processed using the scrapers and scripts described in the DATA_BACKGROUND.

Below are instructions to set up a local development environment. For production deployment instructions, see DEPLOYMENT.

Clone this repository

$ git clone https://github.com/aih/BillMap.git
$ cd BillMap

Install Python dependencies

Create a new Python virtual environment. You can use venv, virtualenv or preferably pyenv virtualenv, which requires installing pyenv first).

If you don’t have pyenv, try installing with homebrew

$ brew update
$ brew install pyenv

If you don’t have pyenv-virtualenv, try installing with homebrew

$ brew install pyenv-virtualenv

Note: you may have to manually update ~/.bashrc for virtual env commands to work

Create the environment (with pyenv virtualenv):

$ pyenv install 3.7
$ pyenv virtualenv 3.7 BillMap

Note: you may have to specify the patch version e.g. 3.7.9

Activate the environment

$ pyenv activate BillMap

Then load the requirements.txt into the virtual environment:

$ cd /path/to/server_py
$ pip install -r requirements.txt

Installing pypy as virtualenv

The application has been tested and works with pypy on ubuntu:

  1. Install pypy as a pyenv virtualenv, for example

pyenv install pypy3.7-7.3.4
pyenv virtualenv pypy3.7-7.3.4 pypy37flat
pyenv activate pypy37flat
  1. Upgrade pip, if appropriate

/home/ubuntu/.pyenv/versions/pypy3.7-7.3.4/envs/pypy37flat/bin/pypy3 -m pip install --upgrade pip

  1. It may be necessary to install C libraries to build lxml

sudo apt-get install libxml2-dev libxslt-dev python-dev

  1. Install requirements

cd /path/to/server_py
pip install -r requirements.txt

Create .env file

Copy server_py/flatgov/.env-sample to server_py/flatgov/.env, and change the SECRET_KEY defined in that file.

Also, obtain an API key from ProPublica and add it as PROPUBLICA_CONGRESS_API_KEY. This is used to get Press Statement data from the ProPublica API.

Database set-up

Use Django manage.py commands to download the data and populate the database (see DATABASE).

Data structure: the Bill model

The core of the application is a bill. This is described in BILL_MODEL, and the model itself is set up in Django in server_py/flatgov/bills/models.py. We model bills at the level of the billnumber, e.g. 116hr1500 is a bill in the 116th Congress, in the House of Representatives, bill number 1500. This bill may have many versions, which may differ significantly from each other (e.g the Introduced version may have just a few sections, while the Reported in House version has an entirely new thousand section bill substituted in its place). Where there are differences, we attempt to process the latest version of any bill (e.g to calculate bill similarity).

Data: download bills with the unitedstates/congress scraper

To download and process data from earlier congresses, see details in DATA_BACKGROUND. There are ~50Gb of data, total for Congresses 110-117, including processed json files, and `DATA_BACKGROUND`describes options for downloading and processing this data. For a 'quick start', you can use data from only the most recent Congress:

Download data from the most recent Congress

cd /path/to/uscongress
./run govinfo --bulkdata=BILLSTATUS --congress=117`
./run bills
Note
You may need to separately clone the unitedstates/congress repository, run the command from there, and link the data directory to a directory congress/data in this repository.

Celery task to update bill downloads and data

Updates to the data are done through the Celery taskrunner (see https://docs.celeryproject.org/en/stable/getting-started/introduction.html). Details of the tasks in BillMap are in CELERY.

To run the Celery worker

$ pyenv activate BillMap
$ cd ~/.../server_py/flatgov
$ celery worker -Q bill -A flatgov.celery:app -n flatgov.%%h --loglevel=info

Set up the Celery schedule

celery beat -S redbeat.RedBeatScheduler -A flatgov.celery:app --loglevel=info

Run the Django application

Run the application from server_py/flatgov (within the Python virtual environment you created above):

$ cd server_py/flatgov
$ python manage.py runserver

This will serve the application on localhost:8000. Pages for individual bills follow the form: http://localhost:8000/bills/116hr1500

Bill-to-bill data pages are at: /bills/compare/115s211/115hr604/

Deployment

Deployment instructions

Deployment instructions are in DEPLOYMENT. The application is served on a Linux server (currently Ubuntu Ubuntu 18.04.5 LTS on AWS).

System components

The components of the system are:

  • Linux server on AWS (Ubuntu 18.04.5 LTS)

  • Nginx web server

  • Postgresql server (see DATABASE)

  • Elasticsearch server for search and bill similarity processing (see ES_SIMILARITY)

  • Python/Django application (this repository)

  • uwsgi Python server running the Django application, proxied by Nginx above

  • Bill metadata and xml, downloaded using scrapers from unitedstates/congress

  • Scrapers: other data scraped from public sources, including:

    -Statements of Administration Policy
    -Press statements
    -Congressional Budget Office reports
    -Congressional Research Service reports
    -Calendar information from various congressional sources

These are described in more detail in SCRAPING.

Bills that are related to each other are identified in three ways:

  1. Metadata (in billstatus XML) from the Congressional Research Service identifies bills as identical or related (e.g through a Committee process). We show these in the Related Bills table of the application.

  2. Same or similar titles. Two bills are considered related if they have exactly the same title, or differ only in the year (e.g. 'The Very Important Information Act of 2022' and 'The Very Important Information Act of 2023').

  3. Calculation of text similarity between bills. We calculate similarity between bills using the bill_similarity module (see below).

Bill similarity — text similarity

Overview

  • Bill-to-bill comparison is impractical

Calculating the text similarity between two bills can be relatively straightforward: we can find the percentage of overlapping text between the two bills, or use an existing text similarity algorithm (e.g. Levenshtein distance).

However, for a database of the size of this one, calculating the similarity of all bills is impractical, particularly if we want to update the data. The calculation requires approximately n2 comparisons, where n is the number of bills. For the ~80k bills in our corpus, this would be 6.4 billion comparisons.

  • Search-based comparison

To improve performance, we use search. In particular, we search each section of the latest version of abill against an index of all bills, and combine the results of all of the section-wise searches to get a total score. We then have to filter results to remove duplicates (due to the different versions of all bills).

This approach is imperfect, since many individual sections may share language with unrelated bills (e.g. an Effective Date provision). Smaller bills may not have enough text to reliably find the most relevant 'similar' bills. On the other hand, large bills may match many similar bills on a subset of sections.

This application sets up the basic mechanisms for similarity measurements (described further in ES_SIMILARITY), which are open to many refinements (e.g. with the similarity metric that is used in the comparison).

Finding Similar Bills

As shown below, the application has three main views to explore bill similarity:

  1. A list of similar bills, in order of similarity.

  2. A section-by-section analysis of which other bills have similar sections.

  3. A bill-to-bill comparision showing matching sections between two bills.

Note that small sections with common language will not show as matches using our methodology. We will only show sections that use distinct language, where that language is shared between sections of the two bills.

SimilarBills
Figure 1: Similar Bills
SectionBySectionBills
Figure 2: Section-by-section List
BilltoBillSimilarity
Figure 3: Bill-to-bill Similarity
TextToBillSimilarity
Figure 4: Text-to-bill Similarity

Relevant Committee Documents

To load Relevant Committee Documents data use the following instructions:

  1. After installing the requirements under scrapers directory, run crec_scrape_urls.py file under scrapers directory.

  2. Go to the crec_scrapy folder and run “scrapy crawl crec” command. It will take about an hour to scrape all the data in crec_scrapy/data/crec_data.json file.

  3. Copy scraped data from crec_scrapy/data/crec_data.json to django base directory. First delete old data under django base directory or replace it.

  4. Run django command “./manage.py load_crec” command to populate the data to the database.