Skip to content
Leonard Bronner edited this page Nov 7, 2018 · 11 revisions

Overview

Bill text is sourced from the U.S. Government Publishing Office (GPO), which is the legislative-branch agency tasked with publishing documents.

For every bill, there may be zero or more prints (or versions) of the bill. A print is in general a snapshot of the text of the bill at a given time. Each print is assigned a status code from a kind-of-but-not-really fixed set of status codes. When a House bill or resolution is introduced, its first print is for the "ih" (Introduced in House) status code.

Prints typically occur when a bill is introduced, after votes on passage, and when a bill is sent to the president (enrollment). Prints are always at least a day behind, and may sometimes not be published for days or weeks after the activity occurred. And because prints only occur after major action, there is typically no new bill text published as amendments are adopted. See more in the documentation on FDSys.

Data Formats

Bill text comes in several formats:

  • PDF. The Government Publishing Office publishes all bills to PDF. These are available in the GPO FDSys Congressiona Bills (BILLS) collection starting with the 103rd Congress. There is no comprehensive bill text available before the 103rd Congress.

  • Plain Text. A plain text version of each bill can also be found along side any PDF. The plain text from GPO is pretty good. It omits line numbers, which are hard to ignore in the PDF text layer, and it doesn't hyphenate words that happen to be broken across lines in the print form, which is very handy for search indexing. So when using plain text, use the plain text from GPO and not the text layer of the PDF. Note that GPO calls these files HTML, but they're HTML wrappers around plain text.

  • XML. Starting with approximately the 111th Congress, bills have been drafted in XML. The XML drafting process began a few Congresses earlier, but it wasn't initially comprehensive. The documentation for the XML format is at xml.house.gov. These XML files can be found on GPO FDSys in the Congressiona Bills (BILLS) collection and also in undocumented directories such as http://thomas.loc.gov/home/gpoxmlc113/. GPO also has a "bulk data" page for bill XML here, but it's entirely redundant with the other sources.

  • HTML. Prior to the 111th Congress, the only comprehensive source of semi-structured bill text data was the HTML rendition of bill text as it appeared on THOMAS. (GovTrack screen-scraped this starting with the 101st Congress, the earliest that this bill text is available.)

  • In addition, a "MODS" file --- which is an XML file --- is available for all bills on GPO FDSys. These files contain fairly detailed metadata about the bill. One interesting component is a list of citations found in the bill.

File layout

Our text fetching scripts store bill text like this:

data/[congress]/bills/[bill_type]/[bill_type][number]/text-versions/[status_code]/document.[format]

For instance, the directory:

113/bills/hr/hr1237/text-versions/ih

is for bill text information for the ih status of H.R. 1237 in the 113th Congress. This directory may contain:

document.txt (plain text version; UTF-8 encoded)
document.xml (XML version)
mods.xml (MODS metadata file)
data.json (metadata extracted from the MODS file in a more convenient JSON format)

Here's data.json:

{
  "bill_version_id": "hr1237-113-ih", 
  "issued_on": "2013-03-18", 
  "urls": {
    "html": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/html/BILLS-113hr1237ih.htm", 
    "pdf": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/pdf/BILLS-113hr1237ih.pdf", 
    "unknown": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/content-detail.html", 
    "xml": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr1237ih/xml/BILLS-113hr1237ih.xml"
  }, 
  "version_code": "ih"

In general, the way to find the current or most recent text of a bill is to look through all of the text-versions directories (for each status code), read data.json, and select the status directory that has the most recent issued_on date. The status codes do not reliably have an order. (That's not to say there isn't a pattern to it, but it's hard or impossible to always know from a status code which is most recent because legislative actions can occur in many orders.)

Scripts

We have four scripts that produce these bill text files:

  • Bill text via GPO's "Federal Digital System" aka FDSys: fdsys.py for the GPO BILLS collection. This gathers GPO's actual bill text data, from the 103rd Congress (1993) to the present. It's smart about updating only changed files.

  • Bill text via the Statutes at Large: statutes.py for the GPO STATUTE collection. This extracts bill information from GPO's Statutes at Large collection, which covers enacted bills and agreed-to concurrent resolutions from 1951 to the present. Since there is better bill data starting in 1993, this scraper should only be used up to 1992. (The scraper also extracts bill metadata. As noted on the bill documentation page, better bill metadata comes from THOMAS stating in 1973.)

Bill text via FDSys

The govinfo.py (formerly fdsys.py) script fetches content from FDSys. FDSys contains documents of many types besides bill text, and it can be used to fetch any document collection on FDSys, including the bulk data collections.

To download all bill text for the 112th Congress, run:

./run govinfo --collections=BILLS --congress=112 --store=pdf,mods,xml,text --bulkdata=False

Running the command again will smartly update changed files by scanning through FDSys's sitemaps for changed sitemaps and changed files.

All arguments are optional. Without --store, the script fetches all document formats that are available. The --bulkdata=False option is necessary here because there is a bulk data collection also named BILLS that contains bill text in XML (where available). Without --bulkdata=False, both collections would be fetched - which would be redundant.

Using fdsys.py for other collections

This script can be used to smartly fetch any collection in FDSys. The stored files for other collections (besides bills) are organized in a more generic way: in data/fdsys/COLLECTION/YEAR/PKGID. The PKGID is the package identifier for the file on FDSys. For instance:

./run fdsys --collections=STATUTE --year=1982 --store=mods
data/fdsys/STATUTE/1982/STATUTE-96/mods.xml

The collections argument can take a comma-separated list of collections. To get a list of collection names, use:

./run fdsys --list

Bill text via the Statutes at Large

The U.S. Statutes at Large is the final compilation of enacted bills and agreed-to concurrent resolutions that is published after the end of each Congress. The Government Printing Office has digitized and OCR'd the Statutes at Large from 1951 forward and has separated the whole-volume PDFs into smaller PDFs, one for each bill, with corresponding MODS metadata files.

The statutes.py scraper extracts bill metadata from these MODS files and bill text from the text layer of the PDFs. The quality of the OCR is not very good. But it's what we have. We can use the Statutes at Large to fill in bill text for enacted bills and agreed-to concurrent resolutions from 1951 to 1992, since there is no comprehensive source of bill text in that period. See the documentation on Bills for how to run this scraper to generate bill metadata.

First download the Statutes at Large from GPO:

./run fdsys --collections=STATUTE --store=mods,pdf --granules

Then run this scraper:

./run statutes --volumes=65-106 --textversions --extracttext

This processes all downloaded statutes files in the period of time for which normal bill text is not available (1951 to 1992, 82nd-102nd Congresses), and it saves bill text files, e.g.:

data/82/bills/hr/hr1/text-versions/enr/data.json
data/82/bills/hr/hr1/text-versions/enr/document.txt

Of course, we only get bill text for the enr (Enrolled) status of bills. Note that without the --textversions flag, this would possibly overwrite bill metadata for the 93rd-102nd Congress created by the bills scraper, so be careful about that.

When --extracttext is given, the PDF is converted to text using "pdftotext -layout" and they are stored in the usual place for bill text in plain text format (as indicated above). The text file is UTF-8 encoded (like normal) and has form-feed characters marking page breaks.

If --linkpdf is given, then hard links are created from where the PDF should be for bill text to where the PDF has been downloaded in the fdsys directory, i.e.:

data/82/bills/hr/hr1/text-versions/enr/document.pdf

You can also use --volume=65, --volumes=65-86, --year=1951, or --years=1951-1972 to limit which files are created.

Clone this wiki locally