CapitolQuery is converting a static archive of the Congressional Record into a research-ready database.
The project, spawned in D-Lab's CTAWG and with the support of the Social Science Research Council, will proceed from June 15th – Sep 15th. The estimated ~200 hours of work will occur across three phases.
Phase I: Acquiring and Cleaning Data
Phase II: Structuring, Chunking, and Tagging the Text
Phase III: Packaging the Data for Researchers and Archives
Each phase of work will result in deliverables demonstrating progress toward (and culminating in a) research-ready pilot database, along with scripts and educational (Jupyter or RMarkdown) notebooks instructing researchers and memory organizations (i.e., archives and libraries) in the process of readying textual data for computational text analysis projects. (Significant elements of Phase I and II are already completed and we can expect some consultation help for Phase III.)
Ideal contributors will be adept at coding in python and/or R, and writing technical curriculum for an audience at beginner/intermediate skill level. Expertise in XML (for the creation of the Phase III database) is a plus. The team will be managed (as lightly as possible) by GoodlyLabs Conductor, Nick Adams, and can expect authorship credit on all their products.
To begin contributing:
- Read the Statement of Work document: 'SSRC_Goodly_SoW_Funded.pdf'
- Email Nick to let him know what you would like to work on: nickbadams [at] gmail.com
- Request access to our files at: https://drive.google.com/drive/folders/0B7dPnKIP7WrQdUtLeFlFenpsZDA