This project extracts UK House of Lords judgements from 1996 to 2009: https://publications.parliament.uk/pa/ld/ldjudgmt.htm
HTML files are scraped for the text of the cases and cleaned up for the purposes of annotating the majority judgement. We select 231 of those cases and merge them with the HOLJ corpus to create the 300 cases strong HOLJ+ corpus.
To get the full corpus used in our research, simply run "holjplus.py", this should get you ~750 House of Lords judgements in plain text format - HOLJ+. To get the 300 cases strong corpus we use for majority opinion research we then merge the existing HOLJ corpus with the HOLJ+ corpus using "merge.py".
"merge.py" can also be used to further extend and combine our, or any .txt corpus. See "merge.py" for details.
To run the build in tests, run format.py, extract.py and scrape.py
scrape.py - functions adapted from realpython.com tutorial
- Josef Valvoda
This project is licensed under the MIT License - LICENSE.md