This repository contains the scripts to build the database and datasets from the European Court of Human Rights OpenData (ECHR-OD) project. The purposes of such repository are many:
- Reproducibility: everyone can rebuild the entire database from scratch,
- Extensibility: any new version of the database must be created from a updated version of those scripts.
- Revision: all cases are automatically processed. There are many corner cases and such repository allow anyone to check the intermediate files to understand if the results are correct or not and locate the root cause of parsing errors.
- Official website: ECHR-OD project
- Original paper: paper, code, supplementary material
- Creation process: https://github.com/echr-od/ECHR-OD_process
- Website sources: https://github.com/echr-od/ECHR-OD_website
If you are using the project, please consider citing:
@article{Quemy2019_ECHROD,
title={European Court of Human Right Open Data project},
author={Alexandre Quemy},
journal={CoRR},
year={2019},
volume={abs/1810.03115}
}
The building chain starts from scratch and consists in the following steps:
get_cases_info.py
: Retrieve the list and basic information about cases from HUDOCfilter_cases.py
: Remove inconsistant, ambiguous or difficult-to-process casespreprocess_documents.py
: Analyse the raw judgments to construct a JSON nested structures representing the paragraphsprocess_documents.py
: Normalize the documents and generate a Bag-of-Words and TFID representationgenerate_datasets.py
: Combine all the information to generate several datasets
In order to parse and normalize the documents, the following packages from nltk
have to be installed: stopwords
, averaged_perceptron_tagger
and wordnet
. To install them, start bin/download-nltk
:
python bin/download-nltk
In order to automatically retrieve the number of documents available on HUDOC, Selenium is installed as a dependency. For Selenium to work, a webdriver is mandatory and must be manually installed. See Selenium documentation for help.
- version 2.0.0: Changelogs
- version 1.0.2: Changelogs
- version 1.0.1: Changelogs
- Alexandre Quemy aquemy@pl.ibm.com