This repository contains:
- The Belgian Statutory Article Retrieval Dataset (BSARD) v1.0.
- Web application to visualize insightful statistics about BSARD.
- Code for training and evaluating strong IR models on BSARD.
This repository is tested on Python 3.8+. First, you should install a virtual environment:
python3 -m venv .venv/bsard
source .venv/bsard/bin/activate
Then, you can install all dependencies:
pip install -r requirements.txt
Additionally, you should install spaCy's fr_core_news_md pipeline (needed for text processing):
python3 -m spacy download fr_core_news_md
We provide access to BSARD on 🤗 Datasets. To load the dataset, you simply need to run:
from datasets import load_dataset
repo = "maastrichtlawtech/bsard"
# Load corpus of statutory articles.
articles = load_dataset(repo, name="corpus")
# Load training questions.
train_questions = load_dataset(repo, name="questions", split="train")
train_negatives = load_dataset(repo, name="negatives", split="train")
# Optional: load synthetic questions for extra training samples.
synthetic_questions = load_dataset(repo, name="questions", split="synthetic")
synthetic_negatives = load_dataset(repo, name="negatives", split="synthetic")
# Load testing questions.
test_questions = load_dataset(repo, name="questions", split="test")
As a way to document our datatset, we provide the dataset nutrition labels (Holland et al., 2018).
We provide a Dash web application that shows insightful visualizations about BSARD.
To explore the visualizations on your local machine, run:
python scripts/eda/visualise.py
In order to evaluate the TF-IDF and BM25 models, run:
python scripts/experiments/run_zeroshot_evaluation.py \
--articles_path </path/to/articles.csv> \
--questions_path </path/to/questions_test.csv> \
--retriever_model {tfidf, bm25} \
--lem \
--output_dir </path/to/output>
First, download the pre-trained French fastText and word2vec embeddings:
bash scripts/experiments/utils/download_embeddings.sh
Then, you can evaluate the bi-encoder models in a zero-shot setup:
python scripts/experiments/run_zeroshot_evaluation.py \
--articles_path </path/to/articles.csv> \
--questions_path </path/to/questions_test.csv> \
--retriever_model {word2vec, fasttext, camembert} \
--lem \ # [Only for word2vec and fastText] Lemmatize both articles and questions as pre-processing.
--output_dir </path/to/output>
In order to train a bi-encoder model, update the model and training hyperparameters in scripts/experiments/train_biencoder.py. Then, run:
python scripts/experiments/train_biencoder.py
To evaluate a trained bi-encoder model, update the checkpoint path in scripts/experiments/test_biencoder.py and run:
python scripts/experiments/test_biencoder.py