Documentation

This repository contains:

The Belgian Statutory Article Retrieval Dataset (BSARD) v1.0.
Web application to visualize insightful statistics about BSARD.
Code for training and evaluating strong IR models on BSARD.

Setup

This repository is tested on Python 3.8+. First, you should install a virtual environment:

python3 -m venv .venv/bsard
source .venv/bsard/bin/activate

Then, you can install all dependencies:

pip install -r requirements.txt

Additionally, you should install spaCy's fr_core_news_md pipeline (needed for text processing):

python3 -m spacy download fr_core_news_md

BSARD: The Belgian Statutory Article Retrieval Dataset

Access

We provide access to BSARD on 🤗 Datasets. To load the dataset, you simply need to run:

from datasets import load_dataset

repo = "maastrichtlawtech/bsard"

# Load corpus of statutory articles.
articles = load_dataset(repo, name="corpus")

# Load training questions.
train_questions = load_dataset(repo, name="questions", split="train")
train_negatives = load_dataset(repo, name="negatives", split="train")

# Optional: load synthetic questions for extra training samples.
synthetic_questions = load_dataset(repo, name="questions", split="synthetic")
synthetic_negatives = load_dataset(repo, name="negatives", split="synthetic")

# Load testing questions.
test_questions = load_dataset(repo, name="questions", split="test")

Documentation

As a way to document our datatset, we provide the dataset nutrition labels (Holland et al., 2018).

Visualization

We provide a Dash web application that shows insightful visualizations about BSARD.

To explore the visualizations on your local machine, run:

python scripts/eda/visualise.py

Experiments

Lexical Models

In order to evaluate the TF-IDF and BM25 models, run:

python scripts/experiments/run_zeroshot_evaluation.py \
    --articles_path </path/to/articles.csv> \
    --questions_path </path/to/questions_test.csv> \
    --retriever_model {tfidf, bm25} \ 
    --lem \ 
    --output_dir </path/to/output>

Dense Models

Zero-Shot Evaluation

First, download the pre-trained French fastText and word2vec embeddings:

bash scripts/experiments/utils/download_embeddings.sh

Then, you can evaluate the bi-encoder models in a zero-shot setup:

python scripts/experiments/run_zeroshot_evaluation.py \
    --articles_path </path/to/articles.csv> \
    --questions_path </path/to/questions_test.csv> \
    --retriever_model {word2vec, fasttext, camembert} \ 
    --lem \ # [Only for word2vec and fastText] Lemmatize both articles and questions as pre-processing.
    --output_dir </path/to/output>

Training

In order to train a bi-encoder model, update the model and training hyperparameters in scripts/experiments/train_biencoder.py. Then, run:

python scripts/experiments/train_biencoder.py

To evaluate a trained bi-encoder model, update the checkpoint path in scripts/experiments/test_biencoder.py and run:

python scripts/experiments/test_biencoder.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Documentation

Setup

BSARD: The Belgian Statutory Article Retrieval Dataset

Access

Documentation

Visualization

Experiments

Lexical Models

Dense Models

Zero-Shot Evaluation

Training

Files

README.md

Latest commit

History

README.md

File metadata and controls

Documentation

Setup

BSARD: The Belgian Statutory Article Retrieval Dataset

Access

Documentation

Visualization

Experiments

Lexical Models

Dense Models

Zero-Shot Evaluation

Training