Skip to content

Latest commit

 

History

History
122 lines (81 loc) · 3.48 KB

README.md

File metadata and controls

122 lines (81 loc) · 3.48 KB

Documentation

This repository contains:

  • The Belgian Statutory Article Retrieval Dataset (BSARD) v1.0.
  • Web application to visualize insightful statistics about BSARD.
  • Code for training and evaluating strong IR models on BSARD.

Setup

This repository is tested on Python 3.8+. First, you should install a virtual environment:

python3 -m venv .venv/bsard
source .venv/bsard/bin/activate

Then, you can install all dependencies:

pip install -r requirements.txt

Additionally, you should install spaCy's fr_core_news_md pipeline (needed for text processing):

python3 -m spacy download fr_core_news_md

BSARD: The Belgian Statutory Article Retrieval Dataset

Access

We provide access to BSARD on 🤗 Datasets. To load the dataset, you simply need to run:

from datasets import load_dataset

repo = "maastrichtlawtech/bsard"

# Load corpus of statutory articles.
articles = load_dataset(repo, name="corpus")

# Load training questions.
train_questions = load_dataset(repo, name="questions", split="train")
train_negatives = load_dataset(repo, name="negatives", split="train")

# Optional: load synthetic questions for extra training samples.
synthetic_questions = load_dataset(repo, name="questions", split="synthetic")
synthetic_negatives = load_dataset(repo, name="negatives", split="synthetic")

# Load testing questions.
test_questions = load_dataset(repo, name="questions", split="test")

Documentation

As a way to document our datatset, we provide the dataset nutrition labels (Holland et al., 2018).

Visualization

We provide a Dash web application that shows insightful visualizations about BSARD.

To explore the visualizations on your local machine, run:

python scripts/eda/visualise.py

Experiments

Lexical Models

In order to evaluate the TF-IDF and BM25 models, run:

python scripts/experiments/run_zeroshot_evaluation.py \
    --articles_path </path/to/articles.csv> \
    --questions_path </path/to/questions_test.csv> \
    --retriever_model {tfidf, bm25} \ 
    --lem \ 
    --output_dir </path/to/output>

Dense Models

Zero-Shot Evaluation

First, download the pre-trained French fastText and word2vec embeddings:

bash scripts/experiments/utils/download_embeddings.sh

Then, you can evaluate the bi-encoder models in a zero-shot setup:

python scripts/experiments/run_zeroshot_evaluation.py \
    --articles_path </path/to/articles.csv> \
    --questions_path </path/to/questions_test.csv> \
    --retriever_model {word2vec, fasttext, camembert} \ 
    --lem \ # [Only for word2vec and fastText] Lemmatize both articles and questions as pre-processing.
    --output_dir </path/to/output>

Training

In order to train a bi-encoder model, update the model and training hyperparameters in scripts/experiments/train_biencoder.py. Then, run:

python scripts/experiments/train_biencoder.py

To evaluate a trained bi-encoder model, update the checkpoint path in scripts/experiments/test_biencoder.py and run:

python scripts/experiments/test_biencoder.py