This repository contains the Mapping Prejudice project pipeline for identifying racially restrictive language in historical deed documents. It provides two core features:
- Entity Identification: Uses ModernBERT to recognize and extract racially restrictive terms and phrases based on their context within the document
- Document Classification: Uses BERT-base with TARS framework to determine whether any sentence in the document contains racially restrictive language.
Install the required dependencies:
pip install -r requirements.txt
Poetry is recommended for better dependency and environment management. To install Poetry and install dependencies:
pip install -r poet.txt
poetry install --no-root
The pipeline supports two types of input: a JSON-formatted string or local directory paths. File input from S3 buckets is also supported via boto3.
This module extracts racially restrictive terms from deed text using contextual entity recognition.
# pip
python3 run_identification.py --json_input <input_json>
# Poetry
poetry run python3 run_identification.py --json_input <input_json>
# pip
python3 run_identification.py --local --path_data <path_to_local_data> --output_dir <output_directory> --output_file <output_file_name>
# Poetry
poetry run python3 run_identification.py --local --path_data <path_to_local_data> --output_dir <output_directory> --output_file <output_file_name>
This module classifies whether any sentence in a document contains racially restrictive language.
# pip
python3 run_classification.py --json_input <input_json>
# Poetry
poetry run python3 run_classification.py --json_input <input_json>
# pip
python3 run_classification.py --local --path_data <path_to_local_data> --output_dir <output_directory> --output_file <output_file_name>
# Poetry
poetry run python3 run_classification.py --local --path_data <path_to_local_data> --output_dir <output_directory> --output_file <output_file_name>