-
-
Notifications
You must be signed in to change notification settings - Fork 152
Importing Harvard Data
Mike Lissner edited this page Nov 6, 2024
·
1 revision
This documentation covers the process of importing and updating case data from the Harvard Case Law Access Project (CAP) into CourtListener.
The import process consists of two main steps:
- Generating crosswalk files that map CAP cases to CourtListener cases
- Using these crosswalk files to update CourtListener's case data with the latest CAP content
Generates mapping files between CAP and CourtListener cases.
manage.py generate_cap_crosswalk --output-dir cl/search/crosswalks [options]
-
--output-dir
: (Required) Directory to save crosswalk files -
--reporter
: Process only a specific reporter (e.g., 'U.S.') -
--volume
: Process only a specific volume -
--updated-after
: Only process cases updated after this date (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00) -
--start-from-reporter
: Start processing from a specific reporter slug -
--dry-run
: Run without saving crosswalk files -
--verbose
: Increase output verbosity
Updates CourtListener cases with the latest CAP HTML content using the generated crosswalk files.
manage.py update_cap_cases --crosswalk-dir cl/search/crosswalks [options]
-
--crosswalk-dir
: (Required) Directory containing crosswalk files -
--reporter
: Update only a specific reporter -
--max-workers
: Maximum number of worker threads (default: 4, max: 16) -
--verbose
: Enable verbose output
- Environment Setup
Required environment variables. These will come from CAP devops team:
- CAP_R2_ENDPOINT_URL="your_endpoint"
- CAP_R2_ACCESS_KEY_ID="your_key"
- CAP_R2_SECRET_ACCESS_KEY="your_secret"
- CAP_R2_BUCKET_NAME="your_bucket"
- Generate Crosswalk Files
Generate for all reporters:
manage.py generate_cap_crosswalk --output-dir cl/search/crosswalks
Or generate for specific reporter with date filter:
manage.py generate_cap_crosswalk --output-dir cl/search/crosswalks --reporter U.S --updated-after 2024-03-19
- Update CourtListener Data
Update all reporters:
manage.py update_cap_cases --crosswalk-dir cl/search/crosswalks
Or update specific reporter:
manage.py update_cap_cases --crosswalk-dir cl/search/crosswalks --reporter U_S
-
generate_cap_crosswalk
:- Fetches reporter metadata from CAP
- For each reporter/volume:
- Retrieves case metadata
- Matches CAP cases with CourtListener cases
- Generates crosswalk JSON files
-
update_cap_cases
:- Reads crosswalk files
- For each mapping:
- Fetches CAP HTML content
- Updates CourtListener XML data
- Updates cluster headmatter if needed
[
{ "cap_case_id": 3, "cl_cluster_id": 1, "cap_path": "/reporter/volume/cases/case-id.json" }
]
-
Missing Environment Variables
- Ensure all CAP R2 environment variables are set
- Check access permissions to the CAP bucket
-
Date Filtering
- Supports both ISO format (YYYY-MM-DDTHH:MM:SS+00:00) and short format (YYYY-MM-DD)
- Short format dates are converted to midnight UTC
-
Performance Considerations
- Use
--max-workers
to adjust concurrent processing - Consider using
--reporter
for targeted updates - Large updates may require significant processing time
- Using last updated date filter will speed up the process significantly
- Use
- Use
--verbose
for debugging - Consider using
--dry-run
before large updates