website-indexer 🪱

Crawl a website and search its content.

This project consists of two components: a crawler application to crawl the contents of a website and store its content in a database; and a viewer web application that allows for searching of that crawled content.

Both components require Python 3.12 to run and are built using the Django web application framework. The crawler piece is built on top of the Archive Team's ludios_wpull web crawler.

Getting started

This project can be run using Docker or a local Python virtual environment.

Using Docker

To build the Docker image:

docker build -t website-indexer:main .

Viewing a sample crawl using Docker

To then run the viewer application using sample data:

docker run -it \
    -p 8000:8000 \
    website-indexer:main

The web application using sample data will be accessible at http://localhost:8000/.

Crawling a website and viewing the crawl results using Docker

To crawl a website using the Docker image, storing the result in a local SQLite database named crawl.sqlite3, first create the database file:

docker run -it \
    -v `pwd`:/data \
    -e DATABASE_URL=sqlite:////data/crawl.sqlite3 \
    website-indexer:main \
    python manage.py migrate

and then run the crawl, storing results into that database file:

docker run -it \
    -v `pwd`:/data \
    -e DATABASE_URL=sqlite:////data/crawl.sqlite3 \
    website-indexer:main \
    python manage.py crawl https://www.consumerfinance.gov

To then run the viewer web application to view that crawler database:

docker run -it \
    -p 8000:8000 \
    -v `pwd`:/data \
    -e DATABASE_URL=sqlite:////data/crawl.sqlite3 \
    website-indexer:main

The web application with the crawl results will be accessible at http://localhost:8000/.

Using a Python virtual environment

Create a Python virtual environment and install required packages:

python3.12 -m venv venv
source venv/bin/activate
pip install -r requirements/base.txt

From the repo's root, compile frontend assets:

yarn
yarn build

Alternatively, to continuously watch the frontend assets and rebuild as necessary:

yarn
yarn watch

Viewing a sample crawl using a Python virtual environment

Run the viewer application using sample data:

./manage.py runserver

The web application using sample data will be accessible at http://localhost:8000/.

Crawling a website and viewing the crawl results using a Python virtual environment

To crawl a website and store the result in a local SQLite database named crawl.sqlite3:

DATABASE_URL=sqlite:///crawl.sqlite3 /manage.py crawl https://www.consumerfinance.gov

To then run the viewer web application to view that crawler database:

DATABASE_URL=sqlite:///crawl.sqlite3 ./manage.py runserver

The web application with the crawl results will be accessible at http://localhost:8000/

Managing crawls in the database

The ./manage.py manage_crawls command can be used to list, delete, and cleanup old crawls (assuming DATABASE_URL is set appropriately).

Crawls in the database have a status field which can be one of Started, Finished, or Failed.

Listing crawls

To list crawls in the database:

./manage.py manage_crawls list

This will list crawls in the database, including each crawl's unique ID.

Deleting crawls

To delete an existing crawl, for example one with ID 123:

./manage.py manage_crawls delete 123

--dry-run can be added to the delete command to preview its output without modifying the database.

Cleaning crawls

To clean old crawls, leaving behind one crawl of each status:

./manage.py manage_crawls clean

To modify the number of crawls left behind, for example leaving behind two of each status:

./manage.py manage_crawls clean --keep=2

--dry-run can also be added to the clean command to preview its output without modifying the database.

Configuration

Database configuration

The DATABASE_URL environment variable can be used to specify the database used for crawl results by the viewer application. This project makes use of the dj-database-url project to convert that variable into a Django database specification.

For example, to use a SQLite file at /path/to/db.sqlite:

export DATABASE_URL=sqlite:////path/to/db.sqlite

(Note use of four slashes when referring to an absolute path; only three are needed when referring to a relative path.)

To point to a PostgreSQL database instead:

export DATABASE_URL=postgres://username:password@localhost/dbname

Please see the dj-database-url documentation for additional examples.

If the DATABASE_URL environment variable is left unset, the sample SQLite database file will be used.

Development

Sample test data

This repository includes a sample database file for testing purposes at sample/sample.sqlite3.

The sample database file is used by the viewer application when no other crawl database file has been specified.

The source website content used to generate this file is included in this repository under the sample/src subdirectory.

To regenerate the same database file, first delete it:

rm ./sample/sample.sqlite3

Then, start a Python webserver to serve the sample website locally:

cd ./sample/src && python -m http.server

This starts the sample website running at http://localhost:8000.

Then, in another terminal, recreate the database file:

./manage.py migrate

Finally, perform the crawl against the locally running site:

./manage.py crawl http://localhost:8000/

These commands assume use of a local Python virtual environment; alternatively consider using Docker.

This command will receate the sample database file sample/sample.sqlite3 with a fresh crawl. To write to a different database, use the DATABASE_URL environment variable.

For consistency, the Python test fixture should be updated at the same time as the sample database.

Testing

To run Python unit tests, first install the test dependencies in your virtual environment:

pip install -r requirements/test.txt

To run the tests:

pytest

The Python tests make use of a test fixture generated from the sample database.

To recreate this test fixture:

./manage.py dumpdata --indent=4 crawler > crawler/fixtures/sample.json

Code formatting

This project uses Black as a Python code formatter.

To check if your changes to project code match the desired coding style:

black . --check

You can fix any problems by running:

black .

This project uses Prettier as a code formatter for JavaScript, CSS, and HTML templates.

To check if your changes to project code match the desired coding style:

yarn prettier

You can fix any problems by running:

yarn prettier:fix

Deployment

For information on how this project is deployed at the CFPB, employees and contractors should refer to the internal CFGOV/crawler-deploy 🔒 repository.

This repository includes a Fabric script that can be used to configure a RHEL8 Linux server to run this project and to deploy both the crawler and the viewer application to that server.

To install Fabric in your virtual environment:

pip install -r requirements/deploy.txt

Configuring a server

To configure a remote RHEL8 server with the appropriate system requirements, you'll need to use some variation of this command:

fab configure

You'll need to provide some additional connection information depending on the specific server you're targeting, for example, hostname and user. See the Fabric documentation for possible options; for example, to connect using a host configuration defined as crawler in your ~/.ssh/config, you might run:

fab configure -H crawler

The configure command:

Installs Node and Git
Installs Python 3.12

Deploying the application

To run the deployment, you'll need to use some variation of this command:

fab deploy

The deploy command:

Pulls down the latest version of the source code from GitHub
Installs the latest dependencies
Runs the frontend build script
Configures the crawler to run nightly
Sets up webserver logging and log rotation
Serves the viewer application on port 8000

See fabfile.py for additional detail.

Name		Name	Last commit message	Last commit date
Latest commit History 312 Commits
.github/workflows		.github/workflows
.vscode		.vscode
crawler		crawler
requirements		requirements
sample		sample
viewer		viewer
.dockerignore		.dockerignore
.gitignore		.gitignore
.prettierignore		.prettierignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
TERMS.md		TERMS.md
buildspec.yml		buildspec.yml
esbuild.mjs		esbuild.mjs
fabfile.py		fabfile.py
manage.py		manage.py
package.json		package.json
pyproject.toml		pyproject.toml
settings.py		settings.py
urls.py		urls.py
wsgi.py		wsgi.py
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

website-indexer 🪱

Getting started

Using Docker

Viewing a sample crawl using Docker

Crawling a website and viewing the crawl results using Docker

Using a Python virtual environment

Viewing a sample crawl using a Python virtual environment

Crawling a website and viewing the crawl results using a Python virtual environment

Managing crawls in the database

Listing crawls

Deleting crawls

Cleaning crawls

Configuration

Database configuration

Development

Sample test data

Testing

Code formatting

Deployment

Configuring a server

Deploying the application

Open source licensing info

About

Releases

Packages

Contributors 5

Languages

License

cfpb/website-indexer

Folders and files

Latest commit

History

Repository files navigation

website-indexer 🪱

Getting started

Using Docker

Viewing a sample crawl using Docker

Crawling a website and viewing the crawl results using Docker

Using a Python virtual environment

Viewing a sample crawl using a Python virtual environment

Crawling a website and viewing the crawl results using a Python virtual environment

Managing crawls in the database

Listing crawls

Deleting crawls

Cleaning crawls

Configuration

Database configuration

Development

Sample test data

Testing

Code formatting

Deployment

Configuring a server

Deploying the application

Open source licensing info

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages