A wrapper around Heritrix for harvesting web content as part of Social Feed Manager.
As of SFM 1.12, sfm-web-harvester is deprecated.
git clone https://github.com/gwu-libraries/sfm-web-harvester.git
cd sfm-web-harvester
pip install -r requirements/requirements.txt
Note that requirements/requirements.txt
references the latest release of sfm-utils.
If you are doing development on the interaction between sfm-utils and sfm-web-harvester,
use requirements/dev.txt
. This uses a local copy of sfm-utils (../sfm-utils
)
in editable mode.
Web harvester will act on harvest start messages received from a queue. To run as a service:
python web_harvester.py service <mq host> <mq username> <mq password> <heritrix url> <heritrix username> <heritrix password> <contact url>
Web harvester can process harvest start files. The format of a harvest start file is the same as a harvest start message. To run:
python flickr_harvester.py seed <path to file> <heritrix url> <heritrix username> <heritrix password> <contact url>
-
Install Docker and Docker-Compose.
-
Start up the containers.
docker-compose -f docker/dev.docker-compose.yml up -d
-
Run the tests.
docker exec docker_sfmwebharvester_1 python -m unittest discover
-
Shutdown containers.
docker-compose -f docker/dev.docker-compose.yml kill docker-compose -f docker/dev.docker-compose.yml rm -v --force
See the messaging specification for how to construct a harvest start message.