This scripts needs :
- one Hyphe server and more precisely :
- one Hyphe core instance
- one Hyphe web-pages Mongo instance
- one SOLR node ready to index web-pages
What it does is :
- get web entities info from Hyphe core filtering by status (see configuration)
- put the web entities not already processed (see logs) into WEB_ENTITY_PILE
- start nb_process (see configuration) processes to work on the web entities retrieved :
- get a web entity from WEB_ENTITY_PILE
- get web pages list from Hyphe core
- retrieve the mongo document for all URLs (filtering on mimetype see configuration)
- prepare documents to be indexed by creating a text verison of the HTML code (see html2text.py)
- index documents
This script relies on an existing Hyphe server running. see https://github.com/medialab/Hypertext-Corpus-Initiative
This script relies on an existing solr server running. see https://lucene.apache.org/solr/
sunburnt lxml httplib2 pymongo jsonrpclib argparse #for python<2.7
You need a hyphe and a solr server running.
git clone this repository
Than simply executes (ideally in a virtualenv):
pip install -r requirements.txt
use the solr node example provided in solr_hyphe_core directory. the script deploy_solr_core.sh might helps you. Change the solr core path and tomcat user/service (depends on your install) in the script before using it. BEWARE : It will erase any hyphe core already present in solr core path.
You should review the script before using it.
Copy config.json.default into config.json and edit the parameters :
- hyphe2core :
- nb_process: number of concurrent process to start
- web_entity_status_filter: a web entity filter to index based on hyphe status
- host/port of Hyphe core
- host/port/db/collection of mongo hyphe db
- host/port/path of solr node
Hyphe2solr proposes you to filter out web pages which doesn't have a mimetype compatible with solr indexing (our schema don't use TIKKA). The script generate_content_filter.py outputs from the mongodb (version >2.1 only) a CSV listing the cotent-type ordered by number of pages found in the mongo. From this csv you have to write the content_type_whitelist.txt file. This file must contain one mimetype (to be indexed) by line. An example is provided : content_type_whitelist.txt.default
Once you prepared the configuration, simply use :
$ python index_hyphe_web_pages.py
Only one option which delete the existing index before (re)indexing
$ python index_hyphe_web_pages.py -h
usage: index_hyphe_web_pages.py [-h] [-d]
optional arguments:
-h, --help show this help message and exit
-d, --delete_index delete solr index before (re)indexing. WARNING all
previous indexing work will be lost.
If calling index_hyphe_web_pages.py multiple times without -d|--delete_index option, the indexation process will omit the web entities listed by id in logs/we_id_done.log The defautl behaviour is thus to resume any previous unfinished indexations.
Hyphe2solr logs into 3 log directories :
- ./logs/by_pid/ : one log file by process
- ./logs/by_web_entity/ : one log file by web entity indexed
- ./logs/errors_solr_document/ : logs documents the script couldn't index in Solr
Hyphe2solr outputs the ids of indexed web entities in :
- ./logs/we_id_done.log : this file is used to resume indexing operations from where it stopped
When using -d or --delete_index option, the script clears all the logs.