principles

This scripts needs :

one Hyphe server and more precisely :
- one Hyphe core instance
- one Hyphe web-pages Mongo instance
one SOLR node ready to index web-pages

What it does is :

get web entities info from Hyphe core filtering by status (see configuration)
put the web entities not already processed (see logs) into WEB_ENTITY_PILE
start nb_process (see configuration) processes to work on the web entities retrieved :
- get a web entity from WEB_ENTITY_PILE
- get web pages list from Hyphe core
- retrieve the mongo document for all URLs (filtering on mimetype see configuration)
- prepare documents to be indexed by creating a text verison of the HTML code (see html2text.py)
- index documents

dependencies

HYPHE

This script relies on an existing Hyphe server running. see https://github.com/medialab/Hypertext-Corpus-Initiative

SOLR

This script relies on an existing solr server running. see https://lucene.apache.org/solr/

python requirements

sunburnt lxml httplib2 pymongo jsonrpclib argparse #for python<2.7

INSTALL

You need a hyphe and a solr server running.

git clone this repository

Than simply executes (ideally in a virtualenv):

pip install -r requirements.txt

CONFIGURE

hyphe SOLR schema

use the solr node example provided in solr_hyphe_core directory. the script deploy_solr_core.sh might helps you. Change the solr core path and tomcat user/service (depends on your install) in the script before using it. BEWARE : It will erase any hyphe core already present in solr core path.

You should review the script before using it.

connection to data sources

Copy config.json.default into config.json and edit the parameters :

hyphe2core :
- nb_process: number of concurrent process to start
- web_entity_status_filter: a web entity filter to index based on hyphe status
host/port of Hyphe core
host/port/db/collection of mongo hyphe db
host/port/path of solr node

Mime-type filter

Hyphe2solr proposes you to filter out web pages which doesn't have a mimetype compatible with solr indexing (our schema don't use TIKKA). The script generate_content_filter.py outputs from the mongodb (version >2.1 only) a CSV listing the cotent-type ordered by number of pages found in the mongo. From this csv you have to write the content_type_whitelist.txt file. This file must contain one mimetype (to be indexed) by line. An example is provided : content_type_whitelist.txt.default

usage

Once you prepared the configuration, simply use :

$ python index_hyphe_web_pages.py

Only one option which delete the existing index before (re)indexing

$ python index_hyphe_web_pages.py -h
usage: index_hyphe_web_pages.py [-h] [-d]

optional arguments:
  -h, --help          show this help message and exit
  -d, --delete_index  delete solr index before (re)indexing. WARNING all
                      previous indexing work will be lost.

If calling index_hyphe_web_pages.py multiple times without -d|--delete_index option, the indexation process will omit the web entities listed by id in logs/we_id_done.log The defautl behaviour is thus to resume any previous unfinished indexations.

logs

Hyphe2solr logs into 3 log directories :

./logs/by_pid/ : one log file by process
./logs/by_web_entity/ : one log file by web entity indexed
./logs/errors_solr_document/ : logs documents the script couldn't index in Solr

Hyphe2solr outputs the ids of indexed web entities in :

./logs/we_id_done.log : this file is used to resume indexing operations from where it stopped

When using -d or --delete_index option, the script clears all the logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

principles

dependencies

HYPHE

SOLR

python requirements

INSTALL

CONFIGURE

hyphe SOLR schema

connection to data sources

Mime-type filter

usage

logs

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
solr_hyphe_core		solr_hyphe_core
.gitignore		.gitignore
README.md		README.md
TimeElapsedLogging.py		TimeElapsedLogging.py
config.json.default		config.json.default
content_type_whitelist.txt.default		content_type_whitelist.txt.default
deploy_solr_core.sh		deploy_solr_core.sh
generate_content_type_filter.py		generate_content_type_filter.py
html2text.py		html2text.py
index_hyphe_web_pages.py		index_hyphe_web_pages.py
requirements.txt		requirements.txt

medialab/hyphe2solr

Folders and files

Latest commit

History

Repository files navigation

principles

dependencies

HYPHE

SOLR

python requirements

INSTALL

CONFIGURE

hyphe SOLR schema

connection to data sources

Mime-type filter

usage

logs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages