GitHub - IllDepence/Canvas-Indexer: A flask web application that crawls Activity Streams for IIIF Canvases and offers a search API.

A flask web application that crawls Activity Streams for IIIF Canvases and offers a search API.

Project state

Canvas Indexer is being developed as part of the CODH's IIIF Curation Platform, but can also be used as a general IIIF tool. Integration into the IIIF Curation Platform means that there is a focus on cr:Curation^[1] type documents.^[2] Nevertheless all development is done with generality in mind.^[3]

[1] http://codh.rois.ac.jp/iiif/curation/1#Curation
[2] The crawler currently only looks for canvases within them (and not, for example, sc:Manifests) and the search API offers dedicated parameters.
[3] The crawling process implements the IIIF Change Discovery API 0.1 and extending the indexing mechanism and search API to support IIIF documents within Activity Streams in general (or at least sc:Manifests for a first step) should be straightforward.

Setup

create virtual environment: $ python3 -m venv venv
activate virtual environment: $ source venv/bin/activate
install requirements: $ pip install -r requirements.txt

Config

section	key	default	explanation
shared	db_uri	sqlite:////tmp/ci_tmp.db	a SQLAlchemy database URI (file system paths have to be absolute)
crawler	as_sources	[]	comma seperated list of links to Activity Streams in form of OrderedCollections
‌	interval	3600	crawl interval in seconds (value <=0 deactivates automatic crawling)
‌	log_file	/tmp/ci_crawl_log.txt	file system path to where the crawling details should be logged
‌	allow_orphan_canvases	false	set whether or not Canvases, that are not associated with any parent elements in the index anymore, should still appear in search results
api	server_url	http://localhost:5005	URL under which Canvas Indexer can be accessed (used to set the `@id` attribute of curation format search results (see API section) and when using tagging bots (see bot intergration section))
‌	api_path	api	specifies the endpoint for API access (e.g. `search` → `http://indexcanvases.com/search` or `http://sirtetris.com/canvasindexer/search`)
‌	bot_urls	[]	comma seperated list of URLs to bots (only needed when using bots (details below))
‌	facet_label_sort_top	[]	comma seperated list defining the beginning of the list returned for the `/facets` endpoint
‌	facet_label_sort_bottom	[]	comma seperated list defining the end of the list returned for the `/facets` endpoint
‌	facet_value_sort_frequency	[]	comma seperated list of facets to be sorted by frequency
‌	facet_value_sort_alphanum	[]	comma seperated list of facets to be sorted alphanumerically
‌	facet_label_hide	[]	comma seperated list of facets labels to hide from API output
facet_value_sort_ custom_<name>	label	‌	facet label for which a custom order is defined
‌	sort_top	‌	comma seperated list defining the beginning
‌	sort_bottom	‌	comma seperated list defining the end

Run

Directly through Flask

$ source venv/bin/activate
$ python3 run.py [debug]

Using gunicorn

$ source venv/bin/activate
$ pip install gunicorn
$ ./venv/bin/gunicorn 'canvasindexer:create_app()'

Note that gunicorn per default times out requests after 30 seconds, which can interfere with long crawling procedures (e.g. the first crawl of a large Activity Stream). The timeout can be changed by creating a file gunicorn_config.py and inserting a line like timeout = 3600 (for a timeout of one hour) or timeout = 0 to deactivate timeouts alltogether. To start Canvas Indexer using this config run

$ ./venv/bin/gunicorn -c gunicorn_config.py 'canvasindexer:create_app()'

API

path: {base_url}/api / {base_url}/{api_path}
arguments:

arg	default	explanation
select	`curation`	set the type of search results to be returned to either `canvas` or `curation`
from	`curation,canvas`	set the type of metadata the search results should be based on to `canvas`, `curation` or a comma seperated list of aforementioned
where		search keyword
where_metadata_label		used to search by a property+value pair. requires where_metadata_value
where_metadata_value		used to search by a property+value pair. requires where_metadata_label
where_agent	`human,machine`	set the type of metadata creator to `human`, `machine` or a comma seperated list of aforementioned
start	`0`	0 based index from which to start listing results from the list of all results
limit	`null` meaning no limit	limit the number of results being returned
output		if set to `curation` and `select=cavnas` search results will be returned as a curation

example: {base_url}/api?select=canvas&from=canvas,curation&where=face

path: {base_url}/parents
returns the list of curations that contain a given canvas or canvas area

arguments:

arg	default	explanation
canvas	`null`	URL encoded canvas ID
xywh	`null`	optional xywh fragment (needs to match exactly)

path: {base_url}/facets
returns a pre generated overview of the indexed metadata facets

Crawler

The crawler can be configured to run periodically (see Config) or triggered manually by accessing {base_url}/crawl.
On its first run the crawler will go through an Activity Stream in its entirety, subsequent runs will only regard Activities that occured after the previous run.
In its current state the crawler indexes only the label value pairs given in a IIIF resource's metadata property.

Bot integration

Canvas Indexer can be set up to send image URLs of the canvases it indexes to bots that return tags. These tags are then integrated in the index. Example code of a bot can be found in the folder bot_example.

Logo

The Canvas Indexer logo uses image content from 絵本花葛蘿 in the 日本古典籍データセット（国文研所蔵） provided by the Center for Open Data in the Humanities, used under CC-BY-SA 4.0. The Canvas Indexer logo is licensed under CC-BY-SA 4.0 by Tarek Saier. A high resolution version (4456×2326 px) can be downloaded here.

Support

Sponsored by the National Institute of Informatics.
Supported by the Center for Open Data in the Humanities, Joint Support-Center for Data Science Research, Research Organization of Information and Systems.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
bot_example		bot_example
canvasindexer		canvasindexer
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.ini.example		config.ini.example
logo_500px.png		logo_500px.png
requirements.txt		requirements.txt
run.py		run.py
run_crawler.py		run_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project state

Setup

Config

Run

Directly through Flask

Using gunicorn

API

Crawler

Bot integration

Logo

Support

About

Releases 1

Packages

Languages

License

IllDepence/Canvas-Indexer

Folders and files

Latest commit

History

Repository files navigation

Project state

Setup

Config

Run

Directly through Flask

Using gunicorn

API

Crawler

Bot integration

Logo

Support

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages