Sheer is a tool for loading arbitrary content into Elasticsearch and serving that content on the web using Jinja2 templates.
If you're not familiar with Elasticsearch, it is highly recommended that you read the Elasticsearch Definitive Guide's Finding Your Feet.
Sheer is a Python application that requires:
Recommended for installing and running Sheer:
Running tests requires:
To run Sheer you will first need to install Elasticsearch. This can be acomplished a number of ways, many of which are detailed in in the Elasticsearch documentation. On Mac OS X it can be installed using Homebrew:
brew install elasticsearch
There are also Elasticsearch apt and Yum repositories.
Before running Sheer, you will also need to ensure that Elasticsearch is running. When installing Elasticsearch on Mac OS X installed via Homebrew, Homebrew will provide some guidance like:
To have launchd start elasticsearch at login:
ln -sfv <homebrew location>/elasticsearch/*.plist ~/Library/LaunchAgents
Then to load elasticsearch now:
launchctl load ~/Library/LaunchAgents/homebrew.mxcl.elasticsearch.plist
To install Sheer itself, it is recommended to create a
virtualenv
using
virtualenvwrapper
.
mkvirtualenv sheer
workon sheer
Then you can clone the Sheer repository and install the Python
requirements using pip
:
git clone https://github.com/cfpb/sheer
pip install -r sheer/requirements.txt
You can then install Sheer with pip
. pip -e
installs Sheer in
"editable" mode, which means it runs from the path where you've cloned
it, and any changes you git pull
from upstream won't have to be
installed again.
pip install -e sheer
The sheer
command takes the following general arguments:
-h
: Show help message and exit.--debug
: Print debugging output to the console.--location
: The directory you want to operate on. You can also set theSHEER_LOCATION
environment variable.--elasticsearch ELASTICSEARCH, -e ELASTICSEARCH
: Elasticsearch host:port pairs. Seperate hosts with commas. Default islocalhost:9200
You can also set theSHEER_ELASTICSEARCH_HOSTS
environment variable.--index INDEX, -i INDEX
: Elasticsearch index name. Default iscontent
. You can also set theSHEER_ELASTICSEARCH_INDEX
environment variable.
The sheer
command also takes one of two positional arguments:
index
: Load content into Elasticsearch.serve
: Serve content from Elasticsearch using configuration and templates at location.
These are covered in more detail below.
To run the Sheer tests, you'll need the Python packages
nose and
mock installed. Both can be
installed via pip
:
pip install nose mock
Both are also installed by the Sheer requirements.txt
file.
To run the tests, simply run:
nosetest sheer
This quick start assumes you have an existing Sheer site you want to load content for and serve.
cd path/to/my/sheer/site
Index the site's content in Elasticsearch:
sheer index
Serve the site at http://localhost:7000:
sheer serve
The site can also be served in "debug" mode:
sheer serve --debug
sheer index
Sheer indexing allows configurable loading of content into Elasticsearch.
sheer index
takes the following arguments:
--reindex, -r
: Recreate the index and reindex all content.--processors [PROCESSORS [PROCESSORS ...]], -p [PROCESSORS [PROCESSORS ...]]
: Content processors to index.
These are covered in more detail below.
Sheer does not index:
_settings/
_layouts/
_queries/
_defaults/
_lib/
_tests/
These are hard-coded in indexer.py
.
The basic indexing process:
- Creates the index with the given settings if it does not exist
- Creates the mappings for each content processor if they do not exist
- Enumerates the documents to be loaded into Elasticsearch that are yeilded by the content processor's
documents()
function. If the documents already exist they are updated.
sheer index --reindex
Destroys the index in Elasticsearch and recreates it, recreating the mappings and reloading all documents.
sheer index --processors posts
Reindex only content provided by the given content processors. The documents provided by the given processor will be updated in Elasticsearch.
sheer index --processors posts --reindex
This will destroy the mappings for the given content processor and recreate them, then load the documents provided by the given processor into Elasticsearch.
Sheer reads settings from _settings/settings.json
. These settings are passed as a document containing index settings to Elasticsearch.create
. Existing Sheer sites use this file to configure Elasticsearch analyzers.
Analyzers tokenize both document fields during indexing and query strings for searching, that way there is consistency between terms being searched for and terms indexed.
For example:
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_edge_ngram_analyzer" : {
"tokenizer" : "my_edge_ngram_tokenizer"
}
},
"tokenizer" : {
"my_edge_ngram_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "2",
"max_gram" : "5",
"token_chars": [ "letter", "digit" ]
}
}
}
}
}
This will use Elasticsearch's edgeNGram
tokenizer (which only keeps n-grams, a sequence of text characters, from the beginning of a token) to build an analyzer
See Elasticsearch's Configuring Analyzers for more information about configuration.
Sheer reads Content Processors from _settings/processors.json
.
Content Processors are configured with a unique name and a JSON object including at least the following fields:
processor
: The name of a Python module within the Sheer site's_lib
directory.mappings
(optional): The mappings file (within the Sheer site directory) for this content.
The processor
module must provide a documents
generator which takes the processor's name
and the remaining keyword arguments from the JSON object and yields documents suitable for indexing in Elasticsearch.
def documents(name, **kwargs):
...
yield document
For example, a content processor might be configured like this in processors.json
:
{
"posts": {
"url": "$WORDPRESS/api/get_posts/",
"processor": "wordpress_post_processor",
"mappings": "_settings/posts_mappings.json"
}
}
And may have the following corresponding _lib/wordpress_post_processor.py
:
def posts_at_url(url):
"""
Yield WordPress posts from the given URL
"""
...
def process_post(post):
"""
Process a post for indexing in Elasticsearch
"""
...
def documents(name, url, **kwargs):
"""
Yield a document for indexing in Elasticsearch
"""
for post in posts_at_url(url):
yield process_post(post)
Mappings are described in great detail in the Elasticsearch mappings documentation. Mapping defines the searchable characteristics of a document, such as which fields are searchable and how they're tokenized.
Each content processor provides a mappings file path relative to the Sheer site directory. This file contains the properties
JSON object for the mapping (as described in Elasticsearch's PUT mapping API reference). This is passed directly to Elasticsearch.
For example:
{ "properties" :
{
"title" : {"type" : "string", "store" : "yes"},
"text" : {"type" : "string", "store" : "yes"},
"date" : {"type" : "date", "store": "yes"},
"category" : {"type":"string", "index": "not_analyzed"},
"author" : {"type":"string", "index":"not_analyzed"},
"tags" : {"type":"string", "store": "yes", "index":"not_analyzed"},
"excerpt" : {"type":"string", "store": "yes"},
"custom_fields": {
"properties": {
"display_in_newsroom": {"type":"string", "index":"not_analyzed"}
}
}
}
}
A default _settings/mappings.json
, if it exists, is also passed to Elasticsearch.
sheer serve
Sheer can serve the content it indexes in Elasticsearch via command-line in the foreground or via WSGI. Sheer serves its content using a Flask application.
sheer serve
takes the following arguments:
--port PORT, -p PORT
: Port to run the web server on.--addr ADDR, -a ADDR
: Address to run the web server on.
Sheer does not serve any paths beginning with an underscore. They are considered private.
Sheer will serve the following content in order:
index.html
template from a directory containing it- An Elasticsearch document from a lookup URL
- A Flask Blueprint
Sheer will serve the index.html
template from any directory under the site root not beginning with an underscore. So, given a <site root>/blog/index.html
, Sheer will serve the template at /blog/
and /blog/index.html
. Sheer will also redirect /blog
to /blog/
.
Sheer always adds two search paths beyond the request directory for Jinja2 templates:
_layouts
_includes
There are no specific rules within Sheer that dictate what templates go in either location, but they provide for some logical separation. These search paths include parent directories up to the root of the site.
Sheer provides convenient access to query tools for use in templates using the following context variables:
Returns the selected filter values contained in the request query string for a given fieldname
.
Returns whether or not the given filter value
is selected in the request query string for the given fieldname
.
A QueryFinder
object for lookup of pre-defined Elasticsearch queries stored as JSON files in <site_root>/_queries/<query_name>.json
. This context variable exposes the Sheer query API to templates.
For example, within a Jinja2 template, one might do the following:
{% set query = queries.posts %}
{% set posts = query.search(size=10) %}
{%- for post in posts %}
...
{% endfor %}
Performs an Elasticsearch "more like this" (mlt) search for documents that are "like" the document described by the given QueryHit
object.
Returns a QueryResult
object.
Optionally takes additional keyword arguments corrosponding to the "mlt" parameters described in the Elasticsearch documentation.
{% set query = queries.posts %}
{% set posts = query.search(size=10) %}
{%- for post in posts %}
...
{% for similar in more_like_this(post) %}
...
{% endfor %}
...
{% endfor %}
Performs an Elasticsearch "get" for a document of the given doctype
with the given docid
. Returns a single QueryHit
object.
Lookup URLs in Sheer are URLs at which Elasticsearch lookups will be performed. Lookup URLs are defined in _settings/lookups.json
. For example:
{
"post": {
"url": "/blog/<id>/",
"type": "posts",
"permalink": true
},
}
This will add a URL pattern where the <id>
in the URL pattern is the Elasticsearch id of a document of the given type
. These documents are then templated using either an index.html
file at their full path (include <id>
) or the first _single.html
template that is found in the search path.
In this case, documents in Elasticsearch with the type posts
will be served at the URL /blog/<id>
and templated with either:
<site root>/blog/<id>/index.html
<site root>/blog/_single.html
<site root>/_single.html
.
Sheer locals Flask Blueprints from _settings/blueprints.json
. This a JSON file that includes the Python package
which contains each blueprint, and the blueprint itself as module
.
For example, the following blueprint:
myblueprint = Blueprint("myblueprint", __name__, url_prefix="")
Defined in the Python package and module ablueprint.controllers
would be configured like this:
{
"myblueprint": {
"package": "ablueprint.controllers",
"module": "myblueprint"
}
}
Sheer includes some wrappers around Elasticsearch queries that allow for queries to be pre-defined in JSON files in <site_root>/_queries/<query_name>.json
, run, and results of those queries to be easily accessed.
QueryFinder
provides a simple attribute-lookup of Elasticsearch JSON queries defined in <site_root>/_queries/<query_name>.json
and will return a Query
object.
For example, given the following query defined in <site root>/_queries/posts.json
:
{
"name": "Blog Posts",
"query": {
"doc_type": "posts",
"size": 10,
"sort": "date:desc"
}
}
From within the Sheer Flask application:
>>> queries = QueryFinder()
>>> posts_query = queries.posts
A Query
object for the archive query is available at queries.posts
.
File lookups are done on the fly and a new Query
instance is created every time the posts
attribute is accessed.
Query
wraps an Elasticsearch search fetched via QueryFinder
.
>>> queries = QueryFinder()
>>> posts_query = queries.posts
>>> posts_results = queries.posts.search(size=10)
Perform the search with the given keyword arguments, returning a QueryResult
object.
Keyword arguments should be Elasticsearch request body parameters
If aggregations
are given, an Elasticsearch terms aggregation is used to return counts and possible values for the given fields.
Return possible values for the field. This performs a search using Elasticsearch terms aggregation. Keyword arguments are Elasticsearch request body parameters.
For example, possible_values_for('category', doc_type='posts')
would return the counts and existing values of the field category
on all posts
documents.
QueryResult
objects wrap Elasticsearch search results. QueryResult
objects are iterables that yield QueryHit
objects for each result.
>>> queries = QueryFinder()
>>> posts_query = queries.posts
>>> posts_results = queries.posts.search(size=10)
>>> for result_hit in post_results:
>>> ...
QueryResult
objects provide several properties to aid in pagination of results:
total
: the total number of resultssize
: the number of results included in this queryfrom_
: the number of results skipped in this querypages
: the total number of pages given the abovesize
current_page
: the current page given the abovesize
andfrom_
values
Returns the terms aggregation dictionary that resulted from the query for the given fieldname
.
Returns a JSON-compatible (but not JSON-encoded) dictionary of the query results, including the above properties and the resulting hits.
Returns a URL for the given page number within the query.
A QueryHit
object is the result of an Elasticsearch query. QueryHit
objects provide the query result's fields as attributes. Given the following blog post document stored in Elasticsearch:
{
"id": 12345,
"title": "An Example Blog Post",
"date": "2015-01-11T09:34:40Z",
"slug": "an-example-blog-post",
"author": [
"Hugh Man"
],
"content": "...",
"excerpt": "...",
"category": [
"Announcements"
]
}
Each of the JSON object's properties would be accessible as a QueryHit
object's attributes:
>>> queries = QueryFinder()
>>> posts_query = queries.posts
>>> posts_results = queries.posts.search(size=10)
>>> for result_hit in post_results:
>>> print result_hit.title, result_hit.author
Returns the permanent link to this Elasticsearch document if the type of this query hit (for our blog example, "posts") corropsonds to one of the lookup URLs configured in a Sheer site.
Returns a JSON-compatible (but not JSON-encoded) dictionary of a query hit.
Public Domain/CC0 1.0