A collection of facets and facet-related tools for ElasticSearch.
Since github disabled their download feature (http://www.elasticsearch.org/blog/2012/12/17/new-download-service.html ) things changed in the way plugins are installed with ElasticSearch. I'm working on an alternative. Until then, you can use the following workaround. Be aware that this will give a master build, not an official release.
In order to install a master build, run: ./bin/plugin -url http://bit.ly/elasticfacets-master -install elasticfacets
.
ElasticFacets ElasticSearch master 0.20.6 -> 0.20.x 0.2 0.19.9 -> 0.19.11 0.1 0.19.8
ElasticSearch comes with a powerfull built in facet called Date Histogram. Using the date histogram facet you can get a statistical analysis of a field for different time intervals (week by week, hour by hour etc.).
{ "query": { "match_all": {} }, "facets": { "pub_per_week": { "date_histogram": { "key_field": "published", "value_field": "copies", "interval": "week" } } } }
While this is very powerful, it is limitted to numerical fields.
The Faceted Date Histogram combines the power of Date Histogram with *any* facet in ElasticSearch. With it, you can replace the value_field parameter with a complete facet definition of your choice. For exmaple:
{ "query": { "match_all": {} }, "facets": { "pub_per_week": { "faceted_date_histogram": { "field": "published", "interval": "week", "facet": { "terms": { "field": "username" } } } } } }
This snippet uses the Terms facet to return the top usernames on a week by week basis.
** STILL UNDER DEVELOPMENT. Available in master only **
A drop in replacement [1] to standrand terms facet but with lower memory signature. Usefull when you want to facet on a string field with many many possible string values. Normally, all these strings will be loaded into memory which take a lot. The Hashed Strings Facet only loads the hashes of these strings which considerably reduce its memory signatures. The facet onload loads the strings needed to actually return a response. Of course, nothing comes for free. The price you is slightly more IO and potentially inacuracies due to hash collisions. If the latter occur, you would see an appropriate WARN log message.
After installing the plugin you can call it as follows:
{ "query": { "match_all": {} }, "facets": { "facet_name": { "hashed_terms": { "field": "username", "size": 10, } } } }
This will return the top 10 usernames in your index in exactly the same format the standard terms facet does.
[1] | As the original string is no longer available at faceting time, these features of the standard term facets are not supported:
|
Next to the features offered by the terms facet, the Hashed Strings facet has some extra tricks to it:
{ "query": { "match_all": {} }, "facets": { "facet_name": { "hashed_terms": { "field": "username", "size": 10, "fetch_size": 20, # control over the number of terms returned by every shard before aggregation. "output_script": # Modify what is outputed via a script. "_source.username+' on '+_source.website" } } } }
** STILL UNDER DEVELOPMENT. Available in master only **
Facets in ElasticSearch are powered by the FieldCache - a component that loads values into memory so they could be counted. This can potentially lead to high memory usage. ElasticSearch comes with a cache statistics end point from which you can get the current *total* cache size. This end points tells what is the cache size per field stored in it so you can find the source of the problem.
Usage:
curl -XGET 'http://localhost:9200/_cluster/nodes/cache/fields/stats curl -XGET 'http://localhost:9200/_cluster/nodes/nodeId1,nodeId2/cache/fields/stats' # simplified curl -XGET 'http://localhost:9200/_nodes/cache/fields/stats'
Respones:
{ "cluster_name": "BoazMBP.local_buzzcapture_1.0" "nodes": { "node_id": { "timestamp": 1353134666971 "name": "Frost, Deacon" "transport_address": "inet[/192.168.1.107:9300]" "hostname": "something.com" "fields": { "publish_date": { "size": 180 } "copies": { "size": 180 } } } } }