Path hierarchy aggregation #8896

clintongormley · 2014-12-11T11:35:51Z

A few users have used nested terms aggregations to try to visualise each level in a tree, such as a file system, eg:

{
  "aggs": {
    "first_level": {
      "terms": {
        "field": "first_level"
      },
      "aggs": {
        "second_level": {
          "terms": {
            "field": "second_level"
          },
          "aggs": {
            "third_level": {
              "terms": {
                "field": "third_level"
              }
            }
          }
        }
      }
    }
  }
}

This is very costly as it results in combinatorial explosion. However, because this is a tree, it would be more efficient to store first_level+second_level+third_level in a single field, and to do a single pass over these "leaf buckets". Once we have the most popular leaves, we can backfill the branches (ie first_level+second_level and first_level).

The results would obviously be different to the nested terms agg: instead of having the most popular first_levels, then the most popular second_levels in the most popular first_levels (etc), you'd have the most popular leaves, plus the first_level and second level that those leaves belong to.

A complete example could look like this:

PUT /filesystem
{
  "mappings": {
    "file": {
      "properties": {
        "path": {
          "type": "string",
          "index": "not_analyzed",
          "doc_values": true
        }
      }
    }
  }
}

PUT /filesystem/file/1
{
  "path": "/My documents/Spreadsheets/Budget_2013.xls",
  "views": 10
}

PUT /filesystem/file/2
{
  "path": "/My documents/Spreadsheets/Budget_2014.xls",
  "views": 6
}

PUT /filesystem/file/3
{
  "path": "/My documents/Test.txt",
  "views": 1
}

GET /filesystem/file/_search?search_type=count
{
  "aggs": {
    "tree": {
      "path_hierarchy": {
        "field": "path",
        "separator": "/",
        "order": "total_views"
      },
      "aggs": {
        "total_views": {
          "sum": {
            "field": "views"
          }
        }
      }
    }
  }
}

And the result like this:

{
  "aggregations": {
    "tree": {
      "buckets": [
        {
          "key": "My documents",
          "doc_count": 3,
          "total_views": {
            "value": 18
          },
          "tree": {
            "buckets": [
              {
                "key": "Spreadsheets",
                "doc_count": 2,
                "total_views": {
                  "value": 17
                },
                "tree": {
                  "buckets": [
                    {
                      "key": "Budget_2013.xls",
                      "doc_count": 1,
                      "total_views": {
                        "value": 10
                      }
                    },
                    {
                      "key": "Budget_2014.xls",
                      "doc_count": 1,
                      "total_views": {
                        "value": 7
                      }
                    }
                  ]
                }
              },
              {
                "key": "Test.txt",
                "doc_count": 1,
                "total_views": {
                  "value": 1
                }
              }
            ]
          }
        }
      ]
    }
  }
}

The text was updated successfully, but these errors were encountered:

markharwood · 2015-01-20T18:22:39Z

I came across a use case that might be related. I want a tool to help me write IP address blocking rules by examining log records. A rule might be "ban everything from 121.205.*" or perhaps I need to be more selective "ban everything from 121.205.247.*"
This is a decision every webmaster takes on the basis of how much good vs bad traffic comes from any one level. If you think about how a hierarchical aggregation might ideally help with this it could progressively inflate sections of the full agg tree but only pursuing those branches where the "bad vs good" mix is high i.e. where there are more logged 404s than 200s in that address range. The existing breadth_first logic is a way of delaying computation of only the best branches but in this case we may need to introduce a new aggregation to help determine what "best" is because in this case "risk" is not a scriptable property found in docs but is derived from a mix of the contents of 2 buckets (404s vs 200s), neither of which know about each other
Of course with things like risky IP address ranges or directories on your hard drive that are using all your disk space, the real culprits that need to be chased down do not necessarily all exist at the same level in the hierarchy. Progressively expanding all branches of the tree in lock-step, a level at a time is perhaps not the only approach required here. In some respects I like the idea of an energy-dissipation based model for exploring large information spaces using finite resources. Using the 'pulsing' model I originally outlined for tackling combinatorial explosions we could direct pulses of doc streams down the various branches of the agg tree that could do with further inflation using a prioritisation system. When the time is up we can cut short the exploration but we have at least directed our efforts down the most promising channels which could be at different depths in the tree.

…elasticsearch#8896

clement-tourriere · 2015-06-17T13:37:07Z

I have made an implementation for this aggregation as a plugin.
You can test it here : https://github.com/opendatasoft/elasticsearch-aggregation-pathhierarchy

clintongormley · 2015-11-21T22:28:52Z

@jpountz this was your idea originally. Do you still think it is worth doing?

jpountz · 2015-11-23T10:43:09Z

I haven't seen it as much as I initially expected. But I think this can still be interesting indeed. Closing for now, we can re-open in the future if needed.

markwalkom · 2015-12-08T03:27:33Z

Something from the forums - https://discuss.elastic.co/t/aggregation-on-a-materialized-path/36519

deviantony · 2015-12-08T07:08:57Z

I'm the author of the forum post, would be neat if you guys can have a look at my problem as it's related to this issue.

@clement-tourriere Does your plugin support ES 2.x ?

deviantony · 2015-12-09T07:02:30Z

Just found out that the plugin does not support ES 2.x: opendatasoft/elasticsearch-aggregation-pathhierarchy#3 (comment)

saimaz · 2016-01-08T09:12:02Z

Just wondering, any plans to add this type of aggregation?

joemcelroy · 2016-01-18T19:39:16Z

Hi all,

Thought give you a heads up on searchkit which uses nested aggregations to build a hierarchical tree to filter results from. Check it out here: http://demo.searchkit.co/taxonomy

More details can be found here: http://docs.searchkit.co/stable/docs/components/navigation/hierarchical-refinement-filter.html

jprante · 2016-10-27T09:17:28Z

I'm late to the party, but some months ago, with the help of @clement-tourriere at jprante/elasticsearch-aggregations#1 it was possible to port the path hierarchy approach to the ES 2.x aggregation framework. With ES 5 now released, I plan to move forward. Please comment at https://github.com/jprante/elasticsearch-aggregations/issues if you have questions about my project or want to contribute.

ale0xb · 2017-03-24T12:55:56Z

Hello, any solutions to get this functionality in 5.x?

clintongormley added :Analytics/Aggregations Aggregations discuss labels Dec 11, 2014

jpountz removed the discuss label Jan 16, 2015

clement-tourriere pushed a commit to opendatasoft/elasticsearch-aggregation-pathhierarchy that referenced this issue Jun 17, 2015

Initial commit for path hierarchy aggregation implementation elastic/…

15167a5

…elasticsearch#8896

clintongormley added the discuss label Nov 21, 2015

jpountz closed this as completed Nov 23, 2015

vbura mentioned this issue Apr 6, 2017

Path hierarchy aggregation #23951

Closed

mwjames mentioned this issue Oct 14, 2017

Support for ElasticSearch SemanticMediaWiki/SemanticMediaWiki#2274

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path hierarchy aggregation #8896

Path hierarchy aggregation #8896

clintongormley commented Dec 11, 2014

markharwood commented Jan 20, 2015

clement-tourriere commented Jun 17, 2015

clintongormley commented Nov 21, 2015

jpountz commented Nov 23, 2015

markwalkom commented Dec 8, 2015

deviantony commented Dec 8, 2015

deviantony commented Dec 9, 2015

saimaz commented Jan 8, 2016

joemcelroy commented Jan 18, 2016

jprante commented Oct 27, 2016

ale0xb commented Mar 24, 2017

Path hierarchy aggregation #8896

Path hierarchy aggregation #8896

Comments

clintongormley commented Dec 11, 2014

markharwood commented Jan 20, 2015

clement-tourriere commented Jun 17, 2015

clintongormley commented Nov 21, 2015

jpountz commented Nov 23, 2015

markwalkom commented Dec 8, 2015

deviantony commented Dec 8, 2015

deviantony commented Dec 9, 2015

saimaz commented Jan 8, 2016

joemcelroy commented Jan 18, 2016

jprante commented Oct 27, 2016

ale0xb commented Mar 24, 2017