Skip to content

Path hierarchy aggregation #8896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
clintongormley opened this issue Dec 11, 2014 · 11 comments
Closed

Path hierarchy aggregation #8896

clintongormley opened this issue Dec 11, 2014 · 11 comments

Comments

@clintongormley
Copy link
Contributor

A few users have used nested terms aggregations to try to visualise each level in a tree, such as a file system, eg:

{
  "aggs": {
    "first_level": {
      "terms": {
        "field": "first_level"
      },
      "aggs": {
        "second_level": {
          "terms": {
            "field": "second_level"
          },
          "aggs": {
            "third_level": {
              "terms": {
                "field": "third_level"
              }
            }
          }
        }
      }
    }
  }
}

This is very costly as it results in combinatorial explosion. However, because this is a tree, it would be more efficient to store first_level+second_level+third_level in a single field, and to do a single pass over these "leaf buckets". Once we have the most popular leaves, we can backfill the branches (ie first_level+second_level and first_level).

The results would obviously be different to the nested terms agg: instead of having the most popular first_levels, then the most popular second_levels in the most popular first_levels (etc), you'd have the most popular leaves, plus the first_level and second level that those leaves belong to.

A complete example could look like this:

PUT /filesystem
{
  "mappings": {
    "file": {
      "properties": {
        "path": {
          "type": "string",
          "index": "not_analyzed",
          "doc_values": true
        }
      }
    }
  }
}

PUT /filesystem/file/1
{
  "path": "/My documents/Spreadsheets/Budget_2013.xls",
  "views": 10
}

PUT /filesystem/file/2
{
  "path": "/My documents/Spreadsheets/Budget_2014.xls",
  "views": 6
}

PUT /filesystem/file/3
{
  "path": "/My documents/Test.txt",
  "views": 1
}

GET /filesystem/file/_search?search_type=count
{
  "aggs": {
    "tree": {
      "path_hierarchy": {
        "field": "path",
        "separator": "/",
        "order": "total_views"
      },
      "aggs": {
        "total_views": {
          "sum": {
            "field": "views"
          }
        }
      }
    }
  }
}

And the result like this:

{
  "aggregations": {
    "tree": {
      "buckets": [
        {
          "key": "My documents",
          "doc_count": 3,
          "total_views": {
            "value": 18
          },
          "tree": {
            "buckets": [
              {
                "key": "Spreadsheets",
                "doc_count": 2,
                "total_views": {
                  "value": 17
                },
                "tree": {
                  "buckets": [
                    {
                      "key": "Budget_2013.xls",
                      "doc_count": 1,
                      "total_views": {
                        "value": 10
                      }
                    },
                    {
                      "key": "Budget_2014.xls",
                      "doc_count": 1,
                      "total_views": {
                        "value": 7
                      }
                    }
                  ]
                }
              },
              {
                "key": "Test.txt",
                "doc_count": 1,
                "total_views": {
                  "value": 1
                }
              }
            ]
          }
        }
      ]
    }
  }
}
@markharwood
Copy link
Contributor

I came across a use case that might be related. I want a tool to help me write IP address blocking rules by examining log records. A rule might be "ban everything from 121.205.*" or perhaps I need to be more selective "ban everything from 121.205.247.*"
This is a decision every webmaster takes on the basis of how much good vs bad traffic comes from any one level. If you think about how a hierarchical aggregation might ideally help with this it could progressively inflate sections of the full agg tree but only pursuing those branches where the "bad vs good" mix is high i.e. where there are more logged 404s than 200s in that address range. The existing breadth_first logic is a way of delaying computation of only the best branches but in this case we may need to introduce a new aggregation to help determine what "best" is because in this case "risk" is not a scriptable property found in docs but is derived from a mix of the contents of 2 buckets (404s vs 200s), neither of which know about each other
Of course with things like risky IP address ranges or directories on your hard drive that are using all your disk space, the real culprits that need to be chased down do not necessarily all exist at the same level in the hierarchy. Progressively expanding all branches of the tree in lock-step, a level at a time is perhaps not the only approach required here. In some respects I like the idea of an energy-dissipation based model for exploring large information spaces using finite resources. Using the 'pulsing' model I originally outlined for tackling combinatorial explosions we could direct pulses of doc streams down the various branches of the agg tree that could do with further inflation using a prioritisation system. When the time is up we can cut short the exploration but we have at least directed our efforts down the most promising channels which could be at different depths in the tree.

clement-tourriere pushed a commit to opendatasoft/elasticsearch-aggregation-pathhierarchy that referenced this issue Jun 17, 2015
@clement-tourriere
Copy link

I have made an implementation for this aggregation as a plugin.
You can test it here : https://github.com/opendatasoft/elasticsearch-aggregation-pathhierarchy

@clintongormley
Copy link
Contributor Author

@jpountz this was your idea originally. Do you still think it is worth doing?

@jpountz
Copy link
Contributor

jpountz commented Nov 23, 2015

I haven't seen it as much as I initially expected. But I think this can still be interesting indeed. Closing for now, we can re-open in the future if needed.

@jpountz jpountz closed this as completed Nov 23, 2015
@markwalkom
Copy link
Contributor

@deviantony
Copy link

I'm the author of the forum post, would be neat if you guys can have a look at my problem as it's related to this issue.

@clement-tourriere Does your plugin support ES 2.x ?

@deviantony
Copy link

Just found out that the plugin does not support ES 2.x: opendatasoft/elasticsearch-aggregation-pathhierarchy#3 (comment)

@saimaz
Copy link

saimaz commented Jan 8, 2016

Just wondering, any plans to add this type of aggregation?

@joemcelroy
Copy link
Member

Hi all,

Thought give you a heads up on searchkit which uses nested aggregations to build a hierarchical tree to filter results from. Check it out here: http://demo.searchkit.co/taxonomy

More details can be found here: http://docs.searchkit.co/stable/docs/components/navigation/hierarchical-refinement-filter.html

@jprante
Copy link
Contributor

jprante commented Oct 27, 2016

I'm late to the party, but some months ago, with the help of @clement-tourriere at jprante/elasticsearch-aggregations#1 it was possible to port the path hierarchy approach to the ES 2.x aggregation framework. With ES 5 now released, I plan to move forward. Please comment at https://github.com/jprante/elasticsearch-aggregations/issues if you have questions about my project or want to contribute.

@ale0xb
Copy link

ale0xb commented Mar 24, 2017

Hello, any solutions to get this functionality in 5.x?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants