Skip to content

Path hierarchy aggregation #8896

@clintongormley

Description

@clintongormley

A few users have used nested terms aggregations to try to visualise each level in a tree, such as a file system, eg:

{
  "aggs": {
    "first_level": {
      "terms": {
        "field": "first_level"
      },
      "aggs": {
        "second_level": {
          "terms": {
            "field": "second_level"
          },
          "aggs": {
            "third_level": {
              "terms": {
                "field": "third_level"
              }
            }
          }
        }
      }
    }
  }
}

This is very costly as it results in combinatorial explosion. However, because this is a tree, it would be more efficient to store first_level+second_level+third_level in a single field, and to do a single pass over these "leaf buckets". Once we have the most popular leaves, we can backfill the branches (ie first_level+second_level and first_level).

The results would obviously be different to the nested terms agg: instead of having the most popular first_levels, then the most popular second_levels in the most popular first_levels (etc), you'd have the most popular leaves, plus the first_level and second level that those leaves belong to.

A complete example could look like this:

PUT /filesystem
{
  "mappings": {
    "file": {
      "properties": {
        "path": {
          "type": "string",
          "index": "not_analyzed",
          "doc_values": true
        }
      }
    }
  }
}

PUT /filesystem/file/1
{
  "path": "/My documents/Spreadsheets/Budget_2013.xls",
  "views": 10
}

PUT /filesystem/file/2
{
  "path": "/My documents/Spreadsheets/Budget_2014.xls",
  "views": 6
}

PUT /filesystem/file/3
{
  "path": "/My documents/Test.txt",
  "views": 1
}

GET /filesystem/file/_search?search_type=count
{
  "aggs": {
    "tree": {
      "path_hierarchy": {
        "field": "path",
        "separator": "/",
        "order": "total_views"
      },
      "aggs": {
        "total_views": {
          "sum": {
            "field": "views"
          }
        }
      }
    }
  }
}

And the result like this:

{
  "aggregations": {
    "tree": {
      "buckets": [
        {
          "key": "My documents",
          "doc_count": 3,
          "total_views": {
            "value": 18
          },
          "tree": {
            "buckets": [
              {
                "key": "Spreadsheets",
                "doc_count": 2,
                "total_views": {
                  "value": 17
                },
                "tree": {
                  "buckets": [
                    {
                      "key": "Budget_2013.xls",
                      "doc_count": 1,
                      "total_views": {
                        "value": 10
                      }
                    },
                    {
                      "key": "Budget_2014.xls",
                      "doc_count": 1,
                      "total_views": {
                        "value": 7
                      }
                    }
                  ]
                }
              },
              {
                "key": "Test.txt",
                "doc_count": 1,
                "total_views": {
                  "value": 1
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions