[Security Solution] Search Strategy slowdown #91269

dplumlee · 2021-02-12T02:02:12Z

Overview

A case was brought up recently that identified the security solution's search strategy as having a slowdown when certain queries were sent to elasticsearch, namely the last_event_time on the hosts tab. Some calls are seemingly running until the query/es times out and returns a backend closed connection error, others are just incredibly slow (20-30 seconds).

Initial triage

Looking at network files led us to identify the async search calls made to elasticsearch as a possible reason for the slowdowns, and might be queuing up causing any future queries to also be slow and making the app slow down as a whole. Upon closer inspection the track_total_hits field wasn't being passed down properly to the search strategy queries and might be the culprit causing the back up in queries, specifically on the hosts overview and hosts details page. A pr was put up to fix this, but further testing is needed to confirm whether or not this will provide enough ease on es to get the queries to run reasonably faster.

Will update as we progress

Related PRs

#91068
#91790

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-12T02:02:14Z

Pinging @elastic/security-solution (Team: SecuritySolution)

FrankHassanabad · 2021-02-16T07:31:11Z

Ok so I dug into this quite a bit and need some peer review from @dplumlee and a verification and validation of what I'm saying is truthy before we go and replace these aggregation queries with plain search + sort. However these first metrics and numbers coming back are looking really good.

To start with, here is some material to read through that I went through and then I will give my analysis and testing I conducted on the queries below:

Reading and studying

https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html
https://www.elastic.co/guide/en/kibana/current/xpack-profiler.html
https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2019-05-31 (Section on Optimize sort on numeric fields)
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html
https://github.com/elastic/elasticsearch/blob/master/docs/reference/query-dsl/distance-feature-query.asciidoc#how-the-distance_feature-query-calculates-relevance-scores

Thoughts and analysis

Adding track_total_hits: false to aggregations does not appear to make things faster, nor make them slower. It seems to have no effect during testing and experimentation. Documentation seems to back this up from the faster-retrieval-of-top-hits-in-elasticsearch-with-block-wax-wand:

If you include aggregations in your search requests, then this optimization
won't apply: Elasticsearch still needs to visit all matches to compute aggregations.
It doesn't necessarily mean that this optimization is not useful to you.

For instance say you have an e-commerce website that both returns
products and displays facets about the whole set of matches.
By running one request to compute top products and another one for facets,
the UI could start displaying hits even if facets aren't available yet.

However, adding and using track_total_hits: false to regular queries does speed things up considerably and using it with size: 1 and sort looks to be very promising and fast. So, it looks like where and when we can replace an aggregation query that has a max or min with a regular query/filter + sort + track_total_hits: false we are going to be 👍

https://github.com/elastic/elasticsearch/blob/master/docs/reference/query-dsl/distance-feature-query.asciidoc#skip-non-competitive-hits seems to back this up as well

Unlike the function_score query or other ways to change relevance scores,
the distance_feature query efficiently skips non-competitive hits when the
track_total_hits parameter is not true.

Testing and numbers

I tested with both the dev tools UI for performance as well as the regular dev tools output. See these two below for more information:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html
https://www.elastic.co/guide/en/kibana/current/xpack-profiler.html

Within our queries for things such as last_seen_event here:
https://github.com/elastic/kibana/blob/master/x-pack/plugins/security_solution/server/search_strategy/timeline/factory/events/last_event_time/query.events_last_event_time.dsl.ts#L82

You can test this with and without the profiler against server side data like so:

# Using an aggregation which is not fast against all indexes
GET auditbeat-*,endgame-*,filebeat-*,logs-*,packetbeat-*,winlogbeat-*,-*elastic-cloud-logs-*/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
  "profile": false,
  "track_total_hits": false,
  "aggregations": {
    "last_seen_event": {
      "max": {
        "field": "@timestamp"
      }
    }
  },
  "query": {
    "match_all": {}
  },
  "size": 0
}

Results with profile set to false come back on from ~2 seconds to ~4 seconds by using the "took" value from the response:

{
  "took" : 2286, <-- ~2 to 4 seconds
  "timed_out" : false,
  "_shards" : {
    "total" : 129,
    "successful" : 129,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "last_seen_event" : {
      "value" : 1.613460708483E12,
      "value_as_string" : "2021-02-16T07:31:48.483Z"
    }
  }
}

Changing "track_total_hits": false, between true and false does not have a performance impact as far as I can tell which backs up what the documentation describes.

Using instead a regular search query, + size: 1 + sort + track_total_hits: false does speed things up and reduces the ~2-4 seconds to ~50-200 milliseconds:

# Sorting + track_total_hits + size: 1 gives us around ~100 millisecond times
GET auditbeat-*,endgame-*,filebeat-*,logs-*,packetbeat-*,winlogbeat-*,-*elastic-cloud-logs-*/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
  "profile": false,
  "size": 1,
  "track_total_hits": false,
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ],
  "fields": [
    "@timestamp"
  ],
  "_source": false
}

{
  "took" : 102, <-- ~100-200 milliseconds
  "timed_out" : false,
  "_shards" : {
    "total" : 129,
    "successful" : 129,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "max_score" : null,
    "hits" : [
      {
        "_index" : "auditbeat-7.8.0-2021.02.11-000011",
        "_id" : "UC28qXcBtgfIO53sJ584",
        "_score" : null,
        "fields" : {
          "@timestamp" : [
            "2021-02-16T07:32:53.949Z"
          ]
        },
        "sort" : [
          1613460773949
        ]
      }
    ]
  }
}

"track_total_hits": false is key to the perf numbers and if you flip it to true you should see the numbers get as bad as the aggregation and in some cases possibly worse, but more than likely it is just as bad on average from what I can tell
running each multiple times.

If you're worried about caching messing up results you can do this to remove the all caches to test more:

# Clear the cache
POST /_cache/clear

Outside of this testing, if you flip the profile flags to true for each query and run them side by side for comparison or to analyze the functionality of profiling you will get the took values to be ~13 seconds for aggregations vs. ~5 seconds for search which reinforces there is a positive net gain.

If you're into looking and analyzing in depth the profile sections and learning about them I suggest using just filebeat-8.0.0-2020.12.31-000005 like so:

GET filebeat-8.0.0-2020.12.31-000005/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
  "profile": true,
  "track_total_hits": false,
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggregations": {
    "last_seen_event": {
      "max": {
        "field": "@timestamp"
      }
    }
  }
}

GET filebeat-8.0.0-2020.12.31-000005/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
  "profile": true,
  "track_total_hits": false,
  "query": {
    "match_all": {}
  },
  "size": 1,
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ],
  "fields": [
    "@timestamp"
  ],
  "_source": false
}

If you change the track_total_hits from false to true for the search you will see the LongDistanceFeatureQuery being removed for when we query match_all. I don't know if that has to do with the track_total_hits or other things but an interesting note to point out here.

FrankHassanabad · 2021-02-16T07:47:06Z

Second example from the file with an added filter from:
https://github.com/elastic/kibana/blob/master/x-pack/plugins/security_solution/server/search_strategy/timeline/factory/events/last_event_time/query.events_last_event_time.dsl.ts#L82

Where I replicate it and then replace it with the faster search + sort. However, I do play with the cache quite a bit to see the speeds with and without it. Speeds indicate faster order of magnitude of seconds vs milliseconds. With caching disabled it's 2 seconds vs 8 seconds. You can add &request_cache=true or &request_cache=false to the queries too I read but that didn't seem to have as much of an effect for me during testing.

# Clear the cache
POST /_cache/clear

# Another aggregation example where times are between 3-8 seconds depending on if the cache is cleared or not 
GET auditbeat-*,endgame-*,filebeat-*,logs-*,packetbeat-*,winlogbeat-*,-*elastic-cloud-logs-*/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
  "profile": false,
  "track_total_hits": false,
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "source.ip": "127.0.0.1"
          }
        },
        {
          "term": {
            "destination.ip": "127.0.0.1"
          }
        }
      ]
    }
  },
  "size": 0,
  "aggregations": {
    "last_seen_event": {
      "max": {
        "field": "@timestamp"
      }
    }
  }
}

# Faster times with using a bool filter and sort + size + track-total_hits: false (100 milliseconds to 2 or 3 seconds depending on if the cache is cleared)
GET auditbeat-*,endgame-*,filebeat-*,logs-*,packetbeat-*,winlogbeat-*,-*elastic-cloud-logs-*/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
  "profile": false,
  "track_total_hits": false,
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "should": [
            {
              "term": {
                "source.ip": "127.0.0.1"
              }
            },
            {
              "term": {
                "destination.ip": "127.0.0.1"
              }
            }
          ]
        }
      }
    }
  },
  "size": 1,
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ],
  "fields": [
    "@timestamp"
  ],
  "_source": false
}

@dplumlee

…nts (#91790) ## Summary Fixes performance issues and timeouts with first and last events. Previously we were using aggregations and max/min but that is slower in orders of magnitude vs. using sorting + size: 1 + track_total_hits: false. This PR also splits out the aggregate of first/last into two separate calls as we were calling the same aggregate twice for first/last twice rather than calling one query for first even and a second query for last event which is also faster when you don't use an aggregate even if you ran them multiple times back to back compared to using aggregates for min/max's. For how we determined performance numbers quantifiably and how we did profiling see this comment: #91269 (comment) Shoutout to @dplumlee for the help and the first cut and draft of this PR that I morphed into this PR. For any manual testing, please test: * Hosts overview page and look at "last events" * Network overview page and look at "last events" * "last events" on the timeline * Host details page the first and last seen event for a particular host. The more records the better. * Network details page and the first and last seen for a particular network. 127.0.0.1 is good as that has a lot of records. For qualitative viewing you should see a noticeable difference like so where the first seen/last seen/last events show up quicker on detailed pages from the left side compared to the right side: ![quantify_perf](https://user-images.githubusercontent.com/1151048/108309335-99773600-716e-11eb-8dae-c289947b188c.gif) ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

@dplumlee

…nts (elastic#91790) ## Summary Fixes performance issues and timeouts with first and last events. Previously we were using aggregations and max/min but that is slower in orders of magnitude vs. using sorting + size: 1 + track_total_hits: false. This PR also splits out the aggregate of first/last into two separate calls as we were calling the same aggregate twice for first/last twice rather than calling one query for first even and a second query for last event which is also faster when you don't use an aggregate even if you ran them multiple times back to back compared to using aggregates for min/max's. For how we determined performance numbers quantifiably and how we did profiling see this comment: elastic#91269 (comment) Shoutout to @dplumlee for the help and the first cut and draft of this PR that I morphed into this PR. For any manual testing, please test: * Hosts overview page and look at "last events" * Network overview page and look at "last events" * "last events" on the timeline * Host details page the first and last seen event for a particular host. The more records the better. * Network details page and the first and last seen for a particular network. 127.0.0.1 is good as that has a lot of records. For qualitative viewing you should see a noticeable difference like so where the first seen/last seen/last events show up quicker on detailed pages from the left side compared to the right side: ![quantify_perf](https://user-images.githubusercontent.com/1151048/108309335-99773600-716e-11eb-8dae-c289947b188c.gif) ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

@dplumlee

…nts (#91790) (#91797) ## Summary Fixes performance issues and timeouts with first and last events. Previously we were using aggregations and max/min but that is slower in orders of magnitude vs. using sorting + size: 1 + track_total_hits: false. This PR also splits out the aggregate of first/last into two separate calls as we were calling the same aggregate twice for first/last twice rather than calling one query for first even and a second query for last event which is also faster when you don't use an aggregate even if you ran them multiple times back to back compared to using aggregates for min/max's. For how we determined performance numbers quantifiably and how we did profiling see this comment: #91269 (comment) Shoutout to @dplumlee for the help and the first cut and draft of this PR that I morphed into this PR. For any manual testing, please test: * Hosts overview page and look at "last events" * Network overview page and look at "last events" * "last events" on the timeline * Host details page the first and last seen event for a particular host. The more records the better. * Network details page and the first and last seen for a particular network. 127.0.0.1 is good as that has a lot of records. For qualitative viewing you should see a noticeable difference like so where the first seen/last seen/last events show up quicker on detailed pages from the left side compared to the right side: ![quantify_perf](https://user-images.githubusercontent.com/1151048/108309335-99773600-716e-11eb-8dae-c289947b188c.gif) ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Frank Hassanabad <frank.hassanabad@elastic.co>

dplumlee added bug Fixes for quality problems that affect the customer experience Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. labels Feb 12, 2021

dplumlee self-assigned this Feb 12, 2021

FrankHassanabad linked a pull request Feb 18, 2021 that will close this issue

[Security Solutions] Improves query performance of first and last events #91790

Merged

1 task

FrankHassanabad mentioned this issue Feb 18, 2021

[Security Solutions] Improves query performance of first and last events #91790

Merged

1 task

FrankHassanabad closed this as completed in #91790 Feb 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution] Search Strategy slowdown #91269

[Security Solution] Search Strategy slowdown #91269

dplumlee commented Feb 12, 2021 •

edited by FrankHassanabad

Loading

elasticmachine commented Feb 12, 2021

FrankHassanabad commented Feb 16, 2021 •

edited

Loading

FrankHassanabad commented Feb 16, 2021 •

edited

Loading

[Security Solution] Search Strategy slowdown #91269

[Security Solution] Search Strategy slowdown #91269

Comments

dplumlee commented Feb 12, 2021 • edited by FrankHassanabad Loading

Overview

Initial triage

Related PRs

elasticmachine commented Feb 12, 2021

FrankHassanabad commented Feb 16, 2021 • edited Loading

FrankHassanabad commented Feb 16, 2021 • edited Loading

dplumlee commented Feb 12, 2021 •

edited by FrankHassanabad

Loading

FrankHassanabad commented Feb 16, 2021 •

edited

Loading

FrankHassanabad commented Feb 16, 2021 •

edited

Loading