-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution] Search Strategy slowdown #91269
Comments
Pinging @elastic/security-solution (Team: SecuritySolution) |
Ok so I dug into this quite a bit and need some peer review from @dplumlee and a verification and validation of what I'm saying is truthy before we go and replace these aggregation queries with plain search + sort. However these first metrics and numbers coming back are looking really good. To start with, here is some material to read through that I went through and then I will give my analysis and testing I conducted on the queries below: Reading and studying https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand Thoughts and analysis Adding
However, adding and using https://github.com/elastic/elasticsearch/blob/master/docs/reference/query-dsl/distance-feature-query.asciidoc#skip-non-competitive-hits seems to back this up as well
Testing and numbers I tested with both the dev tools UI for performance as well as the regular dev tools output. See these two below for more information: Within our queries for things such as You can test this with and without the profiler against server side data like so: # Using an aggregation which is not fast against all indexes
GET auditbeat-*,endgame-*,filebeat-*,logs-*,packetbeat-*,winlogbeat-*,-*elastic-cloud-logs-*/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
"profile": false,
"track_total_hits": false,
"aggregations": {
"last_seen_event": {
"max": {
"field": "@timestamp"
}
}
},
"query": {
"match_all": {}
},
"size": 0
} Results with profile set to {
"took" : 2286, <-- ~2 to 4 seconds
"timed_out" : false,
"_shards" : {
"total" : 129,
"successful" : 129,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"last_seen_event" : {
"value" : 1.613460708483E12,
"value_as_string" : "2021-02-16T07:31:48.483Z"
}
}
} Changing Using instead a regular # Sorting + track_total_hits + size: 1 gives us around ~100 millisecond times
GET auditbeat-*,endgame-*,filebeat-*,logs-*,packetbeat-*,winlogbeat-*,-*elastic-cloud-logs-*/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
"profile": false,
"size": 1,
"track_total_hits": false,
"query": {
"match_all": {}
},
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"fields": [
"@timestamp"
],
"_source": false
} {
"took" : 102, <-- ~100-200 milliseconds
"timed_out" : false,
"_shards" : {
"total" : 129,
"successful" : 129,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"max_score" : null,
"hits" : [
{
"_index" : "auditbeat-7.8.0-2021.02.11-000011",
"_id" : "UC28qXcBtgfIO53sJ584",
"_score" : null,
"fields" : {
"@timestamp" : [
"2021-02-16T07:32:53.949Z"
]
},
"sort" : [
1613460773949
]
}
]
}
}
If you're worried about caching messing up results you can do this to remove the all caches to test more: # Clear the cache
POST /_cache/clear Outside of this testing, if you flip the profile flags to true for each query and run them side by side for comparison or to analyze the functionality of profiling you will get the If you're into looking and analyzing in depth the profile sections and learning about them I suggest using just GET filebeat-8.0.0-2020.12.31-000005/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
"profile": true,
"track_total_hits": false,
"query": {
"match_all": {}
},
"size": 0,
"aggregations": {
"last_seen_event": {
"max": {
"field": "@timestamp"
}
}
}
} GET filebeat-8.0.0-2020.12.31-000005/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
"profile": true,
"track_total_hits": false,
"query": {
"match_all": {}
},
"size": 1,
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"fields": [
"@timestamp"
],
"_source": false
} If you change the |
Second example from the file with an added filter from: Where I replicate it and then replace it with the faster search + sort. However, I do play with the cache quite a bit to see the speeds with and without it. Speeds indicate faster order of magnitude of seconds vs milliseconds. With caching disabled it's 2 seconds vs 8 seconds. You can add # Clear the cache
POST /_cache/clear # Another aggregation example where times are between 3-8 seconds depending on if the cache is cleared or not
GET auditbeat-*,endgame-*,filebeat-*,logs-*,packetbeat-*,winlogbeat-*,-*elastic-cloud-logs-*/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
"profile": false,
"track_total_hits": false,
"query": {
"bool": {
"should": [
{
"term": {
"source.ip": "127.0.0.1"
}
},
{
"term": {
"destination.ip": "127.0.0.1"
}
}
]
}
},
"size": 0,
"aggregations": {
"last_seen_event": {
"max": {
"field": "@timestamp"
}
}
}
} # Faster times with using a bool filter and sort + size + track-total_hits: false (100 milliseconds to 2 or 3 seconds depending on if the cache is cleared)
GET auditbeat-*,endgame-*,filebeat-*,logs-*,packetbeat-*,winlogbeat-*,-*elastic-cloud-logs-*/_search?allow_no_indices=true&ignore_unavailable=true&human=true
{
"profile": false,
"track_total_hits": false,
"query": {
"bool": {
"filter": {
"bool": {
"should": [
{
"term": {
"source.ip": "127.0.0.1"
}
},
{
"term": {
"destination.ip": "127.0.0.1"
}
}
]
}
}
}
},
"size": 1,
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
],
"fields": [
"@timestamp"
],
"_source": false
} |
…nts (#91790) ## Summary Fixes performance issues and timeouts with first and last events. Previously we were using aggregations and max/min but that is slower in orders of magnitude vs. using sorting + size: 1 + track_total_hits: false. This PR also splits out the aggregate of first/last into two separate calls as we were calling the same aggregate twice for first/last twice rather than calling one query for first even and a second query for last event which is also faster when you don't use an aggregate even if you ran them multiple times back to back compared to using aggregates for min/max's. For how we determined performance numbers quantifiably and how we did profiling see this comment: #91269 (comment) Shoutout to @dplumlee for the help and the first cut and draft of this PR that I morphed into this PR. For any manual testing, please test: * Hosts overview page and look at "last events" * Network overview page and look at "last events" * "last events" on the timeline * Host details page the first and last seen event for a particular host. The more records the better. * Network details page and the first and last seen for a particular network. 127.0.0.1 is good as that has a lot of records. For qualitative viewing you should see a noticeable difference like so where the first seen/last seen/last events show up quicker on detailed pages from the left side compared to the right side:  ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
…nts (elastic#91790) ## Summary Fixes performance issues and timeouts with first and last events. Previously we were using aggregations and max/min but that is slower in orders of magnitude vs. using sorting + size: 1 + track_total_hits: false. This PR also splits out the aggregate of first/last into two separate calls as we were calling the same aggregate twice for first/last twice rather than calling one query for first even and a second query for last event which is also faster when you don't use an aggregate even if you ran them multiple times back to back compared to using aggregates for min/max's. For how we determined performance numbers quantifiably and how we did profiling see this comment: elastic#91269 (comment) Shoutout to @dplumlee for the help and the first cut and draft of this PR that I morphed into this PR. For any manual testing, please test: * Hosts overview page and look at "last events" * Network overview page and look at "last events" * "last events" on the timeline * Host details page the first and last seen event for a particular host. The more records the better. * Network details page and the first and last seen for a particular network. 127.0.0.1 is good as that has a lot of records. For qualitative viewing you should see a noticeable difference like so where the first seen/last seen/last events show up quicker on detailed pages from the left side compared to the right side:  ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
…nts (#91790) (#91797) ## Summary Fixes performance issues and timeouts with first and last events. Previously we were using aggregations and max/min but that is slower in orders of magnitude vs. using sorting + size: 1 + track_total_hits: false. This PR also splits out the aggregate of first/last into two separate calls as we were calling the same aggregate twice for first/last twice rather than calling one query for first even and a second query for last event which is also faster when you don't use an aggregate even if you ran them multiple times back to back compared to using aggregates for min/max's. For how we determined performance numbers quantifiably and how we did profiling see this comment: #91269 (comment) Shoutout to @dplumlee for the help and the first cut and draft of this PR that I morphed into this PR. For any manual testing, please test: * Hosts overview page and look at "last events" * Network overview page and look at "last events" * "last events" on the timeline * Host details page the first and last seen event for a particular host. The more records the better. * Network details page and the first and last seen for a particular network. 127.0.0.1 is good as that has a lot of records. For qualitative viewing you should see a noticeable difference like so where the first seen/last seen/last events show up quicker on detailed pages from the left side compared to the right side:  ### Checklist Delete any items that are not applicable to this PR. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Frank Hassanabad <frank.hassanabad@elastic.co>
Overview
A case was brought up recently that identified the security solution's search strategy as having a slowdown when certain queries were sent to elasticsearch, namely the
last_event_time
on the hosts tab. Some calls are seemingly running until the query/es times out and returns abackend closed connection
error, others are just incredibly slow (20-30 seconds).Initial triage
Looking at network files led us to identify the async search calls made to elasticsearch as a possible reason for the slowdowns, and might be queuing up causing any future queries to also be slow and making the app slow down as a whole. Upon closer inspection the
track_total_hits
field wasn't being passed down properly to the search strategy queries and might be the culprit causing the back up in queries, specifically on the hosts overview and hosts details page. A pr was put up to fix this, but further testing is needed to confirm whether or not this will provide enough ease on es to get the queries to run reasonably faster.Will update as we progress
Related PRs
#91068
#91790
The text was updated successfully, but these errors were encountered: