Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expensive queries are causing unnecessary load and delays on Elasticsearch #93770

Open
16 of 45 tasks
Tracked by #119466
rudolf opened this issue Mar 5, 2021 · 7 comments
Open
16 of 45 tasks
Tracked by #119466
Labels
Feature:Search Querying infrastructure in Kibana impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. Team:Detections and Resp Security Detection Response Team Team:Kibana Management Dev Tools, Index Management, Upgrade Assistant, ILM, Ingest Node Pipelines, and more Team:ML Team label for ML (also use :ml) Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.

Comments

@rudolf
Copy link
Contributor

rudolf commented Mar 5, 2021

Until #89915 (v7.12.0) saved objects didn't support paging through large result sets. Now that we have _search_after support, plugins who previously paged through "all" results by setting size: 10000 should be refactored to use search after instead.

The problem with creating searches with large batches of 10000 is that it blocks the Elasticsearch thread pool for a long time which negatively impacts the performance of other search queries. Since Kibana started using system indices for the saved objects index in 7.11, this has had a much bigger impact because these searches share a thread pool with the security index. Paging with smaller batches means faster responses per request, allowing the thread pool to interleave Kibana searches with other requests.

In addition to the performance impact on Elasticsearch, large searches also mean large response payloads which blocks the Kibana thread for an extended amount of time. This causes spikes in the event loop delay which impacts the performance of all plugins.

Short term: fix all 10k searches against the saved object indices

The following is a list of plugins performing searches with perPage: 10000. Please audit each occurrence and mark the task as complete with a link to the PR once it has been resolved. These links are based on a quick search, if the linked code isn't searching against a saved objects index with size > 1000 please mark the item as done.

Blocked on #91175 because that will make it significantly easier for teams to address these issues.
Done. Here are docs on the new point-in-time finder.

Medium term

@rudolf rudolf added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Mar 5, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@rudolf rudolf changed the title [wip] Expensive queries are causing unnecessary load and delays on Elasticsearch Expensive queries are causing unnecessary load and delays on Elasticsearch Mar 9, 2021
@lukeelmers
Copy link
Member

#91175 has been addressed, so teams should now be unblocked on moving forward with the short-term fixes outlined here.

@smith
Copy link
Contributor

smith commented Mar 31, 2021

None of the APM items listed are querying saved object indices. Checked them off.

@alexwizp
Copy link
Contributor

alexwizp commented May 5, 2021

changes related to the KibanaApp team are ready (#99031, #99023, #98914, #98903) but blocked by #99044

alexwizp added a commit that referenced this issue Aug 30, 2021
…lasticsearch (#98914)

* [TSVB] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: #93770

* remove globalConfig

* fix tests

* fix finder.close

* cleanup code

* run queries concurrently

* add namespaces: ['*'],

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit to alexwizp/kibana that referenced this issue Aug 30, 2021
…lasticsearch (elastic#98914)

* [TSVB] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: elastic#93770

* remove globalConfig

* fix tests

* fix finder.close

* cleanup code

* run queries concurrently

* add namespaces: ['*'],

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit that referenced this issue Aug 30, 2021
…elays on Elasticsearch (#99031)

* [Visualizations] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: #93770

* fix CI

* fix typo

* fix namespaces issue

* fix tests

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit to alexwizp/kibana that referenced this issue Aug 30, 2021
…elays on Elasticsearch (elastic#99031)

* [Visualizations] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: elastic#93770

* fix CI

* fix typo

* fix namespaces issue

* fix tests

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit that referenced this issue Aug 30, 2021
…lasticsearch (#99023)

* [Vega] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: #93770

* Update get_usage_collector.test.ts

* add namespaces: ['*']

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit to alexwizp/kibana that referenced this issue Aug 30, 2021
…lasticsearch (elastic#99023)

* [Vega] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: elastic#93770

* Update get_usage_collector.test.ts

* add namespaces: ['*']

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit that referenced this issue Aug 30, 2021
…s on Elasticsearch (#98903)

* [Data Table] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of #93770

* remove extra cycles

* fix PR comments

* fix finder.close

* code cleanup

* add namespaces: ['*'],

* fix jest

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit to alexwizp/kibana that referenced this issue Aug 30, 2021
…s on Elasticsearch (elastic#98903)

* [Data Table] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of elastic#93770

* remove extra cycles

* fix PR comments

* fix finder.close

* code cleanup

* add namespaces: ['*'],

* fix jest

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit that referenced this issue Aug 30, 2021
…lasticsearch (#98914) (#110446)

* [TSVB] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: #93770

* remove globalConfig

* fix tests

* fix finder.close

* cleanup code

* run queries concurrently

* add namespaces: ['*'],

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit that referenced this issue Aug 30, 2021
…lasticsearch (#99023) (#110448)

* [Vega] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: #93770

* Update get_usage_collector.test.ts

* add namespaces: ['*']

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit that referenced this issue Aug 30, 2021
…elays on Elasticsearch (#99031) (#110447)

* [Visualizations] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of: #93770

* fix CI

* fix typo

* fix namespaces issue

* fix tests

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
alexwizp added a commit that referenced this issue Aug 30, 2021
…s on Elasticsearch (#98903) (#110457)

* [Data Table] Expensive queries are causing unnecessary load and delays on Elasticsearch

Part of #93770

* remove extra cycles

* fix PR comments

* fix finder.close

* code cleanup

* add namespaces: ['*'],

* fix jest

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@majagrubic majagrubic removed the Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. label Dec 16, 2021
@afharo
Copy link
Member

afharo commented Jan 27, 2022

All the items for @elastic/kibana-telemetry are listed in #96715. We'll try to prioritize them.

@yctercero
Copy link
Contributor

Just FYI, security solution platform has this ticket we've been trying to get to of moving over to PIT - #103944

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
FrankHassanabad added a commit that referenced this issue Feb 15, 2022
…int in Time) and restructuring of folders (#124912)

## Summary

Changes the usage collector telemetry within security solutions to use PIT (Point in Time) and a few other bug fixes and restructuring.

* The main goal is to change the full queries for up to 10k items to be instead using 1k batched items at a time and PIT (Point in Time). See [this ticket](#93770) for more information and [here](#99031) for an example where they changed there code to use 1k batched items. I use PIT with SO object API, searches, and then composite aggregations which all support the PIT. The PIT timeouts are all set to 5 minutes and all the maximums of 10k to not increase memory more is still there. However, we should be able to increase the 10k limit at this point if we wanted to for usage collector to count beyond the 10k. The initial 10k was an elastic limitation that PIT now avoids.
* This also fixes a bug where the aggregations were only returning the top 10 items instead of the full 10k. That is changed to use [composite aggregations](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html). 
* This restructuring the folder structure to try and do [reductionism](https://en.wikipedia.org/wiki/Reductionism#In_computer_science) best we can. I could not do reductionism with the schema as the tooling does not allow it. But the rest is self-repeating in the way hopefully developers expect it to be. And also make it easier for developers to add new telemetry usage collector counters in the same fashion.
* This exchanges the hand spun TypeScript types in favor of using the `caseComments` and the `Sanitized Alerts` and the `ML job types` using Partial and other TypeScript tricks.
* This removes the [Cyclomatic Complexity](https://en.wikipedia.org/wiki/Cyclomatic_complexity) warnings coming from the linters by breaking down the functions into smaller units.
* This removes the "as casts" in all but 1 area which can lead to subtle TypeScript problems.
* This pushes down the logger and uses the logger to report errors and some debug information

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
FrankHassanabad added a commit that referenced this issue Feb 15, 2022
… from saved objects to exception lists (#125182)

## Summary

Exposes the functionality of
* search_after
* point in time (pit)

From saved objects to the exception lists. This _DOES NOT_ expose these to the REST API just yet. Rather this exposes it at the API level to start with and changes code that had hard limits of 10k and other limited loops. I use the batching of 1k for this at a time as I thought that would be a decent batch guess and I see other parts of the code changed to it. It's easy to change the 1k if we find we need to throttle back more as we get feedback from others.

See this PR where `PIT` and `search_after` were first introduced: #89915
See these 2 issues where we should be using more paging and PIT (Point in Time) with search_after: #93770 #103944

The new methods added to the `exception_list_client.ts` client class are:
* openPointInTime
* closePointInTime
* findExceptionListItemPointInTimeFinder
* findExceptionListPointInTimeFinder
* findExceptionListsItemPointInTimeFinder
* findValueListExceptionListItemsPointInTimeFinder

The areas of functionality that have been changed:
* Exception list exports
* Deletion of lists
* Getting exception list items when generating signals

Note that currently we use our own ways of looping over the saved objects which you can see in the codebase such as this older way below which does work but had a limitation of 10k against saved objects and did not do point in time (PIT)

Older way example (deprecated):
```ts
  let page = 1;
  let ids: string[] = [];
  let foundExceptionListItems = await findExceptionListItem({
    filter: undefined,
    listId,
    namespaceType,
    page,
    perPage: PER_PAGE,
    pit: undefined,
    savedObjectsClient,
    searchAfter: undefined,
    sortField: 'tie_breaker_id',
    sortOrder: 'desc',
  });
  while (foundExceptionListItems != null && foundExceptionListItems.data.length > 0) {
    ids = [
      ...ids,
      ...foundExceptionListItems.data.map((exceptionListItem) => exceptionListItem.id),
    ];
    page += 1;
    foundExceptionListItems = await findExceptionListItem({
      filter: undefined,
      listId,
      namespaceType,
      page,
      perPage: PER_PAGE,
      pit: undefined,
      savedObjectsClient,
      searchAfter: undefined,
      sortField: 'tie_breaker_id',
      sortOrder: 'desc',
    });
  }
  return ids;
```

But now that is replaced with this newer way using PIT:
```ts
  // Stream the results from the Point In Time (PIT) finder into this array
  let ids: string[] = [];
  const executeFunctionOnStream = (response: FoundExceptionListItemSchema): void => {
    const responseIds = response.data.map((exceptionListItem) => exceptionListItem.id);
    ids = [...ids, ...responseIds];
  };

  await findExceptionListItemPointInTimeFinder({
    executeFunctionOnStream,
    filter: undefined,
    listId,
    maxSize: undefined, // NOTE: This is unbounded when it is "undefined"
    namespaceType,
    perPage: 1_000,
    savedObjectsClient,
    sortField: 'tie_breaker_id',
    sortOrder: 'desc',
  });
  return ids;
```

We also have areas of code that has perPage listed at 10k or a constant that represents 10k which this removes in most areas (but not all areas):
```ts
      const items = await client.findExceptionListsItem({
        listId: listIds,
        namespaceType: namespaceTypes,
        page: 1,
        pit: undefined,
        perPage: MAX_EXCEPTION_LIST_SIZE, // <--- Really bad to send in 10k per page at a time
        searchAfter: undefined,
        filter: [],
        sortOrder: undefined,
        sortField: undefined,
      });
```

That is now:
```ts
      // Stream the results from the Point In Time (PIT) finder into this array
      let items: ExceptionListItemSchema[] = [];
      const executeFunctionOnStream = (response: FoundExceptionListItemSchema): void => {
        items = [...items, ...response.data];
      };

      await client.findExceptionListsItemPointInTimeFinder({
        executeFunctionOnStream,
        listId: listIds,
        namespaceType: namespaceTypes,
        perPage: 1_000,
        filter: [],
        maxSize: undefined, // NOTE: This is unbounded when it is "undefined"
        sortOrder: undefined,
        sortField: undefined,
      });
```

Left over areas will be handled in separate PR's because they are in other people's code ownership areas.

### Checklist
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
@afharo
Copy link
Member

afharo commented Jul 5, 2022

All the @elastic/kibana-telemetry items are handled in #135689

@petrklapka petrklapka added Feature:Search Querying infrastructure in Kibana Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. and removed Team:AppServicesSv labels Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Search Querying infrastructure in Kibana impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. loe:small Small Level of Effort Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. Team:Detections and Resp Security Detection Response Team Team:Kibana Management Dev Tools, Index Management, Upgrade Assistant, ILM, Ingest Node Pipelines, and more Team:ML Team label for ML (also use :ml) Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
None yet
Development

No branches or pull requests

10 participants