Add support to SavedObjects.find for fetching more than 10k objects #77961

joshdover · 2020-09-18T22:19:32Z

The current find API on the SavedObjectRepository cannot page through large data sets due to the index.max_result_window setting in Elasticsearch which defaults to 10,000 objects. This is starting to limit what plugins can build and we have several types of SOs that may have large numbers of objects now (SIEM's exception lists come to mind).

To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default. Clients would need to be aware of this and handle it properly and it may not be easy to realize this in development.

Another option could be _async_search APIs, but those are not available in OSS distributions.

This issue definitely needs further investigation, but I wanted to open it to start collecting usecases where it would be useful.
related: #22636
#64715

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-09-18T22:19:34Z

Pinging @elastic/kibana-platform (Team:Platform)

azasypkin · 2020-09-21T08:09:23Z

Thanks for filing this issue, @joshdover.

I've just encountered this limitation in the scope of #72420 as well. In a nutshell, when admin wants to bulk rotate SO encryption key we should fetch and process all SO (ideally in batches with a configurable size to balance the load), from all spaces, for all SO types that may have encrypted attributes. And these days we may have quite a lot of them (alerts, actions, fleet related SOs).

The fact that we also update some of the fetched results makes paging with only perPage/page even more complex even if we have less than 10k of SOs.

Do you happen to have any recommended workarounds for SO use cases like this? If not, is there anything we can help with to boost priority for this enhancement?

cc @elastic/kibana-security

pgayvallet · 2020-09-22T08:54:14Z

To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default.

This is obviously a very stupid option in term of memory usage, but still asking: could using the scroll API internally in the SO repository be an option? We could, when page+perPage > index.max_result_window, use the scroll API under the hood, fetch all, and return the aggregated results?

scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default

This is a configurable via the scroll option though, so If we expose a new scroll API to the SOR/SOC, we could just expose it. But yea, the ttl handling would have to be handled by the consumer anyway, there is not much we could do when _search/scroll is called with an expired TTL.

jen-huang · 2020-09-24T23:50:04Z

Hi Josh, as we discussed last week, the current limitation impacts the scalability of the Fleet effort. Every agent that connects to Fleet is stored as a saved object that can be managed in the UI. The limitation is currently not too bad for us as we are actively working on improving performance to handle a large number of agents, so the number of users who will reach this limit is small. But we will soon want to get to a point where we can handle >10k agents smoothly so the number of large-scale users will increase. #78520 for describes how the current SO client limits us and our current UI workaround.

cc @ph for awareness & prioritization

mshustov · 2020-11-05T17:19:48Z

In the 7.12 release, the team is going to investigate the basic architecture.

XavierM · 2020-11-09T15:14:58Z

To alleviate this, we could add scrolling support to SavedObjects, however, there is one significant caveat: scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default.

This is obviously a very stupid option in term of memory usage, but still asking: could using the scroll API internally in the SO repository be an option? We could, when page+perPage > index.max_result_window, use the scroll API under the hood, fetch all, and return the aggregated results?

scrolls have a TTL in Elasticsearch, meaning that cursors are invalidated after a fairly short time period by default

This is a configurable via the scroll option though, so If we expose a new scroll API to the SOR/SOC, we could just expose it. But yea, the ttl handling would have to be handled by the consumer anyway, there is not much we could do when _search/scroll is called with an expired TTL.

@pgayvallet, so we did try to use the scroll api of Elasticsearch at the beginning of SIEM, but we had a problem with this approach because we did not have any EUI table at this time who is working with the scroll API. Of course, we thought about using simple pagination just like that < > but we got user feedback about it and they did not like it because they did not have the feeling that we had their data in hands. So we had to refactor our query to use the simple from and size and we remove the last page when we have more than 10000 rows, by doing that we were able to get back on our feet.

I am sharing that because using scroll API will break most of our table since if you click on page 3, you won't know the cursor of page 1 but not the cursor of page 3 or you will have to do three queries to get to know the cursor of page 3. Anyway, I will love to be aware of your approach here since I think every solution is dealing with the same kind of problem.

pgayvallet · 2020-11-20T22:15:29Z

I've been looking at async_search, scroll and search_after, and I came to the conclusion that neither of these options would totally address the problem we are facing here.

For scenarios where we are doing bulk processing of a very large number of object on the server-side, all these solutions would work, as they all would allow to 'scroll' all the results for a query that would exceed the index.max_result_window value.

However, as @XavierM mentioned in his comment, one of the most common scenario where we face this limitation is when displaying the saved objects in a paginated table in the UI.

To demonstrate, using as example the saved object management table, where we are displaying all the visible saved objects:

In this table, we are paginating the results per pages of, say, 100 items. The pagination buttons allow the user to navigate forward, back, to the first or to the last page. Currently, when accessing the page PAGE that displays PER_PAGE results, we are calling _search with from: PER_PAGE * (PAGE - 1), size: PER_PAGE. These parameters are deterministic and can be computed independently of the page the user is current on.

As the user is able to navigate from any page to any other page, backward or forward, none of the suggested solutions would work to address this index.max_result_window limitation:

async_search would just allow us to retrieved partial results while the search is running (which achieves nothing), or to return the full list of results. This full list is of little help to display a specific page, and would force us to 'cache' the full list of result to perform pagination on our side, either on the client (not really an option) or the server (which introduces quite a lot of complexity regarding cache invalidation / entry removal on LARGE cached data)
search_after only allows us to fetch the results following the last performed request. Meaning that when we are displaying page X, we can only display page X+1 next. Which doesn't answer the use case to navigate from any arbitrary page to another one.
scroll is even worse in that case, as the search context got a TTL, this would just doesn't work with user-initiated requests (w). It's even stated in the official ES doc: The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests. The search_after parameter circumvents this problem by providing a live cursor

Which is why I'm wondering: As our indices are now system indices, could we just ask the ES team to change the default value of index.max_result_window on these system indices to a higher value? I mean, it wouldn't solve the problem per say, but by setting this value to 100k by default instead of 10k, we could probably work around for most, if not all, of our (current) volumetry without any change on the codebase. Or course, we should also confirm with ES that this would be alright performance-wise.

joshdover · 2020-11-24T00:21:49Z

As our indices are now system indices, could we just ask the ES team to change the default value of index.max_result_window on these system indices to a higher value?

I don't think we'd need to wait for system indices for this? We should be able to set this setting directly on the index during the migration process.

Or course, we should also confirm with ES that this would be alright performance-wise.

This is my primary question. I'm curious if the Elasticsearch performance issues scale with the number of documents or the size (as in bytes) of the results. Since we're primarily paginating large numbers of really small documents, I'm hoping it's the latter.

Would anyone from @elastic/es-perf be able to shed light on this? Specifically, what is the reason for the index.max_result_window setting defaulting to 10k and what types of problems typically surface when increasing this limit? We're trying to paginate through 100k+ of small documents in the UI and increasing this limit on the .kibana index would be the 'easiest' solution from our perspective.

rudolf · 2020-11-24T11:01:30Z

I think changing index.max_result_window might be sufficient for saved objects management because it's relatively rare that users use the UI or export objects. But the performance penalty is probably too high for regular searches from plugins that need to page through more than 10k results. So I think this will eventually bite us.

I think we'll have to do something similar to what @XavierM mentioned where the UI works around the problem. I'm not sure what page size we currently use, but more than 100 results probably don't fit on a screen which means 10k results is at least 100 pages. I don't think users will ever need more pages than that, they should rather narrow down their search. So then the UI could use from and size and display a message like "Your search results were too large to display, only showing the first 10 000 results". We could also use a pattern similar to gmail's "Select all conversations that match this search" to allow users to export all the saved objects that match a search even if there's more than 10k results. That would require the export API to accept a query (and maybe a KQL filter too).

We would have to add a tiebreaker field to all saved objects to allow the find API to support > 10k search results using search_after.

pgayvallet · 2020-11-24T20:16:00Z

I overall agree that implementing search_after for the SO find API seems the most straightforward.

We would have to add a tiebreaker field to all saved objects

The good old {type}|{id} already used in a lot of places, at least on the client-side.

Note that one notable constraint/limitation of such approach is that we would only be allowed to sort by this tie_breaker_id in the successive _find requests, as search_after requires the value of ALL the sorting fields. So the SavedObjectsFindOptions.sortField would be blanked in that case. This seems acceptable though.

Also, this would require to migrate all SO objects during a migration to populate this field, which is kinda unsupported at the moment (even if adding 'internal' migrations to core to impact all types should be doable)

pgayvallet · 2020-12-16T16:18:35Z

As mentioned during the breakout session, we should probably split that into two distinct tasks:

Add support of a tie breaker field for all saved object types
- add support for 'core' migrations
- add a 'core' migration to populate the tie breaker field on all our existing objects
- add the mechanism to add/update the tie breaker field when performing CRUD operation on SOs (create, update and their bulk equivalent)
Support / implement search_after in the SO find API

The main goal of this issue is to address the problems related to SO import/export and our consumers that needs to 'scroll' over more than 10k objects. The 'paginated tables' issue when there is more than 10k objects should be addressed at a later time (and probably just by informing the user than only the 10k first objects can be displayed, as suggested in #77961 (comment))

rudolf · 2020-12-17T14:58:21Z

I've opened #86300 to discuss how we might add a tie_breaker, but as a first iteration we could use search_after with a PIT which doesn't require a tie_breaker field. For the use case of creating an export of all saved objects using a PIT is preferred because it will provide a consistent snapshot (without it it's theoretically possible that an export has broken references if an object was deleted in the middle of the export)

lukeelmers · 2021-02-24T21:43:47Z

Closing as we should now have the ability to page through >10k objects as of #89915

(We still have a separate issue open regarding adding a tiebreaker, but that did not end up blocking this effort after all)

[edit] Feel free to re-open if there's something I've missed here!

joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects labels Sep 18, 2020

joshdover changed the title ~~Add scrolling support to SavedObjects~~ Add support to SavedObjects.find for fetching more than 10k objects Sep 18, 2020

azasypkin mentioned this issue Sep 23, 2020

Add support for the encryption key rotation to the encrypted saved objects. #72420

Merged

jen-huang mentioned this issue Sep 24, 2020

[Ingest Manager] Surface saved object client 10,000 limitation to bulk actions UI #78520

Merged

joshdover assigned rudolf Oct 13, 2020

kobelb mentioned this issue Oct 16, 2020

Reasons for not using saved objects for storing kibana data #80912

Open

rudolf added NeededFor:Fleet Needed for Fleet Team NeededFor:Security Solution SIEM, Endpoint, Timeline, Analyzer, Cases labels Nov 10, 2020

mshustov mentioned this issue Nov 12, 2020

Core should expose a dedicated Elasticsearch client for interacting with Kibana system indices #82716

Closed

rudolf mentioned this issue Nov 20, 2020

[Alerting] Add a tie breaker field to alerts #62002

Closed

legrego mentioned this issue Dec 7, 2020

[Search] Session SO polling #84225

Merged

8 tasks

pgayvallet assigned lukeelmers and unassigned rudolf Dec 16, 2020

rudolf mentioned this issue Dec 17, 2020

Add a tie_breaker field to all saved objects #86300

Closed

pgayvallet mentioned this issue Dec 18, 2020

Sharing saved objects, phase 2 #80945

Merged

joshdover mentioned this issue Jan 11, 2021

Add search_after support to SO find API using PIT #86301

Closed

pzl mentioned this issue Feb 16, 2021

[Fleet] Improve pagination experience for >10k agents #91562

Closed

3 tasks

lukeelmers closed this as completed Feb 24, 2021

pgayvallet mentioned this issue Jun 8, 2021

SavedObjects limit greater than ES max result window #22636

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to SavedObjects.find for fetching more than 10k objects #77961

Add support to SavedObjects.find for fetching more than 10k objects #77961

joshdover commented Sep 18, 2020 •

edited by mshustov

Loading

elasticmachine commented Sep 18, 2020

azasypkin commented Sep 21, 2020

pgayvallet commented Sep 22, 2020

jen-huang commented Sep 24, 2020

mshustov commented Nov 5, 2020

XavierM commented Nov 9, 2020

pgayvallet commented Nov 20, 2020 •

edited

Loading

joshdover commented Nov 24, 2020

rudolf commented Nov 24, 2020

pgayvallet commented Nov 24, 2020 •

edited

Loading

pgayvallet commented Dec 16, 2020 •

edited

Loading

rudolf commented Dec 17, 2020

lukeelmers commented Feb 24, 2021 •

edited

Loading

Add support to SavedObjects.find for fetching more than 10k objects #77961

Add support to SavedObjects.find for fetching more than 10k objects #77961

Comments

joshdover commented Sep 18, 2020 • edited by mshustov Loading

elasticmachine commented Sep 18, 2020

azasypkin commented Sep 21, 2020

pgayvallet commented Sep 22, 2020

jen-huang commented Sep 24, 2020

mshustov commented Nov 5, 2020

XavierM commented Nov 9, 2020

pgayvallet commented Nov 20, 2020 • edited Loading

joshdover commented Nov 24, 2020

rudolf commented Nov 24, 2020

pgayvallet commented Nov 24, 2020 • edited Loading

pgayvallet commented Dec 16, 2020 • edited Loading

rudolf commented Dec 17, 2020

lukeelmers commented Feb 24, 2021 • edited Loading

joshdover commented Sep 18, 2020 •

edited by mshustov

Loading

pgayvallet commented Nov 20, 2020 •

edited

Loading

pgayvallet commented Nov 24, 2020 •

edited

Loading

pgayvallet commented Dec 16, 2020 •

edited

Loading

lukeelmers commented Feb 24, 2021 •

edited

Loading