Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pagination in REST APIs #64099

Open
cjcenizal opened this issue Oct 23, 2020 · 11 comments
Open

Support pagination in REST APIs #64099

cjcenizal opened this issue Oct 23, 2020 · 11 comments
Labels
:Data Management/CAT APIs Text APIs behind /_cat :Data Management/Stats Statistics tracking and retrieval APIs discuss >enhancement Team:Data Management Meta label for data/management team

Comments

@cjcenizal
Copy link
Contributor

cjcenizal commented Oct 23, 2020

Situation

Management UIs such as Index Management currently provide client-side pagination of tabular information. For example, to show the user a table of the indices in the cluster, the client requests a complete list of indices and renders a subset of them into the table. As the user applies filters/search input and paginates through the table, the client performs the necessary logic to determine which indices to render to the table.

Problem

This is problematic for clusters that contain many indices because it requires the Kibana server to store the full set of tabular information in memory before returning it the client. In the above example, this would be the full set of indices returned by GET /*?expand_wildcards=hidden,all as well as those returned by GET /_cat/indices?format=json&h=health,status,index,uuid,pri,rep,docs.count,sth,store.size&expand_wildcards=hidden,all&index=*. If this occupies more memory than is allocated to the Kibana server, it will cause the server to run out of memory and crash. Note that we need to gather more information about the frequency of this problem before we can prioritize a solution to this problem:

  • We need to measure the relationship between response size and memory occupied
  • We need to determine an upper bound on the size of these responses for most users
  • We can compare this upper bound with the 1.4 GB default memory limit for the Kibana server to determine the frequency of this problem

Severity

Based on some conversation with members of the Kibana team, I think it's unlikely that we'll encounter a scenario where the size of an ES REST API response is large enough to cause Kibana to OOM. Given that, I think this is a very low priority feature.

Complexities

If we decide to move forward with implementing server-side pagination, we'll need to consider how the ES APIs will support the current client-side logic that impacts pagination behavior, such as filtering and searching. We'll have to audit our various Management UIs to get a complete picture of what kind of logic the ES APIs will need to support.

Related issues

@cjcenizal cjcenizal added needs:triage Requires assignment of a team area label Team:Deployment Management Meta label for Management Experience - Deployment Management team labels Oct 23, 2020
@calm4wei
Copy link
Contributor

When there are many indexes and shards, executing _cat/indices or _cat/shards on the client side will be very time-consuming and difficult to view the results.
I think the server can support the paging parameters _cat/indices?v&p=page:${pageNumber},size:${pageSize}, and then the rest api can use the incoming paging parameters to request the server results

@mayya-sharipova mayya-sharipova added >enhancement :Data Management/CAT APIs Text APIs behind /_cat and removed needs:triage Requires assignment of a team area label labels Oct 26, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/CAT APIs)

@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Oct 26, 2020
@mayya-sharipova mayya-sharipova added :Core/Features/Features and removed Team:Data Management Meta label for data/management team labels Oct 26, 2020
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Oct 26, 2020
@albertzaharovits
Copy link
Contributor

I remember this (or similar) coming up some time ago, when there was no point in time API.

But nowadays I think that the internal functionality behind the point in time API can be used for pagination with sorting. I think it's just a matter of settling on parameter and URL names consistently across our APIs. There's currently one example that we could follow, the relatively recently introduced query watches API, in addition to the old get watch API. Consequently, I think we could pull https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/utils/persistence/SearchAfterDocumentsIterator.java from ML into server code.

@martijnvg or @jimczi can you please confirm this is a reasonable approach to extend to other APIs that need to return potentially long lists?

@martijnvg
Copy link
Member

I think having a general parameters to do some basic form of server side pagination makes sense. I think it comes down to coming up with terminology that makes sense across APIs. Some APIs that return a large response, it doesn't make sense to support pagination, because the response can't be broken down to units (for example cluster state). However in the case in cat APIs, the unit is clear, which is a row. Also I suspect that pagination parameters in the request body isn't possible in all places, due to conflicts with existing request body formats, but having common params in the query string should I think work.

One additional note about the cat APIs. These APIs should only be used via the terminal. Kibana and other systems should use the json based APIs. I know that the cat APIs provide functionality that isn't supported in the core json APIs, but that is something we should address seperately.

@albertzaharovits
Copy link
Contributor

Thanks @martijnvg!

In Security we're imminently concerned about the get API Keys API, and others will follow, eg get users, privileges, roles. For these APIs the unit is clear (eg an API key, a user def).

To clarify, I think we have a choice to make between extending the existing Get APIs or introducing new Query APIs.
I think the existing Get APIs will become bloated if they now take generic query, sort and search_after parameters in the request body. For this reason I favor the new Query APIs approach.
For the Get APIs we might get by with building a term query in the request handler and basic asc/desc sorting params, which I think is enough for the basic table views in Kibana, but after a certain scale that's most likely not enough. Overall this feels counter-productive to me, better to reuse common powerful filter and paging parameters, that users might be familiar with from our search APIs.

@danhermann
Copy link
Contributor

To clarify, I think we have a choice to make between extending the existing Get APIs or introducing new Query APIs.

I'm not sure if this is what you're suggesting, but I like the idea of a query API where cluster metadata such as API keys, cluster state, ingest pipelines, stats, etc., are exposed as documents in an index with all the features of the ES query DSL available. This would be much like system tables in relational databases where table metadata, system configuration, stats, etc. are exposed through SQL and has the advantage of allowing users to interact with ES through a single unified API. It's not low-hanging fruit, but it would address a variety of requests that we get for being able to sort, paginate, summarize, etc. the responses to various APIs that return large results.

@martijnvg
Copy link
Member

I like the idea of system indices for Elasticsearch configuration and stats. However most of these things aren't really stored in an index, but we can make them appear as retrievable/queryable? This is just an idea, but we could create a generic system query api:

  • Watches: GET /_system/_query/watches
  • Pipelines: GET /_system/_query/pipelines
  • Api keys: GET /_system/_query/api-keys

Implementing the query parameter for things that aren't stored in a system index is tricky,
perhaps we should also have a generic system list api that implements just pagination and sorting,
and the system query api is an extension of that.

The goal would be a unified way to retrieve config and stats. Maybe something like this would help to get there.

@jasontedor
Copy link
Member

I would like to push on the requirements a bit, before we embark too far on solving this because it's going to add complexity and increase the API surface area, etc. I start with the premise:

This is problematic for clusters that contain many indices because it requires the Kibana server to store the full set of tabular information in memory before returning it the client. In the above example, this would be the full set of indices returned by GET /*?expand_wildcards=hidden,all as well as those returned by GET /_cat/indices?format=json&h=health,status,index,uuid,pri,rep,docs.count,sth,store.size&expand_wildcards=hidden,all&index=*. If this occupies more memory than is allocated to the Kibana server, it will cause the server to run out of memory and crash.

I struggle with this. If a there are enough indices in the cluster, that even a small relatively small response like the list of indices is going to cause Kibana to run out of memory and crash, then how is Kibana going to deal with processing search and aggregation results into Discover, or Lens? That is, I say relatively small because as soon as we have "enough indices" in the cluster to cause this problem, then surely there's enough data in the deployment to cause problems elsewhere in Kibana. If the Elasticsearch deployment is large, then Kibana needs to be large too. Responses to management APIs like listing indices are going to be small relative to the data that Elasticsearch could return in search results.

@albertzaharovits
Copy link
Contributor

Point taken that paginating data that ES holds in memory into the cluster state is unimportant in practice. Maybe supporting the combination of Range and Content-Range HTTP headers at the network layer is appropriate in such cases, but I agree I don't feel the urgency of it.

But, it is this case In Security where the get API Keys API returns docs from the .security index and we should prepare for 10_000s of results. Also unlike the other Security APIs, this uses field conjunction to filter down the results. We need to find a way to break down the response (for the ES node's sake too), which probably entails ordering. It is at this point, that it sounds close to the point-in-time query API.

Could very well a niche use case, so maybe we come with a tailored approach and work from it to see if and how it generalizes?

@cjcenizal
Copy link
Contributor Author

Related to #74350, in which we're implementing pagination for the Get Snapshots API.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/CAT APIs Text APIs behind /_cat :Data Management/Stats Statistics tracking and retrieval APIs discuss >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

10 participants