Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend size_limit setting in query engine to support unlimited index query. #703

Open
penghuo opened this issue Jul 20, 2022 · 7 comments
Open
Labels
enhancement New feature or request Priority-High

Comments

@penghuo
Copy link
Collaborator

penghuo commented Jul 20, 2022

Problem statements

Currently, the query.size_limit setting configure the maximum amount of documents to be pull from OpenSearch. The default value is: 200. for example, Let's say size_limit = 200, and index has 10K docs.

  • expected, return 200 rows.
    source=index
  • expected return 1 docs
    source=index | head 1
  • return 200 rows, it is not user expected.
    source=index | head 11000

Proposal

The query.size_limit configure the maximum amount of rows returned by query. The default value is: 200. size_limit must larger than 0. If the query has head(PPL) or limit(SQL). it will override the query.size_limit setting.

Expectation of search query .

  • should return 200 rows.
    source=index
  • should return 1 rows.
    source=index | head 1
  • should return 11000 rows.
    source=index | head 11000

Expectation of aggregation query.

  • should return 200 rows.
    source=index | stats request, count(*) by request
  • should return 11000 rows.
    source=index | stats request, count(*) by request | head 11000
@penghuo penghuo added enhancement New feature or request untriaged labels Jul 20, 2022
@penghuo penghuo mentioned this issue Jul 20, 2022
6 tasks
@dai-chen
Copy link
Collaborator

Probably another way of thinking about this: the size_limit setting is just for default behavior. If users specify a larger number by head or LIMIT, that means they're aware of what they're doing and just want to override the default limit value. This may be safer than setting size limit to -1 and user run a query without head command later?

@penghuo
Copy link
Collaborator Author

penghuo commented Jul 21, 2022

Probably another way of thinking about this: the size_limit setting is just for default behavior. If users specify a larger number by head or LIMIT, that means they're aware of what they're doing and just want to override the default limit value. This may be safer than setting size limit to -1 and user run a query without head command later?

Update the proposal as discussed.

@seankao-az
Copy link
Collaborator

seankao-az commented Aug 1, 2022

Design

OpenSearch Request

Request Operators

Non Aggregation Query

Interface to the OpenSearch engine used by the OpenSearchIndexScan physical plan

  1. OpenSearchQueryRequest
    The default request operator.
  2. OpenSearchScrollRequest
    This is used if query size exceeds the index.max_result_window setting. It invokes scroll requests to OpenSearch and fetches results in batches.
Aggregation Query

There's no scroll request for aggregation queries in OpenSearch.
For a composite (group by) aggregation query, the response contains a keyAfter field, which can be used in the next request to fetch the next buckets.

Request Builder

OpenSearchRequestBuilder builds OpenSearchQueryRequest or OpenSearchScrollRequest, depending on whether scrolling is needed.

Physical Plan Implementation

  1. Get index.max_result_window for indices.
  2. Initializes OpenSearchIndexScan, which contains OpenSearchRequestBuilder
  3. Visit logical plan with index scan as context, so logical operators visited will accumulate (push down) OpenSearch query and aggregation DSL on index scan. The operations are pushed down to the request builder.

Index Scan Execution

  1. Build the request upon plan.open()
  2. Fetch the results in batches of size maxResultWindow
  3. When plan.close(), clean up the cursor and context in OpenSearch engine if request type is OpenSearchScrollRequest

@seankao-az
Copy link
Collaborator

seankao-az commented Aug 11, 2022

Remaining issues:

  1. Extend query size limit for aggregation query requests
  2. Described as follows

Here we assume

query.size_limit = 200
index.max_result_window = 10000

These work as expected:

  • source=index returns 200 rows
  • source=index | head 1 returns 1 row
  • source=index | head 300 returns 300 rows
  • source=index | head 11000 returns 11000 rows using scroll
  • source=index | fields a,b returns 200 rows
  • source=index | fields a,b | head 1 returns 1 row

But these don't:

  • source=index | fields a,b | head 300 returns 200 rows
  • source=index | fields a,b | head 11000 returns 200 rows

The reason being that limit is only pushed down to index scan if they're optimized and merged into a single node.
In these two cases the index scan has query size 200 (query.size_limit).

Solution

Option 1

Better logical plan optimization so that the Project logical plan node doesn't block optimization for other plan nodes.
Project isn't merged with Relation / Index Scan, and thus stops Limit from merging with Relation / Index Scan

@seankao-az
Copy link
Collaborator

One note on the performance.
With this feature, there's no limitation on the size of the query result anymore, so it's possible that a single request-response cycle take too long and timeout.

@penghuo
Copy link
Collaborator Author

penghuo commented Oct 22, 2024

@seankao-az @dai-chen
Want to revisit the definition of plugins.query.size_limit,
currently, the definition of plugins.query.size_limit is , The new engine fetches a default size of index from OpenSearch set by this setting, the default value equals to max result window in index level (10000 by default). You can change the value to any value not greater than the max result window value in index level (index.max_result_window). https://github.com/opensearch-project/sql/blob/main/docs/user/admin/settings.rst#plugins-query-size-limit

In my opinion, there are two issues:

  1. It is unclear how plugins.query.size_limit works.
  2. It should not be tied to the max_result_window.

My proposal is
The query.size_limit configuration sets the maximum number of rows returned by a query. The default value is 10,000, and size_limit must be greater than 0.
Note: This limit applies regardless of whether the query includes HEAD (PPL) or LIMIT (SQL).

@seankao-az
Copy link
Collaborator

seankao-az commented Oct 23, 2024

makes sense to me. so query.size_limit can be any positive number, regardless to max_result_window.

Regarding

It is unclear how plugins.query.size_limit works.

I think we should let plugins.query.size_limit setting only decide the final result size, not size of any intermediate step.
Currently source=index | <other commands>, if no other operation is pushed down to DSL, then <other commands> will operate only on the 10000 (query.size_limit) results returned from the scan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority-High
Projects
None yet
Development

No branches or pull requests

4 participants