You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, in PPL queries without the HEAD command being pushed down, the plugins.query.size_limit setting is used as the size parameter in the OpenSearch DSL within the PPL index scan operator. This setting defaults to 200, which is insufficient for many use cases and results in incorrect query outcomes. This is primarily because the small size limit truncates the result set prematurely, leading to inaccurate query results when further PPL commands are applied.
How can one reproduce the bug?
Consider a PPL query where every command is pushed down to OpenSearch DSL. In this scenario, the behavior is as expected because OpenSearch DSL ensures filtering, aggregation, sorting and size limit executed in order.
POST _plugins/_ppl
{
"query": """
source=opensearch_dashboards_sample_data_flights | sort -FlightTimeMin | head 5 | fields FlightTimeMin
"""
}
# Explain output
# Both sort and head command pushed down to DSL which generates top 5 after sorting
{
"root": {
"name": "ProjectOperator",
"description": {
"fields": "[FlightTimeMin]"
},
"children": [
{
"name": "OpenSearchIndexScan",
"description": {
"request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights,
sourceBuilder={"from":0,"size":5,"timeout":"1m","_source":{"includes":["FlightTimeMin"],"excludes":[]},
"sort":[{"FlightTimeMin":{"order":"desc","missing":"_last"}}]}, searchDone=false)"""
},
"children": []
}
]
}
}
# Correct query result
{
"schema": [
{
"name": "FlightTimeMin",
"type": "float"
}
],
"datarows": [
[
1902.902
],
[
1837.689
],
[
1816.6504
],
[
1811.3477
],
[
1797.1661
]
],
"total": 5,
"size": 5
}
However, in cases where not all commands are pushed down, the default plugins.query.size_limit setting (200) is used. For instance, in a PPL query that includes an eval command, the default setting restricts the DSL output to 200 documents. As a result, any sorting and limiting operations are performed on this truncated dataset, yielding incorrect results.
The plugins.query.size_limit setting is intended to act as a safeguard to prevent excessively large result sets. It should not impact the number of documents scanned during query execution. To ensure the correctness of query results, it is proposed to leverage the pagination capability introduced in PR 716.
Proposal
The proposal involves using pagination to handle larger datasets without compromising result accuracy. When subsequent PPL commands cannot be pushed down to OpenSearch DSL, the system should scan the entire output results from the DSL using pagination. This approach allows the PPL engine to gather all necessary documents in pages, process them, and then apply the post-processing commands to generate the correct result. This aligns with the intuitive behavior observed in other databases and OpenSearch.
By implementing this approach, we ensure that the plugins.query.size_limit setting only controls the final result size, as expected, without affecting the correctness of the query results. This will prevent the truncation of results during intermediate processing steps, leading to accurate and reliable outcomes for PPL queries.
TODO
Pagination on composite aggregation is not supported yet? Update limitation doc accordingly after this change.
Just for the simple case above, I think it's able to support push down sort and limit into OpenSearch DSL through the evalOperator which is just equal expression. We can do fields replacement before pushing down.
What is the bug?
Currently, in PPL queries without the
HEAD
command being pushed down, theplugins.query.size_limit
setting is used as thesize
parameter in the OpenSearch DSL within the PPL index scan operator. This setting defaults to 200, which is insufficient for many use cases and results in incorrect query outcomes. This is primarily because the small size limit truncates the result set prematurely, leading to inaccurate query results when further PPL commands are applied.How can one reproduce the bug?
Consider a PPL query where every command is pushed down to OpenSearch DSL. In this scenario, the behavior is as expected because OpenSearch DSL ensures filtering, aggregation, sorting and size limit executed in order.
However, in cases where not all commands are pushed down, the default
plugins.query.size_limit setting
(200) is used. For instance, in a PPL query that includes an eval command, the default setting restricts the DSL output to 200 documents. As a result, any sorting and limiting operations are performed on this truncated dataset, yielding incorrect results.What is the expected behavior?
The
plugins.query.size_limit
setting is intended to act as a safeguard to prevent excessively large result sets. It should not impact the number of documents scanned during query execution. To ensure the correctness of query results, it is proposed to leverage the pagination capability introduced in PR 716.Proposal
The proposal involves using pagination to handle larger datasets without compromising result accuracy. When subsequent PPL commands cannot be pushed down to OpenSearch DSL, the system should scan the entire output results from the DSL using pagination. This approach allows the PPL engine to gather all necessary documents in pages, process them, and then apply the post-processing commands to generate the correct result. This aligns with the intuitive behavior observed in other databases and OpenSearch.
By implementing this approach, we ensure that the
plugins.query.size_limit
setting only controls the final result size, as expected, without affecting the correctness of the query results. This will prevent the truncation of results during intermediate processing steps, leading to accurate and reliable outcomes for PPL queries.TODO
Do you have any additional context?
The text was updated successfully, but these errors were encountered: