Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Inaccurate PPL query results due to plugins.query.size_limit restriction #2802

Open
dai-chen opened this issue Jul 3, 2024 · 1 comment
Labels
bug Something isn't working PPL Piped processing language

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Jul 3, 2024

What is the bug?

Currently, in PPL queries without the HEAD command being pushed down, the plugins.query.size_limit setting is used as the size parameter in the OpenSearch DSL within the PPL index scan operator. This setting defaults to 200, which is insufficient for many use cases and results in incorrect query outcomes. This is primarily because the small size limit truncates the result set prematurely, leading to inaccurate query results when further PPL commands are applied.

How can one reproduce the bug?

Consider a PPL query where every command is pushed down to OpenSearch DSL. In this scenario, the behavior is as expected because OpenSearch DSL ensures filtering, aggregation, sorting and size limit executed in order.

POST _plugins/_ppl
{
  "query": """
    source=opensearch_dashboards_sample_data_flights | sort -FlightTimeMin | head 5 | fields FlightTimeMin
  """
}

# Explain output
# Both sort and head command pushed down to DSL which generates top 5 after sorting
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "OpenSearchIndexScan",
        "description": {
          "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, 
sourceBuilder={"from":0,"size":5,"timeout":"1m","_source":{"includes":["FlightTimeMin"],"excludes":[]},
"sort":[{"FlightTimeMin":{"order":"desc","missing":"_last"}}]}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

# Correct query result
{
  "schema": [
    {
      "name": "FlightTimeMin",
      "type": "float"
    }
  ],
  "datarows": [
    [
      1902.902
    ],
    [
      1837.689
    ],
    [
      1816.6504
    ],
    [
      1811.3477
    ],
    [
      1797.1661
    ]
  ],
  "total": 5,
  "size": 5
}

However, in cases where not all commands are pushed down, the default plugins.query.size_limit setting (200) is used. For instance, in a PPL query that includes an eval command, the default setting restricts the DSL output to 200 documents. As a result, any sorting and limiting operations are performed on this truncated dataset, yielding incorrect results.

POST _plugins/_ppl
{
  "query": """
    source=opensearch_dashboards_sample_data_flights
    | eval FlightMin = FlightTimeMin 
    | sort -FlightMin | head 5 | fields FlightMin
  """
}

# Explain output
# Without pushdown, the default setting value is used and DSL only output 200 docs
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightMin]"
    },
    "children": [
      {
        "name": "LimitOperator",
        "description": {
          "limit": 5,
          "offset": 0
        },
        "children": [
          {
            "name": "SortOperator",
            "description": {
              "sortList": {
                "FlightMin": {
                  "sortOrder": "DESC",
                  "nullOrder": "NULL_LAST"
                }
              }
            },
            "children": [
              {
                "name": "EvalOperator",
                "description": {
                  "expressions": {
                    "FlightMin": "FlightTimeMin"
                  }
                },
                "children": [
                  {
                    "name": "OpenSearchIndexScan",
                    "description": {
                      "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, 
sourceBuilder={"from":0,"size":200,"timeout":"1m"}, searchDone=false)"""
                    },
                    "children": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

# Incorrect query result
{
  "schema": [
    {
      "name": "FlightMin",
      "type": "float"
    }
  ],
  "datarows": [
    [
      1493.3428
    ],
    [
      1404.9293
    ],
    [
      1227.7903
    ],
    [
      1138.5007
    ],
    [
      1090.7211
    ]
  ],
  "total": 5,
  "size": 5
}

What is the expected behavior?

The plugins.query.size_limit setting is intended to act as a safeguard to prevent excessively large result sets. It should not impact the number of documents scanned during query execution. To ensure the correctness of query results, it is proposed to leverage the pagination capability introduced in PR 716.

Proposal

The proposal involves using pagination to handle larger datasets without compromising result accuracy. When subsequent PPL commands cannot be pushed down to OpenSearch DSL, the system should scan the entire output results from the DSL using pagination. This approach allows the PPL engine to gather all necessary documents in pages, process them, and then apply the post-processing commands to generate the correct result. This aligns with the intuitive behavior observed in other databases and OpenSearch.

By implementing this approach, we ensure that the plugins.query.size_limit setting only controls the final result size, as expected, without affecting the correctness of the query results. This will prevent the truncation of results during intermediate processing steps, leading to accurate and reliable outcomes for PPL queries.

TODO

  1. Pagination on composite aggregation is not supported yet? Update limitation doc accordingly after this change.
  2. Check if any change required by SQL

Do you have any additional context?

  1. Plugin setting: https://github.com/opensearch-project/sql/blob/main/docs/user/admin/settings.rst#plugins-query-size-limit
  2. Current limitation: https://github.com/opensearch-project/sql/blob/main/docs/user/dql/basics.rst#limitation
@qianheng-aws
Copy link
Contributor

Just for the simple case above, I think it's able to support push down sort and limit into OpenSearch DSL through the evalOperator which is just equal expression. We can do fields replacement before pushing down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PPL Piped processing language
Projects
None yet
Development

No branches or pull requests

2 participants