[BUG] Inaccurate PPL query results due to plugins.query.size_limit restriction #2802

dai-chen · 2024-07-03T18:09:07Z

What is the bug?

Currently, in PPL queries without the HEAD command being pushed down, the plugins.query.size_limit setting is used as the size parameter in the OpenSearch DSL within the PPL index scan operator. This setting defaults to 200, which is insufficient for many use cases and results in incorrect query outcomes. This is primarily because the small size limit truncates the result set prematurely, leading to inaccurate query results when further PPL commands are applied.

How can one reproduce the bug?

Consider a PPL query where every command is pushed down to OpenSearch DSL. In this scenario, the behavior is as expected because OpenSearch DSL ensures filtering, aggregation, sorting and size limit executed in order.

POST _plugins/_ppl
{
  "query": """
    source=opensearch_dashboards_sample_data_flights | sort -FlightTimeMin | head 5 | fields FlightTimeMin
  """
}

# Explain output
# Both sort and head command pushed down to DSL which generates top 5 after sorting
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "OpenSearchIndexScan",
        "description": {
          "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, 
sourceBuilder={"from":0,"size":5,"timeout":"1m","_source":{"includes":["FlightTimeMin"],"excludes":[]},
"sort":[{"FlightTimeMin":{"order":"desc","missing":"_last"}}]}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

# Correct query result
{
  "schema": [
    {
      "name": "FlightTimeMin",
      "type": "float"
    }
  ],
  "datarows": [
    [
      1902.902
    ],
    [
      1837.689
    ],
    [
      1816.6504
    ],
    [
      1811.3477
    ],
    [
      1797.1661
    ]
  ],
  "total": 5,
  "size": 5
}

However, in cases where not all commands are pushed down, the default plugins.query.size_limit setting (200) is used. For instance, in a PPL query that includes an eval command, the default setting restricts the DSL output to 200 documents. As a result, any sorting and limiting operations are performed on this truncated dataset, yielding incorrect results.

POST _plugins/_ppl
{
  "query": """
    source=opensearch_dashboards_sample_data_flights
    | eval FlightMin = FlightTimeMin 
    | sort -FlightMin | head 5 | fields FlightMin
  """
}

# Explain output
# Without pushdown, the default setting value is used and DSL only output 200 docs
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightMin]"
    },
    "children": [
      {
        "name": "LimitOperator",
        "description": {
          "limit": 5,
          "offset": 0
        },
        "children": [
          {
            "name": "SortOperator",
            "description": {
              "sortList": {
                "FlightMin": {
                  "sortOrder": "DESC",
                  "nullOrder": "NULL_LAST"
                }
              }
            },
            "children": [
              {
                "name": "EvalOperator",
                "description": {
                  "expressions": {
                    "FlightMin": "FlightTimeMin"
                  }
                },
                "children": [
                  {
                    "name": "OpenSearchIndexScan",
                    "description": {
                      "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, 
sourceBuilder={"from":0,"size":200,"timeout":"1m"}, searchDone=false)"""
                    },
                    "children": []
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

# Incorrect query result
{
  "schema": [
    {
      "name": "FlightMin",
      "type": "float"
    }
  ],
  "datarows": [
    [
      1493.3428
    ],
    [
      1404.9293
    ],
    [
      1227.7903
    ],
    [
      1138.5007
    ],
    [
      1090.7211
    ]
  ],
  "total": 5,
  "size": 5
}

What is the expected behavior?

The plugins.query.size_limit setting is intended to act as a safeguard to prevent excessively large result sets. It should not impact the number of documents scanned during query execution. To ensure the correctness of query results, it is proposed to leverage the pagination capability introduced in PR 716.

Proposal

The proposal involves using pagination to handle larger datasets without compromising result accuracy. When subsequent PPL commands cannot be pushed down to OpenSearch DSL, the system should scan the entire output results from the DSL using pagination. This approach allows the PPL engine to gather all necessary documents in pages, process them, and then apply the post-processing commands to generate the correct result. This aligns with the intuitive behavior observed in other databases and OpenSearch.

By implementing this approach, we ensure that the plugins.query.size_limit setting only controls the final result size, as expected, without affecting the correctness of the query results. This will prevent the truncation of results during intermediate processing steps, leading to accurate and reliable outcomes for PPL queries.

TODO

Pagination on composite aggregation is not supported yet? Update limitation doc accordingly after this change.
Check if any change required by SQL

Do you have any additional context?

The text was updated successfully, but these errors were encountered:

qianheng-aws · 2024-07-24T05:18:09Z

Just for the simple case above, I think it's able to support push down sort and limit into OpenSearch DSL through the evalOperator which is just equal expression. We can do fields replacement before pushing down.

dai-chen added bug Something isn't working untriaged PPL Piped processing language and removed untriaged labels Jul 3, 2024

qianheng-aws mentioned this issue Jul 24, 2024

[FEATURE] Top-K enhancement when having plan like SortOperator + LimitOperator #2857

Open

LantaoJin mentioned this issue Jul 25, 2024

[RFC] Change the default value of plugins.query.size_limit to 10000 (MAX_RESULT_WINDOW) #2859

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Inaccurate PPL query results due to plugins.query.size_limit restriction #2802

[BUG] Inaccurate PPL query results due to plugins.query.size_limit restriction #2802

dai-chen commented Jul 3, 2024 •

edited

Loading

qianheng-aws commented Jul 24, 2024

[BUG] Inaccurate PPL query results due to plugins.query.size_limit restriction #2802

[BUG] Inaccurate PPL query results due to plugins.query.size_limit restriction #2802

Comments

dai-chen commented Jul 3, 2024 • edited Loading

Proposal

TODO

qianheng-aws commented Jul 24, 2024

dai-chen commented Jul 3, 2024 •

edited

Loading