Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discover][Search Search] Discrepancy in results in Discover and devtools when sorting and size present #92843

Closed
majagrubic opened this issue Feb 25, 2021 · 8 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Discover Discover Application Feature:Search Querying infrastructure in Kibana Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@majagrubic
Copy link
Contributor

majagrubic commented Feb 25, 2021

Kibana version:
7.12, testing on cloud

We have large dataset, currently returning 3156517 hits. When sorting on a column, Discover sends a size parameter, together with a sort parameter, eg:

"size": 1000,
  "sort": [
    {
      "@timestamp": {
        "order": "asc",
        "unmapped_type": "boolean"
      }
    },
    {
      "relatedContent.og:title.keyword": {
        "order": "asc",
        "unmapped_type": "boolean"
      }
    }
  ],

In the response, all hits are returned, disregarding the size parameter:

"hits": {
  "total": 3156517,
  "max_score": null,

Due to this, sort appears broken. When executing the same request from the devtools console, only 1000 hits are returned, as I would expect.
Full request:

{
  "size": 1000,
  "sort": [
    {
      "@timestamp": {
        "order": "asc",
        "unmapped_type": "boolean"
      }
    },
    {
      "relatedContent.og:title.keyword": {
        "order": "asc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "version": true,
  "fields": [
    {
      "field": "*",
      "include_unmapped": "true"
    },
    {
      "field": "@timestamp",
      "format": "strict_date_optional_time"
    },
    {
      "field": "relatedContent.article:modified_time",
      "format": "strict_date_optional_time"
    },
    {
      "field": "relatedContent.article:published_time",
      "format": "strict_date_optional_time"
    },
    {
      "field": "utc_time",
      "format": "strict_date_optional_time"
    }
  ],
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "30m",
        "time_zone": "Europe/Dublin",
        "min_doc_count": 1
      }
    }
  },
  "script_fields": {},
  "stored_fields": [
    "*"
  ],
  "runtime_mappings": {},
  "_source": false,
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2020-11-18T00:00:00.000Z",
              "lte": "2020-11-19T00:00:00.000Z",
              "format": "strict_date_optional_time"
            }
          }
        }
      ],
      "should": [],
      "must_not": [
        {
          "match_phrase": {
            "_id": "PAdPyncBulkzspp2YBUP"
          }
        }
      ]
    }
  },
  "highlight": {
    "pre_tags": [
      "@kibana-highlighted-field@"
    ],
    "post_tags": [
      "@/kibana-highlighted-field@"
    ],
    "fields": {
      "*": {}
    },
    "fragment_size": 2147483647
  }
}

Where is this discrepancy coming from? I assume something in the search_source?
I wonder if we should switch off sorting if this is something we can't support atm?

@majagrubic majagrubic added bug Fixes for quality problems that affect the customer experience Feature:Discover Discover Application Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Feb 25, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app (Team:KibanaApp)

@majagrubic majagrubic added Feature:Search Querying infrastructure in Kibana Team:AppServices labels Feb 25, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-services (Team:AppServices)

@wylieconlon
Copy link
Contributor

@majagrubic Where is the bug here? size does not affect hits.total, hits.total tells you how many matches your query has. So your query can match 1 million documents, but you only get size results per page.

@lizozom
Copy link
Contributor

lizozom commented Mar 1, 2021

@majagrubic
Generally speaking, the size request argument controls how many documents are being returned alongside with the aggregations.
The total response parameter is how many documents match the current query (and it's controlled by setting track_total_hits, if you like).

Could you please elaborate more about the discrepancy you're seeing?

@majagrubic
Copy link
Contributor Author

majagrubic commented Mar 4, 2021

I thought size should be the maximum number of documents returned by the query.?
So, when executing the same request from DevTools console and Discover, the response is different.
Request:

GET logstash-*/_search
{
  "size": 1000,
  "sort": [
    {
      "@timestamp": {
        "order": "asc",
        "unmapped_type": "boolean"
      }
    },
    {
      "relatedContent.og:type.keyword": {
        "order": "desc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "version": true,
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "30d",
        "time_zone": "Europe/Dublin",
        "min_doc_count": 1
      }
    }
  },
  "fields": [
    {
      "field": "@timestamp",
      "format": "date_time"
    },
    {
      "field": "relatedContent.article:modified_time",
      "format": "date_time"
    },
    {
      "field": "relatedContent.article:published_time",
      "format": "date_time"
    },
    {
      "field": "utc_time",
      "format": "date_time"
    }
  ],
  "script_fields": {},
  "stored_fields": [
    "@message",
    "@message.keyword",
    "@tags",
    "@tags.keyword",
    "@timestamp",
    "_id",
    "_index",
    "_score",
    "_source",
    "_type",
    "agent",
    "agent.keyword",
    "bytes",
    "clientip",
    "extension",
    "extension.keyword",
    "geo.coordinates",
    "geo.dest",
    "geo.src",
    "geo.srcdest",
    "headings",
    "headings.keyword",
    "host",
    "host.keyword",
    "id",
    "index",
    "index.keyword",
    "ip",
    "links",
    "links.keyword",
    "longValues",
    "longValues.keyword",
    "longValuesWithSpaces",
    "longValuesWithSpaces.keyword",
    "machine.os",
    "machine.os.keyword",
    "machine.ram",
    "memory",
    "meta.char",
    "meta.related",
    "meta.user.firstname",
    "meta.user.lastname",
    "phpmemory",
    "referer",
    "relatedContent.article:modified_time",
    "relatedContent.article:published_time",
    "relatedContent.article:section",
    "relatedContent.article:section.keyword",
    "relatedContent.article:tag",
    "relatedContent.article:tag.keyword",
    "relatedContent.og:description",
    "relatedContent.og:description.keyword",
    "relatedContent.og:image",
    "relatedContent.og:image.keyword",
    "relatedContent.og:image:height",
    "relatedContent.og:image:height.keyword",
    "relatedContent.og:image:width",
    "relatedContent.og:image:width.keyword",
    "relatedContent.og:site_name",
    "relatedContent.og:site_name.keyword",
    "relatedContent.og:title",
    "relatedContent.og:title.keyword",
    "relatedContent.og:type",
    "relatedContent.og:type.keyword",
    "relatedContent.og:url",
    "relatedContent.og:url.keyword",
    "relatedContent.twitter:card",
    "relatedContent.twitter:card.keyword",
    "relatedContent.twitter:description",
    "relatedContent.twitter:description.keyword",
    "relatedContent.twitter:image",
    "relatedContent.twitter:image.keyword",
    "relatedContent.twitter:site",
    "relatedContent.twitter:site.keyword",
    "relatedContent.twitter:title",
    "relatedContent.twitter:title.keyword",
    "relatedContent.url",
    "relatedContent.url.keyword",
    "request",
    "request.keyword",
    "response",
    "response.keyword",
    "spaces",
    "spaces.keyword",
    "thisisaverylongfieldnamethatevendoesnotcontainanyspaceswhyitcouldpotentiallybreakouruiinseveralplaces",
    "thisisaverylongfieldnamethatevendoesnotcontainanyspaceswhyitcouldpotentiallybreakouruiinseveralplaces.keyword",
    "timestamp",
    "timestamp.keyword",
    "unmappedField1",
    "unmappedField1.keyword",
    "unmappedField2",
    "url",
    "url.keyword",
    "utc_time",
    "xss",
    "xss.keyword"
  ],
  "runtime_mappings": {},
  "_source": {
    "excludes": []
  },
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        },
        {
          "range": {
            "@timestamp": {
              "gte": "2013-03-04T07:25:26.720Z",
              "lte": "2021-03-04T07:25:26.720Z",
              "format": "strict_date_optional_time"
            }
          }
        }
      ],
      "should": [],
      "must_not": [
        {
          "match_phrase": {
            "_id": "PAdPyncBulkzspp2YBUP"
          }
        }
      ]
    }
  },
  "highlight": {
    "pre_tags": [
      "@kibana-highlighted-field@"
    ],
    "post_tags": [
      "@/kibana-highlighted-field@"
    ],
    "fields": {
      "*": {}
    },
    "fragment_size": 2147483647
  }
}

Devtools response:

  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ... ]

Discover response:

 "hits": {
   "total": 35813999,
   "max_score": null,
   "hits": [ ... ]

Why this discrepancy?

@flash1293
Copy link
Contributor

flash1293 commented Mar 4, 2021

Edit: Should have checked the code first.

@majagrubic I thought this was related to the rest_total_hits_as_int flag for BWC first, but after looking into the code I don't find a reference to it. @lizozom is it possible the search source is rewriting the result somehow so the consumers don't have to change how they are using the hit count?

Edit2: I think I found it:

* Temporary workaround until https://github.com/elastic/kibana/issues/26356 is addressed.

I guess discover should change it's own logic at some point. @timroes do we track this somewhere? I vaguely remember plans to change how Discover fetches hits plus aggregations, maybe it's part of that, but just to make sure.

@kertal
Copy link
Member

kertal commented Mar 4, 2021

@flash1293 should be part of this task #69134

@timroes
Copy link
Contributor

timroes commented Mar 4, 2021

(EDIT: I've updated my response since it was written a bit confusing)

As Matthias pointed out the work to change Discover logic is tracked in #69134 I've left a rather detailed description how it should work in #69134 (comment).

And just to recap on the descripency (basically what Joe said):

Dev tools are sending the query as it is, the default behavior in ES is not to track hits anymore. While using search source we by default still set the track_total_hits flag, which will make sure we're tracking the exact hits. Also for now we're simulating the old API in the response without relation (so we didn't need to adjust all apps yet). You have a flag legacyHitsTotal: false (default is true) in the search API that allows disabling this shiming behavior, so you can move apps to the new API response.

The rest_total_hits_as_int flag (Joe mentioned above) was replaced by track_total_hits, since the first one is deprecated and might be removed soon, while the track_total_hits will continue to exist.

So you can already set track_total_hits on a search source to false (or a number) and legacyHitsTotal: false, in which case we will get the real new behavior that we're getting from querying ES directly. I'd also assume we want to remove that response transforming to the legacy format in the future, but we need to make sure ALL consumers have adopted to the new format beforehand, why it was the easier solution just transform the response for now.

Also just for additional clarification: size does not affect how many documents are counted (and make it into the hits.total), but only how many documents are returned as part of hits.hits. So the hits.total upper boundary is solvely controlled by track_total_hits and not by size.

Since there is no issue here present, I'll close this, but please let me know if there's anything still unclear.

@timroes timroes closed this as completed Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Discover Discover Application Feature:Search Querying infrastructure in Kibana Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

7 participants