Unexpected number of total hits when running a kNN query #176

juliusbachnick · 2020-07-23T13:20:57Z

Hey,

First of all: Thanks for the effort you put into this plugin! I'm trying to run some Elasticsearch aggregations on the document set that is returned when querying for the neighbors of a vector. However, the number of returned total hits (as far as I'm aware all hits are used as document set for ES aggregations) is sometimes greater than the specified k in the original query.

For example, given the following statements are executed to set up the index:

PUT /my_index
{
  "settings": {
     "number_of_shards": 4,
     "index.knn": true,
     "index.knn.algo_param.ef_search": 200,
     "index.knn.algo_param.ef_construction": 200,
     "index.knn.algo_param.m": 32
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 2
      }
    }
  }
}

PUT /my_index/_doc/1
{
  "my_vector": [1.5, 2.5]
}

PUT /my_index/_doc/2
{
  "my_vector": [2.5, 3.5]
}

PUT /my_index/_doc/3
{
  "my_vector": [3.5, 4.5]
}

When executing a the following kNN query:

POST /my_index/_search
{
  "size": 3,
  "query": {
    "knn": {
      "my_vector": {
        "vector": [3, 4],
        "k": 3
      }
    }
  }
}

The returned total hits equal k:

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 4,
        "successful": 4,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
[...]
}

However, once I add another document:

PUT /my_index/_doc/4
{
  "my_vector": [4.5, 5.5]
}

When running the exact same kNN query as above (with k=3), the number of reported total hits increases to 4, even though k was specified as 3:

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 4,
        "successful": 4,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
[...]
}

I'm running the docker image amazon/opendistro-for-elasticsearch:1.4.0:

  "version" : {
    "number" : "7.4.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
    "build_date" : "2019-10-28T20:40:44.881551Z",
    "build_snapshot" : false,
    "lucene_version" : "8.2.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  }

I assumed that k would limit the number of returned hits (i.e. return early once the desired number of neighbours is reached), but could not really find anything more specific in the documentation, apart from:

In this case, k is the number of neighbors you want the query to return, but you must also include the size option. Otherwise, you get k results for each shard (and each segment) rather than k results for the entire query. The plugin supports a maximum k value of 10,000.

So I wonder, whether I'm not using the k parameter correct or whether this is a bug. I could aggregate all values on the client side using the returned documents as specified by the size parameter but one of the reasons that I chose the kNN implementation for ES, are the aggregations that I can run on top of queries within the same ES query.

Thanks for clarifying this :).

The text was updated successfully, but these errors were encountered:

vamshin · 2020-07-23T19:08:18Z

Hi @juliusbachnick,

This is not a bug and expected behavior from k-NN design. Plugin extracts k neighbors from each segment inside the shard and that accounts for atleast k hits.

During search, k-NN plugin asks each segment to return the 'k' results which accounts to 'k' hits per segment. size parameter in the query helps to give you closest neighbors among them. Having size=k would give atmost 'k' nearest neighbors. For example, in your case lets say all the first 3 docs went to segment0 and 4th doc went to segment1 then we would get 3 hits from segment 1 and 1 hit from segment 2 accounting to 4 hits.

You could forcemerge to 1 segment if you want hits to be <=k.

curl -X POST "localhost:9200/<indexName>/_forcemerge?max_num_segments=1&pretty"

juliusbachnick · 2020-08-03T10:02:49Z

Hey @vamshin,

Thank you for looking into this and for giving a detailed explanation of how the k-NN plugin works internally. Is the behaviour of getting up to k results per segment documented somewhere? Otherwise, it would be nice if this is added to avoid future confusion about the returned number of total hits.

The forcemerge approach seems like a workaround to me which I would like to avoid. As I understand, the k-NN Plugin should seamlessly integrate with Elasticsearch, i.e. it is possible to combine it with other ES features such as aggregations. However, given the underlying logic I'm not sure how I would use a k-NN query in conjunction with ES aggregations as the aggregations always operate on all returned hits. As long as the index only has one shard, I could work around this by using a sampler aggregation to only aggregate on the top k results. As soon as use more than one shard, this is not a viable option anymore.

Can you provide an example how aggregations on k results are supposed to be used if k+n total hits are returned?

vamshin · 2020-08-11T05:58:24Z

Hi @juliusbachnick,

Good point. Seems like a limitation on usage of aggregations on the k-NN search. Possible work around i could see is having shards with segments force merged to 1 segment. Need to dig into aggregations to see if there is a way to overcome this problem. Let us know if you figure out the solution. Will keep this thread open for further analysis.

fgrosse · 2020-10-11T16:25:19Z

Hey @vamshin thanks for looking into this topic. Is there any update on this issue? I am wondering if the described use case is in some way unusual since using Elasticsearch aggregations on top of the kNN results is listed as a feature in the second sentence of the README.md

[…] You can use aggregations and filter clauses to further refine your similarity search operations. […]

Should this section be updated to explain the limitations?

vamshin · 2021-02-27T06:35:43Z

Sorry for getting late on this thread. Elasticsearch performs aggregations on all the documents that are "hit". Since 'k' nearest neighbors are evaluated at segment level, each segment return at most 'k' hits. Could not think of a way to apply aggregations on just the final top 'k' results at coordinator node level. @juliusbachnick were you able to figure out a way to do this?

vamshin added the question Further information is requested label Jul 24, 2020

vamshin self-assigned this Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected number of total hits when running a kNN query #176

Unexpected number of total hits when running a kNN query #176

juliusbachnick commented Jul 23, 2020

vamshin commented Jul 23, 2020 •

edited

Loading

juliusbachnick commented Aug 3, 2020

vamshin commented Aug 11, 2020

fgrosse commented Oct 11, 2020

vamshin commented Feb 27, 2021 •

edited

Loading

Unexpected number of total hits when running a kNN query #176

Unexpected number of total hits when running a kNN query #176

Comments

juliusbachnick commented Jul 23, 2020

vamshin commented Jul 23, 2020 • edited Loading

juliusbachnick commented Aug 3, 2020

vamshin commented Aug 11, 2020

fgrosse commented Oct 11, 2020

vamshin commented Feb 27, 2021 • edited Loading

vamshin commented Jul 23, 2020 •

edited

Loading

vamshin commented Feb 27, 2021 •

edited

Loading