Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Unexpected number of total hits when running a kNN query #176

Open
juliusbachnick opened this issue Jul 23, 2020 · 5 comments
Open

Unexpected number of total hits when running a kNN query #176

juliusbachnick opened this issue Jul 23, 2020 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@juliusbachnick
Copy link

Hey,

First of all: Thanks for the effort you put into this plugin! I'm trying to run some Elasticsearch aggregations on the document set that is returned when querying for the neighbors of a vector. However, the number of returned total hits (as far as I'm aware all hits are used as document set for ES aggregations) is sometimes greater than the specified k in the original query.

For example, given the following statements are executed to set up the index:

PUT /my_index
{
  "settings": {
     "number_of_shards": 4,
     "index.knn": true,
     "index.knn.algo_param.ef_search": 200,
     "index.knn.algo_param.ef_construction": 200,
     "index.knn.algo_param.m": 32
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 2
      }
    }
  }
}

PUT /my_index/_doc/1
{
  "my_vector": [1.5, 2.5]
}

PUT /my_index/_doc/2
{
  "my_vector": [2.5, 3.5]
}

PUT /my_index/_doc/3
{
  "my_vector": [3.5, 4.5]
}

When executing a the following kNN query:

POST /my_index/_search
{
  "size": 3,
  "query": {
    "knn": {
      "my_vector": {
        "vector": [3, 4],
        "k": 3
      }
    }
  }
}

The returned total hits equal k:

{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 4,
        "successful": 4,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3,
            "relation": "eq"
        },
[...]
}

However, once I add another document:

PUT /my_index/_doc/4
{
  "my_vector": [4.5, 5.5]
}

When running the exact same kNN query as above (with k=3), the number of reported total hits increases to 4, even though k was specified as 3:

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 4,
        "successful": 4,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
[...]
}

I'm running the docker image amazon/opendistro-for-elasticsearch:1.4.0:

  "version" : {
    "number" : "7.4.2",
    "build_flavor" : "oss",
    "build_type" : "tar",
    "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
    "build_date" : "2019-10-28T20:40:44.881551Z",
    "build_snapshot" : false,
    "lucene_version" : "8.2.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  }

I assumed that k would limit the number of returned hits (i.e. return early once the desired number of neighbours is reached), but could not really find anything more specific in the documentation, apart from:

In this case, k is the number of neighbors you want the query to return, but you must also include the size option. Otherwise, you get k results for each shard (and each segment) rather than k results for the entire query. The plugin supports a maximum k value of 10,000.

So I wonder, whether I'm not using the k parameter correct or whether this is a bug. I could aggregate all values on the client side using the returned documents as specified by the size parameter but one of the reasons that I chose the kNN implementation for ES, are the aggregations that I can run on top of queries within the same ES query.

Thanks for clarifying this :).

@vamshin
Copy link
Member

vamshin commented Jul 23, 2020

Hi @juliusbachnick,

This is not a bug and expected behavior from k-NN design. Plugin extracts k neighbors from each segment inside the shard and that accounts for atleast k hits.

During search, k-NN plugin asks each segment to return the 'k' results which accounts to 'k' hits per segment. size parameter in the query helps to give you closest neighbors among them. Having size=k would give atmost 'k' nearest neighbors. For example, in your case lets say all the first 3 docs went to segment0 and 4th doc went to segment1 then we would get 3 hits from segment 1 and 1 hit from segment 2 accounting to 4 hits.

You could forcemerge to 1 segment if you want hits to be <=k.

curl -X POST "localhost:9200/<indexName>/_forcemerge?max_num_segments=1&pretty"

@vamshin vamshin added the question Further information is requested label Jul 24, 2020
@juliusbachnick
Copy link
Author

Hey @vamshin,

Thank you for looking into this and for giving a detailed explanation of how the k-NN plugin works internally. Is the behaviour of getting up to k results per segment documented somewhere? Otherwise, it would be nice if this is added to avoid future confusion about the returned number of total hits.

The forcemerge approach seems like a workaround to me which I would like to avoid. As I understand, the k-NN Plugin should seamlessly integrate with Elasticsearch, i.e. it is possible to combine it with other ES features such as aggregations. However, given the underlying logic I'm not sure how I would use a k-NN query in conjunction with ES aggregations as the aggregations always operate on all returned hits. As long as the index only has one shard, I could work around this by using a sampler aggregation to only aggregate on the top k results. As soon as use more than one shard, this is not a viable option anymore.

Can you provide an example how aggregations on k results are supposed to be used if k+n total hits are returned?

@vamshin
Copy link
Member

vamshin commented Aug 11, 2020

Hi @juliusbachnick,

Good point. Seems like a limitation on usage of aggregations on the k-NN search. Possible work around i could see is having shards with segments force merged to 1 segment. Need to dig into aggregations to see if there is a way to overcome this problem. Let us know if you figure out the solution. Will keep this thread open for further analysis.

@fgrosse
Copy link

fgrosse commented Oct 11, 2020

Hey @vamshin thanks for looking into this topic. Is there any update on this issue? I am wondering if the described use case is in some way unusual since using Elasticsearch aggregations on top of the kNN results is listed as a feature in the second sentence of the README.md

[…] You can use aggregations and filter clauses to further refine your similarity search operations. […]

Should this section be updated to explain the limitations?

@vamshin vamshin self-assigned this Feb 27, 2021
@vamshin
Copy link
Member

vamshin commented Feb 27, 2021

Sorry for getting late on this thread. Elasticsearch performs aggregations on all the documents that are "hit". Since 'k' nearest neighbors are evaluated at segment level, each segment return at most 'k' hits. Could not think of a way to apply aggregations on just the final top 'k' results at coordinator node level. @juliusbachnick were you able to figure out a way to do this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants