-
Notifications
You must be signed in to change notification settings - Fork 56
Unexpected number of total hits when running a kNN query #176
Comments
Hi @juliusbachnick, This is not a bug and expected behavior from k-NN design. Plugin extracts k neighbors from each segment inside the shard and that accounts for atleast k hits. During search, k-NN plugin asks each segment to return the 'k' results which accounts to 'k' hits per segment. You could forcemerge to 1 segment if you want hits to be <=k.
|
Hey @vamshin, Thank you for looking into this and for giving a detailed explanation of how the k-NN plugin works internally. Is the behaviour of getting up to The Can you provide an example how aggregations on |
Hi @juliusbachnick, Good point. Seems like a limitation on usage of aggregations on the k-NN search. Possible work around i could see is having shards with segments force merged to 1 segment. Need to dig into aggregations to see if there is a way to overcome this problem. Let us know if you figure out the solution. Will keep this thread open for further analysis. |
Hey @vamshin thanks for looking into this topic. Is there any update on this issue? I am wondering if the described use case is in some way unusual since using Elasticsearch aggregations on top of the kNN results is listed as a feature in the second sentence of the README.md
Should this section be updated to explain the limitations? |
Sorry for getting late on this thread. Elasticsearch performs aggregations on all the documents that are "hit". Since 'k' nearest neighbors are evaluated at segment level, each segment return at most 'k' hits. Could not think of a way to apply aggregations on just the final top 'k' results at coordinator node level. @juliusbachnick were you able to figure out a way to do this? |
Hey,
First of all: Thanks for the effort you put into this plugin! I'm trying to run some Elasticsearch aggregations on the document set that is returned when querying for the neighbors of a vector. However, the number of returned total hits (as far as I'm aware all hits are used as document set for ES aggregations) is sometimes greater than the specified
k
in the original query.For example, given the following statements are executed to set up the index:
When executing a the following kNN query:
The returned total hits equal
k
:However, once I add another document:
When running the exact same kNN query as above (with
k=3
), the number of reported total hits increases to4
, even thoughk
was specified as3
:I'm running the docker image
amazon/opendistro-for-elasticsearch:1.4.0
:I assumed that
k
would limit the number of returned hits (i.e. return early once the desired number of neighbours is reached), but could not really find anything more specific in the documentation, apart from:So I wonder, whether I'm not using the
k
parameter correct or whether this is a bug. I could aggregate all values on the client side using the returned documents as specified by thesize
parameter but one of the reasons that I chose the kNN implementation for ES, are the aggregations that I can run on top of queries within the same ES query.Thanks for clarifying this :).
The text was updated successfully, but these errors were encountered: