Add new kNN search endpoint #79013

jtibshirani · 2021-10-12T20:30:58Z

This PR adds a new REST endpoint called _knn_search that supports retrieving nearest vectors:

POST index/_knn_search
{
  "knn": {
    "field": "image_vector",
    "query_vector": [0.3f, 0.1f, ...],
    "k": 10,
    "num_candidates": 100
  },
  "_source": false,
  "fields": [ "name", "date" ]
}

The response has the exact same format as a _search response. The k closest documents to query_vector are returned as hits, ranked by their proximity. The num_cands parameter controls how many candidate vectors are gathered per shard, before these are merged and sorted to produce the final top k. For the HNSW algorithm, num_cands corresponds to efSearch. Increasing num_cands usually improves recall at the expense of latency.

Restrictions:

The k parameter must be less than num_cands. To prevent very expensive queries, num_cands is limited to 10,000.
Currently only one kNN search is allowed, but later we could extend the knn section to accept an array of kNN definitions.

The endpoint also supports these options from _search:

The routing parameter
All options for loading document content, including source filtering and fields. For simplicity, it uses the same default as _search where we return the whole _source for each hit.

Relates to #78473.

jtibshirani · 2021-10-12T21:18:42Z

Other notes:

I went with k instead of a top-level size parameter. I thought it fit best with the _knn_search naming and user's expectations. It also feels more in line with our future ideas around combining multiple result sets together (like combining kNN and term-based query results, or extending the knn section to support multiple definitions).
We will always return k top hits if they're available. The hits.total will always be equal to the number of candidate neighbors considered (num_cands * num_shards). This is a little arbitrary but seems okay to me.
I think the logic happens to work with cross-cluster search, but it is not explicitly supported.
In a follow-up, we need to avoid buggy interactions with index alias filters and nested documents.

Some items I'm looking for feedback on:

What do you think about the API choices around k and num_cands?
I tried to include just those _search options that could be important for testing out kNN search. Do you think there are any I missed?
Related to this, I included not only source filtering and fields, but also docvalue_fields and stored_fields. It is common when benchmarking kNN to store the vector ID in doc values, and disable stored field loading entirely by passing stored_fields: _none_. My experiments show it can make a substantial difference in QPS. So having these options available lets us compare more fairly to other implementations in tests.

elasticmachine · 2021-10-12T22:35:18Z

Pinging @elastic/clients-team (Team:Clients)

elasticmachine · 2021-10-12T22:35:19Z

Pinging @elastic/es-search (Team:Search)

jpountz

What do you think about the API choices around k and num_cands?

k sounds good to me, but I have a slight preference for num_candidates over num_cands.

...ctors/src/test/java/org/elasticsearch/xpack/vectors/action/KnnSearchRequestBuilderTests.java

mayya-sharipova

@jtibshirani Thanks for your work, this PR overall looks very nice to me. I left some comments

...in/vectors/src/main/java/org/elasticsearch/xpack/vectors/action/KnnSearchRequestBuilder.java

...lugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/KnnVectorQueryBuilder.java

...ctors/src/test/java/org/elasticsearch/xpack/vectors/action/KnnSearchRequestBuilderTests.java

...plugin/vectors/src/test/java/org/elasticsearch/xpack/vectors/query/KnnSearchActionTests.java

sethmlarson

Some nitty naming feedback on the API spec, mostly LGTM!

rest-api-spec/src/main/resources/rest-api-spec/api/vectors.knn_search.json

* Correct KnnVectorQueryBuilder equals and hashCode * Remove printBoostAndQueryName * Test fixes in KnnSearchRequestBuilderTests

jtibshirani · 2021-10-15T17:53:41Z

@mayya-sharipova @sethmlarson this is now ready for another look. Notable changes:

Rename num_cands -> num_candidates
Rename REST spec name vectors.knn_search -> search_knn
Make sure _source can accept boolean values

sethmlarson

LGTM!

mayya-sharipova

@jtibshirani Thanks for iterating! This LGTM, I just left a small comment to use OBJECT_ARRAY_BOOLEAN_OR_STRING for _source parsing, but addressing it doesn't need any review from me.

jimczi

I left some additional comments but the API looks good to me.

jimczi · 2021-10-18T07:28:51Z

...in/vectors/src/main/java/org/elasticsearch/xpack/vectors/action/KnnSearchRequestBuilder.java

+    /**
+     * An optional timeout to control how long search is allowed to take.
+     */
+    private void timeout(TimeValue timeout) {


That looks like a good option to add but I am afraid that it won't work with the current code.
First of all, I think we should avoid the confusion with a request timeout. It's a shard timeout that allows to stop the collection early.
Moreover the Lucene knn Query performs the approximate nearest neighbor search during the query rewrite.
The rewrite is not taken into account by this timeout so I'd prefer that we remove the option for now and adds it later with the required modifications to make it work.

Oof, I didn't realize the timeout didn't take rewrites into account. I'll remove it and we can incorporate it later, as you suggest.

jimczi · 2021-10-18T13:06:19Z

rest-api-spec/src/main/resources/rest-api-spec/api/knn_search.json

+    "url":{
+      "paths":[
+        {
+          "path":"/_knn_search",


Is it needed ? _search allows that for on-boarding but we can be more strict here imo.

Sure, I can remove this route.

jimczi · 2021-10-18T13:45:46Z

...in/vectors/src/main/java/org/elasticsearch/xpack/vectors/action/KnnSearchRequestBuilder.java

+    /**
+     * A list of docvalue fields to load and return.
+     */
+    private void docValueFields(List<FieldAndFormat> docValueFields) {


I wonder if we could delay adding these options. I feel like the fields option should handle this more consistently. Do you think it's really needed for 8.0 (same question for stored fields) ?

Pasting my comment from above which gives some context:

I included not only source filtering and fields, but also docvalue_fields and stored_fields. It is common when benchmarking kNN to store the vector ID in doc values, and disable stored field loading entirely by passing stored_fields: _none_. My experiments show it can make a substantial difference in QPS. So having these options available lets us compare more fairly to other implementations in tests.

Using docvalue_fields with stored_fields: _none can really make a substantial latency difference (~15% improvement in QPS). I'd like for us to be able to represent Elasticsearch's performance as accurately + strongly as possible in benchmarks. These options are also self-contained and all fall under the category of "document loading". If we document it well, I don't think they'll add confusion/ complexity (unlike "timeout"!)

ok thanks for explaining

jimczi · 2021-10-18T13:49:01Z

...lugin/vectors/src/main/java/org/elasticsearch/xpack/vectors/query/KnnVectorQueryBuilder.java

+    @Override
+    protected QueryBuilder doRewrite(QueryRewriteContext queryRewriteContext) {
+        SearchExecutionContext context = queryRewriteContext.convertToSearchExecutionContext();
+        if (context != null && context.getFieldType(fieldName) == null) {


For a specialized query like this I wonder if we should be strict by default and throw an error if the field doesn't exist ? We can add an ignoreUnmapped option for the explicit case where the field might not exist in some indices but that shouldn't be the default imo.

For this new endpoint it does seem clearest to just throw an error when the field doesn't exist. I'll update this to throw an error.

Maybe we can start strict, but think through this more deeply when we add more capabilities to the API. The majority of query types ignore unmapped fields by default, and it'd be good to have consistent behavior across queries/ different parts of the search.

* Remove timeout parameter * Require that URL path always includes index * Throw error when vector field does not exist

jimczi

LGTM

jimczi · 2021-10-18T18:53:58Z

...in/vectors/src/main/java/org/elasticsearch/xpack/vectors/action/KnnSearchRequestBuilder.java

+    /**
+     * A list of docvalue fields to load and return.
+     */
+    private void docValueFields(List<FieldAndFormat> docValueFields) {


ok thanks for explaining

jtibshirani · 2021-10-18T20:20:22Z

Thanks everyone for the reviews.

* upstream/master: (34 commits) Add extensionName() to security extension (elastic#79329) More robust and consistent allowAll indicesAccessControl (elastic#79415) Fix circuit breaker leak in MultiTerms aggregation (elastic#79362) guard geoline aggregation from parents aggegator that emit empty buckets (elastic#79129) Vector tiles: increase the size of the envelope used to clip geometries (elastic#79030) Revert "[ML] Add queue_capacity setting to start deployment API (elastic#79369)" (elastic#79374) Convert token service license object to LicensedFeature (elastic#79284) [TEST] Fix ShardPathTests for MDP (elastic#79393) Fix fleet search API with no checkpints (elastic#79400) Reduce BWC version for transient settings (elastic#79396) EQL: Rename a test class for eclipse (elastic#79254) Use search_coordination threadpool in field caps (elastic#79378) Use query param instead of a system property for opting in for new cluster health response code (elastic#79351) Add new kNN search endpoint (elastic#79013) Disable BWC tests Convert auditing license object to LicensedFeature (elastic#79280) Update BWC versions after backport of elastic#78551 Enable InstantiatingObjectParser to pass context as a first argument (elastic#79206) Move xcontent filtering tests (elastic#79298) Update links to Fleet/Agent docs (elastic#79303) ...

The `_knn_search` endpoint does not accept an empty `index` parameter. Follow-up to #79013.

Adds a release highlight for the kNN search API. Relates to #78473 and #79013 ### Preview https://elasticsearch_83755.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/8.0/release-highlights.html#_knn_search_api

elasticsearchmachine added the v8.0.0 label Oct 12, 2021

Add new kNN search endpoint

1fb6806

sethmlarson added the Team:Clients label Oct 12, 2021

jtibshirani force-pushed the knn-endpoint branch from 901ac83 to 1fb6806 Compare October 12, 2021 21:11

jtibshirani added the :Search/Search label Oct 12, 2021

jtibshirani marked this pull request as ready for review October 12, 2021 22:35

elasticmachine added the Team:Search label Oct 12, 2021

jtibshirani mentioned this pull request Oct 12, 2021

Integrate ANN search #78473

Closed

17 tasks

jpountz reviewed Oct 13, 2021

View reviewed changes

...ctors/src/test/java/org/elasticsearch/xpack/vectors/action/KnnSearchRequestBuilderTests.java Outdated Show resolved Hide resolved

mayya-sharipova reviewed Oct 15, 2021

View reviewed changes

...ctors/src/test/java/org/elasticsearch/xpack/vectors/action/KnnSearchRequestBuilderTests.java Outdated Show resolved Hide resolved

mayya-sharipova reviewed Oct 15, 2021

View reviewed changes

...plugin/vectors/src/test/java/org/elasticsearch/xpack/vectors/query/KnnSearchActionTests.java Show resolved Hide resolved

mayya-sharipova reviewed Oct 15, 2021

View reviewed changes

...plugin/vectors/src/test/java/org/elasticsearch/xpack/vectors/query/KnnSearchActionTests.java Show resolved Hide resolved

sethmlarson reviewed Oct 15, 2021

View reviewed changes

rest-api-spec/src/main/resources/rest-api-spec/api/vectors.knn_search.json Outdated Show resolved Hide resolved

jtibshirani added 4 commits October 15, 2021 10:45

Merge remote-tracking branch 'upstream/master' into pr

7fb097d

Small fixes in response to feedback

489bfb1

* Correct KnnVectorQueryBuilder equals and hashCode * Remove printBoostAndQueryName * Test fixes in KnnSearchRequestBuilderTests

Make sure _source can be passed as a boolean.

205a6c6

Rename REST spec name and num_cands

0739a26

Revert to using knn_search for REST spec name

1f40a8d

sethmlarson approved these changes Oct 15, 2021

View reviewed changes

mayya-sharipova approved these changes Oct 16, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into knn-endpoint

a1818fa

jimczi reviewed Oct 18, 2021

View reviewed changes

jtibshirani added 3 commits October 18, 2021 09:51

Address more comments

7c6c883

* Remove timeout parameter * Require that URL path always includes index * Throw error when vector field does not exist

Make sure we can parse strings and arrays for _source

74227c4

Simplify tests now that we are more restrictive

dfb5335

jtibshirani added 3 commits October 18, 2021 10:33

Merge remote-tracking branch 'upstream/master' into knn-endpoint

a36d7e9

Merge remote-tracking branch 'upstream/master' into knn-endpoint

ce420a2

Ensure the endpoint respects field aliases

a0c2bdf

jimczi approved these changes Oct 18, 2021

View reviewed changes

jtibshirani merged commit 74cce57 into elastic:master Oct 18, 2021

jtibshirani deleted the knn-endpoint branch October 18, 2021 20:27

sethmlarson mentioned this pull request Oct 21, 2021

Add types to 'knn_search' API elastic/elasticsearch-specification#932

Merged

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

jtibshirani mentioned this pull request Nov 3, 2021

Correct description in kNN search rest spec #80313

Merged

jtibshirani added a commit that referenced this pull request Nov 3, 2021

Correct description in kNN search rest spec (#80313)

bdbd4e3

The `_knn_search` endpoint does not accept an empty `index` parameter. Follow-up to #79013.

jtibshirani added a commit that referenced this pull request Nov 3, 2021

Correct description in kNN search rest spec (#80313)

45add77

The `_knn_search` endpoint does not accept an empty `index` parameter. Follow-up to #79013.

jpountz added the release highlight label Jan 28, 2022

jrodewig mentioned this pull request Feb 9, 2022

[DOCS] Add 8.0 release highlight for kNN search API #83755

Merged

jtibshirani added :Search Relevance/Vectors and removed :Search/Search labels Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new kNN search endpoint #79013

Add new kNN search endpoint #79013

jtibshirani commented Oct 12, 2021 •

edited

Loading

jtibshirani commented Oct 12, 2021

elasticmachine commented Oct 12, 2021

elasticmachine commented Oct 12, 2021

jpountz left a comment

mayya-sharipova left a comment

sethmlarson left a comment

jtibshirani commented Oct 15, 2021

sethmlarson left a comment

mayya-sharipova left a comment

jimczi left a comment

jimczi Oct 18, 2021

jtibshirani Oct 18, 2021

jimczi Oct 18, 2021

jtibshirani Oct 18, 2021

jimczi Oct 18, 2021

jtibshirani Oct 18, 2021 •

edited

Loading

jimczi Oct 18, 2021

jimczi Oct 18, 2021

jtibshirani Oct 18, 2021 •

edited

Loading

jimczi left a comment

jimczi Oct 18, 2021

jtibshirani commented Oct 18, 2021

Add new kNN search endpoint #79013

Add new kNN search endpoint #79013

Conversation

jtibshirani commented Oct 12, 2021 • edited Loading

jtibshirani commented Oct 12, 2021

elasticmachine commented Oct 12, 2021

elasticmachine commented Oct 12, 2021

jpountz left a comment

Choose a reason for hiding this comment

mayya-sharipova left a comment

Choose a reason for hiding this comment

sethmlarson left a comment

Choose a reason for hiding this comment

jtibshirani commented Oct 15, 2021

sethmlarson left a comment

Choose a reason for hiding this comment

mayya-sharipova left a comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani Oct 18, 2021 • edited Loading

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtibshirani commented Oct 18, 2021

jtibshirani commented Oct 12, 2021 •

edited

Loading

jtibshirani Oct 18, 2021 •

edited

Loading

jtibshirani Oct 18, 2021 •

edited

Loading