Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new kNN search endpoint #79013

Merged
merged 13 commits into from
Oct 18, 2021
Merged

Conversation

jtibshirani
Copy link
Contributor

@jtibshirani jtibshirani commented Oct 12, 2021

This PR adds a new REST endpoint called _knn_search that supports retrieving nearest vectors:

POST index/_knn_search
{
  "knn": {
    "field": "image_vector",
    "query_vector": [0.3f, 0.1f, ...],
    "k": 10,
    "num_candidates": 100
  },
  "_source": false,
  "fields": [ "name", "date" ]
}

The response has the exact same format as a _search response. The k closest documents to query_vector are returned as hits, ranked by their proximity. The num_cands parameter controls how many candidate vectors are gathered per shard, before these are merged and sorted to produce the final top k. For the HNSW algorithm, num_cands corresponds to efSearch. Increasing num_cands usually improves recall at the expense of latency.

Restrictions:

  • The k parameter must be less than num_cands. To prevent very expensive queries, num_cands is limited to 10,000.
  • Currently only one kNN search is allowed, but later we could extend the knn section to accept an array of kNN definitions.

The endpoint also supports these options from _search:

  • The routing parameter
  • All options for loading document content, including source filtering and fields. For simplicity, it uses the same default as _search where we return the whole _source for each hit.

Relates to #78473.

@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Oct 12, 2021
@jtibshirani
Copy link
Contributor Author

Other notes:

  • I went with k instead of a top-level size parameter. I thought it fit best with the _knn_search naming and user's expectations. It also feels more in line with our future ideas around combining multiple result sets together (like combining kNN and term-based query results, or extending the knn section to support multiple definitions).
  • We will always return k top hits if they're available. The hits.total will always be equal to the number of candidate neighbors considered (num_cands * num_shards). This is a little arbitrary but seems okay to me.
  • I think the logic happens to work with cross-cluster search, but it is not explicitly supported.
  • In a follow-up, we need to avoid buggy interactions with index alias filters and nested documents.

Some items I'm looking for feedback on:

  • What do you think about the API choices around k and num_cands?
  • I tried to include just those _search options that could be important for testing out kNN search. Do you think there are any I missed?
  • Related to this, I included not only source filtering and fields, but also docvalue_fields and stored_fields. It is common when benchmarking kNN to store the vector ID in doc values, and disable stored field loading entirely by passing stored_fields: _none_. My experiments show it can make a substantial difference in QPS. So having these options available lets us compare more fairly to other implementations in tests.

@jtibshirani jtibshirani added the :Search/Search Search-related issues that do not fall into other categories label Oct 12, 2021
@jtibshirani jtibshirani marked this pull request as ready for review October 12, 2021 22:35
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Oct 12, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jtibshirani jtibshirani mentioned this pull request Oct 12, 2021
17 tasks
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about the API choices around k and num_cands?

k sounds good to me, but I have a slight preference for num_candidates over num_cands.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks for your work, this PR overall looks very nice to me. I left some comments

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitty naming feedback on the API spec, mostly LGTM!

* Correct KnnVectorQueryBuilder equals and hashCode
* Remove printBoostAndQueryName
* Test fixes in KnnSearchRequestBuilderTests
@jtibshirani
Copy link
Contributor Author

@mayya-sharipova @sethmlarson this is now ready for another look. Notable changes:

  • Rename num_cands -> num_candidates
  • Rename REST spec name vectors.knn_search -> search_knn
  • Make sure _source can accept boolean values

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks for iterating! This LGTM, I just left a small comment to use OBJECT_ARRAY_BOOLEAN_OR_STRING for _source parsing, but addressing it doesn't need any review from me.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some additional comments but the API looks good to me.

/**
* An optional timeout to control how long search is allowed to take.
*/
private void timeout(TimeValue timeout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a good option to add but I am afraid that it won't work with the current code.
First of all, I think we should avoid the confusion with a request timeout. It's a shard timeout that allows to stop the collection early.
Moreover the Lucene knn Query performs the approximate nearest neighbor search during the query rewrite.
The rewrite is not taken into account by this timeout so I'd prefer that we remove the option for now and adds it later with the required modifications to make it work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I didn't realize the timeout didn't take rewrites into account. I'll remove it and we can incorporate it later, as you suggest.

"url":{
"paths":[
{
"path":"/_knn_search",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed ? _search allows that for on-boarding but we can be more strict here imo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can remove this route.

/**
* A list of docvalue fields to load and return.
*/
private void docValueFields(List<FieldAndFormat> docValueFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could delay adding these options. I feel like the fields option should handle this more consistently. Do you think it's really needed for 8.0 (same question for stored fields) ?

Copy link
Contributor Author

@jtibshirani jtibshirani Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pasting my comment from above which gives some context:

I included not only source filtering and fields, but also docvalue_fields and stored_fields. It is common when benchmarking kNN to store the vector ID in doc values, and disable stored field loading entirely by passing stored_fields: _none_. My experiments show it can make a substantial difference in QPS. So having these options available lets us compare more fairly to other implementations in tests.

Using docvalue_fields with stored_fields: _none can really make a substantial latency difference (~15% improvement in QPS). I'd like for us to be able to represent Elasticsearch's performance as accurately + strongly as possible in benchmarks. These options are also self-contained and all fall under the category of "document loading". If we document it well, I don't think they'll add confusion/ complexity (unlike "timeout"!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks for explaining

@Override
protected QueryBuilder doRewrite(QueryRewriteContext queryRewriteContext) {
SearchExecutionContext context = queryRewriteContext.convertToSearchExecutionContext();
if (context != null && context.getFieldType(fieldName) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a specialized query like this I wonder if we should be strict by default and throw an error if the field doesn't exist ? We can add an ignoreUnmapped option for the explicit case where the field might not exist in some indices but that shouldn't be the default imo.

Copy link
Contributor Author

@jtibshirani jtibshirani Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this new endpoint it does seem clearest to just throw an error when the field doesn't exist. I'll update this to throw an error.

Maybe we can start strict, but think through this more deeply when we add more capabilities to the API. The majority of query types ignore unmapped fields by default, and it'd be good to have consistent behavior across queries/ different parts of the search.

* Remove timeout parameter
* Require that URL path always includes index
* Throw error when vector field does not exist
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/**
* A list of docvalue fields to load and return.
*/
private void docValueFields(List<FieldAndFormat> docValueFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks for explaining

@jtibshirani
Copy link
Contributor Author

Thanks everyone for the reviews.

@jtibshirani jtibshirani merged commit 74cce57 into elastic:master Oct 18, 2021
@jtibshirani jtibshirani deleted the knn-endpoint branch October 18, 2021 20:27
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 19, 2021
* upstream/master: (34 commits)
  Add extensionName() to security extension (elastic#79329)
  More robust and consistent allowAll indicesAccessControl (elastic#79415)
  Fix circuit breaker leak in MultiTerms aggregation (elastic#79362)
  guard geoline aggregation from parents aggegator that emit empty buckets (elastic#79129)
  Vector tiles: increase the size of the envelope used to clip geometries (elastic#79030)
  Revert "[ML] Add queue_capacity setting to start deployment API (elastic#79369)" (elastic#79374)
  Convert token service license object to LicensedFeature (elastic#79284)
  [TEST] Fix ShardPathTests for MDP (elastic#79393)
  Fix fleet search API with no checkpints (elastic#79400)
  Reduce BWC version for transient settings (elastic#79396)
  EQL: Rename a test class for eclipse (elastic#79254)
  Use search_coordination threadpool in field caps (elastic#79378)
  Use query param instead of a system property for opting in for new cluster health response code (elastic#79351)
  Add new kNN search endpoint (elastic#79013)
  Disable BWC tests
  Convert auditing license object to LicensedFeature (elastic#79280)
  Update BWC versions after backport of elastic#78551
  Enable InstantiatingObjectParser to pass context as a first argument (elastic#79206)
  Move xcontent filtering tests (elastic#79298)
  Update links to Fleet/Agent docs (elastic#79303)
  ...
jtibshirani added a commit that referenced this pull request Nov 3, 2021
The `_knn_search` endpoint does not accept an empty `index` parameter.

Follow-up to #79013.
jtibshirani added a commit that referenced this pull request Nov 3, 2021
The `_knn_search` endpoint does not accept an empty `index` parameter.

Follow-up to #79013.
@jtibshirani jtibshirani added :Search Relevance/Vectors Vector search and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release highlight :Search Relevance/Vectors Vector search Team:Clients Meta label for clients team Team:Search Meta label for search team v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants