Skip to content
This repository has been archived by the owner on Aug 16, 2022. It is now read-only.

Latest commit

 

History

History
162 lines (139 loc) · 6.83 KB

approximate-knn.md

File metadata and controls

162 lines (139 loc) · 6.83 KB
layout title nav_order parent has_children has_math
default
Approximate Search
1
k-NN
false
true

Approximate k-NN Search

The approximate k-NN method uses nmslib's implementation of the HNSW algorithm to power k-NN search. In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest neighbors. Of the three methods, this method offers the best search scalability for large data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach should be preferred.

This plugin builds an HNSW graph of the vectors for each "knn-vector field"/ "Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, please refer to Apache Lucene's documentation. These graphs are loaded into native memory during search and managed by a cache. To learn more about pre-loading graphs into memory, refer to the warmup API. Additionally, you can see what graphs are already loaded in memory, which you can learn more about in the stats API section.

Because the graphs are constructed during indexing, it is not possible to apply a filter on an index and then use this search method. All filters will be applied on the results produced by the approximate nearest neighbor search.

Get started with approximate k-NN

To use the k-NN plugin's approximate search functionality, you must first create a k-NN index with setting index.knn to true. This setting tells the plugin to create HNSW graphs for the index.

Additionally, if you are using the approximate k-nearest neighbor method, you should specify knn.space_type to the space that you are interested in. This setting cannot be changed after it is set. To see what spaces we support, please refer to the spaces section. By default, index.knn.space_type is l2. For more information on index settings, such as algorithm parameters that can be tweaked to tune performance, please refer to the documentation.

Next, you must add one or more fields of the knn_vector data type. Here is an example that creates an index with two knn_vector fields and uses cosine similarity:

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true,
      "knn.space_type": "cosinesimil"
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 2
      },
      "my_vector2": {
        "type": "knn_vector",
        "dimension": 4
      }
    }
  }
}

The knn_vector data type supports a vector of floats that can have a dimension of up to 10,000, as set by the dimension mapping parameter.

In Elasticsearch, codecs handle the storage and retrieval of indices. The k-NN plugin uses a custom codec to write vector data to graphs so that the underlying k-NN search library can read it. {: .tip }

After you create the index, you can add some data to it:

POST _bulk
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{ "my_vector1": [1.5, 2.5], "price": 12.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{ "my_vector1": [2.5, 3.5], "price": 7.1 }
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{ "my_vector1": [3.5, 4.5], "price": 12.9 }
{ "index": { "_index": "my-knn-index-1", "_id": "4" } }
{ "my_vector1": [5.5, 6.5], "price": 1.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "5" } }
{ "my_vector1": [4.5, 5.5], "price": 3.7 }
{ "index": { "_index": "my-knn-index-1", "_id": "6" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
{ "index": { "_index": "my-knn-index-1", "_id": "7" } }
{ "my_vector2": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
{ "index": { "_index": "my-knn-index-1", "_id": "8" } }
{ "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-knn-index-1", "_id": "9" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }

Then you can execute an approximate nearest neighbor search on the data using the knn query type:

GET my-knn-index-1/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector2": {
        "vector": [2, 3, 5, 6],
        "k": 2
      }
    }
  }
}

k is the number of neighbors the search of each graph will return. You must also include the size option. This option indicates how many results the query actually returns. The plugin returns k amount of results for each shard (and each segment) and size amount of results for the entire query. The plugin supports a maximum k value of 10,000.

Using approximate k-NN with filters

If you use the knn query alongside filters or other clauses (e.g. bool, must, match), you might receive fewer than k results. In this example, post_filter reduces the number of results from 2 to 1:

GET my-knn-index-1/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector2": {
        "vector": [2, 3, 5, 6],
        "k": 2
      }
    }
  },
  "post_filter": {
    "range": {
      "price": {
        "gte": 5,
        "lte": 10
      }
    }
  }
}

Spaces

A space corresponds to the function used to measure the distance between 2 points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how Elasticsearch scores results, where a greater score equates to a better result. To convert distances to Elasticsearch scores, we take 1 / (1 + distance). Currently, the k-NN plugin supports the following spaces:

spaceType Distance Function Elasticsearch Score
l2 \[ Distance(X, Y) = \sum_{i=1}^n (X_i - Y_i)^2 \] 1 / (1 + Distance Function)
l1 \[ Distance(X, Y) = \sum_{i=1}^n (X_i - Y_i) \] 1 / (1 + Distance Function)
cosinesimil \[ 1 - {A · B \over \|A\| · \|B\|} = 1 - {\sum_{i=1}^n (A_i · B_i) \over \sqrt{\sum_{i=1}^n A_i^2} · \sqrt{\sum_{i=1}^n B_i^2}}\] where \(\|A\|\) and \(\|B\|\) represent normalized vectors. 1 / (1 + Distance Function)
hammingbit Distance = countSetBits(X \(\oplus\) Y) 1 / (1 + Distance Function)

The cosine similarity formula does not include the 1 - prefix. However, because nmslib equates smaller scores with closer results, they return 1 - cosineSimilarity for their cosine similarity space---that's why 1 - is included in the distance function. {: .note }