Skip to content

Commit 9cc6d81

Browse files
Simran-Bnerpaula
andauthored
DOC-765 | Cosine similarity fix for vector indexes (#719)
* Initial reference docs about the vector index * HTTP API docs and refinements * Version remark, OpenAPI minItems/maxItems and fix a type * inBackground and parallelism are supported * Review feedback, address cosine metric issue, reword inBackground, add parallelism * Remove leftover line * Add internal links to release notes * Cosine similarity value out of range has been fixed * Add innerProduct metric --------- Co-authored-by: Paula Mihu <97217318+nerpaula@users.noreply.github.com>
1 parent 4f7090e commit 9cc6d81

File tree

10 files changed

+240
-43
lines changed

10 files changed

+240
-43
lines changed

site/content/3.12/aql/functions/vector.md

Lines changed: 81 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,14 @@ be found depends on the data as well as the search effort (see the `nProbe` opti
6464

6565
`APPROX_NEAR_COSINE(vector1, vector2, options) → similarity`
6666

67-
Retrieve the approximate angular similarity using the cosine metric, accelerated
68-
by a matching vector index.
6967

70-
The higher the cosine similarity value is, the more similar the two vectors
71-
are. The closer it is to 0, the more different they are. The value can also
72-
be negative, indicating that the vectors are not similar and point in opposite
73-
directions. You need to sort in descending order so that the most similar
68+
Retrieve the approximate cosine of the angle between two vectors, accelerated
69+
by a matching vector index with the `cosine` metric.
70+
71+
The closer the similarity value is to 1, the more similar the two vectors
72+
are. The closer it is to 0, the more different they are. The value can also be
73+
negative up to -1, indicating that the vectors are not similar and point in opposite
74+
directions. You need to **sort in descending order** so that the most similar
7475
documents come first, which is what a vector index using the `cosine` metric
7576
can provide.
7677

@@ -83,8 +84,8 @@ can provide.
8384
closest Voronoi cells to consider for the search results. The larger the number,
8485
the slower the search but the better the search results. If not specified, the
8586
`defaultNProbe` value of the vector index is used.
86-
- returns **similarity** (number): The approximate angular similarity between
87-
both vectors.
87+
- returns **similarity** (number): The approximate cosine similarity of
88+
both normalized vectors. The value range is `[-1, 1]`.
8889

8990
**Examples**
9091

@@ -126,15 +127,83 @@ FOR docOuter IN coll
126127
RETURN { key: docOuter._key, neighbors }
127128
```
128129

130+
### APPROX_NEAR_INNER_PRODUCT()
131+
132+
<small>Introduced in: v3.12.6</small>
133+
134+
`APPROX_NEAR_INNER_PRODUCT(vector1, vector2, options) → similarity`
135+
136+
Retrieve the approximate dot product of two vectors, accelerated by a matching
137+
vector index with the `innerProduct` metric.
138+
139+
The higher the similarity value is, the more similar the two vectors
140+
are. The closer it is to 0, the more different they are. The value can also
141+
be negative, indicating that the vectors are not similar and point in opposite
142+
directions. You need to **sort in descending order** so that the most similar
143+
documents come first, which is what a vector index using the `innerProduct`
144+
metric can provide.
145+
146+
- **vector1** (array of numbers): The first vector. Either this parameter or
147+
`vector2` needs to reference a stored attribute holding the vector embedding.
148+
- **vector2** (array of numbers): The second vector. Either this parameter or
149+
`vector1` needs to reference a stored attribute holding the vector embedding.
150+
- **options** (object, _optional_):
151+
- **nProbe** (number, _optional_): How many neighboring centroids respectively
152+
closest Voronoi cells to consider for the search results. The larger the number,
153+
the slower the search but the better the search results. If not specified, the
154+
`defaultNProbe` value of the vector index is used.
155+
- returns **similarity** (number): The approximate dot product
156+
of both vectors without normalization. The value range is unbounded.
157+
158+
**Examples**
159+
160+
Return up to `10` similar documents based on their closeness to the vector
161+
`@q` according to the inner product metric:
162+
163+
```aql
164+
FOR doc IN coll
165+
SORT APPROX_NEAR_INNER_PRODUCT(doc.vector, @q) DESC
166+
LIMIT 10
167+
RETURN doc
168+
```
169+
170+
Return up to `5` similar documents as well as the similarity value,
171+
considering `20` neighboring centroids respectively closest Voronoi cells:
172+
173+
```aql
174+
FOR doc IN coll
175+
LET similarity = APPROX_NEAR_INNER_PRODUCT(doc.vector, @q, { nProbe: 20 })
176+
SORT similarity DESC
177+
LIMIT 5
178+
RETURN MERGE( { similarity }, doc)
179+
```
180+
181+
Return the similarity value and the document keys of up to `3` similar documents
182+
for multiple input vectors using a subquery. In this example, the input vectors
183+
are taken from ten random documents of the same collection:
184+
185+
```aql
186+
FOR docOuter IN coll
187+
LIMIT 10
188+
LET neighbors = (
189+
FOR docInner IN coll
190+
LET similarity = APPROX_NEAR_INNER_PRODUCT(docInner.vector, docOuter.vector)
191+
SORT similarity DESC
192+
LIMIT 3
193+
RETURN { key: docInner._key, similarity }
194+
)
195+
RETURN { key: docOuter._key, neighbors }
196+
```
197+
129198
### APPROX_NEAR_L2()
130199

131-
`APPROX_NEAR_L2(vector1, vector2, options) → similarity`
200+
`APPROX_NEAR_L2(vector1, vector2, options) → distance`
132201

133202
Retrieve the approximate distance using the L2 (Euclidean) metric, accelerated
134-
by a matching vector index.
203+
by a matching vector index with the `l2` metric.
135204

136205
The closer the distance is to 0, the more similar the two vectors are. The higher
137-
the value, the more different the they are. You need to sort in ascending order
206+
the value, the more different the they are. You need to **sort in ascending order**
138207
so that the most similar documents come first, which is what a vector index using
139208
the `l2` metric can provide.
140209

@@ -147,7 +216,7 @@ the `l2` metric can provide.
147216
for the search results. The larger the number, the slower the search but the
148217
better the search results. If not specified, the `defaultNProbe` value of
149218
the vector index is used.
150-
- returns **similarity** (number): The approximate L2 (Euclidean) distance between
219+
- returns **distance** (number): The approximate L2 (Euclidean) distance between
151220
both vectors.
152221

153222
**Examples**

site/content/3.12/develop/http-api/indexes/vector.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -88,9 +88,14 @@ paths:
8888
properties:
8989
metric:
9090
description: |
91-
Whether to use `cosine` or `l2` (Euclidean) distance calculation.
92-
type: string
93-
enum: ["cosine", "l2"]
91+
The measure for calculating the vector similarity:
92+
- `"cosine"`: Angular similarity. Vectors are automatically
93+
normalized before insertion and search.
94+
- `"innerProduct"` (introduced in v3.12.6):
95+
Similarity in terms of angle and magnitude.
96+
Vectors are not normalized, making it faster than `cosine`.
97+
- `"l2":` Euclidean distance.
98+
enum: ["cosine", "innerProduct", "l2"]
9499
dimension:
95100
description: |
96101
The vector dimension. The attribute to index needs to

site/content/3.12/index-and-search/indexing/working-with-indexes/vector-indexes.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,7 @@ data numerically and can be generated with machine learning models.
1414
You can then quickly find a given number of semantically similar documents by
1515
searching for close neighbors in a high-dimensional vector space.
1616

17-
The vector index implementation uses the [Faiss library](https://github.com/facebookresearch/faiss/)
18-
to support L2 and cosine metrics. The index used is IndexIVFFlat, the quantizer
19-
for L2 is IndexFlatL2, and the cosine uses IndexFlatIP, where vectors are
20-
normalized before insertion and search.
17+
The vector index implementation uses the [Faiss library](https://github.com/facebookresearch/faiss/).
2118

2219
## How to use vector indexes
2320

@@ -75,7 +72,13 @@ centroids and the quality of vector search thus degrades.
7572
write operations by not using an exclusive write lock for the duration
7673
of the index creation. The default is `false`.
7774
- **params**: The parameters as used by the Faiss library.
78-
- **metric** (string): Whether to use `cosine` or `l2` (Euclidean) distance calculation.
75+
- **metric** (string): The measure for calculating the vector similarity:
76+
- `"cosine"`: Angular similarity. Vectors are automatically
77+
normalized before insertion and search.
78+
- `"innerProduct"` (introduced in v3.12.6):
79+
Similarity in terms of angle and magnitude.
80+
Vectors are not normalized, making it faster than `cosine`.
81+
- `"l2":` Euclidean distance.
7982
- **dimension** (number): The vector dimension. The attribute to index needs to
8083
have this many elements in the array that stores the vector embedding.
8184
- **nLists** (number): The number of Voronoi cells to partition the vector space
@@ -115,7 +118,6 @@ centroids and the quality of vector search thus degrades.
115118
{{< tabs "interfaces" >}}
116119

117120
{{< tab "Web interface" >}}
118-
{{< comment >}}TODO: Only in v3.12.6+
119121
1. In the **Collections** section, click the name or row of the desired collection.
120122
2. Go to the **Indexes** tab.
121123
3. Click **Add index**.
@@ -125,8 +127,6 @@ centroids and the quality of vector search thus degrades.
125127
under `param`.
126128
7. Optionally give the index a user-defined name.
127129
8. Click **Create**.
128-
{{< /comment >}}
129-
The web interface does not support vector indexes yet.
130130
{{< /tab >}}
131131

132132
{{< tab "arangosh" >}}

site/content/3.12/release-notes/version-3.12/incompatible-changes-in-3-12.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -900,6 +900,17 @@ the following steps.
900900
4. Restore the dump to the new deployment. You can directly move from any
901901
3.11 or 3.12 version to 3.12.4 (or later) this way.
902902

903+
## Cosine similarity fix for vector indexes
904+
905+
<small>Introduced in: v3.12.6</small>
906+
907+
A normalization issue has been addressed for the experimental vector index type.
908+
It was possible for the cosine similarity value returned by `APPROX_NEAR_COSINE()`
909+
to be outside the expected range of `[-1, 1]`.
910+
911+
It is recommended to recreate all vector indexes that use the `cosine` metric
912+
after upgrading to v3.12.6 or later.
913+
903914
## HTTP RESTful API
904915

905916
### JavaScript-based traversal using `/_api/traversal` removed

site/content/3.12/release-notes/version-3.12/whats-new-in-3-12.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1443,6 +1443,18 @@ utilizing vector indexes in queries.
14431443
Furthermore, a new error code `ERROR_QUERY_VECTOR_SEARCH_NOT_APPLIED` (1554)
14441444
has been added.
14451445

1446+
---
1447+
1448+
<small>Introduced in: v3.12.6</small>
1449+
1450+
Another metric has been added. The `innerProduct` is a vector similarity measure
1451+
calculated using the dot product of two vectors without normalizing them.
1452+
Therefore, it compares not only the angle but also the magnitudes.
1453+
1454+
The accompanying AQL function is the following:
1455+
1456+
- `APPROX_NEAR_INNER_PRODUCT()`
1457+
14461458
## Server options
14471459

14481460
### Effective and available startup options

site/content/3.13/aql/functions/vector.md

Lines changed: 81 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,14 @@ be found depends on the data as well as the search effort (see the `nProbe` opti
6464

6565
`APPROX_NEAR_COSINE(vector1, vector2, options) → similarity`
6666

67-
Retrieve the approximate angular similarity using the cosine metric, accelerated
68-
by a matching vector index.
6967

70-
The higher the cosine similarity value is, the more similar the two vectors
71-
are. The closer it is to 0, the more different they are. The value can also
72-
be negative, indicating that the vectors are not similar and point in opposite
73-
directions. You need to sort in descending order so that the most similar
68+
Retrieve the approximate cosine of the angle between two vectors, accelerated
69+
by a matching vector index with the `cosine` metric.
70+
71+
The closer the similarity value is to 1, the more similar the two vectors
72+
are. The closer it is to 0, the more different they are. The value can also be
73+
negative up to -1, indicating that the vectors are not similar and point in opposite
74+
directions. You need to **sort in descending order** so that the most similar
7475
documents come first, which is what a vector index using the `cosine` metric
7576
can provide.
7677

@@ -83,8 +84,8 @@ can provide.
8384
closest Voronoi cells to consider for the search results. The larger the number,
8485
the slower the search but the better the search results. If not specified, the
8586
`defaultNProbe` value of the vector index is used.
86-
- returns **similarity** (number): The approximate angular similarity between
87-
both vectors.
87+
- returns **similarity** (number): The approximate cosine similarity of
88+
both normalized vectors. The value range is `[-1, 1]`.
8889

8990
**Examples**
9091

@@ -126,15 +127,83 @@ FOR docOuter IN coll
126127
RETURN { key: docOuter._key, neighbors }
127128
```
128129

130+
### APPROX_NEAR_INNER_PRODUCT()
131+
132+
<small>Introduced in: v3.12.6</small>
133+
134+
`APPROX_NEAR_INNER_PRODUCT(vector1, vector2, options) → similarity`
135+
136+
Retrieve the approximate dot product of two vectors, accelerated by a matching
137+
vector index with the `innerProduct` metric.
138+
139+
The higher the similarity value is, the more similar the two vectors
140+
are. The closer it is to 0, the more different they are. The value can also
141+
be negative, indicating that the vectors are not similar and point in opposite
142+
directions. You need to **sort in descending order** so that the most similar
143+
documents come first, which is what a vector index using the `innerProduct`
144+
metric can provide.
145+
146+
- **vector1** (array of numbers): The first vector. Either this parameter or
147+
`vector2` needs to reference a stored attribute holding the vector embedding.
148+
- **vector2** (array of numbers): The second vector. Either this parameter or
149+
`vector1` needs to reference a stored attribute holding the vector embedding.
150+
- **options** (object, _optional_):
151+
- **nProbe** (number, _optional_): How many neighboring centroids respectively
152+
closest Voronoi cells to consider for the search results. The larger the number,
153+
the slower the search but the better the search results. If not specified, the
154+
`defaultNProbe` value of the vector index is used.
155+
- returns **similarity** (number): The approximate dot product
156+
of both vectors without normalization. The value range is unbounded.
157+
158+
**Examples**
159+
160+
Return up to `10` similar documents based on their closeness to the vector
161+
`@q` according to the inner product metric:
162+
163+
```aql
164+
FOR doc IN coll
165+
SORT APPROX_NEAR_INNER_PRODUCT(doc.vector, @q) DESC
166+
LIMIT 10
167+
RETURN doc
168+
```
169+
170+
Return up to `5` similar documents as well as the similarity value,
171+
considering `20` neighboring centroids respectively closest Voronoi cells:
172+
173+
```aql
174+
FOR doc IN coll
175+
LET similarity = APPROX_NEAR_INNER_PRODUCT(doc.vector, @q, { nProbe: 20 })
176+
SORT similarity DESC
177+
LIMIT 5
178+
RETURN MERGE( { similarity }, doc)
179+
```
180+
181+
Return the similarity value and the document keys of up to `3` similar documents
182+
for multiple input vectors using a subquery. In this example, the input vectors
183+
are taken from ten random documents of the same collection:
184+
185+
```aql
186+
FOR docOuter IN coll
187+
LIMIT 10
188+
LET neighbors = (
189+
FOR docInner IN coll
190+
LET similarity = APPROX_NEAR_INNER_PRODUCT(docInner.vector, docOuter.vector)
191+
SORT similarity DESC
192+
LIMIT 3
193+
RETURN { key: docInner._key, similarity }
194+
)
195+
RETURN { key: docOuter._key, neighbors }
196+
```
197+
129198
### APPROX_NEAR_L2()
130199

131-
`APPROX_NEAR_L2(vector1, vector2, options) → similarity`
200+
`APPROX_NEAR_L2(vector1, vector2, options) → distance`
132201

133202
Retrieve the approximate distance using the L2 (Euclidean) metric, accelerated
134-
by a matching vector index.
203+
by a matching vector index with the `l2` metric.
135204

136205
The closer the distance is to 0, the more similar the two vectors are. The higher
137-
the value, the more different the they are. You need to sort in ascending order
206+
the value, the more different the they are. You need to **sort in ascending order**
138207
so that the most similar documents come first, which is what a vector index using
139208
the `l2` metric can provide.
140209

@@ -147,7 +216,7 @@ the `l2` metric can provide.
147216
for the search results. The larger the number, the slower the search but the
148217
better the search results. If not specified, the `defaultNProbe` value of
149218
the vector index is used.
150-
- returns **similarity** (number): The approximate L2 (Euclidean) distance between
219+
- returns **distance** (number): The approximate L2 (Euclidean) distance between
151220
both vectors.
152221

153222
**Examples**

site/content/3.13/develop/http-api/indexes/vector.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -88,9 +88,14 @@ paths:
8888
properties:
8989
metric:
9090
description: |
91-
Whether to use `cosine` or `l2` (Euclidean) distance calculation.
92-
type: string
93-
enum: ["cosine", "l2"]
91+
The measure for calculating the vector similarity:
92+
- `"cosine"`: Angular similarity. Vectors are automatically
93+
normalized before insertion and search.
94+
- `"innerProduct"` (introduced in v3.12.6):
95+
Similarity in terms of angle and magnitude.
96+
Vectors are not normalized, making it faster than `cosine`.
97+
- `"l2":` Euclidean distance.
98+
enum: ["cosine", "innerProduct", "l2"]
9499
dimension:
95100
description: |
96101
The vector dimension. The attribute to index needs to

site/content/3.13/index-and-search/indexing/working-with-indexes/vector-indexes.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,7 @@ data numerically and can be generated with machine learning models.
1414
You can then quickly find a given number of semantically similar documents by
1515
searching for close neighbors in a high-dimensional vector space.
1616

17-
The vector index implementation uses the [Faiss library](https://github.com/facebookresearch/faiss/)
18-
to support L2 and cosine metrics. The index used is IndexIVFFlat, the quantizer
19-
for L2 is IndexFlatL2, and the cosine uses IndexFlatIP, where vectors are
20-
normalized before insertion and search.
17+
The vector index implementation uses the [Faiss library](https://github.com/facebookresearch/faiss/).
2118

2219
## How to use vector indexes
2320

@@ -75,7 +72,13 @@ centroids and the quality of vector search thus degrades.
7572
write operations by not using an exclusive write lock for the duration
7673
of the index creation. The default is `false`.
7774
- **params**: The parameters as used by the Faiss library.
78-
- **metric** (string): Whether to use `cosine` or `l2` (Euclidean) distance calculation.
75+
- **metric** (string): The measure for calculating the vector similarity:
76+
- `"cosine"`: Angular similarity. Vectors are automatically
77+
normalized before insertion and search.
78+
- `"innerProduct"` (introduced in v3.12.6):
79+
Similarity in terms of angle and magnitude.
80+
Vectors are not normalized, making it faster than `cosine`.
81+
- `"l2":` Euclidean distance.
7982
- **dimension** (number): The vector dimension. The attribute to index needs to
8083
have this many elements in the array that stores the vector embedding.
8184
- **nLists** (number): The number of Voronoi cells to partition the vector space

0 commit comments

Comments
 (0)