-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TiDB vector search doc #18502
Merged
ti-chi-bot
merged 121 commits into
pingcap:master
from
EricZequan:VectorFunction-and-VectorIndex
Oct 22, 2024
Merged
TiDB vector search doc #18502
Changes from 116 commits
Commits
Show all changes
121 commits
Select commit
Hold shift + click to select a range
48f906c
TiDB vector data type and vector index Doc
EricZequan 0b31525
remove vector index part
EricZequan 96f2701
modify cluster type
EricZequan 3fdab1b
fix
EricZequan 98e417b
modify expression
EricZequan 9bd83d6
fix ci
EricZequan 52934cb
fix comment
EricZequan 26027f9
fix ci
EricZequan df68638
fix ci
EricZequan febd534
fix comment
EricZequan 69429af
vector-search-overview: refine descriptions
qiancai d285b55
vector-search-data-types: refine descriptions
qiancai 3852ed3
vector-search-functions-and-operators: refine descriptions
qiancai 93b7e90
add remaining doc
EricZequan 15e18d0
remove rows
EricZequan 402f1d0
fix
EricZequan f869c42
get started: refine descriptions
qiancai 374b687
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai ea2b0b1
integrate-with-django-orm: refine descriptions
qiancai 2cf3e79
Apply suggestions from code review
qiancai 0eea4d7
fix comment
EricZequan 93fe602
fix comment
EricZequan abe5eac
integrate-with-peewee/sqlalchemy: refine descriptions
qiancai a01ba0e
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai 48f47af
get-started and integrate-with-jinaai-embedding: refine descriptions
qiancai f9b6dc0
get-started-using-sql: update connection instructions
qiancai ae94864
Update vector-search-data-types.md
breezewish f485b58
integrate-with-llamaindex: refine descriptions
qiancai 2e56e20
integrate-with-langchain: refine descriptions
qiancai bb7ce0b
overview and limitation: refine descriptions
qiancai d50b638
add vector index doc introduction
EricZequan 5290946
fix comment
EricZequan eb604f7
modify introduction of self-hosted tidb connection type
EricZequan 65656f8
modify tidb connection when using tidb self-hosted
EricZequan 99f9ab7
fix comment
EricZequan 2c5efa4
fix comment
EricZequan b2e32b7
fix comment
EricZequan 9a51513
shorten index example case
EricZequan c09d85f
fix comment
EricZequan bf063b7
fix comment
EricZequan 9eaf1f7
fix comment
EricZequan c62f7c9
fix comment
EricZequan 0083ea1
fix comment
EricZequan 6a673b9
index & improve performance: refine descriptions
qiancai 5ebe116
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai a250dff
add vector index part in other document
EricZequan 94e8ab5
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
EricZequan 634c602
modify index name when create vector index
EricZequan 4b54e6d
Update vector-search-improve-performance.md
EricZequan 5eeb336
refine descriptions for TiDB self-managed connection
qiancai c6dad29
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai 9a77c65
fix comment
EricZequan 502bd52
vector-index: refine descriptions
qiancai 652b639
remove index part when create table in integration-doc
EricZequan 05784ea
Resolve merge conflicts
EricZequan cfacbef
fix comment
EricZequan 3a68b70
Merge remote-tracking branch 'upstream/master' into pr/18502
qiancai eb42ae8
TiDB Serverless -> TiDB Cloud Serverless
qiancai 8d938d4
add the experimental warning
qiancai f7a31f2
fix comment
EricZequan 894fcd4
fix comment
EricZequan 92e1aee
fix comment
EricZequan 0f91e5a
Apply suggestions from code review
qiancai 39958f3
UI changes: Endpoint Type -> Connection Type
qiancai f8af8f6
fix comment
EricZequan 12abce8
fix comment
EricZequan c9ef22f
fix comment
EricZequan 3350c10
remove 'vector64()' sytax
EricZequan f18d840
Update desc about tiflash upgrade
JaySon-Huang 9cbc09e
Update desc about br support
JaySon-Huang 13bc862
Add limitation about BR restore
JaySon-Huang 7704118
Update desc about limitation
JaySon-Huang d135903
add limit about cdc
wk989898 f374485
Update tiflash-configuration
JaySon-Huang 2ba7811
fix comment
EricZequan c7fedb4
Apply suggestions from code review
EricZequan 28164e7
fix comment
EricZequan b862a36
Merge remote-tracking branch 'upstream/master' into pr/18502
qiancai 7c4e8bd
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai 95338ee
Update TOC.md
qiancai 1b5fa2f
Apply suggestions from code review
EricZequan 03d3cb5
Apply suggestions from code review
JaySon-Huang 1c39b37
Add limitation about encryption-at-rest
JaySon-Huang 0010207
Update format
lilin90 9fee36a
Update wording and format
lilin90 ad6eeb1
Remove description about future features
lilin90 44995da
Update wording
lilin90 ead511b
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
lilin90 7fbbed3
Update wording
lilin90 f51d860
Using upper case VEC_COSINE_DISTANCE instead
JaySon-Huang c00b0fc
make "USING HNSW" as default
JaySon-Huang 6d13acc
Apply suggestions from code review
JaySon-Huang 36d03f9
Apply suggestions from code review
EricZequan e03d678
format udpates
qiancai 30118d0
remove "或删除" from the experimental warning
qiancai fac90f3
Merge branch 'VectorFunction-and-VectorIndex' of https://github.com/E…
qiancai 5480123
update
EricZequan 5e0c7d5
Apply suggestions from code review
EricZequan 18a286b
Apply suggestions from code review
EricZequan 97d997e
add "Cast between Vector ⇔ other data types" back
qiancai a93a7d2
Apply suggestions from code review
EricZequan dd38923
refine descriptions in vector-search-limitations
qiancai 6243bd0
fix comment
EricZequan 52e696f
vector-search-index: refine new changes
qiancai bf35116
Apply suggestions from code review
qiancai 6fa2321
fix a broken link
qiancai a13da5f
fix broken links
qiancai f87d771
Apply suggestions from code review
EricZequan 960218c
fix index naming
JaySon-Huang 8b3f476
Apply suggestions from code review
EricZequan 7ace7de
Update vector-search-improve-performance.md
EricZequan a3cd4e8
Apply suggestions from code review
EricZequan 9aa03cd
remove ORM operation
EricZequan b8a100a
remove part of ORM intro
EricZequan 57aacb9
Apply suggestions from code review
EricZequan 4f3560b
Update vector-search-index.md
EricZequan d82bb79
fix a broken link
qiancai c4d049e
add ORM-non-index doc
EricZequan cc21bc8
Revert "remove part of ORM intro"
EricZequan b50a16e
fix comment
EricZequan 7125775
Update punctuation
lilin90 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,245 @@ | ||
--- | ||
title: 向量数据类型 | ||
summary: 本文介绍 TiDB 的向量数据类型。 | ||
--- | ||
|
||
# 向量数据类型 (Vector) | ||
|
||
向量指的是一组浮点数序列,例如 `[0.3, 0.5, -0.1, ...]`。针对 AI 应用中大量使用到的嵌入向量 (vector embedding) 数据,TiDB 专门提供了向量数据类型,以便高效地存储和访问这些数据。 | ||
|
||
> **警告:** | ||
> | ||
> 该功能目前为实验特性,不建议在生产环境中使用。该功能可能会在未事先通知的情况下发生变化。如果发现 bug,请在 GitHub 上提 [issue](https://github.com/pingcap/tidb/issues) 反馈。 | ||
|
||
目前支持的向量数据类型包括: | ||
|
||
- `VECTOR`:存储一组单精度浮点数 (Float) 向量,向量维度可以是任意的。 | ||
- `VECTOR(D)`:存储一组单精度浮点数 (Float) 向量,向量维度固定为 `D`。 | ||
|
||
与使用 [`JSON`](/data-type-json.md) 类型相比,使用向量类型具有以下优势: | ||
|
||
- 支持向量索引。可以通过构建[向量搜索索引](/vector-search-index.md)加速查询。 | ||
- 可指定维度。指定一个固定维度后,不符合维度的数据将被阻止写入到表中。 | ||
- 存储格式更优。向量数据类型针对向量数据进行了特别优化,在空间利用和性能效率上都优于 `JSON` 类型。 | ||
|
||
## 语法 | ||
|
||
可以使用以下格式的字符串来表示一个数据类型为向量的值: | ||
|
||
```sql | ||
'[<float>, <float>, ...]' | ||
``` | ||
|
||
示例: | ||
|
||
```sql | ||
CREATE TABLE vector_table ( | ||
id INT PRIMARY KEY, | ||
embedding VECTOR(3) | ||
); | ||
|
||
INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); | ||
|
||
INSERT INTO vector_table VALUES (2, NULL); | ||
``` | ||
|
||
插入不符合语法的字符串作为向量数据时,TiDB 会报错: | ||
|
||
```sql | ||
[tidb]> INSERT INTO vector_table VALUES (3, '[5, ]'); | ||
ERROR 1105 (HY000): Invalid vector text: [5, ] | ||
``` | ||
|
||
下面的示例中 `embedding` 向量列的维度在建表时已经定义为 `3`,因此当插入其他维度的向量数据时,TiDB 会报错: | ||
|
||
```sql | ||
[tidb]> INSERT INTO vector_table VALUES (4, '[0.3, 0.5]'); | ||
ERROR 1105 (HY000): vector has 2 dimensions, does not fit VECTOR(3) | ||
``` | ||
|
||
关于向量数据类型支持的所有函数和操作符,可参阅[向量函数与操作符](/vector-search-functions-and-operators.md)。 | ||
|
||
关于向量搜索索引的更多信息,可参阅[向量搜索索引](/vector-search-index.md)。 | ||
|
||
## 混合存储不同维度的向量 | ||
|
||
省略 `VECTOR` 类型中的维度参数后,就可以在同一列中存储不同维度的向量: | ||
|
||
```sql | ||
CREATE TABLE vector_table ( | ||
id INT PRIMARY KEY, | ||
embedding VECTOR | ||
); | ||
|
||
INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); -- 插入一个 3 维向量 | ||
INSERT INTO vector_table VALUES (2, '[0.3, 0.5]'); -- 插入一个 2 维向量 | ||
``` | ||
|
||
需要注意的是,存储了不同维度向量的列不支持构建[向量搜索索引](/vector-search-index.md),因为只有维度相同的向量之间才能计算向量距离。 | ||
|
||
## 比较 | ||
|
||
向量数据支持[比较运算符](/vector-search-functions-and-operators.md#扩展的内置函数和运算符),例如 `=`、`!=`、`<`、`>`、`<=` 和 `>=` 等。关于向量数据类型支持的所有函数和操作符,可参阅[向量函数与操作符](/vector-search-functions-and-operators.md)。 | ||
|
||
比较向量数据类型时,TiDB 会以向量中的各个元素为单位进行依次比较,如: | ||
|
||
- `[1] < [12]` | ||
- `[1,2,3] < [1,2,5]` | ||
- `[1,2,3] = [1,2,3]` | ||
- `[2,2,3] > [1,2,3]` | ||
|
||
当两个向量的维度不同时,TiDB 采用字典序 (Lexicographical Order) 进行比较,具体规则如下: | ||
|
||
- 两个向量中的各个元素逐一进行数值比较。 | ||
- 当遇到第一个不同的元素时,它们之间的数值比较结果即为两个向量之间的比较结果。 | ||
- 如果一个向量是另一个向量的前缀,那么维度小的向量**小于**维度大的向量。例如,`[1,2,3] < [1,2,3,0]`。 | ||
- 长度相同且各个元素相同的两个向量**相等**。 | ||
- 空向量**小于**任何非空向量。例如,`[] < [1]`。 | ||
- 两个空向量**相等**。 | ||
|
||
qiancai marked this conversation as resolved.
Show resolved
Hide resolved
|
||
在进行向量比较时,请使用[显式转换](#类型转换-cast)将向量数据从字符串转换为向量类型,以避免 TiDB 直接基于字符串进行比较: | ||
|
||
```sql | ||
-- 因为给出的数据实际上是字符串,因此 TiDB 会按字符串进行比较 | ||
[tidb]> SELECT '[12.0]' < '[4.0]'; | ||
+--------------------+ | ||
| '[12.0]' < '[4.0]' | | ||
+--------------------+ | ||
| 1 | | ||
+--------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
-- 显式转换为向量类型,从而按照向量的比较规则进行正确的比较 | ||
[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]'); | ||
+--------------------------------------------------+ | ||
| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') | | ||
+--------------------------------------------------+ | ||
| 0 | | ||
+--------------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
``` | ||
|
||
## 运算 | ||
|
||
向量数据类型支持算术运算 `+` 和 `-`,对应的是两个向量以元素为单位进行的加法和减法。不支持对不同维度向量进行算术运算,执行这类运算会遇到报错。 | ||
|
||
以下是一些示例: | ||
|
||
```sql | ||
[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]'); | ||
+---------------------------------------------+ | ||
| VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]') | | ||
+---------------------------------------------+ | ||
| [9] | | ||
+---------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
[tidb]> SELECT VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]'); | ||
+-----------------------------------------------------+ | ||
| VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]') | | ||
+-----------------------------------------------------+ | ||
| [1,1,1] | | ||
+-----------------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[1,2,3]'); | ||
ERROR 1105 (HY000): vectors have different dimensions: 1 and 3 | ||
``` | ||
|
||
## 类型转换 (Cast) | ||
|
||
### 向量与字符串之间的转换 | ||
|
||
可以使用以下函数在向量和字符串之间进行转换: | ||
|
||
- `CAST(... AS VECTOR)`:将字符串类型转换为向量类型 | ||
- `CAST(... AS CHAR)`:将向量类型转换为字符串类型 | ||
- `VEC_FROM_TEXT`:将字符串类型转换为向量类型 | ||
- `VEC_AS_TEXT`:将向量类型转换为字符串类型 | ||
|
||
出于易用性考虑,如果你使用的函数只支持向量数据类型(例如,向量相关距离函数),那么你也可以直接传入符合格式要求的字符串数据,TiDB 会进行隐式转换: | ||
|
||
```sql | ||
-- VEC_DIMS 只接受向量类型,因此你可以直接传入字符串类型,TiDB 会隐式转换为向量类型: | ||
[tidb]> SELECT VEC_DIMS('[0.3, 0.5, -0.1]'); | ||
+------------------------------+ | ||
| VEC_DIMS('[0.3, 0.5, -0.1]') | | ||
+------------------------------+ | ||
| 3 | | ||
+------------------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
-- 也可以使用 VEC_FROM_TEXT 显式地将字符串转换为向量类型后传递给 VEC_DIMS 函数: | ||
[tidb]> SELECT VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')); | ||
+---------------------------------------------+ | ||
| VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')) | | ||
+---------------------------------------------+ | ||
| 3 | | ||
+---------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
-- 也可以使用 CAST(... AS VECTOR) 进行显式转换: | ||
[tidb]> SELECT VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)); | ||
+----------------------------------------------+ | ||
| VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)) | | ||
+----------------------------------------------+ | ||
| 3 | | ||
+----------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
``` | ||
|
||
当你使用的运算符或函数接受多种数据类型时,TiDB 不会进行隐式转换,请先显式地将字符串类型转换为向量类型后,再传递给这些运算符或函数。例如,进行比较运算前,需要显式地将字符串转换为向量类型,否则 TiDB 将会按照字符串类型进行比较,而非按照向量类型进行比较: | ||
|
||
```sql | ||
-- 传入的类型是字符串,因此 TiDB 会按字符串进行比较: | ||
[tidb]> SELECT '[12.0]' < '[4.0]'; | ||
+--------------------+ | ||
| '[12.0]' < '[4.0]' | | ||
+--------------------+ | ||
| 1 | | ||
+--------------------+ | ||
1 row in set (0.01 sec) | ||
|
||
-- 转换为向量类型,以便使用向量类型的比较规则: | ||
[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]'); | ||
+--------------------------------------------------+ | ||
| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') | | ||
+--------------------------------------------------+ | ||
| 0 | | ||
+--------------------------------------------------+ | ||
1 row in set (0.01 sec) | ||
``` | ||
|
||
向量也可以显式地转换为字符串。以使用 `VEC_AS_TEXT()` 函数为例: | ||
|
||
```sql | ||
-- 字符串首先被隐式地转换成向量,然后被显式地转为字符串,因而返回了一个规范化的字符串格式: | ||
[tidb]> SELECT VEC_AS_TEXT('[0.3, 0.5, -0.1]'); | ||
+--------------------------------------+ | ||
| VEC_AS_TEXT('[0.3, 0.5, -0.1]') | | ||
+--------------------------------------+ | ||
| [0.3,0.5,-0.1] | | ||
+--------------------------------------+ | ||
1 row in set (0.01 sec) | ||
``` | ||
|
||
如需了解其他转换函数,请参阅[向量函数和操作符](/vector-search-functions-and-operators.md)。 | ||
|
||
### 向量与其他数据类型之间的转换 | ||
|
||
目前 TiDB 无法直接在向量和其他数据类型(如 `JSON`)之间进行转换,但你可以在执行的 SQL 语句中使用字符串作为中间类型进行转换。 | ||
|
||
需要注意的是,对于存储在表中的向量数据类型列,无法通过 `ALTER TABLE ... MODIFY COLUMN ...` 转换为其他数据类型。 | ||
|
||
## 使用限制 | ||
|
||
有关向量类型的限制,请参阅[向量搜索限制](/vector-search-limitations.md)以及[向量搜索索引的使用限制](/vector-search-index.md#使用限制)。 | ||
|
||
## MySQL 兼容性 | ||
|
||
向量数据类型只在 TiDB 中支持,MySQL 不支持。 | ||
|
||
## 另请参阅 | ||
|
||
- [向量函数和操作符](/vector-search-functions-and-operators.md) | ||
- [向量搜索索引](/vector-search-index.md) |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preview: https://pr.pingcap-docsite-preview.pages.dev/zh/tidb/dev/vector-search-overview