title | summary |
---|---|
Get Started with Vector Search via SQL |
Learn how to quickly get started with Vector Search in TiDB using SQL statements to power your generative AI applications. |
TiDB extends MySQL syntax to support Vector Search and introduce new Vector data types and several vector functions.
This tutorial demonstrates how to get started with TiDB Vector Search just using SQL statements. You will learn how to use the MySQL command-line client to complete the following operations:
- Connect to your TiDB cluster.
- Create a vector table.
- Store vector embeddings.
- Perform vector search queries.
Warning:
The vector search feature is experimental. It is not recommended that you use it in the production environment. This feature might be changed without prior notice. If you find a bug, you can report an issue on GitHub.
Note:
The vector search feature is only available for TiDB Self-Managed clusters and TiDB Cloud Serverless clusters.
To complete this tutorial, you need:
- MySQL command-line client (MySQL CLI) installed on your machine.
- A TiDB cluster.
If you don't have a TiDB cluster, you can create one as follows:
- Follow Deploy a local test TiDB cluster or Deploy a production TiDB cluster to create a local cluster.
- Follow Creating a TiDB Cloud Serverless cluster to create your own TiDB Cloud cluster.
If you don't have a TiDB cluster, you can create one as follows:
- (Recommended) Follow Creating a TiDB Cloud Serverless cluster to create your own TiDB Cloud cluster.
- Follow Deploy a local test TiDB cluster or Deploy a production TiDB cluster to create a local cluster of v8.4.0 or a later version.
Connect to your TiDB cluster depending on the TiDB deployment option you've selected.
-
Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
-
Click Connect in the upper-right corner. A connection dialog is displayed.
-
In the connection dialog, select MySQL CLI from the Connect With drop-down list and keep the default setting of the Connection Type as Public.
-
If you have not set a password yet, click Generate Password to generate a random password.
-
Copy the connection command and paste it into your terminal. The following is an example for macOS:
mysql -u '<prefix>.root' -h '<host>' -P 4000 -D 'test' --ssl-mode=VERIFY_IDENTITY --ssl-ca=/etc/ssl/cert.pem -p'<password>'
After your TiDB Self-Managed cluster is started, execute your cluster connection command in the terminal.
The following is an example connection command for macOS:
mysql --comments --host 127.0.0.1 --port 4000 -u root
When creating a table, you can define a column as a vector column by specifying the VECTOR
data type.
For example, to create a table embedded_documents
with a three-dimensional VECTOR
column, execute the following SQL statements using your MySQL CLI:
USE test;
CREATE TABLE embedded_documents (
id INT PRIMARY KEY,
-- Column to store the original content of the document.
document TEXT,
-- Column to store the vector representation of the document.
embedding VECTOR(3)
);
The expected output is as follows:
Query OK, 0 rows affected (0.27 sec)
Insert three documents with their vector embeddings into the embedded_documents
table:
INSERT INTO embedded_documents
VALUES
(1, 'dog', '[1,2,1]'),
(2, 'fish', '[1,2,4]'),
(3, 'tree', '[1,0,0]');
The expected output is as follows:
Query OK, 3 rows affected (0.15 sec)
Records: 3 Duplicates: 0 Warnings: 0
Note
This example simplifies the dimensions of the vector embeddings and uses only 3-dimensional vectors for demonstration purposes.
In real-world applications, embedding models often produce vector embeddings with hundreds or thousands of dimensions.
To verify that the documents have been inserted correctly, query the embedded_documents
table:
SELECT * FROM embedded_documents;
The expected output is as follows:
+----+----------+-----------+
| id | document | embedding |
+----+----------+-----------+
| 1 | dog | [1,2,1] |
| 2 | fish | [1,2,4] |
| 3 | tree | [1,0,0] |
+----+----------+-----------+
3 rows in set (0.15 sec)
Similar to full-text search, users provide search terms to the application when using vector search.
In this example, the search term is "a swimming animal", and its corresponding vector embedding is assumed to be [1,2,3]
. In practical applications, you need to use an embedding model to convert the user's search term into a vector embedding.
Execute the following SQL statement, and TiDB will identify the top three documents closest to [1,2,3]
by calculating and sorting the cosine distances (vec_cosine_distance
) between the vector embeddings in the table.
SELECT id, document, vec_cosine_distance(embedding, '[1,2,3]') AS distance
FROM embedded_documents
ORDER BY distance
LIMIT 3;
The expected output is as follows:
+----+----------+---------------------+
| id | document | distance |
+----+----------+---------------------+
| 2 | fish | 0.00853986601633272 |
| 1 | dog | 0.12712843905603044 |
| 3 | tree | 0.7327387580875756 |
+----+----------+---------------------+
3 rows in set (0.15 sec)
The three terms in the search results are sorted by their respective distance from the queried vector: the smaller the distance, the more relevant the corresponding document
.
Therefore, according to the output, the swimming animal is most likely a fish, or a dog with a gift for swimming.