-
Notifications
You must be signed in to change notification settings - Fork 680
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: initialize the vector search document structure (#17983)
* feat: initialize the vector search document structure * fix: merge toc * fix * feat: add langchain + llamaindex integration guide * feat: add jinaai embedding integration guide * vector search: refine wording (#1) * vector search: refine wording * Discard changes to tidb-cloud/create-tidb-cluster-serverless.md * remove "cluster with vector search enabled" * Update tidb-cloud/vector-search-overview.md * Apply suggestions from code review Co-authored-by: Mini256 <minianter@foxmail.com> --------- Co-authored-by: Mini256 <minianter@foxmail.com> * feat: add peewee + sqlalchemy integration guide * feat: add django integration quickstart * add supported distance functions * fix: add faqs --------- Co-authored-by: Aolin <aolinz@outlook.com>
- Loading branch information
Showing
13 changed files
with
2,313 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
--- | ||
title: Vector Search FAQs | ||
summary: Learn about the FAQs related to TiDB Vector Search. | ||
--- | ||
|
||
# Vector Search FAQs | ||
|
||
This document lists the most frequently asked questions about TiDB Vector Search. | ||
|
||
## General FAQs | ||
|
||
### What is TiDB Vector Search? | ||
|
||
TiDB Vector search allows you to power generative AI, or implement semantic search or similarity search for texts, images, videos, audios or any type of data. Rather than searching on the data itself, vector search allows you to search on the meanings of the data. | ||
|
||
### What are the key use cases? | ||
|
||
You can use machine learning models like OpenAI and Hugging Face to create and store vector embeddings in TiDB. Then you can use TiDB Vector Search for retrieval augmented generation (RAG), semantic search, recommendation engines, dynamic personalization, and other use cases. | ||
|
||
### Does Vector Search work with articles, images or media files? | ||
|
||
Yes. TiDB Vector Search can query any kind of data that can be turned into a vector embedding. You can store both vector embeddings and the data in the same TiDB cluster or even the same table without the need to set up other vector search engines. | ||
|
||
### What AI integrations does TiDB Vector Search support? | ||
|
||
TiDB Vector has now been integrated into [Langchain](/tidb-cloud/vector-search-integrate-with-langchain.md) and [LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md). | ||
|
||
### Which vector embeddings does TiDB Vector Search support? | ||
|
||
TiDB supports vector embeddings under the 16000-dimension limit. | ||
|
||
### How can I speed up the Vector Search? | ||
|
||
You can create an index over the vector column to speed up the Vector Search. See Build AI Apps with TiDB Vector Search for more details. | ||
|
||
### How do I get support for Vector Search or about general usage of TiDB Serverless? | ||
|
||
We value your feedback and always here to help, you can choose either way to get support: | ||
|
||
- Discord: https://discord.gg/zcqexutz2R | ||
- Support Portal: https://tidb.support.pingcap.com/ |
201 changes: 201 additions & 0 deletions
201
tidb-cloud/vector-search-get-started-via-python-client.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,201 @@ | ||
--- | ||
title: Get Started with Vector Search Using the Python Client | ||
summary: Learn how to quickly get started with the TiDB vector search feature in TiDB Cloud using a Python client and perform semantic searches. | ||
--- | ||
|
||
# Get Started with Vector Search Using the Python Client | ||
|
||
This tutorial demonstrates how to get started with the [vector search](/tidb-cloud/vector-search-overview.md) feature in TiDB Cloud using a Python client. You will learn how to use the Python client [`tidb-vector`](https://github.com/pingcap/tidb-vector-python) to: | ||
|
||
- Set up your environment. | ||
- Connect to your TiDB cluster. | ||
- Create a vector table. | ||
- Store vector embeddings. | ||
- Perform vector search queries. | ||
|
||
> **Note** | ||
> | ||
> The vector search feature is currently in beta and only available for [TiDB Serverless](/tidb-cloud/select-cluster-tier.md#tidb-serverless) clusters. | ||
## Prerequisites | ||
|
||
To complete this tutorial, you need: | ||
|
||
- [Python 3.8 or higher](https://www.python.org/downloads/) installed. | ||
- [Git](https://git-scm.com/downloads) installed. | ||
- A TiDB Serverless cluster. Follow [creating a TiDB Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. | ||
|
||
## Get started | ||
|
||
This section demonstrates how to get started with the vector search feature using the Python client [`tidb-vector`](https://github.com/pingcap/tidb-vector-python). | ||
|
||
To run the demo directly, check out the sample code in the [pingcap/tidb-vector-python](https://github.com/pingcap/tidb-vector-python/blob/main/examples/python-client-quickstart) repository. | ||
|
||
### Step 1. Create a new Python project | ||
|
||
In your preferred directory, create a new Python project and a file named `example.py`. | ||
|
||
```shell | ||
mkdir python-client-quickstart | ||
cd python-client-quickstart | ||
touch example.py | ||
``` | ||
|
||
### Step 2. Install required dependencies | ||
|
||
In your project directory, run the following command to install the necessary packages: | ||
|
||
```shell | ||
pip install sqlalchemy pymysql sentence-transformers tidb-vector | ||
``` | ||
|
||
- `tidb-vector`: the Python client for interacting with the vector search feature in TiDB Cloud, which is based on [SQLAlchemy](https://www.sqlalchemy.org). | ||
- [`sentence-transformers`](https://sbert.net): a Python library that provides pre-trained models for generating [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding) from text. | ||
|
||
### Step 3. Configure the connection string to the TiDB cluster | ||
|
||
1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. | ||
|
||
2. Click **Connect** in the upper-right corner. A connection dialog is displayed. | ||
|
||
3. Ensure the configurations in the connection dialog match your operating environment. | ||
|
||
- **Endpoint Type** is set to `Public`. | ||
- **Branch** is set to `main`. | ||
- **Connect With** is set to `SQLAlchemy`. | ||
- **Operating System** matches your environment. | ||
|
||
> **Tip:** | ||
> | ||
> If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. | ||
4. Click the **PyMySQL** tab and copy the connection string. | ||
|
||
> **Tip:** | ||
> | ||
> If you have not set a password yet, click **Generate Password** to generate a random password. | ||
5. In the root directory of your Python project, create a `.env` file and paste the connection string into it. | ||
|
||
The following is an example for macOS: | ||
|
||
```dotenv | ||
TIDB_DATABASE_URL="mysql+pymysql://<prefix>.root:<password>@gateway01.<region>.prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" | ||
``` | ||
|
||
### Step 4. Initialize the embedding model | ||
|
||
An [embedding model](/tidb-cloud/vector-search-overview.md#embedding-model) transforms data into [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding). This example uses the pre-trained model [**msmarco-MiniLM-L12-cos-v5**](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) for text embedding. This lightweight model, provided by the `sentence-transformers` library, transforms text data into 384-dimensional vector embeddings. | ||
|
||
To set up the model, copy the following code into the `example.py` file. This code initializes a `SentenceTransformer` instance and defines a `text_to_embedding()` function for later use. | ||
|
||
```python | ||
from sentence_transformers import SentenceTransformer | ||
|
||
print("Downloading and loading the embedding model...") | ||
embed_model = SentenceTransformer("sentence-transformers/msmarco-MiniLM-L12-cos-v5", trust_remote_code=True) | ||
embed_model_dims = embed_model.get_sentence_embedding_dimension() | ||
|
||
def text_to_embedding(text): | ||
"""Generates vector embeddings for the given text.""" | ||
embedding = embed_model.encode(text) | ||
return embedding.tolist() | ||
``` | ||
|
||
### Step 5. Connect to the TiDB cluster | ||
|
||
Use the `TiDBVectorClient` class to connect to your TiDB cluster and create a table `embedded_documents` with a vector column to serve as the vector store. | ||
|
||
> **Note** | ||
> | ||
> Ensure the dimension of your vector column matches the dimension of the vectors produced by your embedding model. For example, the **msmarco-MiniLM-L12-cos-v5** model generates vectors with 384 dimensions. | ||
```python | ||
import os | ||
from tidb_vector.integrations import TiDBVectorClient | ||
from dotenv import load_dotenv | ||
|
||
# Load the connection string from the .env file | ||
load_dotenv() | ||
|
||
vector_store = TiDBVectorClient( | ||
# The table which will store the vector data. | ||
table_name='embedded_documents', | ||
# The connection string to the TiDB cluster. | ||
connection_string=os.environ.get('TIDB_DATABASE_URL'), | ||
# The dimension of the vector generated by the embedding model. | ||
vector_dimension=embed_model_dims, | ||
# Determine whether to recreate the table if it already exists. | ||
drop_existing_table=True, | ||
) | ||
``` | ||
|
||
### Step 6. Embed text data and store the vectors | ||
|
||
In this step, you will prepare sample documents containing single words, such as "dog", "fish", and "tree". The following code uses the `text_to_embedding()` function to transform these text documents into vector embeddings, and then inserts them into the vector store. | ||
|
||
```python | ||
documents = [ | ||
{ | ||
"id": "f8e7dee2-63b6-42f1-8b60-2d46710c1971", | ||
"text": "dog", | ||
"embedding": text_to_embedding("dog"), | ||
"metadata": {"category": "animal"}, | ||
}, | ||
{ | ||
"id": "8dde1fbc-2522-4ca2-aedf-5dcb2966d1c6", | ||
"text": "fish", | ||
"embedding": text_to_embedding("fish"), | ||
"metadata": {"category": "animal"}, | ||
}, | ||
{ | ||
"id": "e4991349-d00b-485c-a481-f61695f2b5ae", | ||
"text": "tree", | ||
"embedding": text_to_embedding("tree"), | ||
"metadata": {"category": "plant"}, | ||
}, | ||
] | ||
|
||
vector_store.insert( | ||
ids=[doc["id"] for doc in documents], | ||
texts=[doc["text"] for doc in documents], | ||
embeddings=[doc["embedding"] for doc in documents], | ||
metadatas=[doc["metadata"] for doc in documents], | ||
) | ||
``` | ||
|
||
### Step 7. Perform a vector search query | ||
|
||
In this step, you will search for "a swimming animal", which doesn't directly match any words in existing documents. | ||
|
||
The following code uses the `text_to_embedding()` function again to convert the query text into a vector embedding, and then queries with the embedding to find the top three closest matches. | ||
|
||
```python | ||
def print_result(query, result): | ||
print(f"Search result (\"{query}\"):") | ||
for r in result: | ||
print(f"- text: \"{r.document}\", distance: {r.distance}") | ||
|
||
query = "a swimming animal" | ||
query_embedding = text_to_embedding(query) | ||
search_result = vector_store.query(query_embedding, k=3) | ||
print_result(query, search_result) | ||
``` | ||
|
||
Run the `example.py` file and the output is as follows: | ||
|
||
```plain | ||
Search result ("a swimming animal"): | ||
- text: "fish", distance: 0.4586619425596351 | ||
- text: "dog", distance: 0.6521646263795423 | ||
- text: "tree", distance: 0.7980725077476978 | ||
``` | ||
|
||
From the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. | ||
|
||
This demonstration shows how vector search can efficiently locate the most relevant documents, with search results organized by the proximity of the vectors: the smaller the distance, the more relevant the document. | ||
|
||
## See also | ||
|
||
- [Vector Column](/tidb-cloud/vector-search-vector-column.md) | ||
- [Vector Index](/tidb-cloud/vector-search-vector-index.md) |
Oops, something went wrong.