-
Notifications
You must be signed in to change notification settings - Fork 116
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add a section about AI/timescale vector (#2862)
Add a section about AI/timescale vector
- Loading branch information
Showing
13 changed files
with
1,497 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
--- | ||
title: Key vector database concepts | ||
excerpt: The most important concepts for understanding vectors in PostgreSQL | ||
products: [cloud] | ||
keywords: [ai, vector, pgvector, timescale vector] | ||
tags: [ai, vector] | ||
--- | ||
|
||
<!-- vale Google.Headings = NO --> | ||
# Key vector database concepts | ||
<!-- vale Google.Headings = YES --> | ||
|
||
## `Vector` data type | ||
|
||
Vectors inside of the database are stored in regular PostgreSQL tables using `vector` columns. The `vector` column type is provided by the [pgvector](https://github.com/pgvector/pgvector) extension. A common way to store vectors is alongside the data they have indexed. For example, to store embeddings for documents, a common table structure is: | ||
|
||
```sql | ||
CREATE TABLE IF NOT EXISTS document_embedding ( | ||
id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, | ||
document_id BIGINT FOREIGN KEY(document.id) | ||
metadata JSONB, | ||
contents TEXT, | ||
embedding VECTOR(1536) | ||
) | ||
``` | ||
|
||
This table contains a primary key, a foreign key to the document table, some metadata, the text being embedded (in the `contents` column), and the embedded vector. | ||
|
||
This may seem like a bit of a weird design: why aren't the embeddings simply a separate column in the document table? The answer has to do with context length limits of embedding models and of LLMs. When embedding data, there is a limit to the length of content you can embed (for example, OpenAI's ada-002 has a limit of [8191 tokens]((https://platform.openai.com/docs/guides/embeddings/second-generation-models))), and so, if you are embedding a long piece of text, you have to break it up into smaller chunks and embed each chunk individually. Therefore, when thinking about this at the database layer, there is usually a one-to-many relationship between the thing being embedded and the embeddings which is represented by a foreign key from the embedding to the thing. | ||
|
||
Of course, if you do not want to store the original data in the database and you are just storing only the embeddings, that's totally fine too. Just omit the foreign key from the table. Another popular alternative is to put the foreign key into the metadata JSONB. | ||
|
||
## Querying vectors | ||
|
||
The canonical query for vectors is for the closest query vectors to an embedding of the user's query. This is also known as finding the [K nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). | ||
|
||
In the example query below, `$1` is a parameter taking a query embedding, and the `<=>` operator calculates the distance between the query embedding and embedding vectors stored in the database (and returns a float value). | ||
|
||
```sql | ||
SELECT * | ||
FROM document_embedding | ||
ORDER BY embedding <=> $1 | ||
LIMIT 10 | ||
``` | ||
|
||
The query above returns the 10 rows with the smallest distance between the query's embedding and the row's embedding. Of course, this being PostgreSQL, you can add additional `WHERE` clauses (such as filters on the metadata), joins, etc. | ||
|
||
|
||
### Vector distance types | ||
|
||
The query shown above uses something called cosine distance (using the <=> operator) as a measure of how similar two embeddings are. But, there are multiple ways to quantify how far apart two vectors are from each other. | ||
|
||
<Highlight type="note"> | ||
In practice, the choice of distance measure doesn't matters much and it is recommended to just stick with cosine distance for most applications. | ||
</Highlight> | ||
|
||
#### Description of cosine distance, negative inner product, and Euclidean distance | ||
|
||
Here's a succinct description of three common vector distance measures | ||
|
||
- **Cosine distance a.k.a. angular distance**: This measures the cosine of the angle between two vectors. It's not a true "distance" in the mathematical sense but a similarity measure, where a smaller angle corresponds to a higher similarity. The cosine distance is particularly useful in high-dimensional spaces where the magnitude of the vectors (their length) is less important, such as in text analysis or information retrieval. It ranges from -1 (meaning exactly opposite) to 1 (exactly the same), with 0 typically indicating orthogonality (no similarity). See here for more on [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). | ||
|
||
- **Negative inner product**: This is simply the negative of the inner product (also known as the dot product) of two vectors. The inner product measures vector similarity based on the vectors' magnitudes and the cosine of the angle between them. A higher inner product indicates greater similarity. However, it's important to note that, unlike cosine similarity, the magnitude of the vectors influences the inner product. | ||
|
||
- **Euclidean distance**: This is the "ordinary" straight-line distance between two points in Euclidean space. In terms of vectors, it's the square root of the sum of the squared differences between corresponding elements of the vectors. This measure is sensitive to the magnitude of the vectors and is widely used in various fields such as clustering and nearest neighbor search. | ||
|
||
Many embedding systems (for example OpenAI's ada-002) use vectors with length 1 (unit vectors). For those systems, the rankings (ordering) of all three measures is the same. In particular, | ||
- The cosine distance is `1−dot product`. | ||
- The negative inner product is `−dot product`. | ||
- The Euclidean distance is related to the dot product, where the squared Euclidean distance is `2(1−dot product)`. | ||
|
||
<!-- vale Google.Headings = NO --> | ||
#### Recommended vector distance | ||
<!-- vale Google.Headings = YES --> | ||
|
||
Using cosine distance, especially on unit vectors, is recommended. These recommendations are based on OpenAI's [recommendation](https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use) as well as the fact that the ranking of different distances on unit vectors is preserved. | ||
|
||
## Vector search indexing (approximate nearest neighbor search) | ||
|
||
In PostgreSQL and other relational databases, indexing is a way to speed up queries. For vector data, indexes speed up the similarity search query shown above where you find the most similar embedding to some given query embedding. This problem is often referred to as finding the [K nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). | ||
|
||
<Highlight type="note"> | ||
The term "index" in the context of vector databases has multiple meanings. It can refer to both the storage mechanism for your data and the tool that enhances query efficiency. These docs use the latter meaning. | ||
</Highlight> | ||
|
||
Finding the K nearest neighbors is not a new problem in PostgreSQL, but existing techniques only work with low-dimensional data. These approaches cease to be effective when dealing with data larger than approximately 10 dimensions due to the "curse of dimensionality." Given that embeddings often consist of more than a thousand dimensions(OpenAI's are 1,536) new techniques had to be developed. | ||
|
||
There are no known exact algorithms for efficiently searching in such high-dimensional spaces. Nevertheless, there are excellent approximate algorithms that fall into the category of approximate nearest neighbor algorithms. | ||
|
||
<!-- vale Google.Colons = NO --> | ||
There are 3 different indexing algorithms available as part of Timescale Vector: Timescale Vector index, pgvector HNSW, and pgvector ivfflat. The table below illustrates the high-level differences between these algorithms: | ||
<!-- vale Google.Colons = YES --> | ||
|
||
| Algorithm | Build Speed | Query Speed | Need to rebuild after updates | | ||
|------------------|-------------|-------------|-------------------------------| | ||
| Timescale Vector | Slow | Fastest | No | | ||
| pgvector HNSW | Slowest | Fast | No | | ||
| pgvector ivfflat | Fastest | Slowest | Yes | | ||
|
||
|
||
See the [performance benchmarks](https://www.timescale.com/blog/how-we-made-postgresql-the-best-vector-database/) for details on how the each index performs on a dataset of 1 million OpenAI embeddings. | ||
|
||
## Recommended index types | ||
|
||
For most applications, the Timescale Vector index is recommended. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
--- | ||
title: Overview of Timescale Vector | ||
excerpt: A description of Timescale Vector and vectors in general | ||
products: [cloud] | ||
keywords: [ai, vector, pgvector, timescale vector] | ||
tags: [ai, vector] | ||
--- | ||
|
||
# Overview of Timescale Vector | ||
|
||
## What is Timescale Vector | ||
Timescale Vector is PostgreSQL++ for AI applications. With Timescale Vector, you can power production AI applications with PostgreSQL as your vector database, storing both vector embeddings, relational data (for example, related metadata), and time-based data in the same database. | ||
|
||
Timescale Vector is a cloud-based vector database. There is no self-hosted version at this time. To use Timescale Vector, [sign up here](https://console.cloud.timescale.com/signup?utm_campaign=vectorlaunch&utm_source=tsv-docs&utm_medium=direct). | ||
|
||
<!-- vale Google.Headings = NO --> | ||
## Timescale Vector vs. pgvector | ||
<!-- vale Google.Headings = Yes --> | ||
[Pgvector](https://github.com/pgvector/pgvector) is a popular open source extension for vector storage and similarity search in PostgreSQL. Pgvector is packaged as part of Timescale Vector, so you can think of Timescale Vector as a complement, not a replacement for pgvector. Timescale Vector uses the same vector data type as pgvector, offering all its other capabilities (like HNSW and ivfflat indexes). Timescale Vector also offers features not present in pgvector, such as the Timescale Vector index and time-based vector search. | ||
|
||
This makes it easy to migrate your existing pgvector deployment and take advantage of additional features for scale in Timescale Vector. You also have the flexibility to create different index types suited to your needs. See the [vector search indexing][vector-search-indexing] section for more information. | ||
|
||
## What are vector embeddings? | ||
|
||
Embeddings offer a way to represent the semantic essence of data and to allow comparing data according to how closely related it is in terms of meaning. In the database context, this is extremely powerful: think of this as full-text search on steroids. Vector databases allow storing embeddings associated with data and then searching for embeddings that are similar to a given query. | ||
|
||
## Applications of vector embeddings | ||
|
||
There are many applications where vector embeddings can be useful. | ||
|
||
### Semantic search | ||
Transcend the limitations of traditional keyword-driven search methods by creating systems that understand the intent and contextual meaning of a query, thereby returning more relevant results. Semantic search doesn't just seek exact word matches; it grasps the deeper intent behind a user's query. The result? Even if search terms differ in phrasing, relevant results are surfaced. Taking advantage of hybrid search, which marries lexical and semantic search methodologies, offers users a search experience that's both rich and accurate. It's not just about finding direct matches anymore; it's about tapping into contextually and conceptually similar content to meet user needs. | ||
|
||
### Recommendation systems | ||
Imagine a user who has shown interest in several articles on a singular topic. With embeddings, the recommendation engine can delve deep into the semantic essence of those articles, surfacing other database items that resonate with the same theme. Recommendations, thus, move beyond just the superficial layers like tags or categories and dive into the very heart of the content. | ||
|
||
### Retrieval augmented generation (RAG) | ||
Supercharge generative AI by providing additional context to Large Language Models (LLMs) like OpenAI's GPT-4, Anthropic's Claude 2, and open source modes like Llama 2. When a user poses a query, relevant database content is fetched and used to supplement the query as additional information for the LLM. This helps reduce LLM hallucinations, as it ensures the model's output is more grounded in specific and relevant information, even if it wasn't part of the model's original training data. | ||
|
||
### Clustering | ||
Embeddings also offer a robust solution for clustering data. Transforming data into these vectorized forms allows for nuanced comparisons between data points in a high-dimensional space. Through algorithms like K-means or hierarchical clustering, data can be categorized into semantic categories, offering insights that surface-level attributes might miss. This surfaces inherent data patterns, enriching both exploration and decision-making processes. | ||
|
||
|
||
## Vector similarity search: How does it work | ||
|
||
On a high level, embeddings help a database to look for data that is similar to a given piece of information (similarity search). This process includes a few steps: | ||
|
||
- First, embeddings are created for data and inserted into the database. This can take place either in an application or in the database itself. | ||
- Second, when a user has a search query (for example, a question in chat), that query is then transformed into an embedding. | ||
- Third, the database takes the query embedding and searches for the closest matching (most similar) embeddings it has stored. | ||
|
||
Under the hood, embeddings are represented as a vector (a list of numbers) that capture the essence of the data. To determine the similarity of two pieces of data, the database uses mathematical operations on vectors to get a distance measure (commonly Euclidean or cosine distance). During a search, the database should return those stored items where the distance between the query embedding and the stored embedding is as small as possible, suggesting the items are most similar. | ||
|
||
|
||
## Embedding models | ||
|
||
Timescale Vector works with the most popular embedding models that have output vectors of 2,000 dimensions or less. Here are some popular choices for text embeddings for use with Timescale Vector: | ||
|
||
- [OpenAI embedding models](https://platform.openai.com/docs/guides/embeddings): text-embedding-ada-002 is OpenAI's recommended embedding generation model. | ||
- [Sentence transformers](https://huggingface.co/sentence-transformers): Several popular open source models for embedding generation from text. | ||
- [Cohere representation models](https://docs.cohere.com/docs/models#representation): Cohere offers many models that can be used to generate embeddings from text in English or multiple languages. | ||
|
||
See the [HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for more embedding model options. | ||
|
||
|
||
And here are some popular choices for image embeddings: | ||
|
||
- [OpenAI CLIP](https://github.com/openai/CLIP): Useful for applications involving text and images. | ||
- [VGG](https://pytorch.org/vision/stable/models/vgg.html) | ||
- [Vision Transformer (ViT)](https://github.com/lukemelas/PyTorch-Pretrained-ViT) | ||
|
||
[vector-search-indexing]: /ai/:currentVersion:/concepts/#vector-search-indexing-approximate-nearest-neighbor-search |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
--- | ||
title: LangChain Integration | ||
excerpt: LangChain integration for Timescale Vector | ||
products: [cloud] | ||
keywords: [ai, vector, pgvector, timescale vector, python, langchain] | ||
tags: [ai, vector, python, langchain] | ||
--- | ||
|
||
# LangChain integration | ||
|
||
[LangChain](https://www.langchain.com/) is a popular framework for development applications powered by LLMs. Timescale Vector has a native LangChain integration, enabling you to use Timescale Vector as a vector store and leverage all its capabilities in your applications built with LangChain. | ||
|
||
Here are resources about using Timescale Vector with LangChain: | ||
|
||
- [Getting started with LangChain and Timescale Vector](https://python.langchain.com/docs/integrations/vectorstores/timescalevector): You'll learn how to use Timescale Vector for (1) semantic search, (2) time-based vector search, (3) self-querying, and (4) how to create indexes to speed up queries. | ||
- [PostgreSQL Self Querying](https://python.langchain.com/docs/integrations/retrievers/self_query/timescalevector_self_query): Learn how to use Timescale Vector with self-querying in LangChain. | ||
- [LangChain template: RAG with conversational retrieval](https://github.com/langchain-ai/langchain/tree/master/templates/rag-timescale-conversation): This template is used for conversational retrieval, which is one of the most popular LLM use-cases. It passes both a conversation history and retrieved documents into an LLM for synthesis. | ||
- [LangChain template: RAG with time-based search and self-query retrieval](https://github.com/langchain-ai/langchain/tree/master/templates/rag-timescale-hybrid-search-time): This template shows how to use timescale-vector with the self-query retriever to perform hybrid search on similarity and time. This is useful any time your data has a strong time-based component. | ||
- [Learn more about Timescale Vector and LangChain](https://blog.langchain.dev/timescale-vector-x-langchain-making-postgresql-a-better-vector-database-for-ai-applications/): A blog post about the unique capabilities that Timescale Vector brings to the LangChain ecosystem. |
Oops, something went wrong.