feat: Embedding Model for Databend #10689

BohuTANG · 2023-03-21T05:01:49Z

Summary

Tasks

feat: add cosine_distance for vector similarity compute #10737
Introduce a new data type VECTOR (alias of ARRAY<Float32>) #10769
change openai api crate from async to sync #10775
Implement ai_embedding_vector(<string>) to get data vectors from openai api
feat(query): add l2 distance operator <-> #12382
Implement IVF PQ index for VECTOR data type

Introduction

An embedding model is designed to map high-dimensional data into a lower-dimensional vector space, which facilitates various applications such as NLP, recommendation systems, and anomaly detection.

Obtaining Embedding Vectors with OpenAI API

To extract embedding vectors using the OpenAI API, utilize OpenAI's pre-trained language models. Below is a Python example:

import openai

openai.api_key = "your_openai_api_key"

def get_embedding(text):
    response = openai.Completion.create(
        engine="davinci-codex",
        prompt=f"Embed the following text: {text}",
        max_tokens=16,
        n=1,
        stop=None,
        temperature=0.5,
    )
    embedding = response.choices[0].text.strip()
    return embedding

text = "Databend warehouse"
embedding = get_embedding(text)
print(embedding)

Storing Embedding Vectors in Databend

To store the embedding vectors returned by the OpenAI API in Databend, create a table with a column of Vector(Alias Array(Float32) can be with IVF PQ index) type for holding the vectors. Assuming you have connected to a Databend instance:

CREATE TABLE embeddings (
    id INT,
    text VARCHAR NOT NULL,
    vector VECTOR NOT NULL
);

Computing the Distance Between Vectors in Databend

Databend can compute the distance between a query vector and stored vectors using a built-in function called cosine_distance. This function calculates the distance between two ARRAY(FLOAT32) inputs and can be used directly in SQL queries.

However, calculating vector distance for every pair of vectors becomes computationally expensive and slow with large-scale datasets and high-dimensional vectors. To tackle this issue, we propose the following techniques:

Inverted File (IVF) Index: An inverted file is an index data structure that maps words or terms to their locations in a set of documents. Within a vector database, it stores a mapping from a set of quantized vectors to their locations. An inverted file enables fast and memory-efficient search for approximate nearest neighbors.
Product Quantization (PQ) Index: Product Quantization is a vector compression technique that reduces memory footprint and computational cost while searching for nearest neighbors in high-dimensional spaces. PQ quantizes the original vector space into a Cartesian product of multiple lower-dimensional subspaces, compressing each high-dimensional vector into a compact code by quantizing its sub-vectors and concatenating the quantization indices. This enables efficient and approximate distance computation between compressed vectors.

The IVF PQ index is a combination of these techniques, where the database vectors are first quantized using product quantization, followed by the creation of an inverted file to index the quantized vectors. This approach allows for a fast and memory-efficient search of approximate nearest neighbors in high-dimensional vector spaces, particularly beneficial in large-scale multimedia retrieval systems.

Example SQL Queries

CREATE TABLE embeddings (
    id INT,
    text VARCHAR NOT NULL,
    vector VECTOR NOT NULL
);

Insert sample data

INSERT INTO embeddings (text, vector) VALUES
(1, 'Databend warehouse', ARRAY[0.12, 0.34, -0.56, 0.78]),
(2, 'Data warehouse', ARRAY[-0.15, 0.37, 0.29, -0.22]);

Query

WITH query_vector AS (
    SELECT ARRAY[0.11, 0.33, -0.55, 0.77] AS vector
)
SELECT id, text, cosine_distance(vector, query_vector.vector) AS distance
FROM embeddings, query_vector
ORDER BY distance ASC
LIMIT 1;

The text was updated successfully, but these errors were encountered:

mokeyish · 2023-03-21T08:28:53Z

vector_distance: Similarity Metrics

BohuTANG · 2023-03-21T08:39:00Z

From openai doc:
https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use

We recommend [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). The choice of distance function typically doesn’t matter much.

OpenAI embeddings are normalized to length 1, which means that:

Cosine similarity can be computed slightly faster using just a dot product
Cosine similarity and Euclidean distance will result in the identical rankings

thatcort · 2023-08-07T17:42:18Z

When is the vector index feature expected to be complete?

BohuTANG · 2023-08-08T00:23:08Z

When is the vector index feature expected to be complete?

Indeed, there is a PR already #11318

But still a lot of work needs to do.
From Databend users case, their data is not large, so we make this ticket to low priority.

thatcort · 2023-08-08T01:33:16Z

I'm considering Databend for querying over large data sets of text and vectors. Vector indexing would allow replacing the current vector DB and save a lot of money by using object storage. Would be great if you raised the priority of that feature!

BohuTANG · 2023-08-08T14:23:51Z

Thank you for your explanation. We will raise the priority of this feature, but there is still no definite expected time, as there are many higher-priority tasks that need to be completed.

thatcort · 2023-11-15T19:22:42Z

Another library worth looking at for vector ann support is USearch: https://unum-cloud.github.io/usearch/

BohuTANG added the C-feature Category: feature label Mar 21, 2023

BohuTANG mentioned this issue Mar 23, 2023

feat: add cosine_distance for vector similarity compute #10737

Merged

BohuTANG changed the title ~~feat: Embedding Model Proposal for Databend(By ChatGPT4)~~ feat: Embedding Model for Databend Mar 25, 2023

This was referenced Mar 25, 2023

Release proposal: Nightly v1.1 #10334

Closed

chore(openai): change openai from async to sync #10785

Merged

BohuTANG closed this as completed in #10785 Mar 27, 2023

BohuTANG reopened this Mar 27, 2023

BohuTANG mentioned this issue Mar 27, 2023

feat(functions): ai_embedding_vector #10789

Merged

BohuTANG self-assigned this Mar 27, 2023

BohuTANG mentioned this issue Apr 14, 2023

Release proposal: Nightly v1.2 #11073

Closed

7 tasks

thatcort mentioned this issue Aug 7, 2023

Roadmap 2023 #9448

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Embedding Model for Databend #10689

feat: Embedding Model for Databend #10689

BohuTANG commented Mar 21, 2023 •

edited

Loading

mokeyish commented Mar 21, 2023

BohuTANG commented Mar 21, 2023

thatcort commented Aug 7, 2023

BohuTANG commented Aug 8, 2023

thatcort commented Aug 8, 2023

BohuTANG commented Aug 8, 2023 •

edited

Loading

thatcort commented Nov 15, 2023

feat: Embedding Model for Databend #10689

feat: Embedding Model for Databend #10689

Comments

BohuTANG commented Mar 21, 2023 • edited Loading

Tasks

Introduction

Obtaining Embedding Vectors with OpenAI API

Storing Embedding Vectors in Databend

Computing the Distance Between Vectors in Databend

Example SQL Queries

Insert sample data

Query

mokeyish commented Mar 21, 2023

BohuTANG commented Mar 21, 2023

thatcort commented Aug 7, 2023

BohuTANG commented Aug 8, 2023

thatcort commented Aug 8, 2023

BohuTANG commented Aug 8, 2023 • edited Loading

thatcort commented Nov 15, 2023

BohuTANG commented Mar 21, 2023 •

edited

Loading

BohuTANG commented Aug 8, 2023 •

edited

Loading