-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Embedding Model for Databend #10689
Comments
From openai doc:
|
When is the vector index feature expected to be complete? |
Indeed, there is a PR already #11318 But still a lot of work needs to do. |
I'm considering Databend for querying over large data sets of text and vectors. Vector indexing would allow replacing the current vector DB and save a lot of money by using object storage. Would be great if you raised the priority of that feature! |
Thank you for your explanation. We will raise the priority of this feature, but there is still no definite expected time, as there are many higher-priority tasks that need to be completed. |
Another library worth looking at for vector ann support is USearch: https://unum-cloud.github.io/usearch/ |
Summary
Tasks
VECTOR
(alias ofARRAY<Float32>
) #10769ai_embedding_vector(<string>)
to get data vectors from openai apiVECTOR
data typeIntroduction
An embedding model is designed to map high-dimensional data into a lower-dimensional vector space, which facilitates various applications such as NLP, recommendation systems, and anomaly detection.
Obtaining Embedding Vectors with OpenAI API
To extract embedding vectors using the OpenAI API, utilize OpenAI's pre-trained language models. Below is a Python example:
Storing Embedding Vectors in Databend
To store the embedding vectors returned by the OpenAI API in Databend, create a table with a column of
Vector
(AliasArray(Float32)
can be with IVF PQ index) type for holding the vectors. Assuming you have connected to a Databend instance:Computing the Distance Between Vectors in Databend
Databend can compute the distance between a query vector and stored vectors using a built-in function called cosine_distance. This function calculates the distance between two ARRAY(FLOAT32) inputs and can be used directly in SQL queries.
However, calculating vector distance for every pair of vectors becomes computationally expensive and slow with large-scale datasets and high-dimensional vectors. To tackle this issue, we propose the following techniques:
The IVF PQ index is a combination of these techniques, where the database vectors are first quantized using product quantization, followed by the creation of an inverted file to index the quantized vectors. This approach allows for a fast and memory-efficient search of approximate nearest neighbors in high-dimensional vector spaces, particularly beneficial in large-scale multimedia retrieval systems.
Example SQL Queries
Insert sample data
Query
The text was updated successfully, but these errors were encountered: