Skip to content

Commit

Permalink
docs: refine the AI functions (#11205)
Browse files Browse the repository at this point in the history
  • Loading branch information
BohuTANG authored Apr 24, 2023
1 parent 6170d21 commit 9ba6313
Show file tree
Hide file tree
Showing 4 changed files with 111 additions and 99 deletions.
2 changes: 1 addition & 1 deletion docs/doc/15-sql-functions/61-ai-functions/01-ai-to-sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ openai_api_key = "<your-key>"

## Examples

In this example, a SQL query statement is generated from an instruction with the AI_TO_SQL function, and the resulting statement is executed to obtain the query results.
In this example, an SQL query statement is generated from an instruction with the AI_TO_SQL function, and the resulting statement is executed to obtain the query results.

1. Prepare data.

Expand Down
49 changes: 28 additions & 21 deletions docs/doc/15-sql-functions/61-ai-functions/02-ai-embedding-vector.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ This document provides an overview of the ai_embedding_vector function in Databe

The main code implementation can be found [here](https://github.com/datafuselabs/databend/blob/1e93c5b562bd159ecb0f336bb88fd1b7f9dc4a62/src/common/openai/src/embedding.rs).

By default, Databend leverages the [text-embedding-ada](https://platform.openai.com/docs/models/embeddings) model for generating embeddings.

:::caution
Databend relies on OpenAI for `AI_EMBEDDING_VECTOR` and sends the embedding column data to OpenAI.

Expand All @@ -17,7 +19,6 @@ This function is available by default on [Databend Cloud](https://databend.com)

## Overview of ai_embedding_vector


The `ai_embedding_vector` function in Databend is a built-in function that generates vector embeddings for text data. It is useful for natural language processing tasks, such as document similarity, clustering, and recommendation systems.

The function takes a text input and returns a high-dimensional vector that represents the input text's semantic meaning and context. The embeddings are created using pre-trained models on large text corpora, capturing the relationships between words and phrases in a continuous space.
Expand All @@ -28,36 +29,42 @@ To create embeddings for a text document using the `ai_embedding_vector` functio
1. Create a table to store the documents:
```sql
CREATE TABLE documents (
doc_id INT,
text_content TEXT
id INT,
title VARCHAR,
content VARCHAR,
embedding ARRAY(FLOAT32)
);
```

2. Insert example documents into the table:
```sql
INSERT INTO documents (doc_id, text_content)
INSERT INTO documents(id, title, content)
VALUES
(1, 'Artificial intelligence is a fascinating field.'),
(2, 'Machine learning is a subset of AI.'),
(3, 'I love going to the beach on weekends.');
(1, 'A Brief History of AI', 'Artificial intelligence (AI) has been a fascinating concept of science fiction for decades...'),
(2, 'Machine Learning vs. Deep Learning', 'Machine learning and deep learning are two subsets of artificial intelligence...'),
(3, 'Neural Networks Explained', 'A neural network is a series of algorithms that endeavors to recognize underlying relationships...'),
```

3. Create a table to store the embeddings:
3. Generate the embeddings:
```sql
CREATE TABLE embeddings (
doc_id INT,
text_content TEXT,
embedding ARRAY(FLOAT32)
);
UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0;
```
After running the query, the embedding column in the table will contain the generated embeddings.

4. Generate embeddings for the text content and store them in the embeddings table:
```sql
INSERT INTO embeddings (doc_id, text_content, embedding)
SELECT doc_id, text_content, ai_embedding_vector(text_content)
FROM documents;
The embeddings are stored as an array of `FLOAT32` values in the embedding column, which has the `ARRAY(FLOAT32)` column type.

```
After running these SQL queries, the embeddings table will contain the generated embeddings for each document in the documents table. The embeddings are stored as an array of `FLOAT32` values in the embedding column, which has the `ARRAY(FLOAT32)` column type.
You can now use these embeddings for various natural language processing tasks, such as finding similar documents or clustering documents based on their content.

4. Inspect the embeddings:

You can now use these embeddings for various natural language processing tasks, such as finding similar documents or clustering documents based on their content.
```sql
SELECT length(embedding) FROM documents;
+-------------------+
| length(embedding) |
+-------------------+
| 1536 |
| 1536 |
| 1536 |
+-------------------+
```
The query above shows that the generated embeddings have a length of 1536(dimensions) for each document.
71 changes: 39 additions & 32 deletions docs/doc/15-sql-functions/61-ai-functions/04-ai-cosine-distance.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,66 @@
---
title: 'COSINE_DISTANCE'
description: 'Measuring document similarity using the cosine_distance function in Databend'
description: 'Measuring similarity using the cosine_distance function in Databend'
---

This document provides an overview of the `cosine_distance` function in Databend and demonstrates how to measure document similarity using this function.
This document provides an overview of the cosine_distance function in Databend and demonstrates how to measure document similarity using this function.

:::info
The `cosine_distance` function performs vector computations within Databend and does not rely on the OpenAI API.
:::

## Overview of cosine_distance
The cosine_distance function performs vector computations within Databend and does not rely on the OpenAI API.

:::

The `cosine_distance` function in Databend is a built-in function that calculates the cosine distance between two vectors. It is commonly used in natural language processing tasks, such as document similarity and recommendation systems.
The cosine_distance function in Databend is a built-in function that calculates the cosine distance between two vectors. It is commonly used in natural language processing tasks, such as document similarity and recommendation systems.

Cosine distance is a measure of similarity between two vectors, based on the cosine of the angle between them. The function takes two input vectors and returns a value between 0 and 1, with 0 indicating identical vectors and 1 indicating orthogonal (completely dissimilar) vectors.

## Measuring similarity using cosine_distance
## Examples

To measure document similarity using the cosine_distance function, follow the example below. This example assumes that you have already created document embeddings using the ai_embedding_vector function and stored them in a table with the `ARRAY(FLOAT32)` column type.
**Creating a Table and Inserting Sample Data**

1. Create a table to store the documents and their embeddings:
Let's create a table to store some sample text documents and their corresponding embeddings:
```sql
CREATE TABLE documents (
doc_id INT,
text_content TEXT,
CREATE TABLE articles (
id INT,
title VARCHAR,
content VARCHAR,
embedding ARRAY(FLOAT32)
);

```

2. Insert example documents and their embeddings into the table:
Now, let's insert some sample documents into the table:
```sql
INSERT INTO documents (doc_id, text_content, embedding)
INSERT INTO articles (id, title, content, embedding)
VALUES
(1, 'Artificial intelligence is a fascinating field.', ai_embedding_vector('Artificial intelligence is a fascinating field.')),
(2, 'Machine learning is a subset of AI.', ai_embedding_vector('Machine learning is a subset of AI.')),
(3, 'I love going to the beach on weekends.', ai_embedding_vector('I love going to the beach on weekends.'));
(1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')),
(2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')),
(3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...'));
```

3. Measure the similarity between a query document and the stored documents using the `cosine_distance` function:
**Querying for Similar Documents**

Now, let's find the documents that are most similar to a given query using the cosine_distance function:
```sql
SELECT doc_id, text_content, cosine_distance(embedding, ai_embedding_vector('What is a subfield of artificial intelligence?')) AS distance
FROM embeddings
ORDER BY distance ASC
LIMIT 5;
SELECT
id,
title,
content,
cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
FROM
articles
ORDER BY
similarity ASC
LIMIT 3;
```
This SQL query calculates the cosine distance between the query document's embedding and the embeddings of the stored documents. The results are ordered by ascending distance, with the smallest distance indicating the highest similarity.

Result:
```sql
+--------+-------------------------------------------------+------------+
| doc_id | text_content | distance |
+--------+-------------------------------------------------+------------+
| 1 | Artificial intelligence is a fascinating field. | 0.10928339 |
| 2 | Machine learning is a subset of AI. | 0.13584924 |
| 3 | I love going to the beach on weekends. | 0.30774158 |
+--------+-------------------------------------------------+------------+
```
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
| id | title | content | similarity |
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
| 1 | Python for Data Science | Python is a versatile programming language widely used in data science... | 0.1142081 |
| 2 | Introduction to R | R is a popular programming language for statistical computing and graphics... | 0.18741018 |
| 3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 |
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
```
88 changes: 43 additions & 45 deletions docs/doc/15-sql-functions/61-ai-functions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,82 +59,80 @@ Databend provides built-in AI functions for various natural language processing

## Creating and storing embeddings using Databend


Here's an example:
Let's create a table to store some sample text documents and their corresponding embeddings:
```sql
CREATE TABLE documents (
doc_id INT,
text_content TEXT
);

INSERT INTO documents (doc_id, text_content)
VALUES
(1, 'Artificial intelligence is a fascinating field.'),
(2, 'Machine learning is a subset of AI.'),
(3, 'I love going to the beach on weekends.');

CREATE TABLE embeddings (
doc_id INT,
text_content TEXT,
CREATE TABLE articles (
id INT,
title VARCHAR,
content VARCHAR,
embedding ARRAY(FLOAT32)
);

INSERT INTO embeddings (doc_id, text_content, embedding)
SELECT doc_id, text_content, ai_embedding_vector(text_content)
FROM documents;
```

This SQL script creates a `documents` table, inserts the example documents, and then generates embeddings using the `ai_embedding_vector` function. The embeddings are stored in the embeddings table with the `ARRAY(FLOAT32)` column type.
Now, let's insert some sample documents into the table:
```sql
INSERT INTO articles (id, title, content, embedding)
VALUES
(1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')),
(2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')),
(3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...'));
```

## Searching for similarity documents using cosine distance

Suppose you have a question, "What is a subfield of artificial intelligence?", and you want to find the most related document from the stored embeddings. First, generate an embedding for the question using the `ai_embedding_vector` function:
Now, let's find the documents that are most similar to a given query using the [cosine_distance](04-ai-cosine-distance.md) function:
```sql
SELECT doc_id, text_content, cosine_distance(embedding, ai_embedding_vector('What is a subfield of artificial intelligence?')) AS distance
FROM embeddings
ORDER BY distance ASC
LIMIT 5;
SELECT
id,
title,
content,
cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
FROM
articles
ORDER BY
similarity ASC
LIMIT 3;
```
This query will return the top 5 most similar documents to the input question, ordered by their cosine distance, with the smallest distance indicating the highest similarity.

Result:
```sql
+--------+-------------------------------------------------+------------+
| doc_id | text_content | distance |
+--------+-------------------------------------------------+------------+
| 1 | Artificial intelligence is a fascinating field. | 0.10928339 |
| 2 | Machine learning is a subset of AI. | 0.13584924 |
| 3 | I love going to the beach on weekends. | 0.30774158 |
+--------+-------------------------------------------------+------------+
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
| id | title | content | similarity |
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
| 1 | Python for Data Science | Python is a versatile programming language widely used in data science... | 0.1142081 |
| 2 | Introduction to R | R is a popular programming language for statistical computing and graphics... | 0.18741018 |
| 3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 |
+------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
```

## Generating text completions with Databend

Databend also supports a text completion function, ai_text_completion. For example, from the above output, we choose the document with the smallest cosine distance: "Artificial intelligence is a fascinating field." We can use this as context and provide the original question to the ai_text_completion function to generate a completion:
Databend also supports a text completion function, [ai_text_completion](03-ai-text-completion.md).

For example, from the above output, we choose the document with the smallest cosine distance: "Python is a versatile programming language widely used in data science...".

We can use this as context and provide the original question to the [ai_text_completion](03-ai-text-completion.md) function to generate a completion:

```sql
SELECT ai_text_completion('Artificial intelligence is a fascinating field. What is a subfield of artificial intelligence?') AS completion;
SELECT ai_text_completion('Python is a versatile programming language widely used in data science...') AS completion;
```

Result:
```sql
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| completion |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|
| A subfield of artificial intelligence is machine learning, which is the study of algorithms that allow computers to learn from data and improve their performance over time. Other subfields include natural language processing, computer vision, robotics, and deep learning. |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

*************************** 1. row ***************************
completion: and machine learning. It is known for its simplicity, readability, and ease of use. Python has a vast collection of libraries and frameworks that make it easy to perform complex tasks such as data analysis, visualization, and machine learning. Some of the popular libraries used in data science include NumPy, Pandas, Matplotlib, and Scikit-learn. Python is also used in web development, game development, and automation. Its popularity and versatility make it a valuable skill for programmers and data scientists.
```

You can experience these functions on our [Databend Cloud](https://databend.com), where you can sign up for a free trial and start using these AI functions right away. Databend's AI functions are designed to be easy to use, even for users who are not familiar with machine learning or natural language processing. With Databend, you can quickly and easily add powerful AI capabilities to your SQL queries and take your data analysis to the next level.
You can experience these functions on our [Databend Cloud](https://databend.com), where you can sign up for a free trial and start using these AI functions right away.

Databend's AI functions are designed to be easy to use, even for users who are not familiar with machine learning or natural language processing. With Databend, you can quickly and easily add powerful AI capabilities to your SQL queries and take your data analysis to the next level.

## Examples(https://ask.databend.rs)

We have utilized [Databend Cloud](https://databend.com) and its AI functions to create an interactive Q&A system for the https://databend.rs website. The demo site, https://ask.databend.rs, allows users to ask questions about any topic related to the https://databend.rs website.

:::note
:::info
You can also deploy Databend and configure the `openai_api_key`.
:::

Expand Down

1 comment on commit 9ba6313

@vercel
Copy link

@vercel vercel bot commented on 9ba6313 Apr 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

databend – ./

databend-git-main-databend.vercel.app
databend.vercel.app
databend.rs
databend-databend.vercel.app

Please sign in to comment.