docs: refine the AI functions (#11205)

databendlabs · Apr 24, 2023 · 9ba6313 · 9ba6313 · vercel · Apr 24, 2023
1 parent 6170d21
commit 9ba6313
Show file tree

Hide file tree

Showing 4 changed files with 111 additions and 99 deletions.
diff --git a/docs/doc/15-sql-functions/61-ai-functions/01-ai-to-sql.md b/docs/doc/15-sql-functions/61-ai-functions/01-ai-to-sql.md
@@ -40,7 +40,7 @@ openai_api_key = "<your-key>"
 
 ## Examples
 
-In this example, a SQL query statement is generated from an instruction with the AI_TO_SQL function, and the resulting statement is executed to obtain the query results.
+In this example, an SQL query statement is generated from an instruction with the AI_TO_SQL function, and the resulting statement is executed to obtain the query results.
 
 1. Prepare data.
 

diff --git a/docs/doc/15-sql-functions/61-ai-functions/02-ai-embedding-vector.md b/docs/doc/15-sql-functions/61-ai-functions/02-ai-embedding-vector.md
@@ -7,6 +7,8 @@ This document provides an overview of the ai_embedding_vector function in Databe
 
 The main code implementation can be found [here](https://github.com/datafuselabs/databend/blob/1e93c5b562bd159ecb0f336bb88fd1b7f9dc4a62/src/common/openai/src/embedding.rs).
 
+By default, Databend leverages the [text-embedding-ada](https://platform.openai.com/docs/models/embeddings) model for generating embeddings.
+
 :::caution
 Databend relies on OpenAI for `AI_EMBEDDING_VECTOR` and sends the embedding column data to OpenAI.
 
@@ -17,7 +19,6 @@ This function is available by default on [Databend Cloud](https://databend.com)
 
 ## Overview of ai_embedding_vector
 
-
 The `ai_embedding_vector` function in Databend is a built-in function that generates vector embeddings for text data. It is useful for natural language processing tasks, such as document similarity, clustering, and recommendation systems.
 
 The function takes a text input and returns a high-dimensional vector that represents the input text's semantic meaning and context. The embeddings are created using pre-trained models on large text corpora, capturing the relationships between words and phrases in a continuous space.
@@ -28,36 +29,42 @@ To create embeddings for a text document using the `ai_embedding_vector` functio
 1. Create a table to store the documents:
 ```sql
 CREATE TABLE documents (
-    doc_id INT,
-    text_content TEXT
+                           id INT,
+                           title VARCHAR,
+                           content VARCHAR,
+                           embedding ARRAY(FLOAT32)
 );
 ```
 
 2. Insert example documents into the table:
 ```sql
-INSERT INTO documents (doc_id, text_content)
+INSERT INTO documents(id, title, content)
 VALUES
-    (1, 'Artificial intelligence is a fascinating field.'),
-    (2, 'Machine learning is a subset of AI.'),
-    (3, 'I love going to the beach on weekends.');
+    (1, 'A Brief History of AI', 'Artificial intelligence (AI) has been a fascinating concept of science fiction for decades...'),
+    (2, 'Machine Learning vs. Deep Learning', 'Machine learning and deep learning are two subsets of artificial intelligence...'),
+    (3, 'Neural Networks Explained', 'A neural network is a series of algorithms that endeavors to recognize underlying relationships...'),
 ```
 
-3. Create a table to store the embeddings:
+3. Generate the embeddings:
 ```sql
-CREATE TABLE embeddings (
-    doc_id INT,
-    text_content TEXT,
-    embedding ARRAY(FLOAT32)
-);
+UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0;
 ```
+After running the query, the embedding column in the table will contain the generated embeddings.
 
-4. Generate embeddings for the text content and store them in the embeddings table:
-```sql
-INSERT INTO embeddings (doc_id, text_content, embedding)
-SELECT doc_id, text_content, ai_embedding_vector(text_content)
-FROM documents;
+The embeddings are stored as an array of `FLOAT32` values in the embedding column, which has the `ARRAY(FLOAT32)` column type.
 
-```
-After running these SQL queries, the embeddings table will contain the generated embeddings for each document in the documents table. The embeddings are stored as an array of `FLOAT32` values in the embedding column, which has the `ARRAY(FLOAT32)` column type.
+You can now use these embeddings for various natural language processing tasks, such as finding similar documents or clustering documents based on their content.
+
+4. Inspect the embeddings:
 
-You can now use these embeddings for various natural language processing tasks, such as finding similar documents or clustering documents based on their content.
+```sql
+SELECT length(embedding) FROM documents;
++-------------------+
+| length(embedding) |
++-------------------+
+|              1536 |
+|              1536 |
+|              1536 |
++-------------------+
+```
+The query above shows that the generated embeddings have a length of 1536(dimensions) for each document.
diff --git a/docs/doc/15-sql-functions/61-ai-functions/04-ai-cosine-distance.md b/docs/doc/15-sql-functions/61-ai-functions/04-ai-cosine-distance.md
@@ -1,59 +1,66 @@
 ---
 title: 'COSINE_DISTANCE'
-description: 'Measuring document similarity using the cosine_distance function in Databend'
+description: 'Measuring similarity using the cosine_distance function in Databend'
 ---
 
-This document provides an overview of the `cosine_distance` function in Databend and demonstrates how to measure document similarity using this function.
+This document provides an overview of the cosine_distance function in Databend and demonstrates how to measure document similarity using this function.
 
 :::info
-The `cosine_distance` function performs vector computations within Databend and does not rely on the OpenAI API.
-:::
 
-## Overview of cosine_distance
+The cosine_distance function performs vector computations within Databend and does not rely on the OpenAI API. 
+
+:::
 
-The `cosine_distance` function in Databend is a built-in function that calculates the cosine distance between two vectors. It is commonly used in natural language processing tasks, such as document similarity and recommendation systems.
+The cosine_distance function in Databend is a built-in function that calculates the cosine distance between two vectors. It is commonly used in natural language processing tasks, such as document similarity and recommendation systems.
 
 Cosine distance is a measure of similarity between two vectors, based on the cosine of the angle between them. The function takes two input vectors and returns a value between 0 and 1, with 0 indicating identical vectors and 1 indicating orthogonal (completely dissimilar) vectors.
 
-## Measuring similarity using cosine_distance
+## Examples
 
-To measure document similarity using the cosine_distance function, follow the example below. This example assumes that you have already created document embeddings using the ai_embedding_vector function and stored them in a table with the `ARRAY(FLOAT32)` column type.
+**Creating a Table and Inserting Sample Data**
 
-1. Create a table to store the documents and their embeddings:
+Let's create a table to store some sample text documents and their corresponding embeddings:
 ```sql
-CREATE TABLE documents (
-    doc_id INT,
-    text_content TEXT,
+CREATE TABLE articles (
+    id INT,
+    title VARCHAR,
+    content VARCHAR,
     embedding ARRAY(FLOAT32)
 );
-
 ```
 
-2. Insert example documents and their embeddings into the table:
+Now, let's insert some sample documents into the table:
 ```sql
-INSERT INTO documents (doc_id, text_content, embedding)
+INSERT INTO articles (id, title, content, embedding)
 VALUES
-    (1, 'Artificial intelligence is a fascinating field.', ai_embedding_vector('Artificial intelligence is a fascinating field.')),
-    (2, 'Machine learning is a subset of AI.', ai_embedding_vector('Machine learning is a subset of AI.')),
-    (3, 'I love going to the beach on weekends.', ai_embedding_vector('I love going to the beach on weekends.'));
+    (1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')),
+    (2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')),
+    (3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...'));
 ```
 
-3. Measure the similarity between a query document and the stored documents using the `cosine_distance` function:
+**Querying for Similar Documents**
+
+Now, let's find the documents that are most similar to a given query using the cosine_distance function:
 ```sql
-SELECT doc_id, text_content, cosine_distance(embedding, ai_embedding_vector('What is a subfield of artificial intelligence?')) AS distance
-FROM embeddings
-ORDER BY distance ASC
-LIMIT 5;
+SELECT
+    id,
+    title,
+    content,
+    cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
+FROM
+    articles
+ORDER BY
+    similarity ASC
+    LIMIT 3;
 ```
-This SQL query calculates the cosine distance between the query document's embedding and the embeddings of the stored documents. The results are ordered by ascending distance, with the smallest distance indicating the highest similarity.
 
 Result:
 ```sql
-+--------+-------------------------------------------------+------------+
-| doc_id | text_content                                    | distance   |
-+--------+-------------------------------------------------+------------+
-|      1 | Artificial intelligence is a fascinating field. | 0.10928339 |
-|      2 | Machine learning is a subset of AI.             | 0.13584924 |
-|      3 | I love going to the beach on weekends.          | 0.30774158 |
-+--------+-------------------------------------------------+------------+
-```
++------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
+| id   | title                    | content                                                                                                 | similarity |
++------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
+|    1 | Python for Data Science  | Python is a versatile programming language widely used in data science...                               |  0.1142081 |
+|    2 | Introduction to R        | R is a popular programming language for statistical computing and graphics...                           | 0.18741018 |
+|    3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 |
++------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
+```
diff --git a/docs/doc/15-sql-functions/61-ai-functions/index.md b/docs/doc/15-sql-functions/61-ai-functions/index.md
@@ -59,82 +59,80 @@ Databend provides built-in AI functions for various natural language processing
 
 ## Creating and storing embeddings using Databend
 
-
-Here's an example:
+Let's create a table to store some sample text documents and their corresponding embeddings:
 ```sql
-CREATE TABLE documents (
-    doc_id INT,
-    text_content TEXT
-);
-
-INSERT INTO documents (doc_id, text_content)
-VALUES
-    (1, 'Artificial intelligence is a fascinating field.'),
-    (2, 'Machine learning is a subset of AI.'),
-    (3, 'I love going to the beach on weekends.');
-
-CREATE TABLE embeddings (
-    doc_id INT,
-    text_content TEXT,
+CREATE TABLE articles (
+    id INT,
+    title VARCHAR,
+    content VARCHAR,
     embedding ARRAY(FLOAT32)
 );
-
-INSERT INTO embeddings (doc_id, text_content, embedding)
-SELECT doc_id, text_content, ai_embedding_vector(text_content)
-FROM documents;
 ```
 
-This SQL script creates a `documents` table, inserts the example documents, and then generates embeddings using the `ai_embedding_vector` function. The embeddings are stored in the embeddings table with the `ARRAY(FLOAT32)` column type.
+Now, let's insert some sample documents into the table:
+```sql
+INSERT INTO articles (id, title, content, embedding)
+VALUES
+    (1, 'Python for Data Science', 'Python is a versatile programming language widely used in data science...', ai_embedding_vector('Python is a versatile programming language widely used in data science...')),
+    (2, 'Introduction to R', 'R is a popular programming language for statistical computing and graphics...', ai_embedding_vector('R is a popular programming language for statistical computing and graphics...')),
+    (3, 'Getting Started with SQL', 'Structured Query Language (SQL) is a domain-specific language used for managing relational databases...', ai_embedding_vector('Structured Query Language (SQL) is a domain-specific language used for managing relational databases...'));
+```
 
 ## Searching for similarity documents using cosine distance
 
-Suppose you have a question, "What is a subfield of artificial intelligence?", and you want to find the most related document from the stored embeddings. First, generate an embedding for the question using the `ai_embedding_vector` function:
+Now, let's find the documents that are most similar to a given query using the [cosine_distance](04-ai-cosine-distance.md) function:
 ```sql
-SELECT doc_id, text_content, cosine_distance(embedding, ai_embedding_vector('What is a subfield of artificial intelligence?')) AS distance
-FROM embeddings
-ORDER BY distance ASC
-LIMIT 5;
+SELECT
+    id,
+    title,
+    content,
+    cosine_distance(embedding, ai_embedding_vector('How to use Python in data analysis?')) AS similarity
+FROM
+    articles
+ORDER BY
+    similarity ASC
+    LIMIT 3;
 ```
-This query will return the top 5 most similar documents to the input question, ordered by their cosine distance, with the smallest distance indicating the highest similarity.
 
 Result:
 ```sql
-+--------+-------------------------------------------------+------------+
-| doc_id | text_content                                    | distance   |
-+--------+-------------------------------------------------+------------+
-|      1 | Artificial intelligence is a fascinating field. | 0.10928339 |
-|      2 | Machine learning is a subset of AI.             | 0.13584924 |
-|      3 | I love going to the beach on weekends.          | 0.30774158 |
-+--------+-------------------------------------------------+------------+
++------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
+| id   | title                    | content                                                                                                 | similarity |
++------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
+|    1 | Python for Data Science  | Python is a versatile programming language widely used in data science...                               |  0.1142081 |
+|    2 | Introduction to R        | R is a popular programming language for statistical computing and graphics...                           | 0.18741018 |
+|    3 | Getting Started with SQL | Structured Query Language (SQL) is a domain-specific language used for managing relational databases... | 0.25137568 |
++------+--------------------------+---------------------------------------------------------------------------------------------------------+------------+
 ```
 
 ## Generating text completions with Databend
 
-Databend also supports a text completion function, ai_text_completion. For example, from the above output, we choose the document with the smallest cosine distance: "Artificial intelligence is a fascinating field." We can use this as context and provide the original question to the ai_text_completion function to generate a completion:
+Databend also supports a text completion function, [ai_text_completion](03-ai-text-completion.md).
+
+For example, from the above output, we choose the document with the smallest cosine distance: "Python is a versatile programming language widely used in data science...".
+
+We can use this as context and provide the original question to the [ai_text_completion](03-ai-text-completion.md) function to generate a completion:
 
 ```sql
-SELECT ai_text_completion('Artificial intelligence is a fascinating field. What is a subfield of artificial intelligence?') AS completion;
+SELECT ai_text_completion('Python is a versatile programming language widely used in data science...') AS completion;
 ```
 
 Result:
 ```sql
-+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| completion                                                                                                                                                                                                                                                                        |
-+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-|
-| A subfield of artificial intelligence is machine learning, which is the study of algorithms that allow computers to learn from data and improve their performance over time. Other subfields include natural language processing, computer vision, robotics, and deep learning.   |
-+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-```
 
+*************************** 1. row ***************************
+completion: and machine learning. It is known for its simplicity, readability, and ease of use. Python has a vast collection of libraries and frameworks that make it easy to perform complex tasks such as data analysis, visualization, and machine learning. Some of the popular libraries used in data science include NumPy, Pandas, Matplotlib, and Scikit-learn. Python is also used in web development, game development, and automation. Its popularity and versatility make it a valuable skill for programmers and data scientists.
+```
 
-You can experience these functions on our [Databend Cloud](https://databend.com), where you can sign up for a free trial and start using these AI functions right away. Databend's AI functions are designed to be easy to use, even for users who are not familiar with machine learning or natural language processing. With Databend, you can quickly and easily add powerful AI capabilities to your SQL queries and take your data analysis to the next level.
+You can experience these functions on our [Databend Cloud](https://databend.com), where you can sign up for a free trial and start using these AI functions right away.
 
+Databend's AI functions are designed to be easy to use, even for users who are not familiar with machine learning or natural language processing. With Databend, you can quickly and easily add powerful AI capabilities to your SQL queries and take your data analysis to the next level.
 
 ## Examples(https://ask.databend.rs)
 
 We have utilized [Databend Cloud](https://databend.com) and its AI functions to create an interactive Q&A system for the https://databend.rs website. The demo site, https://ask.databend.rs, allows users to ask questions about any topic related to the https://databend.rs website.
 
-:::note
+:::info
 You can also deploy Databend and configure the `openai_api_key`.
 :::