Skip to content

Commit

Permalink
feat: add Voyage AI vectorizer integration
Browse files Browse the repository at this point in the history
To configure a vectorizer with Voyage AI:

```sql
SELECT ai.create_vectorizer(
    'my_table'::regclass,
    embedding => ai.embedding_voyageai(
      'voyage-3-lite',
      512,
    ),
    -- other parameters...
);
```

The vectorizer worker connects to the Voyage AI API with the API
specified in the `VOYAGE_API_KEY` environment variable.

To get a vector embedding from SQL, use the `ai.voyageai_embed`
function:

```sql
SELECT ai.voyageai_embed('voyage-3-lite', 'text to embed');
```
  • Loading branch information
JamesGuthrie committed Nov 26, 2024
1 parent 6a4a449 commit 81a6ac4
Show file tree
Hide file tree
Showing 24 changed files with 119,471 additions and 7 deletions.
1 change: 1 addition & 0 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ To set up the tests:
ENABLE_OLLAMA_TESTS=1
ENABLE_ANTHROPIC_TESTS=1
ENABLE_COHERE_TESTS=1
ENABLE_VOYAGEAI_TESTS=1
ENABLE_VECTORIZER_TESTS=1
ENABLE_DUMP_RESTORE_TESTS=1
ENABLE_PRIVILEGES_TESTS=1
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ For other use cases, first [Install pgai](#installation) in Timescale Cloud, a p
* [OpenAI](./docs/openai.md) - configure pgai for OpenAI, then use the model to tokenize, embed, chat complete and moderate. This page also includes advanced examples.
* [Anthropic](./docs/anthropic.md) - configure pgai for Anthropic, then use the model to generate content.
* [Cohere](./docs/cohere.md) - configure pgai for Cohere, then use the model to tokenize, embed, chat complete, classify, and rerank.
* [Voyage AI](./docs/voyageai.md) - configure pgai for Voyage AI, then use the model to embed.
- Leverage LLMs for data processing tasks such as classification, summarization, and data enrichment ([see the OpenAI example](/docs/openai.md)).


Expand Down Expand Up @@ -166,6 +167,7 @@ You can use pgai to integrate AI from the following providers:
- [Anthropic](./docs/anthropic.md)
- [Cohere](./docs/cohere.md)
- [Llama 3 (via Ollama)](/docs/ollama.md)
- [Voyage AI](/docs/voyageai.md)
Learn how to [moderate](/docs/moderate.md) content directly in the database using triggers and background jobs.
Expand Down
42 changes: 42 additions & 0 deletions docs/vectorizer-api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,7 @@ The embedding functions are:

- [ai.embedding_openai](#aiembedding_openai)
- [ai.embedding_ollama](#aiembedding_ollama)
- [ai.embedding_voyageai](#aiembedding_voyageai)

### ai.embedding_openai

Expand Down Expand Up @@ -343,6 +344,47 @@ The function takes several parameters to customize the Ollama embedding configur

A JSON configuration object that you can use in [ai.create_vectorizer](#create-vectorizers).

### ai.embedding_voyageai

You use the `ai.embedding_voyageai` function to use a Voyage AI model to generate embeddings.

The purpose of `ai.embedding_voyageai` is to:
- Define which Voyage AI model to use.
- Specify the dimensionality of the embeddings.
- Configure the model's truncation behaviour, and api key name.

#### Example usage

This function is used to create an embedding configuration object that is passed as an argument to [ai.create_vectorizer](#create-vectorizers):

```sql
SELECT ai.create_vectorizer(
'my_table'::regclass,
embedding => ai.embedding_voyageai(
'voyage-3-lite',
512,
truncate => false,
api_key_name => "TEST_API_KEY"
),
-- other parameters...
);
```

#### Parameters

The function takes several parameters to customize the Ollama embedding configuration:

| Name | Type | Default | Required | Description |
|--------------|---------|------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| model | text | - || Specify the name of the Voyage AI [model](https://docs.voyageai.com/docs/embeddings#model-choices) to use. |
| dimensions | int | - || Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. |
| truncate | boolean | true || Truncates the end of each input to fit within the chosen model's context length. Embedding fails (for a given chunk) if set to false and the context length is exceeded. |
| api_key_name | text | `VOYAGE_API_KEY` || Set the name of the environment variable that contains the Voyage AI API key. This allows for flexible API key management without hardcoding keys in the database. On Timescale Cloud, you should set this to the name of the secret that contains the OpenAI API key. |

#### Returns

A JSON configuration object that you can use in [ai.create_vectorizer](#create-vectorizers).

## Formatting configuration

You use the `ai.formatting_python_template` function in `pgai` to
Expand Down
167 changes: 167 additions & 0 deletions docs/voyageai.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Use pgai with Voyage AI

This page shows you how to:

- [Configure pgai for Voyage AI](#configure-pgai-for-voyage-ai)
- [Add AI functionality to your database](#usage)
- [Follow advanced AI examples](#advanced-examples)

## Configure pgai for Voyage AI

Most pgai functions require an [Voyage AI API key](https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys).

- [Handle API keys using pgai from psql](#handle-api-keys-using-pgai-from-psql)
- [Handle API keys using pgai from python](#handle-api-keys-using-pgai-from-python)

### Handle API keys using pgai from psql

The api key is an [optional parameter to pgai functions](https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html).
You can either:

* [Run AI queries by passing your API key implicitly as a session parameter](#run-ai-queries-by-passing-your-api-key-implicitly-as-a-session-parameter)
* [Run AI queries by passing your API key explicitly as a function argument](#run-ai-queries-by-passing-your-api-key-explicitly-as-a-function-argument)

#### Run AI queries by passing your API key implicitly as a session parameter

To use a [session level parameter when connecting to your database with psql](https://www.postgresql.org/docs/current/config-setting.html#CONFIG-SETTING-SHELL)
to run your AI queries:

1. Set your Voyage AI key as an environment variable in your shell:
```bash
export VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
```
1. Use the session level parameter when you connect to your database:

```bash
PGOPTIONS="-c ai.voyage_api_key=$VOYAGE_API_KEY" psql -d "postgres://<username>:<password>@<host>:<port>/<database-name>"
```

1. Run your AI query:

`ai.voyage_api_key` is set for the duration of your psql session, you do not need to specify it for pgai functions.

```sql
SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed');
```

#### Run AI queries by passing your API key explicitly as a function argument

1. Set your Voyage AI key as an environment variable in your shell:
```bash
export VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
```

2. Connect to your database and set your api key as a [psql variable](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-VARIABLES):

```bash
psql -d "postgres://<username>:<password>@<host>:<port>/<database-name>" -v voyage_api_key=$VOYAGE_API_KEY
```
Your API key is now available as a psql variable named `voyage_api_key` in your psql session.

You can also log into the database, then set `voyage_api_key` using the `\getenv` [metacommand](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-GETENV):

```sql
\getenv voyage_api_key VOYAGE_API_KEY
```

3. Pass your API key to your parameterized query:
```sql
SELECT *
FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed', api_key=>$1)
ORDER BY created DESC
\bind :voyage_api_key
\g
```

Use [\bind](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-BIND) to pass the value of `voyage_api_key` to the parameterized query.

The `\bind` metacommand is available in psql version 16+.

4. Once you have used `\getenv` to load the environment variable to a psql variable
you can optionally set it as a session-level parameter which can then be used explicitly.
```sql
SELECT set_config('ai.voyage_api_key', $1, false) IS NOT NULL
\bind :voyage_api_key
\g
```

```sql
SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed');
```

### Handle API keys using pgai from python

1. In your Python environment, include the dotenv and postgres driver packages:

```bash
pip install python-dotenv
pip install psycopg2-binary
```

1. Set your Voyage AI key in a .env file or as an environment variable:
```bash
VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
DB_URL="your connection string"
```

1. Pass your API key as a parameter to your queries:

```python
import os
from dotenv import load_dotenv
load_dotenv()
VOYAGE_API_KEY = os.environ["VOYAGE_API_KEY"]
DB_URL = os.environ["DB_URL"]
import psycopg2
with psycopg2.connect(DB_URL) as conn:
with conn.cursor() as cur:
# pass the API key as a parameter to the query. don't use string manipulations
cur.execute("SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed', api_key=>%s)", (VOYAGE_API_KEY,))
records = cur.fetchall()
```

Do not use string manipulation to embed the key as a literal in the SQL query.


## Usage

This section shows you how to use AI directly from your database using SQL.

- [Embed](#embed): generate [embeddings](https://platform.openai.com/docs/guides/embeddings) using a
specified model.

### Embed

Generate [embeddings](https://platform.openai.com/docs/guides/embeddings) using a specified model.

- Request an embedding using a specific model:

```sql
SELECT ai.voyageai_embed
( 'text-embedding-ada-002'
, 'the purple elephant sits on a red mushroom'
);
```

The data returned looks like:

```text
voyageai_embed
--------------------------------------------------------
[0.005978798,-0.020522336,...-0.0022857306,-0.023699166]
(1 row)
```

- Pass an array of text inputs:

```sql
SELECT ai.voyageai_embed
( 'text-embedding-ada-002'
, array['Timescale is Postgres made Powerful', 'the purple elephant sits on a red mushroom']
);
```

22 changes: 22 additions & 0 deletions projects/extension/ai/voyageai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import voyageai
from typing import Optional, Generator, Union

DEFAULT_KEY_NAME = "VOYAGE_API_KEY"


def embed(
model: str,
input: Union[list[str]],
api_key: str,
input_type: Optional[str] = None,
truncation: Optional[bool] = None,
) -> Generator[tuple[int, list[float]], None, None]:
client = voyageai.Client(api_key=api_key)
args = {}
if truncation is not None:
args["truncation"] = truncation
response = client.embed(input, model=model, input_type=input_type, **args)
if not hasattr(response, "embeddings"):
return None
for idx, obj in enumerate(response.embeddings):
yield idx, obj
3 changes: 2 additions & 1 deletion projects/extension/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ tiktoken==0.7.0
ollama==0.2.1
anthropic==0.29.0
cohere==5.5.8
backoff==2.2.1
backoff==2.2.1
voyageai==0.3.1
24 changes: 24 additions & 0 deletions projects/extension/sql/idempotent/008-embedding.sql
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,28 @@ $func$ language sql immutable security invoker
set search_path to pg_catalog, pg_temp
;

-------------------------------------------------------------------------------
-- embedding_voyageai
create or replace function ai.embedding_voyageai
( model text
, dimensions int
, truncate boolean default true
, api_key_name text default 'VOYAGE_API_KEY'
) returns jsonb
as $func$
select json_object
( 'implementation': 'voyageai'
, 'config_type': 'embedding'
, 'model': model
, 'dimensions': dimensions
, 'truncate': truncate
, 'api_key_name': api_key_name
absent on null
)
$func$ language sql immutable security invoker
set search_path to pg_catalog, pg_temp
;

-------------------------------------------------------------------------------
-- _validate_embedding
create or replace function ai._validate_embedding(config jsonb) returns void
Expand All @@ -69,6 +91,8 @@ begin
-- ok
when 'ollama' then
-- ok
when 'voyageai' then
-- ok
else
if _implementation is null then
raise exception 'embedding implementation not specified';
Expand Down
48 changes: 48 additions & 0 deletions projects/extension/sql/idempotent/015-voyageai.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
-------------------------------------------------------------------------------
-- voyageai_embed
-- generate an embedding from a text value
-- https://docs.voyageai.com/reference/embeddings-api
create or replace function ai.voyageai_embed
( model text
, input_text text
, input_type text default null
, api_key text default null
, api_key_name text default null
) returns @extschema:vector@.vector
as $python$
#ADD-PYTHON-LIB-DIR
import ai.voyageai
import ai.secrets
api_key_resolved = ai.secrets.get_secret(plpy, api_key, api_key_name, ai.voyageai.DEFAULT_KEY_NAME, SD)
for tup in ai.voyageai.embed(model, [input_text], api_key=api_key_resolved):
return tup[1]
$python$
language plpython3u immutable parallel safe security invoker
set search_path to pg_catalog, pg_temp
;

-------------------------------------------------------------------------------
-- voyageai_embed
-- generate embeddings from an array of text values
-- https://docs.voyageai.com/reference/embeddings-api
create or replace function ai.voyageai_embed
( model text
, input_texts text[]
, api_key text default null
, api_key_name text default null
, input_type text default null
) returns table
( "index" int
, embedding @extschema:vector@.vector
)
as $python$
#ADD-PYTHON-LIB-DIR
import ai.voyageai
import ai.secrets
api_key_resolved = ai.secrets.get_secret(plpy, api_key, api_key_name, ai.voyageai.DEFAULT_KEY_NAME, SD)
for tup in ai.voyageai.embed(model, input_texts, api_key=api_key_resolved):
yield tup
$python$
language plpython3u immutable parallel safe security invoker
set search_path to pg_catalog, pg_temp
;
5 changes: 4 additions & 1 deletion projects/extension/tests/contents/output16.expected
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ CREATE EXTENSION
function ai.drop_vectorizer(integer,boolean)
function ai.embedding_ollama(text,integer,text,boolean,jsonb,text)
function ai.embedding_openai(text,integer,text,text)
function ai.embedding_voyageai(text,integer,boolean,text)
function ai.enable_vectorizer_schedule(integer)
function ai.execute_vectorizer(integer)
function ai.formatting_python_template(text)
Expand Down Expand Up @@ -79,6 +80,8 @@ CREATE EXTENSION
function ai._vectorizer_should_create_vector_index(ai.vectorizer)
function ai._vectorizer_source_pk(regclass)
function ai._vectorizer_vector_index_exists(name,name,jsonb)
function ai.voyageai_embed(text,text,text,text,text)
function ai.voyageai_embed(text,text[],text,text,text)
sequence ai.vectorizer_id_seq
table ai.feature_flag
table ai.migration
Expand All @@ -87,7 +90,7 @@ CREATE EXTENSION
table ai.vectorizer_errors
view ai.secret_permissions
view ai.vectorizer_status
(83 rows)
(86 rows)

Table "ai._secret_permissions"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
Expand Down
Loading

0 comments on commit 81a6ac4

Please sign in to comment.