Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Voyage AI vectorizer integration #256

Merged
merged 7 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ To set up the tests:
ENABLE_OLLAMA_TESTS=1
ENABLE_ANTHROPIC_TESTS=1
ENABLE_COHERE_TESTS=1
ENABLE_VOYAGEAI_TESTS=1
ENABLE_VECTORIZER_TESTS=1
ENABLE_DUMP_RESTORE_TESTS=1
ENABLE_PRIVILEGES_TESTS=1
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ For other use cases, first [Install pgai](#installation) in Timescale Cloud, a p
* [OpenAI](./docs/openai.md) - configure pgai for OpenAI, then use the model to tokenize, embed, chat complete and moderate. This page also includes advanced examples.
* [Anthropic](./docs/anthropic.md) - configure pgai for Anthropic, then use the model to generate content.
* [Cohere](./docs/cohere.md) - configure pgai for Cohere, then use the model to tokenize, embed, chat complete, classify, and rerank.
* [Voyage AI](./docs/voyageai.md) - configure pgai for Voyage AI, then use the model to embed.
- Leverage LLMs for data processing tasks such as classification, summarization, and data enrichment ([see the OpenAI example](/docs/openai.md)).


Expand Down Expand Up @@ -166,6 +167,7 @@ You can use pgai to integrate AI from the following providers:
- [Anthropic](./docs/anthropic.md)
- [Cohere](./docs/cohere.md)
- [Llama 3 (via Ollama)](/docs/ollama.md)
- [Voyage AI](/docs/voyageai.md)

Learn how to [moderate](/docs/moderate.md) content directly in the database using triggers and background jobs.

Expand Down
44 changes: 44 additions & 0 deletions docs/vectorizer-api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,7 @@ The embedding functions are:

- [ai.embedding_openai](#aiembedding_openai)
- [ai.embedding_ollama](#aiembedding_ollama)
- [ai.embedding_voyageai](#aiembedding_voyageai)

### ai.embedding_openai

Expand Down Expand Up @@ -343,6 +344,49 @@ The function takes several parameters to customize the Ollama embedding configur

A JSON configuration object that you can use in [ai.create_vectorizer](#create-vectorizers).

### ai.embedding_voyageai

You use the `ai.embedding_voyageai` function to use a Voyage AI model to generate embeddings.

The purpose of `ai.embedding_voyageai` is to:
- Define which Voyage AI model to use.
- Specify the dimensionality of the embeddings.
- Configure the model's truncation behaviour, and api key name.
- Configure the input type.

#### Example usage

This function is used to create an embedding configuration object that is passed as an argument to [ai.create_vectorizer](#create-vectorizers):

```sql
SELECT ai.create_vectorizer(
'my_table'::regclass,
embedding => ai.embedding_voyageai(
'voyage-3-lite',
512,
truncate => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should not set the truncate to false in all of our examples unless we explicitly want to show the behaviour when is set to false. Otherwise, we might confuse users, who 99.9% of the time will want this to be true as default.

Suggested change
truncate => false,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

api_key_name => "TEST_API_KEY"
),
-- other parameters...
);
```

#### Parameters

The function takes several parameters to customize the Ollama embedding configuration:

| Name | Type | Default | Required | Description |
|--------------|---------|------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| model | text | - | ✔ | Specify the name of the [Voyage AI model](https://docs.voyageai.com/docs/embeddings#model-choices) to use. |
| dimensions | int | - | ✔ | Define the number of dimensions for the embedding vectors. This should match the output dimensions of the chosen model. |
| truncate | boolean | true | ✖ | Truncates the end of each input to fit within the chosen model's context length. Embedding fails (for a given chunk) if set to false and the context length is exceeded. |
| input_type | text | 'document' | ✖ | Type of the input text, null, 'query', or 'document'. |
smoya marked this conversation as resolved.
Show resolved Hide resolved
| api_key_name | text | `VOYAGE_API_KEY` | ✖ | Set the name of the environment variable that contains the Voyage AI API key. This allows for flexible API key management without hardcoding keys in the database. On Timescale Cloud, you should set this to the name of the secret that contains the Voyage AI API key. |

#### Returns

A JSON configuration object that you can use in [ai.create_vectorizer](#create-vectorizers).

## Formatting configuration

You use the `ai.formatting_python_template` function in `pgai` to
Expand Down
182 changes: 182 additions & 0 deletions docs/voyageai.md
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically copy-pasted this from docs/openai.md and search-replaced "openai" with "voyageai", which points to the documentation here being quite duplicated.

Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
# Use pgai with Voyage AI

This page shows you how to:

- [Configure pgai for Voyage AI](#configure-pgai-for-voyage-ai)
- [Add AI functionality to your database](#usage)
- [Follow advanced AI examples](#advanced-examples)

## Configure pgai for Voyage AI

Most pgai functions require a [Voyage AI API key](https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should reword this sentence (and the corresponding ones in the other docs). When it was first authored, we only supported OpenAI, so "most pgai functions" DID require an OpenAI key. This sentence was copy/pasted around. Now, most OpenAI functions require an openai API key, but MOST pgai functions do not. Same goes for VoyageAI and all the other providers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take this up as a follow-on task.


- [Handle API keys using pgai from psql](#handle-api-keys-using-pgai-from-psql)
- [Handle API keys using pgai from python](#handle-api-keys-using-pgai-from-python)

### Handle API keys using pgai from psql

The api key is an [optional parameter to pgai functions](https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html).
You can either:

* [Run AI queries by passing your API key implicitly as a session parameter](#run-ai-queries-by-passing-your-api-key-implicitly-as-a-session-parameter)
* [Run AI queries by passing your API key explicitly as a function argument](#run-ai-queries-by-passing-your-api-key-explicitly-as-a-function-argument)

#### Run AI queries by passing your API key implicitly as a session parameter

To use a [session level parameter when connecting to your database with psql](https://www.postgresql.org/docs/current/config-setting.html#CONFIG-SETTING-SHELL)
to run your AI queries:

1. Set your Voyage AI key as an environment variable in your shell:
```bash
export VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
```
1. Use the session level parameter when you connect to your database:

```bash
PGOPTIONS="-c ai.voyage_api_key=$VOYAGE_API_KEY" psql -d "postgres://<username>:<password>@<host>:<port>/<database-name>"
```

1. Run your AI query:

`ai.voyage_api_key` is set for the duration of your psql session, you do not need to specify it for pgai functions.

```sql
SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed');
```

#### Run AI queries by passing your API key explicitly as a function argument

1. Set your Voyage AI key as an environment variable in your shell:
```bash
export VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
```

2. Connect to your database and set your api key as a [psql variable](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-VARIABLES):

```bash
psql -d "postgres://<username>:<password>@<host>:<port>/<database-name>" -v voyage_api_key=$VOYAGE_API_KEY
```
Your API key is now available as a psql variable named `voyage_api_key` in your psql session.

You can also log into the database, then set `voyage_api_key` using the `\getenv` [metacommand](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-GETENV):

```sql
\getenv voyage_api_key VOYAGE_API_KEY
```

3. Pass your API key to your parameterized query:
```sql
SELECT *
FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed', api_key=>$1)
ORDER BY created DESC
\bind :voyage_api_key
\g
```

Use [\bind](https://www.postgresql.org/docs/current/app-psql.html#APP-PSQL-META-COMMAND-BIND) to pass the value of `voyage_api_key` to the parameterized query.

The `\bind` metacommand is available in psql version 16+.

4. Once you have used `\getenv` to load the environment variable to a psql variable
you can optionally set it as a session-level parameter which can then be used explicitly.
```sql
SELECT set_config('ai.voyage_api_key', $1, false) IS NOT NULL
\bind :voyage_api_key
\g
```

```sql
SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed');
```

### Handle API keys using pgai from python

1. In your Python environment, include the dotenv and postgres driver packages:

```bash
pip install python-dotenv
pip install psycopg2-binary
```

1. Set your Voyage AI key in a .env file or as an environment variable:
```bash
VOYAGE_API_KEY="this-is-my-super-secret-api-key-dont-tell"
DB_URL="your connection string"
```

1. Pass your API key as a parameter to your queries:

```python
import os
from dotenv import load_dotenv

load_dotenv()

VOYAGE_API_KEY = os.environ["VOYAGE_API_KEY"]
DB_URL = os.environ["DB_URL"]

import psycopg2

with psycopg2.connect(DB_URL) as conn:
with conn.cursor() as cur:
# pass the API key as a parameter to the query. don't use string manipulations
cur.execute("SELECT * FROM ai.voyageai_embed('voyage-3-lite', 'sample text to embed', api_key=>%s)", (VOYAGE_API_KEY,))
records = cur.fetchall()
```

Do not use string manipulation to embed the key as a literal in the SQL query.


## Usage

This section shows you how to use AI directly from your database using SQL.

- [Embed](#embed): generate [embeddings](https://docs.voyageai.com/docs/embeddings) using a
specified model.

### Embed

Generate [embeddings](https://docs.voyageai.com/docs/embeddings) using a specified model.

- Request an embedding using a specific model:

```sql
SELECT ai.voyageai_embed
( 'voyage-3-lite'
, 'the purple elephant sits on a red mushroom'
);
```

The data returned looks like:

```text
voyageai_embed
--------------------------------------------------------
[0.005978798,-0.020522336,...-0.0022857306,-0.023699166]
(1 row)
```

- Pass an array of text inputs:

```sql
SELECT ai.voyageai_embed
( 'voyage-3-lite'
, array['Timescale is Postgres made Powerful', 'the purple elephant sits on a red mushroom']
);
```

- Specify the input type

The Voyage AI API allows setting the `input_type` to `"document"`, or
`"query"`, (or unset). Correctly setting this value should enhance retrieval
quality:

```sql
SELECT ai.voyageai_embed
( 'voyage-3-lite'
, 'A query'
, input_type => 'query'
);
```


22 changes: 22 additions & 0 deletions projects/extension/ai/voyageai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import voyageai
from typing import Optional, Generator, Union

DEFAULT_KEY_NAME = "VOYAGE_API_KEY"


def embed(
model: str,
input: Union[list[str]],
api_key: str,
input_type: Optional[str] = None,
truncation: Optional[bool] = None,
) -> Generator[tuple[int, list[float]], None, None]:
client = voyageai.Client(api_key=api_key)
args = {}
if truncation is not None:
args["truncation"] = truncation
response = client.embed(input, model=model, input_type=input_type, **args)
smoya marked this conversation as resolved.
Show resolved Hide resolved
if not hasattr(response, "embeddings"):
return None
for idx, obj in enumerate(response.embeddings):
yield idx, obj
3 changes: 2 additions & 1 deletion projects/extension/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ tiktoken==0.7.0
ollama==0.2.1
anthropic==0.29.0
cohere==5.5.8
backoff==2.2.1
backoff==2.2.1
voyageai==0.3.1
1 change: 1 addition & 0 deletions projects/extension/setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ install_requires =
anthropic==0.29.0
cohere==5.5.8
backoff==2.2.1
voyageai==0.3.1
33 changes: 33 additions & 0 deletions projects/extension/sql/idempotent/008-embedding.sql
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,37 @@ $func$ language sql immutable security invoker
set search_path to pg_catalog, pg_temp
;

-------------------------------------------------------------------------------
-- embedding_voyageai
create or replace function ai.embedding_voyageai
( model text
, dimensions int
, truncate boolean default true
, input_type text default 'document'
, api_key_name text default 'VOYAGE_API_KEY'
) returns jsonb
as $func$
begin
if input_type is not null and input_type not in ('query', 'document') then
-- Note: purposefully not using an enum here because types make life complicated
raise exception 'invalid input_type for voyage ai "%"', input_type;
end if;

return json_object
( 'implementation': 'voyageai'
, 'config_type': 'embedding'
, 'model': model
, 'dimensions': dimensions
, 'truncate': truncate
JamesGuthrie marked this conversation as resolved.
Show resolved Hide resolved
, 'input_type': input_type
, 'api_key_name': api_key_name
absent on null
);
end
$func$ language plpgsql immutable security invoker
set search_path to pg_catalog, pg_temp
;

-------------------------------------------------------------------------------
-- _validate_embedding
create or replace function ai._validate_embedding(config jsonb) returns void
Expand All @@ -69,6 +100,8 @@ begin
-- ok
when 'ollama' then
-- ok
when 'voyageai' then
-- ok
else
if _implementation is null then
raise exception 'embedding implementation not specified';
Expand Down
48 changes: 48 additions & 0 deletions projects/extension/sql/idempotent/015-voyageai.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
-------------------------------------------------------------------------------
-- voyageai_embed
-- generate an embedding from a text value
-- https://docs.voyageai.com/reference/embeddings-api
create or replace function ai.voyageai_embed
( model text
, input_text text
, input_type text default null
jgpruitt marked this conversation as resolved.
Show resolved Hide resolved
, api_key text default null
, api_key_name text default null
) returns @extschema:vector@.vector
as $python$
#ADD-PYTHON-LIB-DIR
import ai.voyageai
import ai.secrets
api_key_resolved = ai.secrets.get_secret(plpy, api_key, api_key_name, ai.voyageai.DEFAULT_KEY_NAME, SD)
for tup in ai.voyageai.embed(model, [input_text], api_key=api_key_resolved):
return tup[1]
$python$
language plpython3u immutable parallel safe security invoker
set search_path to pg_catalog, pg_temp
;

-------------------------------------------------------------------------------
-- voyageai_embed
-- generate embeddings from an array of text values
-- https://docs.voyageai.com/reference/embeddings-api
create or replace function ai.voyageai_embed
( model text
, input_texts text[]
, api_key text default null
, api_key_name text default null
, input_type text default null
jgpruitt marked this conversation as resolved.
Show resolved Hide resolved
) returns table
( "index" int
, embedding @extschema:vector@.vector
)
as $python$
#ADD-PYTHON-LIB-DIR
import ai.voyageai
import ai.secrets
api_key_resolved = ai.secrets.get_secret(plpy, api_key, api_key_name, ai.voyageai.DEFAULT_KEY_NAME, SD)
for tup in ai.voyageai.embed(model, input_texts, api_key=api_key_resolved):
yield tup
$python$
language plpython3u immutable parallel safe security invoker
set search_path to pg_catalog, pg_temp
;
Loading