Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Voyage AI vectorizer integration #256

Merged
merged 7 commits into from
Dec 5, 2024

Conversation

JamesGuthrie
Copy link
Member

@JamesGuthrie JamesGuthrie commented Nov 26, 2024

To configure a vectorizer with Voyage AI:

SELECT ai.create_vectorizer(
    'my_table'::regclass,
    embedding => ai.embedding_voyageai(
      'voyage-3-lite',
      512,
    ),
    -- other parameters...
);

The vectorizer worker connects to the Voyage AI API with the API
specified in the VOYAGE_API_KEY environment variable.

To get a vector embedding from SQL, use the ai.voyageai_embed
function:

SELECT ai.voyageai_embed('voyage-3-lite', 'text to embed');

@JamesGuthrie JamesGuthrie requested a review from a team as a code owner November 26, 2024 15:00
@JamesGuthrie JamesGuthrie force-pushed the jg/voyageai-vectorizer branch 4 times, most recently from ce01c57 to baab4d2 Compare November 26, 2024 15:27
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this file is 11MB because the call to voyageai.Client().tokenizer(<model name here>) uses huggingface under the hood, which dynamically downloads the tokenizer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was also an issue with huggingface's caching making the sequence of requests it makes non-deterministic, so then tests would fail.

I fixed that and the "this file is huge" problem by ignoring requests to huggingface.co in VCR. The API is public, so that actually makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means tests are going to always make calls to that API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's the huggingface tokenizer API. It returns an 11MB tokenizer file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was also an issue with huggingface's caching making the sequence of requests it makes non-deterministic, so then tests would fail.

Can't we just do expectations without caring on the order?

@JamesGuthrie JamesGuthrie force-pushed the jg/voyageai-vectorizer branch from baab4d2 to 81a6ac4 Compare November 26, 2024 15:34
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically copy-pasted this from docs/openai.md and search-replaced "openai" with "voyageai", which points to the documentation here being quite duplicated.

@JamesGuthrie JamesGuthrie force-pushed the jg/voyageai-vectorizer branch 3 times, most recently from 64fc4cd to 2c9aca5 Compare November 26, 2024 19:18
To configure a vectorizer with Voyage AI:

```sql
SELECT ai.create_vectorizer(
    'my_table'::regclass,
    embedding => ai.embedding_voyageai(
      'voyage-3-lite',
      512,
    ),
    -- other parameters...
);
```

The vectorizer worker connects to the Voyage AI API with the API
specified in the `VOYAGE_API_KEY` environment variable.

To get a vector embedding from SQL, use the `ai.voyageai_embed`
function:

```sql
SELECT ai.voyageai_embed('voyage-3-lite', 'text to embed');
```
@JamesGuthrie JamesGuthrie force-pushed the jg/voyageai-vectorizer branch from 2c9aca5 to a4df434 Compare November 27, 2024 15:57
Copy link
Contributor

@smoya smoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly nits and small suggestions. LGTM otherwise 💯

docs/vectorizer-api-reference.md Outdated Show resolved Hide resolved
docs/voyageai.md Outdated Show resolved Hide resolved
docs/voyageai.md Outdated Show resolved Hide resolved
docs/voyageai.md Outdated Show resolved Hide resolved
docs/vectorizer-api-reference.md Show resolved Hide resolved
projects/extension/ai/voyageai.py Show resolved Hide resolved
embedding => ai.embedding_voyageai(
'voyage-3-lite',
512,
truncate => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should not set the truncate to false in all of our examples unless we explicitly want to show the behaviour when is set to false. Otherwise, we might confuse users, who 99.9% of the time will want this to be true as default.

Suggested change
truncate => false,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@jgpruitt jgpruitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

projects/extension/sql/idempotent/015-voyageai.sql Outdated Show resolved Hide resolved
projects/extension/tests/test_voyageai.py Outdated Show resolved Hide resolved
projects/extension/tests/test_voyageai.py Outdated Show resolved Hide resolved
projects/extension/tests/test_voyageai.py Show resolved Hide resolved
projects/pgai/pgai/vectorizer/embeddings.py Outdated Show resolved Hide resolved
@JamesGuthrie JamesGuthrie force-pushed the jg/voyageai-vectorizer branch from b1a2f9f to 30d4566 Compare December 3, 2024 07:59
@JamesGuthrie JamesGuthrie requested a review from jgpruitt December 4, 2024 11:53

## Configure pgai for Voyage AI

Most pgai functions require a [Voyage AI API key](https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should reword this sentence (and the corresponding ones in the other docs). When it was first authored, we only supported OpenAI, so "most pgai functions" DID require an OpenAI key. This sentence was copy/pasted around. Now, most OpenAI functions require an openai API key, but MOST pgai functions do not. Same goes for VoyageAI and all the other providers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take this up as a follow-on task.

@JamesGuthrie JamesGuthrie merged commit 1b56d62 into main Dec 5, 2024
5 checks passed
@JamesGuthrie JamesGuthrie deleted the jg/voyageai-vectorizer branch December 5, 2024 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants