feat: add Voyage AI vectorizer integration #256

JamesGuthrie · 2024-11-26T15:00:37Z

To configure a vectorizer with Voyage AI:

SELECT ai.create_vectorizer(
    'my_table'::regclass,
    embedding => ai.embedding_voyageai(
      'voyage-3-lite',
      512,
    ),
    -- other parameters...
);

The vectorizer worker connects to the Voyage AI API with the API
specified in the VOYAGE_API_KEY environment variable.

To get a vector embedding from SQL, use the ai.voyageai_embed
function:

SELECT ai.voyageai_embed('voyage-3-lite', 'text to embed');

JamesGuthrie · 2024-11-26T15:32:33Z

.../pgai/tests/vectorizer/cassettes/voyageai-character_text_splitter-too-large-chunk_value.yaml

Note: this file is 11MB because the call to voyageai.Client().tokenizer(<model name here>) uses huggingface under the hood, which dynamically downloads the tokenizer.

There was also an issue with huggingface's caching making the sequence of requests it makes non-deterministic, so then tests would fail.

I fixed that and the "this file is huge" problem by ignoring requests to huggingface.co in VCR. The API is public, so that actually makes sense.

That means tests are going to always make calls to that API?

Yes. It's the huggingface tokenizer API. It returns an 11MB tokenizer file.

There was also an issue with huggingface's caching making the sequence of requests it makes non-deterministic, so then tests would fail.

Can't we just do expectations without caring on the order?

JamesGuthrie · 2024-11-26T15:35:47Z

docs/voyageai.md

I basically copy-pasted this from docs/openai.md and search-replaced "openai" with "voyageai", which points to the documentation here being quite duplicated.

projects/extension/sql/idempotent/008-embedding.sql

To configure a vectorizer with Voyage AI: ```sql SELECT ai.create_vectorizer( 'my_table'::regclass, embedding => ai.embedding_voyageai( 'voyage-3-lite', 512, ), -- other parameters... ); ``` The vectorizer worker connects to the Voyage AI API with the API specified in the `VOYAGE_API_KEY` environment variable. To get a vector embedding from SQL, use the `ai.voyageai_embed` function: ```sql SELECT ai.voyageai_embed('voyage-3-lite', 'text to embed'); ```

smoya

Mainly nits and small suggestions. LGTM otherwise 💯

docs/vectorizer-api-reference.md

docs/voyageai.md

docs/vectorizer-api-reference.md

projects/extension/ai/voyageai.py

smoya · 2024-12-02T10:38:21Z

docs/vectorizer-api-reference.md

+    embedding => ai.embedding_voyageai(
+      'voyage-3-lite',
+      512,
+      truncate => false,


I believe we should not set the truncate to false in all of our examples unless we explicitly want to show the behaviour when is set to false. Otherwise, we might confuse users, who 99.9% of the time will want this to be true as default.

Suggested change

truncate => false,

jgpruitt

awesome!

projects/extension/sql/idempotent/015-voyageai.sql

projects/extension/tests/test_voyageai.py

projects/pgai/pgai/vectorizer/embeddings.py

jgpruitt · 2024-12-04T15:17:18Z

docs/voyageai.md

+
+## Configure pgai for Voyage AI
+
+Most pgai functions require a [Voyage AI API key](https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys).


nit: We should reword this sentence (and the corresponding ones in the other docs). When it was first authored, we only supported OpenAI, so "most pgai functions" DID require an OpenAI key. This sentence was copy/pasted around. Now, most OpenAI functions require an openai API key, but MOST pgai functions do not. Same goes for VoyageAI and all the other providers.

I will take this up as a follow-on task.

JamesGuthrie requested a review from a team as a code owner November 26, 2024 15:00

JamesGuthrie force-pushed the jg/voyageai-vectorizer branch 4 times, most recently from ce01c57 to baab4d2 Compare November 26, 2024 15:27

JamesGuthrie commented Nov 26, 2024

View reviewed changes

JamesGuthrie force-pushed the jg/voyageai-vectorizer branch from baab4d2 to 81a6ac4 Compare November 26, 2024 15:34

JamesGuthrie commented Nov 26, 2024

View reviewed changes

projects/extension/sql/idempotent/008-embedding.sql Show resolved Hide resolved

JamesGuthrie force-pushed the jg/voyageai-vectorizer branch 3 times, most recently from 64fc4cd to 2c9aca5 Compare November 26, 2024 19:18

JamesGuthrie force-pushed the jg/voyageai-vectorizer branch from 2c9aca5 to a4df434 Compare November 27, 2024 15:57

smoya approved these changes Nov 27, 2024

View reviewed changes

JamesGuthrie and others added 3 commits November 28, 2024 16:55

chore: address review feedback

27ea0be

fix: add ApiKeyMixin to VoyageAI

c36c8de

fix: pass api key to voyageai API

5113516

smoya reviewed Dec 2, 2024

View reviewed changes

jgpruitt requested changes Dec 2, 2024

View reviewed changes

JamesGuthrie added 3 commits December 3, 2024 08:49

chore: address review feedback

0a845a6

Merge remote-tracking branch 'origin/main' into jg/voyageai-vectorizer

5fe98ae

chore: regenerate test outputs

30d4566

JamesGuthrie force-pushed the jg/voyageai-vectorizer branch from b1a2f9f to 30d4566 Compare December 3, 2024 07:59

JamesGuthrie requested a review from jgpruitt December 4, 2024 11:53

jgpruitt approved these changes Dec 4, 2024

View reviewed changes

JamesGuthrie merged commit 1b56d62 into main Dec 5, 2024
5 checks passed

JamesGuthrie deleted the jg/voyageai-vectorizer branch December 5, 2024 08:45

github-actions bot mentioned this pull request Dec 5, 2024

chore(main): release pgai 0.3.0 #277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Voyage AI vectorizer integration #256

feat: add Voyage AI vectorizer integration #256

JamesGuthrie commented Nov 26, 2024 •

edited

Loading

JamesGuthrie Nov 26, 2024

JamesGuthrie Nov 26, 2024

smoya Nov 28, 2024

JamesGuthrie Nov 28, 2024

smoya Nov 29, 2024

JamesGuthrie Nov 26, 2024

smoya left a comment

smoya Dec 2, 2024

JamesGuthrie Dec 2, 2024

jgpruitt left a comment

jgpruitt Dec 4, 2024

JamesGuthrie Dec 5, 2024


		## Configure pgai for Voyage AI

		Most pgai functions require a [Voyage AI API key](https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys).

feat: add Voyage AI vectorizer integration #256

feat: add Voyage AI vectorizer integration #256

Conversation

JamesGuthrie commented Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smoya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgpruitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesGuthrie commented Nov 26, 2024 •

edited

Loading