Use ChromaDB for our embeddings database and similarity search #460

granawkins · 2024-01-08T08:18:56Z

No description provided.

granawkins · 2024-01-08T09:12:01Z

mentat/feature_filters/embedding_similarity_filter.py

@@ -19,7 +19,7 @@ async def score(
            self.query, features, self.loading_multiplier
        )
        features_scored = zip(features, sim_scores)
-        return sorted(features_scored, key=lambda x: x[1], reverse=True)
+        return sorted(features_scored, key=lambda x: x[1])


We were using cosine_similarity, which is larger-is-better, but Chroma's default similarity func is 'l2', which is lower-is-better.

tests/conftest.py

tests/license_check.py

mentat/embeddings.py

granawkins · 2024-01-10T07:31:57Z

Now we're passing our llm_api_handler.call_embeddings_api directly to ChromaDB. Only complication was that it takes a non-async func.

biobootloader · 2024-01-10T22:14:14Z

mentat/embeddings.py

+    def migrate_old_db(self):
+        """Temporary helper function to migrate sqlite3 to chromadb
+
+        Prior to January 2024, embeddings were fetched directly from the OpenAI API in
+        batches and saved to a db. We're currently using the same embeddings (ada-2) with
+        ChromaDB, so we might as well save the effort of re-fetching them. One drawback
+        is that ChromaDB saves the actual text, while our old schema did not, so migrated
+        records will have an empty documents field. This shouldn't be a problem. If it is,
+        we can just update the 'exists' method to require a non-empty "document" field.
+
+        TODO: erase this method/call after a few months


this is a nice optimization, but I'm not sure the extra complexity is worth saving users a couple cents to rebuild. It might be unlikely, but some bug from migrated records being slightly different could be a big headache and not worth the risk

Ok make sense. I was just looking at the size of my sqlite file but (1) mine is probably bigger than anyone's and (2) it is indeed really fast and cheap to reload.

biobootloader · 2024-01-10T22:23:20Z

mentat/llm_api_handler.py

-    async def call_embedding_api(
+    def call_embedding_api(


hmm a little too bad Chroma doesn't support async. It's probably not an issue now but in the future if were doing embeddings in the background it could make mentat unresponsive

Look like some other people on Discord also want async, but not dice yet: https://discord.com/channels/1073293645303795742/1073293646083919946/1191885640120414218

tests/license_check.py

granawkins · 2024-01-11T03:48:27Z

I messed up - I was running license_check locally outside a venv, so it was scanning my entire pip library. Chroma only requires is 1 skip/UNKNOWN, and some different wordings of the Apache license.

granawkins added 7 commits January 8, 2024 13:13

draft duplicate of embeddings.py

81a25a6

solid

11edb7c

cleanup

26cf5ec

format fix

73bbbae

add UNKNOWN to the list of accepted licenses for chroma-hnswlib

ef5ad21

more with licenses

d354d02

hammer out license stuff

23defc1

granawkins commented Jan 8, 2024

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

tests/license_check.py Outdated Show resolved Hide resolved

PCSwingle reviewed Jan 8, 2024

View reviewed changes

tests/license_check.py Outdated Show resolved Hide resolved

PCSwingle reviewed Jan 8, 2024

View reviewed changes

mentat/embeddings.py Outdated Show resolved Hide resolved

use a custom MentatEmbeddingFunction for chromadb

a5fdfad

granawkins requested a review from PCSwingle January 10, 2024 07:29

biobootloader reviewed Jan 10, 2024

View reviewed changes

fix license_check

e877b32

granawkins added 2 commits January 12, 2024 09:58

small tweaks to embeddings

f9e00f5

remove migrate function

0866b18

granawkins requested a review from biobootloader January 12, 2024 03:10

biobootloader approved these changes Jan 12, 2024

View reviewed changes

granawkins merged commit 414cfc3 into main Jan 12, 2024
16 checks passed

gssakash-SxT mentioned this pull request Jan 16, 2024

Use embeddings of files/functions to select context for prompt #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ChromaDB for our embeddings database and similarity search #460

Use ChromaDB for our embeddings database and similarity search #460

granawkins commented Jan 8, 2024

granawkins Jan 8, 2024

granawkins commented Jan 10, 2024

biobootloader Jan 10, 2024

granawkins Jan 10, 2024

biobootloader Jan 10, 2024

granawkins Jan 11, 2024

granawkins commented Jan 11, 2024

Use ChromaDB for our embeddings database and similarity search #460

Use ChromaDB for our embeddings database and similarity search #460

Conversation

granawkins commented Jan 8, 2024

granawkins Jan 8, 2024

Choose a reason for hiding this comment

granawkins commented Jan 10, 2024

biobootloader Jan 10, 2024

Choose a reason for hiding this comment

granawkins Jan 10, 2024

Choose a reason for hiding this comment

biobootloader Jan 10, 2024

Choose a reason for hiding this comment

granawkins Jan 11, 2024

Choose a reason for hiding this comment

granawkins commented Jan 11, 2024