-
Notifications
You must be signed in to change notification settings - Fork 243
Use ChromaDB for our embeddings database and similarity search #460
Conversation
@@ -19,7 +19,7 @@ async def score( | |||
self.query, features, self.loading_multiplier | |||
) | |||
features_scored = zip(features, sim_scores) | |||
return sorted(features_scored, key=lambda x: x[1], reverse=True) | |||
return sorted(features_scored, key=lambda x: x[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were using cosine_similarity, which is larger-is-better, but Chroma's default similarity func is 'l2', which is lower-is-better.
Now we're passing our |
mentat/embeddings.py
Outdated
def migrate_old_db(self): | ||
"""Temporary helper function to migrate sqlite3 to chromadb | ||
|
||
Prior to January 2024, embeddings were fetched directly from the OpenAI API in | ||
batches and saved to a db. We're currently using the same embeddings (ada-2) with | ||
ChromaDB, so we might as well save the effort of re-fetching them. One drawback | ||
is that ChromaDB saves the actual text, while our old schema did not, so migrated | ||
records will have an empty documents field. This shouldn't be a problem. If it is, | ||
we can just update the 'exists' method to require a non-empty "document" field. | ||
|
||
TODO: erase this method/call after a few months |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice optimization, but I'm not sure the extra complexity is worth saving users a couple cents to rebuild. It might be unlikely, but some bug from migrated records being slightly different could be a big headache and not worth the risk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok make sense. I was just looking at the size of my sqlite file but (1) mine is probably bigger than anyone's and (2) it is indeed really fast and cheap to reload.
async def call_embedding_api( | ||
def call_embedding_api( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm a little too bad Chroma doesn't support async. It's probably not an issue now but in the future if were doing embeddings in the background it could make mentat unresponsive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like some other people on Discord also want async, but not dice yet: https://discord.com/channels/1073293645303795742/1073293646083919946/1191885640120414218
I messed up - I was running license_check locally outside a venv, so it was scanning my entire pip library. Chroma only requires is 1 skip/UNKNOWN, and some different wordings of the Apache license. |
No description provided.