-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Upsert and a synthetic data set. #3253
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This lays out the scaffolding for upsert.
.iter() | ||
.enumerate() | ||
.filter_map(|(idx, word)| { | ||
if embedding[idx >> 3] & (1 << (idx & 0x7)) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some context on what this does would be helpful
pub fn embedding(&self) -> Vec<f32> { | ||
let mut result = vec![]; | ||
let words = self.content.split_whitespace().collect::<Vec<_>>(); | ||
for word in WORDS.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the wordlist changes? or will created collections never be re-used across chroma load invocations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The words are set in the binary. They will "never" change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, they won't change for the same binary
I was asking if there would be a problem if we decided to change the wordlist in a new version or made it longer to simulate a larger embedding space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make that a new data set. Any limitations there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, that makes sense 👍
This introduces the upsert call and a synthetic data set capable of upsert.