Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart Collections FR: Pinecone Adapter #4

Open
arminta7 opened this issue Dec 27, 2022 · 8 comments
Open

Smart Collections FR: Pinecone Adapter #4

arminta7 opened this issue Dec 27, 2022 · 8 comments
Labels
enhancement New feature or request

Comments

@arminta7
Copy link

Would it be possible to have the option to store the embeddings in Pinecone?

@arminta7 arminta7 changed the title FR: Store embedding sun Pinecone FR: Store embeddings in Pinecone Dec 27, 2022
@brianpetro
Copy link
Owner

brianpetro commented Dec 27, 2022

It's possible.

Integrating Pinecone would require:

  • reformatting the embeddings to the "Pinecone vectors array format"
    • storing vector.metadata (file.path, file.mtime, etc.)
  • garbage collection process to keep the embeddings stored in Pinecone up-to-date
  • replace the current cosine similarity calculation with a query to the Pinecone API

I would consider doing this mainly because of performance, but calculating cosine similarity on my vault containing ~1,500 notes runs pretty smoothly at the moment.

Is the performance why you are asking about this? Or is there another reason?

Thanks!

@brianpetro brianpetro added the enhancement New feature or request label Dec 27, 2022
@arminta7
Copy link
Author

My vault is about 20k notes. Part of it is performance. The other is being able to reuse the embeddings for other things rather than paying for the process multiple times.

@brianpetro
Copy link
Owner

20K is significantly more notes than I have tested with myself. Your embeddings.json file must be almost 2GB! And that's all pulled into memory, which would likely cause performance degradation on an average computer.

Regarding reusing the embeddings, the main issue with that is synchronization—the metadata limit for Pinecone is 10kb. So smaller notes will fit completely into the metadata at this limit, but not the larger notes (>~10,000 characters).

  1. One way to get around this is to store references to the notes in the metadata (i.e. file.path), but that requires the "secondary" applications have access to your notes file system.

  2. Another possibility is to limit all the embeddings to <10kb, which would decrease the maximum "chunk" sizes to about a third of what they are now (~8,000 tokens ~= 30,000 characters ~= 30kb). This way, you could avoid accessing your notes file system directly from "secondary" applications.

I feel option 2 goes against the Obsidian.md ethos of "owning your data" since all your notes would be hosted in the cloud.

Option 1 has its drawbacks, too. "Secondary" applications outside of Obsidian would be more difficult to develop. However, other Obsidian plugins (i.e., Smart Completions) will have no problem reusing the embeddings stored within Obsidian. So it depends on your use case.

What is the average number of notes in an Obsidian vault? If it's much more than what I've anticipated (<1000 notes), then I think option 1 could make sense for performance reasons. That said, performance has been an afterthought at this point. There is still likely a lot of low-hanging fruit in terms of performance that wouldn't require an additional API service provider.

I'm thinking out loud here, so any feedback would be appreciated.

Thanks!

@arminta7
Copy link
Author

Yes, the file is... unwieldy lol.

I don't know too much about the specifics of the different options. I know there's also something like Weaviate? Not sure if that's better. It is open source right?

Just checked and my largest note is ~4 million characters. And plenty of others over 10k.

As far as the average number of notes? I have no idea. I'm probably on the larger end, not the largest I've heard. I'm sure there are plenty over 1,000 notes.

@brianpetro
Copy link
Owner

Thanks for suggesting Weaviate. It's pretty comparable to Pinecone. Hosting your own instance looks non-trivial and may not be easily packed into the plugin. I'll have to look into it more before saying it for sure.

It needs further research, but there should be a relatively simple solution to manage the vector calculations better. The storage file can be separated based on a cosine similarity clustering algorithm. Then the calculations could be prioritized based on the nearest cluster. I'm surprised I haven't seen anything like this, but I haven't looked much.

I'll continue to look into this. Thanks for the feedback.

@vguillet
Copy link

I second this. Being able to pull embedding from pinecone would allow for potentialy leveraging purpose-made embedding tools capable of taking in a large variety of files for example (powerpoints/pdfs for example). This could in turn unlock better query responses while also keeping the base embedding repository across all tools leveraging personal data unique!

@brianpetro
Copy link
Owner

@vguillet I see you already commented on brianpetro/obsidian-smart-connections#27 , thanks!

It's a similar idea. I still think a pinecone/weaviate integration will happen. But I need to learn more about how people are using them.

@brianpetro brianpetro transferred this issue from brianpetro/obsidian-smart-connections Jun 27, 2024
@brianpetro brianpetro changed the title FR: Store embeddings in Pinecone Smart Collections FR: Pinecone Adapter Jun 27, 2024
@brianpetro
Copy link
Owner

Recorded response https://youtu.be/J5ARc_91fzs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants