-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Weaviate as another DocumentStore #957 #1064
Conversation
Cross tagging issue #957 |
@venuraja79 Thank you for creating PR |
@lalitpagaria @tholor Just trying to get some attention :) - Once we decide on the above design aspects and also an initial review of the source code, we can update the PR with tests & next improved version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @venuraja79,
First of all: Thanks a lot for working on this! I think it will be a great addition to Haystack - especially as Weaviate allows vector search with scalar filters which FAISS and Milvus don't allow for (yet).
Sorry for my late reply here - was a crazy week and wanted to first check weaviate out as I am not familiar with their design yet.
However, here are already a few comments / observations / questions to not stale the process here and especially gain clarity on the major design decisions:
update embeddings - My pov is that update embeddings is not required because
weaviate will automatically generate embedding if it isn't passed. Is this OK?
This is actually the core question I have. I only had a quick look at weaviate but I understand that they actually have an own module tight to the actual "Document Store", where the user can pick a model and weaviate takes care of embedding creation.
This is fundamentally different to all other vector stores in Haystack. Just using their models is not really an option as the optionality seems rather limited there, we want to be able to easily swap different documentstores and combine them with our retrievers in Haystack. I see two directions:
a) We don't use their models and just pass embeddings / update them as with the other document stores.
=> Seems to be the cleaner solution to me as we have full compatibility with Haystack models, the API will be quite similar to the others and we have control over model configuration (GPU, preprocessing, distribution ...). Not sure though if this plays nicely with weaviate and if there are any nice features in weaviate that we wouldn't be able to use in this case.
b) We allow both: just passing embeddings from Haystack or running a weaviate model
=> This might be quite complex to implement in a user-friendly way. We would need to expose the model choice somehow in the documentstore and add a "dummy WeaviateRetriever". This will be fundamentally different to the other documentstores and could add confusion further down the line in pipelines.
write documents - if there is a failure in a batch, we will log an error message.
User will debug and figure out to format the documents correctly. Is this OK?
This behavior will change soon in #1069. If the implementation is ready before this PR get's merged we can update it in here, too. For now, let's not worry about it.
To implement filters in query, GraphQL is supported. Can we allow the user to provide the filter as a text
instead of a Dict? An example below -
where_filter = {
"path": ["wordCount"],
"operator": "GreaterThan",
"valueInt": 1000
}
We will need the user to pass 'path', 'operator' and both 'value' & 'valueType'
in case if we want to construct the filter text on the fly. Any suggestions please?
I think it will be cleaner to have the same syntax as in the other document stores and then convert the dict to weaviate format under the hood. Again, this will simplify compatibility of different pipelines so that users can easily do prototyping in A and deploy it in B.
FYI We plan to improve the syntax and functionality of filtering in the other document stores (e.g. to support date ranges, numerical comparisons etc). I don't think it makes sense to mix it with this PR though. It would just blow up the complexity. So let's implement the current filter style here and refactor it once we have settled on the new filter syntax in Haystack.
haystack/document_store/weaviate.py
Outdated
|
||
def __init__( | ||
self, | ||
weaviate_url: str = "http://localhost:8080", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep naming of args as similar to other document stores as possible.
In this case, I would suggest keeping it similar to elasticsearch and split it into host + port.
(I know we use sql_url
in FAISS - reasoning back then was that just host confuses people as they would probably look for a FAISS host in a FAISSDocumentstore)
haystack/document_store/weaviate.py
Outdated
See https://www.semi.technology/developers/weaviate/current/data-schema/schema-configuration.html | ||
:param module_name : Vectorization module to convert data into vectors. Default is "text2vec-trasnformers" | ||
For more details, See https://www.semi.technology/developers/weaviate/current/modules/ | ||
:param update_existing_documents: Whether to update any existing documents with the same ID when adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you saw in #1069, this arg is going to change soon.
Thank you @tholor for the feedback. #2. Filter Implementation: Another challenge is that elastic search has a _meta to store any additional custom (dynamic ?) key-value pairs. However, Weaviate doesn't support anything like that by default because of the schema requirement. As a hack, converted all the meta into a string during write and converted back to a dict while reading. But this will not help in filtering by any meta data. Planning to do some experiments with a meta class or raise a question in Weaviate forum. |
@tholor, @lalitpagaria Please see this issue that I created in Weaviate repo for creating dynamic properties. We can make a decision on filter implementation based on the direction here. Thank you! |
Additional updates:
I think we have a few options - your views/suggestions please! Also, waiting for feedback from Weaviate team. Schema for the meta class:
|
@venuraja79 Thanks for the update.
This will save us from adding hackish code. But it will add another embedding store (FAISS and Milvus) to Haystack which does not support filtering. @tholor WDYT? |
Hi all – I just noticed this PR (thanks for mentioning it on the Weaviate Slack channel @venuraja79) RE: @tholor
I would agree that this would add the most value to Haystack because Weaviate is created with a scalable “vector-first” approach (also see the feature overview). We also have some internal benchmarks ready that you are probably going to like 😊 When it comes to horizontal scalability, we will be releasing a detailed architecture roadmap for this very soon on our website (I’ll leave a link here when it's live), but I can already share that the ETA is in Q3 this year. Long story short, it seems the timing is perfect. Around the time this feature is finished, horizontal scalability is around the corner in Weaviate. PS: PPS: |
Great to see this PR develop and Weaviate being used! As you've already identified absolutely correctly, you can use Weaviate as a "pure vector storage and search" or optionally use modules to include the encoding journey as well. I'm also really happy to read that the combining of scalar and vector search is explicitly highlighted, as that's one of the things that make Weaviate really unique in my opinion. The filtering happens pre-search by the way, so no messing around with very large The planned auto-schema feature has also been mentioned in this thread quite a lot. It's currenlty not in short-term prioritization yet, but it's often asked for and we'll make sure that it gets prioritized soon. I can let you know once we have an ETA for it. If you want to get started adding metadata before that point already, you can also simply cache the schema on your side and add a new property ever time you encounter a previously unseen one. (Essentially this is what we're going to do in the feature, too). See this Slack post for a more detail outline of this workaround. |
Thanks for the pointers @etiennedi @bobvanluijt ! @venuraja79 I see two ways forward regarding meta data filtering:
Both ways are fine for me. 1) is obviously more elegant but also more work. I leave the choice up to you. FYI: we will soon also work on a better filtering syntax to allow for more data types in all documentstores (#625). Right now we only support string filtering end-to-end. |
Thank you all for the help! |
Awesome! I think then we are on the last mile of this PR :) |
@tholor @lalitpagaria This is mostly code complete except for a couple of outstanding issues. please review when you get a chance. Few points to note: Limitations
Update Embeddings: Known Issue: |
@venuraja79 Tests are failing because of missing
|
Hi @venuraja79, some answers inlined:
we want to open this up, but unfortunately it hasn't made the cut for
This is no longer the case in
Thanks for raising this, we'll think about what it would take to allow arbitrary url-safe keys. In theory there should be nothing in the way of it, but it would definitely be a larger change, as for now Weaviate relies on UUID keys quite a bit. So, only expect changes in the long-term here.
This will also be loosened once we losen up allowed characters in property names.
Do I understand it correctly that this are limitations in this PR and not in Weaviate? Because Weaviate's filters should be quite flexible and allow a lot more than this.
I think I need a bit more context on this one. But as mentioned above, the requirement to pass an embedding will go away in
If you aren't using a vectorizer, Weaviate does not know how text can be embedded as vectors. The idea behind the vectorizer system in Weaviate is that it is used both at import time as well as at search time. This makes sure the same Hope I could help with the Weaviate-related questions. |
Thank you @etiennedi for your quick response. Most of the comments here are related to this PR except the last question, please see inline -
Yes, you are correct! This is a limitation for this PR as we want to keep this filter feature simple to start with.
This is a PR feature that updates embeddings only for the documents that didn't have a vector associated with them while indexing. Your answer already helps here.
Yes, thank you! This is our understanding as well. Because weaviate is primarily vector search, we can allow only to query using an embedding. I hope this is OK for now. Anyway waiting for the team to confirm. |
Thanks for the reply.
You can also perform inverted index searches in Weaviate using the I fully agree that it makes most sense to use Weaviate for vector search, however. :)
This is in my (biased) opinion one of the biggest strengths of Weaviate. It allows for scaling things together that live together, i.e. document store, inverted index and vector index always grow together. And in a multi-node setup cross-node traffic is minimized. Also it allows for efficient pre-filtering when combining vector and scalar search. :) |
Additionally, links: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good!
Left one minor comment about test markers.
We should also adjust the description in the docstring slightly - I can take care of this.
Should be ready to merge very soon 🎉
Yes, that's completely correct. FAISS and milvus are pure "vector stores" and therefore we use SQL in our implementation to store docs and metadata.
The focus of weaviate is clearly on vector search. We can optionally also support |
query() method supports keyword matching via filters. For example {"text" : "some text"} will return all documents that contains "some text". However, If the user passes only query text (as a method parameter), we raise a NotImplementedError. This is because the number of properties to search can be many and it's difficult to construct the query (for now). In short, the user has to provide both the property and the keywords as a filter. In addition, we also allow the user to pass a custom graphQL query which can be very powerful. |
.github/workflows/ci.yml
Outdated
@@ -76,6 +76,9 @@ jobs: | |||
- name: Run Milvus | |||
run: docker run -d -p 19530:19530 -p 19121:19121 milvusdb/milvus:1.1.0-cpu-d050721-5e559c | |||
|
|||
- name: Run Weaviate | |||
run: docker run -d -p 8080:8080 --name haystack_test_weaviate --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion to set Weaviate to version 1.4.0
Hi @venuraja79 – Weaviate 1.4.0 is just released so you might want to use that as mentioned here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good to me. Let's update to 1.4.0 and merge once CI is passing
Proposed changes:
Status (please check what you already did):
Discussion Points:
update embeddings - My pov is that update embeddings is not required because
weaviate will automatically generate embedding if it isn't passed. Is this OK?
write documents - if there is a failure in a batch, we will log an error message.
User will debug and figure out to format the documents correctly. Is this OK?
To implement filters in query, GraphQL is supported. Can we allow the user to provide the filter as a text
instead of a Dict? An example below -
where_filter = {
"path": ["wordCount"],
"operator": "GreaterThan",
"valueInt": 1000
}
We will need the user to pass 'path', 'operator' and both 'value' & 'valueType'
in case if we want to construct the filter text on the fly. Any suggestions please?
Open items:
code:
tests
ci/cd
-add weaviate dockers in ci.yml