-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PineconeDocumentStore
#2254
Add PineconeDocumentStore
#2254
Conversation
…ert, query, delete)
…-doc-store Haystack master changes merged into Pinecone doc store branch
…into pinecone-doc-store merge origin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far! Before merging, we would definitely need to add Pinecone to our document store tests. Furthermore, it would be nice to make use of the newly introduced filter_utils.py
for converting the filters, as this makes maintenance of filters across all document stores easier.
We also need to make sure that the typing is compliant with mypy.
haystack/document_stores/pinecone.py
Outdated
environment: str = "us-west1-gcp", | ||
sql_url: str = "sqlite:///pinecone_document_store.db", | ||
pinecone_index: Optional["pinecone.Index"] = None, | ||
vector_dim: int = 768, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We deprecated vector_dim
for FAISS and Milvus and use embedding_dim
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
haystack/document_stores/pinecone.py
Outdated
def _convert_pinecone_result_to_document(self, result: dict, return_embedding: bool) -> Document: | ||
""" | ||
Convert Pinecone result dict into haystack document object. This is more involved because | ||
weaviate search result dict varies between get and query interfaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove mentions of Weaviate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
haystack/document_stores/pinecone.py
Outdated
) | ||
return document | ||
|
||
def _validate_params_load_from_disk(self, sig: Signature, locals: dict, kwargs: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
haystack/document_stores/pinecone.py
Outdated
) | ||
if len(document_objects) > 0: | ||
add_vectors = False if document_objects[0].embedding is None else True | ||
# I don't think below is required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you are referring to with this comment..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I had already removed the code, but missed the comment - now I removed the comment
haystack/document_stores/pinecone.py
Outdated
self.index: self.embedding_field, | ||
} | ||
|
||
def _build_filter_clause(self, filters: Dict[str, Union[str, int, float, bool, list]]) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For uniformity with the remaining document stores, it would be better to add convert_to_pinecone
method to the classes in filter_utils.py
. Like this, we can simply call LogicalFilterClause.parse(filters).convert_to_pinecone()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use filter_utils.py
, I've removed _build_filter_clause
haystack/document_stores/pinecone.py
Outdated
Weaviate get methods return the data items in properties key, whereas the query doesn't. | ||
""" | ||
score = None | ||
content = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is content
somewhere else populated? Otherwise, I think it will always be an empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
content is extracted with content = super().get_documents_by_id([doc.id for doc in documents])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bogdankostic Could you please explain what was the outcome here? I see that get_documents_by_id()
in SQLDocumentStore
returns a List
of Document
(not a List
of content
of Document
):
haystack/haystack/document_stores/sql.py
Line 178 in 46fa166
) -> List[Document]: |
Where is content set?
haystack/document_stores/pinecone.py
Outdated
# check there are vectors | ||
count = self.get_embedding_count(index) | ||
if count == 0: | ||
raise Exception("No documents exist, try creating documents with write_embeddings first.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be changed to either write_documents or update_embeddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated comment, now reads "either write_documents or update_embeddings"
haystack/document_stores/pinecone.py
Outdated
count = stats["namespaces"][""]["vector_count"] if stats["namespaces"].get("") else 0 | ||
return count | ||
|
||
def train_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
haystack/document_stores/pinecone.py
Outdated
""" | ||
raise NotImplementedError("save method not implemented for PineconeDocumentStore") | ||
|
||
def _load_init_params_from_config( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
haystack/document_stores/pinecone.py
Outdated
) | ||
|
||
@classmethod | ||
def load(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cls
argument is missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in latest commit, is now def load(cls)
…store_tests # Conflicts: # test/conftest.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me 👍 Only some smaller things that don't concern the main functionality. I would like to see some of the Exceptions become DocumentStoreError
, the doc string of PineconeDocumentStore should describe a bit more how to obtain an API key and what it means that the DocumentStore is hosted. And I didn't understand the role of content = ""
. Please explain that in a comment in the code. Happy to approve once that is addressed. 🙂
haystack/document_stores/pinecone.py
Outdated
Weaviate get methods return the data items in properties key, whereas the query doesn't. | ||
""" | ||
score = None | ||
content = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bogdankostic Could you please explain what was the outcome here? I see that get_documents_by_id()
in SQLDocumentStore
returns a List
of Document
(not a List
of content
of Document
):
haystack/haystack/document_stores/sql.py
Line 178 in 46fa166
) -> List[Document]: |
Where is content set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍
# Conflicts: # docs/_src/api/api/pipelines.md
…-store' into pr/2254
Proposed changes:
haystack/document_stores/pinecone.py
, embeddings and metadata are stored in Pinecone, content is stored in a local SQL DBStatus:
I put together two notebooks for testing, if it helps: