Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PineconeDocumentStore #2254

Merged
merged 80 commits into from
Mar 21, 2022

Conversation

jamescalam
Copy link
Contributor

@jamescalam jamescalam commented Feb 27, 2022

Proposed changes:

Status:

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

I put together two notebooks for testing, if it helps:

@CLAassistant
Copy link

CLAassistant commented Feb 27, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@bogdankostic bogdankostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far! Before merging, we would definitely need to add Pinecone to our document store tests. Furthermore, it would be nice to make use of the newly introduced filter_utils.py for converting the filters, as this makes maintenance of filters across all document stores easier.

We also need to make sure that the typing is compliant with mypy.

environment: str = "us-west1-gcp",
sql_url: str = "sqlite:///pinecone_document_store.db",
pinecone_index: Optional["pinecone.Index"] = None,
vector_dim: int = 768,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We deprecated vector_dim for FAISS and Milvus and use embedding_dim instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

def _convert_pinecone_result_to_document(self, result: dict, return_embedding: bool) -> Document:
"""
Convert Pinecone result dict into haystack document object. This is more involved because
weaviate search result dict varies between get and query interfaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove mentions of Weaviate here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

)
return document

def _validate_params_load_from_disk(self, sig: Signature, locals: dict, kwargs: dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

)
if len(document_objects) > 0:
add_vectors = False if document_objects[0].embedding is None else True
# I don't think below is required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you are referring to with this comment..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had already removed the code, but missed the comment - now I removed the comment

self.index: self.embedding_field,
}

def _build_filter_clause(self, filters: Dict[str, Union[str, int, float, bool, list]]) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For uniformity with the remaining document stores, it would be better to add convert_to_pinecone method to the classes in filter_utils.py. Like this, we can simply call LogicalFilterClause.parse(filters).convert_to_pinecone().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use filter_utils.py, I've removed _build_filter_clause

Weaviate get methods return the data items in properties key, whereas the query doesn't.
"""
score = None
content = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is content somewhere else populated? Otherwise, I think it will always be an empty string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content is extracted with content = super().get_documents_by_id([doc.id for doc in documents])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdankostic Could you please explain what was the outcome here? I see that get_documents_by_id() in SQLDocumentStore returns a List of Document (not a List of content of Document):

) -> List[Document]:

Where is content set?

# check there are vectors
count = self.get_embedding_count(index)
if count == 0:
raise Exception("No documents exist, try creating documents with write_embeddings first.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be changed to either write_documents or update_embeddings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated comment, now reads "either write_documents or update_embeddings"

count = stats["namespaces"][""]["vector_count"] if stats["namespaces"].get("") else 0
return count

def train_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

"""
raise NotImplementedError("save method not implemented for PineconeDocumentStore")

def _load_init_params_from_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

)

@classmethod
def load():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cls argument is missing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in latest commit, is now def load(cls)

@jamescalam jamescalam requested a review from bogdankostic March 2, 2022 12:31
@bogdankostic bogdankostic added the type:feature New feature or request label Mar 2, 2022
@bogdankostic bogdankostic changed the title Pinecone doc store Add PineconeDocumentStore Mar 14, 2022
@julian-risch julian-risch self-requested a review March 15, 2022 08:28
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me 👍 Only some smaller things that don't concern the main functionality. I would like to see some of the Exceptions become DocumentStoreError, the doc string of PineconeDocumentStore should describe a bit more how to obtain an API key and what it means that the DocumentStore is hosted. And I didn't understand the role of content = "". Please explain that in a comment in the code. Happy to approve once that is addressed. 🙂

haystack/document_stores/pinecone.py Show resolved Hide resolved
haystack/document_stores/pinecone.py Outdated Show resolved Hide resolved
haystack/document_stores/pinecone.py Outdated Show resolved Hide resolved
Weaviate get methods return the data items in properties key, whereas the query doesn't.
"""
score = None
content = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdankostic Could you please explain what was the outcome here? I see that get_documents_by_id() in SQLDocumentStore returns a List of Document (not a List of content of Document):

) -> List[Document]:

Where is content set?

haystack/document_stores/pinecone.py Outdated Show resolved Hide resolved
haystack/document_stores/pinecone.py Outdated Show resolved Hide resolved
haystack/document_stores/pinecone.py Show resolved Hide resolved
haystack/document_stores/pinecone.py Outdated Show resolved Hide resolved
haystack/document_stores/pinecone.py Outdated Show resolved Hide resolved
haystack/document_stores/pinecone.py Outdated Show resolved Hide resolved
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants