Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocArray as a Retriever #6031

Merged
merged 10 commits into from
Jun 17, 2023
Merged

Conversation

jupyterjazz
Copy link
Contributor

@jupyterjazz jupyterjazz commented Jun 12, 2023

DocArray as a Retriever

DocArray is an open-source tool for managing your multi-modal data. It offers flexibility to store and search through your data using various document index backends. This PR introduces DocArrayRetriever - which works with any available backend and serves as a retriever for Langchain apps.

Also, I added 2 notebooks:
DocArray Backends - intro to all 5 currently supported backends, how to initialize, index, and use them as a retriever
DocArray Usage - showcasing what additional search parameters you can pass to create versatile retrievers

Example:

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.retrievers import DocArrayRetriever


# define document schema
class MyDoc(BaseDoc):
    description: str
    description_embedding: NdArray[1536]


embeddings = OpenAIEmbeddings()
# create documents
descriptions = ["description 1", "description 2"]
desc_embeddings = embeddings.embed_documents(texts=descriptions)
docs = DocList[MyDoc](
    [
        MyDoc(description=desc, description_embedding=embedding)
        for desc, embedding in zip(descriptions, desc_embeddings)
    ]
)

# initialize document index with data
db = InMemoryExactNNIndex[MyDoc](docs)

# create a retriever
retriever = DocArrayRetriever(
    index=db,
    embeddings=embeddings,
    search_field="description_embedding",
    content_field="description",
)

# find the relevant document
doc = retriever.get_relevant_documents("action movies")
print(doc)

Who can review?

@dev2049

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
@jpzhangvincent
Copy link
Contributor

It would be nice to also add jina's annlite for the vector store option as well.

@jupyterjazz
Copy link
Contributor Author

hey @jpzhangvincent, annlite is not yet compatible with the new docarray version, but we might do it in the future, thanks for the suggestion!

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need two separate notebooks about docarrary in the retrievers section

@hwchase17 hwchase17 added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jun 16, 2023
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
@vercel
Copy link

vercel bot commented Jun 16, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 16, 2023 7:45pm

@vercel vercel bot temporarily deployed to Preview June 16, 2023 08:00 Inactive
@jupyterjazz
Copy link
Contributor Author

@hwchase17 @vowelparrot @dev2049

I'm not sure why Vercel is failing, I think it fails for all other recent PRs.

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
@vercel vercel bot temporarily deployed to Preview June 16, 2023 18:52 Inactive
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
@vercel vercel bot temporarily deployed to Preview June 16, 2023 19:29 Inactive
@vercel vercel bot temporarily deployed to Preview June 16, 2023 19:45 Inactive
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
@vercel
Copy link

vercel bot commented Jun 16, 2023

@jupyterjazz is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
@jupyterjazz
Copy link
Contributor Author

hey @hwchase17 @vowelparrot @dev2049

I think Vercel needs some approval from your side and CI should be green afterwards. The comment about separate notebooks is addressed!

@hwchase17 hwchase17 merged commit 427551e into langchain-ai:master Jun 17, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants