forked from stanfordnlp/dspy
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request stanfordnlp#1036 from Anush008/qdrant-rm-doc
docs: Qdrant RM example
- Loading branch information
Showing
2 changed files
with
303 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
302 changes: 302 additions & 0 deletions
302
examples/integrations/qdrant/qdrant_retriever_example.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,302 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# DSPy Retriever using Qdrant\n", | ||
"\n", | ||
"This notebook will walk you through using Qdrant as retriever in DSPy. We'll be loading a dataset into Qdrant and retrieving relevant context from it in our DSPy retriever." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Setup" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install dspy-ai[qdrant]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Configure Constants" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This notebook assumes, you have a Qdrant instance running at http://localhost:6333/. To learn more about setting up Qdrant, you can refer to the [quickstart guide](https://qdrant.tech/documentation/quick-start/)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"COLLECTION_NAME = \"DBPEDIA-DSPY\"\n", | ||
"QDRANT_URL = \"http://localhost:6333\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Ingesting data\n", | ||
"\n", | ||
"We'll load the [Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K) dataset that contains info from DBPedia and embeddings pre-computed using OpenAI's `text-embedding-3-small`!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install datasets" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from datasets import load_dataset\n", | ||
"\n", | ||
"# We will use a small subset of the dataset\n", | ||
"dataset = (\n", | ||
" load_dataset(\n", | ||
" \"Qdrant/dbpedia-entities-openai3-text-embedding-3-small-1536-100K\",\n", | ||
" streaming=True,\n", | ||
" split=\"train\",\n", | ||
" )\n", | ||
" .take(1000)\n", | ||
" .remove_columns([\"openai\", \"combined_text\"])\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Set up a client that points to a Qdrant instance at http://localhost:6333/." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from qdrant_client import QdrantClient\n", | ||
"\n", | ||
"client = QdrantClient(url=QDRANT_URL)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We [create a collection](https://qdrant.tech/documentation/concepts/collections/#create-a-collection) with the appropriate dimensions and distance metric to load our dataset into." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"True" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"from qdrant_client import models\n", | ||
"\n", | ||
"client.create_collection(\n", | ||
" collection_name=COLLECTION_NAME,\n", | ||
" vectors_config=models.VectorParams(\n", | ||
" size=1536,\n", | ||
" distance=models.Distance.COSINE,\n", | ||
" ),\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We can now load the dataset to be indexed in Qdrant. The `upload_collection` methods accepts argumens to configure the batch size and parallelism. We'll go with the defaults." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"vectors = [entry.pop(\"text-embedding-3-small-1536-embedding\") for entry in dataset]\n", | ||
"\n", | ||
"client.upload_collection(collection_name=COLLECTION_NAME, vectors=vectors, payload=dataset)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The loading is now complete. You can browse through the entries at http://localhost:6333/dashboard." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Initialize Qdrant retriever and OpenAI vectorizer" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The Qdrant retriever allows us to configure the vectorizer to use. We'll use the `OpenAIVectorizer` with the `text-embedding-3-small` model as per our dataset.\n", | ||
"\n", | ||
"We can also specify the field in our Qdrant payload with the document content. In our case, it's `\"text\"`. Based on the dataset we loaded." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"\n", | ||
"os.environ[\"OPENAI_API_KEY\"] = \"<YOUR_OPENAI_API_KEY>\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dsp.modules.sentence_vectorizer import OpenAIVectorizer\n", | ||
"\n", | ||
"vectorizer = OpenAIVectorizer(model=\"text-embedding-3-small\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dspy.retrieve.qdrant_rm import QdrantRM\n", | ||
"\n", | ||
"qdrant_retriever = QdrantRM(\n", | ||
" qdrant_client=client,\n", | ||
" qdrant_collection_name=COLLECTION_NAME,\n", | ||
" vectorizer=vectorizer,\n", | ||
" document_field=\"text\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"With the `qdrant_retriever` now instantiated, we can now configure `dspy` to use it." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import dspy\n", | ||
"\n", | ||
"dspy.settings.configure(rm=qdrant_retriever)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Trying out the retriever\n", | ||
"\n", | ||
"We can use the `dspy.Retrieve` class to query our retriever. Similar to how it's done in the DSPy RAG pipelines." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"Prediction(\n", | ||
" passages=['CounterSpy is a proprietary spyware removal program for Microsoft Windows software developed by Sunbelt Software.', 'In computing, the diff utility is a data comparison tool that calculates and displays the differences between two files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other.', \"AudioDesk is an audio workstation application by Mark of the Unicorn (MOTU) for the Mac OS. It is a multi-track recording, editing, and mixing application, with both offline file-based processing and realtime effects. It is a more basic version of MOTU's Digital Performer DAW software. Much of the graphical user interface (GUI) and its operation are similar to Digital Performer, although it lacks some of Digital Performer's features.\"]\n", | ||
")" | ||
] | ||
}, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"retrieve = dspy.Retrieve()\n", | ||
"\n", | ||
"retrieve(\"Some computer programs.\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We are able to successfully retrieve results relevant to the query from our Qdrant collection." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.13" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |