Skip to content

Commit

Permalink
Release new docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Milvus-doc-bot authored and Milvus-doc-bot committed Oct 28, 2024
1 parent ebb7e77 commit 12eca46
Showing 1 changed file with 51 additions and 56 deletions.
107 changes: 51 additions & 56 deletions v2.4.x/site/en/integrations/integrate_with_sentencetransformers.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,22 @@ title: Movie Search Using Milvus and SentenceTransformers

# Movie Search Using Milvus and SentenceTransformers

In this example, we are going to be going over a Wikipedia article search using Milvus and the SentenceTransformers library. The dataset we are searching through is [Wikipedia Movie Plots with Summaries](https://huggingface.co/datasets/vishnupriyavr/wiki-movie-plots-with-summaries) hosted on HuggingFace.
In this example, we will search movie plot summaries using Milvus and the SentenceTransformers library. The dataset we will use is [Wikipedia Movie Plots with Summaries](https://huggingface.co/datasets/vishnupriyavr/wiki-movie-plots-with-summaries) hosted on HuggingFace.

Let's get started!

## Required Libraries

For this example, we are going to be using `pymilvus` to connect to use Milvus, `sentence-transformers` to generate vector embeddings, and `datasets` to download the example dataset.
For this example, we will use `pymilvus` to connect to use Milvus, `sentence-transformers` to generate vector embeddings, and `datasets` to download the example dataset.

```shell
pip install pymilvus sentence-transformers datasets tqdm
```

```python
from datasets import load_dataset
from pymilvus import MilvusClient, connections
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection
from pymilvus import MilvusClient
from pymilvus import FieldSchema, CollectionSchema, DataType
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
```
Expand All @@ -33,7 +33,7 @@ collection_name = "movie_embeddings"
```

## Downloading and Opening the Dataset
In a single line, `datasets` allows us to download and open a dataset. The library will cache the dataset locally and use that copy next time it is run. Each row contains the details of a movie that has an accompanying Wikipedia article. We only make use of the `Title` and `PlotSummary` columns.
In a single line, `datasets` allows us to download and open a dataset. The library will cache the dataset locally and use that copy next time it is run. Each row contains the details of a movie that has an accompanying Wikipedia article. We make use of the `Title`, `PlotSummary`, `Release Year`, and `Origin/Ethnicity` columns.

```python
ds = load_dataset("vishnupriyavr/wiki-movie-plots-with-summaries", split="train")
Expand All @@ -46,34 +46,30 @@ At this point, we are going to begin setting up Milvus. The steps are as follows
1. Create a Milvus Lite database in a local file. (Replace this URI to the server address for Milvus Standalone and Milvus Distributed.)

```python
connections.connect(uri="./sentence_transformers_example.db")
client = MilvusClient(uri="./sentence_transformers_example.db")
```

2. Create the data schema. This specifies the fields that comprise an element including the dimension of the vector embedding.

```python
fields = [
FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=embedding_dim)
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=embedding_dim),
FieldSchema(name="year", dtype=DataType.INT64),
FieldSchema(name="origin", dtype=DataType.VARCHAR, max_length=64),
]

schema = CollectionSchema(fields=fields, enable_dynamic_field=False)
collection = Collection(name=collection_name, schema=schema)
client.create_collection(collection_name=collection_name, schema=schema)
```

3. Define the vector search indexing algorithm. Milvus Lite implements brute force search and HNSW, whereas Milvus Standalone and Milvus Distributed implement a wide variety of methods. For this scale of data, the naive brute force search suffices.
3. Define the vector search indexing algorithm. Milvus Lite support FLAT index type, whereas Milvus Standalone and Milvus Distributed implement a wide variety of methods such as IVF, HNSW and DiskANN. For the small scale of data in this demo, any search index type suffices so we use the simplest one FLAT here.

```python
params = {
'index_type':"FLAT",
'metric_type': "IP"
}

collection.create_index(
'embedding',
params
)
index_params = client.prepare_index_params()
index_params.add_index(field_name="embedding", index_type="FLAT", metric_type="IP")
client.create_index(collection_name, index_params)
```

Once these steps are done, we are ready to insert data into the collection and perform a search. Any data added will be indexed automatically and be available to search immediately. If the data is very fresh, the search might be slower as brute force searching will be used on data that is still in process of getting indexed.
Expand All @@ -90,16 +86,14 @@ We loop over the rows of the data, embed the plot summary field, and insert enti

```python
for batch in tqdm(ds.batch(batch_size=512)):
embeddings = model.encode(batch['PlotSummary'])
data = [{"title": title, "embedding": embedding} for title, embedding in zip(batch['Title'], embeddings)]
res = collection.insert(data=data)
```

To be safe, we flush the data writing queue and check that the expected number of elements are present in the database.

```python
collection.flush()
print(collection.num_entities)
embeddings = model.encode(batch["PlotSummary"])
data = [
{"title": title, "embedding": embedding, "year": year, "origin": origin}
for title, embedding, year, origin in zip(
batch["Title"], embeddings, batch["Release Year"], batch["Origin/Ethnicity"]
)
]
res = client.insert(collection_name=collection_name, data=data)
```

<div class="alert note">
Expand All @@ -109,7 +103,7 @@ The above operation is relatively time-consuming because embedding takes time. T
</div>

## Performing the Search
With all the data inserted into Milvus, we can start performing our searches. In this example, we are going to search for movies based on the plot. Because we are doing a batch search, the search time is shared across the movie searches. (Can you guess what the intended result was based on the movie search?)
With all the data inserted into Milvus, we can start performing our searches. In this example, we are going to search for movies based on plot summaries from Wikipedia. Because we are doing a batch search, the search time is shared across the movie searches. (Can you guess what movie I had in mind to retrieve based on the query description text?)

```python
queries = [
Expand All @@ -122,60 +116,61 @@ queries = [
]

# Search the database based on input text
def embed_search(data):
embeds = model.encode(data)
return [x for x in embeds]
def embed_query(data):
vectors = model.encode(data)
return [x for x in vectors]


search_data = embed_search(queries)
query_vectors = embed_query(queries)

res = collection.search(
data=search_data,
res = client.search(
collection_name=collection_name,
data=query_vectors,
filter='origin == "American" and year > 1945 and year < 2000',
anns_field="embedding",
param={},
limit=3,
output_fields=['title']
output_fields=["title"],
)

for idx, hits in enumerate(res):
print('Title:', queries[idx])
# print('Search Time:', end-start)
print('Results:')
print("Query:", queries[idx])
print("Results:")
for hit in hits:
print( hit.entity.get('title'), '(', round(hit.distance, 2), ')')
print(hit["entity"].get("title"), "(", round(hit["distance"], 2), ")")
print()
```

The results are:

```shell
Title: An archaeologist searches for ancient artifacts while fighting Nazis.
Query: An archaeologist searches for ancient artifacts while fighting Nazis.
Results:
"Pimpernel" Smith ( 0.48 )
Phantom of Chinatown ( 0.42 )
Counterblast ( 0.41 )
Love Slaves of the Amazons ( 0.4 )
A Time to Love and a Time to Die ( 0.39 )
The Fifth Element ( 0.39 )

Title: Teenagers in detention learn about themselves.
Query: Teenagers in detention learn about themselves.
Results:
The Breakfast Club ( 0.54 )
Up the Academy ( 0.46 )
Fame ( 0.43 )

Title: A teenager fakes illness to get off school and have adventures with two friends.
Query: A teenager fakes illness to get off school and have adventures with two friends.
Results:
Ferris Bueller's Day Off ( 0.48 )
Fever Lake ( 0.47 )
A Walk to Remember ( 0.45 )
Losin' It ( 0.39 )

Title: A young couple with a kid look after a hotel during winter and the husband goes insane.
Query: A young couple with a kid look after a hotel during winter and the husband goes insane.
Results:
Always a Bride ( 0.54 )
Fast and Loose ( 0.49 )
The Shining ( 0.48 )
The Four Seasons ( 0.42 )
Highball ( 0.41 )

Title: Four turtles fight bad guys.
Query: Four turtles fight bad guys.
Results:
TMNT 2: Out of the Shadows ( 0.49 )
Teenage Mutant Ninja Turtles II: The Secret of the Ooze ( 0.47 )
Gamera: Super Monster ( 0.43 )
Devil May Hare ( 0.43 )
Attack of the Giant Leeches ( 0.42 )
```

0 comments on commit 12eca46

Please sign in to comment.