-
If I want to return the corresponding ID instead of the doc text itself after a query, how should I implement it? |
Beta Was this translation helpful? Give feedback.
Answered by
xhluca
Oct 6, 2024
Replies: 1 comment
-
Tou can pass the ids to import bm25s
import Stemmer # optional: for stemming
# Create your corpus here
corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]
corpus_ids = [
"doc_1",
"doc_2",
"doc_3",
"doc_4",
]
# optional: create a stemmer
stemmer = Stemmer.Stemmer("english")
# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)
# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)
# Query the corpus
query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)
# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k)
results, scores = retriever.retrieve(query_tokens, corpus=corpus_ids, k=2)
for i in range(results.shape[1]):
doc, score = results[0, i], scores[0, i]
print(f"Rank {i+1} (score: {score:.2f}): {doc}") |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
xhluca
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Tou can pass the ids to
corpus
when you callretriever.retrieve
, e.g.retriever.retrieve(..., corpus=your_corpus_ids)
. alternatively, you can also pass it when you initialize the retriever.