How can I obtain the doc ID? #64

Ask-sola · 2024-10-06T09:40:53Z

Ask-sola
Oct 6, 2024

If I want to return the corresponding ID instead of the doc text itself after a query, how should I implement it?

Oct 6, 2024

Tou can pass the ids to corpus when you call retriever.retrieve, e.g. retriever.retrieve(..., corpus=your_corpus_ids). alternatively, you can also pass it when you initialize the retriever.

import bm25s
import Stemmer  # optional: for stemming

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

corpus_ids = [
  "doc_1",
  "doc_2",
  "doc_3",
  "doc_4",
]

# optional: create a stemmer
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpu…

View full answer

xhluca · 2024-10-06T20:58:46Z

xhluca
Oct 6, 2024
Maintainer

Tou can pass the ids to corpus when you call retriever.retrieve, e.g. retriever.retrieve(..., corpus=your_corpus_ids). alternatively, you can also pass it when you initialize the retriever.

import bm25s
import Stemmer  # optional: for stemming

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

corpus_ids = [
  "doc_1",
  "doc_2",
  "doc_3",
  "doc_4",
]

# optional: create a stemmer
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

# Query the corpus
query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

# Get top-k results as a tuple of (doc ids, scores). Both are arrays of shape (n_queries, k)
results, scores = retriever.retrieve(query_tokens, corpus=corpus_ids, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I obtain the doc ID? #64

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How can I obtain the doc ID? #64

Ask-sola Oct 6, 2024

Replies: 1 comment

xhluca Oct 6, 2024 Maintainer

Ask-sola
Oct 6, 2024

xhluca
Oct 6, 2024
Maintainer