Skip to content

How can I obtain the doc ID? #64

Answered by xhluca
Ask-sola asked this question in Q&A
Discussion options

You must be logged in to vote

Tou can pass the ids to corpus when you call retriever.retrieve, e.g. retriever.retrieve(..., corpus=your_corpus_ids). alternatively, you can also pass it when you initialize the retriever.

import bm25s
import Stemmer  # optional: for stemming

# Create your corpus here
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

corpus_ids = [
  "doc_1",
  "doc_2",
  "doc_3",
  "doc_4",
]

# optional: create a stemmer
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpu…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by xhluca
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #62 on October 06, 2024 20:58.