Pre-release v0.2.0: Speeding up retrieval with numba, and new stopwords #46

xhluca · 2024-08-18T17:39:40Z

xhluca
Aug 18, 2024
Maintainer

This is a pretty exciting pre-release! It is a major new feature for the v0.2.0 that will come out soon. I hope you get to try this and share your thoughts in this thread!

to try:

pip install numba
pip install "bm25s[full]==0.2.0rc6"

What's Changed

Add numba integration to allow for faster scoring and retrieval by @xhluca in Add numba integration to allow for faster scoring and retrieval #41
Add stopwords for 10 new languages by @bm777 in Add stopwords for 10 new languages #33
Add type hint for texts argument in tokenize function and replace time.time() with time.monotonic()` by @dantetemplar in Add type hint for texts argument in tokenize function, use time.monotonic instead of time.time #44

New Contributors

@bm777 made their first contribution in Add stopwords for 10 new languages #33
@dantetemplar made their first contribution in Add type hint for texts argument in tokenize function, use time.monotonic instead of time.time #44

Full Changelog: 0.1.10...0.2.0rc6

Notes about new numba integration

In PR #41, we add support for Numba's no-python JIT compiling, allowing substantial speedup. For example, we went from 41 queries/s for NQ to 91.83 q/s (see bm25-benchmark).

Changes

We added an option to use a numba backend for topk selection when you retrieve text. Simply use retriever.retrieve(... backend_selection="numba") to activate it.
We changed how the relevance score is computed, to make it faster by default and even faster when numba is used. You can now use retriever.activate_numba_scorer() to enable numba
New tests for numba: tests/numba/test_topk_numba.py
New example using numba: examples/retrieve_with_numba.py

Detailed notes

New scoring approaches (numba ready)

You can find the function _compute_relevance_from_scores_legacy in bm25s/scoring.py to see how the old scoring worked. We now also have a _compute_relevance_from_scores_jit_ready which is an alternative to the legacy and default relevance scoring function, which is slow out of the box but can be muich faster when we call numba.njit(_compute_relevance_from_scores_jit_ready). Moreover, our default relevance scoring function is now faster than the legacy approach, and has been moved directly to the main BM25 class as a staticmethod called _compute_relevance_from_scores. That can be overwritten to use your custom function, such as _compute_relevance_from_scores_jit_ready or _compute_relevance_from_scores_legacy.

New selection algorithm powered by numba (`topk`)

We created a bm25s.numba.selection module that can be imported only when numba is available, and offers a topk function that behaves mostly the same as bm25s.selection.topk (only difference might be that some of the order of retrieved documents differ if they have the same score). It is automatically selected when backend_selection="numba" is selected)

Usage

Here's an example of how to leverage numba speedups

import os
import Stemmer

import bm25s.hf

def main(repo_name="xhluca/bm25s-fiqa-index"):
    queries = [
        "Is chemotherapy effective for treating cancer?",
        "Is Cardiac injury is common in critical cases of COVID-19?",
    ]

    retriever = bm25s.hf.BM25HF.load_from_hub(
        repo_name, load_corpus=False, mmap=False
    )

    # Tokenize the queries
    stemmer = Stemmer.Stemmer("english")
    queries_tokenized = bm25s.tokenize(queries, stemmer=stemmer)

    # Retrieve the top-k results
    retriever.activate_numba_scorer()
    results = retriever.retrieve(queries_tokenized, k=3, backend_selection="numba")
    # show first results
    result = results.documents[0]
    print(f"First score (# 1 result):{results.scores[0, 0]}")
    print(f"First result (# 1 result):\n{result[0]}")

if __name__ == "__main__":
    main()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-release v0.2.0: Speeding up retrieval with numba, and new stopwords #46

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Pre-release v0.2.0: Speeding up retrieval with numba, and new stopwords #46

xhluca Aug 18, 2024 Maintainer

What's Changed

New Contributors

Notes about new numba integration

Changes

Detailed notes

New scoring approaches (numba ready)

New selection algorithm powered by numba (topk)

Usage

Replies: 0 comments

xhluca
Aug 18, 2024
Maintainer

New selection algorithm powered by numba (`topk`)