ictnlp · ZhuochengZhang98 · Jan 8, 2025 · Jan 8, 2025 · Jan 8, 2025 · Jan 8, 2025
diff --git a/README-zh.md b/README-zh.md
@@ -91,7 +91,7 @@ python -m flexrag.entrypoints.prepare_index \
     saving_fields=$CORPUS_FIELDS \
     retriever_type=dense \
     dense_config.database_path=$DB_PATH \
-    dense_config.encode_fields='[text]' \
+    dense_config.encode_fields=[text] \
     dense_config.passage_encoder_config.encoder_type=hf \
     dense_config.passage_encoder_config.hf_config.model_path='facebook/contriever' \
     dense_config.passage_encoder_config.hf_config.device_id=[0,1,2,3] \
@@ -114,7 +114,7 @@ python -m flexrag.entrypoints.prepare_index \
     saving_fields=$CORPUS_FIELDS \
     retriever_type=bm25s \
     bm25s_config.database_path=$DB_PATH \
-    bm25s_config.indexed_fields='[title,text]' \
+    bm25s_config.indexed_fields=[title,text] \
     bm25s_config.method=lucene \
     bm25s_config.batch_size=512 \
     bm25s_config.log_interval=100000 \
@@ -312,7 +312,7 @@ FlexRAG 采用**模块化**架构设计，让您可以轻松定制和扩展框
 </p>
 
 # 📊 基准测试
-我们利用 FlexRAG 进行了大量的基准测试，详情请参考 [benchmarks](benchmarks.md) 页面。
+我们利用 FlexRAG 进行了大量的基准测试，详情请参考 [benchmarks](benchmarks/README.md) 页面。
 
 # 🏷️ 许可证
 本仓库采用 **MIT License** 开源协议. 详情请参考 [LICENSE](LICENSE) 文件。

diff --git a/README.md b/README.md
@@ -93,7 +93,7 @@ python -m flexrag.entrypoints.prepare_index \
     saving_fields=$CORPUS_FIELDS \
     retriever_type=dense \
     dense_config.database_path=$DB_PATH \
-    dense_config.encode_fields='[text]' \
+    dense_config.encode_fields=[text] \
     dense_config.passage_encoder_config.encoder_type=hf \
     dense_config.passage_encoder_config.hf_config.model_path='facebook/contriever' \
     dense_config.passage_encoder_config.hf_config.device_id=[0,1,2,3] \
@@ -116,7 +116,7 @@ python -m flexrag.entrypoints.prepare_index \
     saving_fields=$CORPUS_FIELDS \
     retriever_type=bm25s \
     bm25s_config.database_path=$DB_PATH \
-    bm25s_config.indexed_fields='[title,text]' \
+    bm25s_config.indexed_fields=[title,text] \
     bm25s_config.method=lucene \
     bm25s_config.batch_size=512 \
     bm25s_config.log_interval=100000 \
@@ -315,7 +315,7 @@ FlexRAG is designed with a **modular** architecture, allowing you to easily cust
 </p>
 
 # 📊 Benchmarks
-We have conducted extensive benchmarks using the FlexRAG framework. For more details, please refer to the [benchmarks](benchmarks.md) page.
+We have conducted extensive benchmarks using the FlexRAG framework. For more details, please refer to the [benchmarks](benchmarks/README.md) page.
 
 # 🏷️ License
 This repository is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.

diff --git a/benchmarks.md b/benchmarks.md
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,7 @@
+# Benchmarks
+This directory contains benchmarks for the FlexRAG framework. We conduct these experiments for the following reasons:
+1. We hope to help users gain a deeper understanding of the various components in RAG and determine their impact on the overall RAG system.
+2. We aim to test the FlexRAG framework, which will help us reduce potential bugs.
+
+## Directory Structure
+- [`singlehop_qa.md`](singlehop_qa.md): This file contains the benchmark results for the single-hop QA task.
diff --git a/benchmarks/singlehop_qa.md b/benchmarks/singlehop_qa.md
@@ -0,0 +1,115 @@
+# Benchmark for Single-Hop QA Tasks
+To better understand the performance of each component in FlexRAG within the RAG pipeline, we have conducted a series of experiments on a variety of single-hop QA datasets including PopQA, Natural Questions, and TriviaQA. The experiments are divided into four categories: sparse retriever benchmarks, dense retriever benchmarks, index benchmarks, reranker benchmarks, and generator benchmarks. We also provide best practices for single-hop QA tasks.
+
+All experiments are conducted using the `ModularAssistant` in FlexRAG framework.
+
+## Sparse Retriever Benchmarks
+> Experiment Settings: In these experiments, we employ the Qwen/Qwen2-7B-Instruct as our generator. All the other settings are the same as the default settings in `ModularAssistant`.
+
+| Methods       | PopQA(%) |       |           | NQ(%) |       |           | TriviaQA(%) |       |           | Average |       |           |
+| ------------- | :------: | :---: | :-------: | :---: | :---: | :-------: | :---------: | :---: | :-------: | :-----: | :---: | :-------: |
+|               |    F1    |  EM   | Recall@10 |  F1   |  EM   | Recall@10 |     F1      |  EM   | Recall@10 |   F1    |  EM   | Recall@10 |
+| BM25s+Lucene  |  57.88   | 52.75 |   68.48   | 38.79 | 30.00 |   54.74   |    65.93    | 58.02 |   61.98   |  54.20  | 46.92 |   61.73   |
+| BM25s+BM25+   |  57.88   | 52.75 |   68.48   | 38.79 | 30.00 |   54.74   |    65.92    | 58.01 |   61.99   |  54.20  | 46.92 |   61.74   |
+| BM25s+BM25l   |  57.97   | 53.04 |   66.55   | 36.54 | 28.12 |   50.39   |    62.70    | 54.75 |   58.15   |  52.40  | 45.30 |   58.36   |
+| BM25s+Atire   |  57.88   | 52.75 |   68.48   | 38.79 | 30.00 |   54.74   |    65.92    | 58.01 |   61.99   |  54.20  | 46.92 |   61.74   |
+| ElasticSearch |  57.29   | 52.39 |   66.12   | 36.70 | 28.39 |   52.05   |    65.94    | 58.35 |   62.23   |  53.31  | 46.38 |   60.13   |
+| Typesense     |  19.41   | 17.80 |   26.38   | 20.57 | 14.88 |   15.87   |    44.69    | 38.49 |   20.48   |  28.22  | 23.72 |   20.91   |
+
+Observations:
+- BM25s+Lucene, BM25s+BM25+, and BM25s+Atire have similar performance on all datasets.
+- Typesense struggles to retrieve relevant documents when using natural language queries.
+- ElasticSearch has balanced performance across three datasets, and it offers a wide range of retrieval features.
+
+Conclusion:
+Considering the excellent performance and simple installation of the BM25S, we believe that using the BM25S as a sparse retriever in research and prototype building scenarios is a better choice. For more complex requirements, try using ElasticSearch, whose out-of-the-box configuration and rich features can provide more powerful retrieval capabilities
+
+## Dense Retriever Benchmarks
+> Experiment Settings: In these experiments, we employ the Qwen/Qwen2-7B-Instruct as our generator. All the other settings are the same as the default settings in `ModularAssistant`.
+
+| Methods                                      | PopQA(%) |       |           | NQ(%) |       |           | TriviaQA(%) |       |           | Average |       |           |
+| -------------------------------------------- | :------: | :---: | :-------: | :---: | :---: | :-------: | :---------: | :---: | :-------: | :-----: | :---: | :-------: |
+|                                              |    F1    |  EM   | Recall@10 |  F1   |  EM   | Recall@10 |     F1      |  EM   | Recall@10 |   F1    |  EM   | Recall@10 |
+| facebook/contriever-msmarco                  |  64.14   | 59.04 |   80.77   | 49.67 | 39.03 |   75.65   |    70.36    | 62.55 |   68.26   |  61.39  | 53.54 |   74.89   |
+| intfloat/e5-base-v2                          |  59.74   | 54.25 |   77.20   | 50.05 | 39.56 |   78.84   |    71.66    | 63.79 |   70.63   |  60.48  | 52.53 |   75.56   |
+| BAAI/bge-m3                                  |  63.65   | 58.76 |   83.42   | 50.98 | 40.36 |   80.00   |    71.92    | 63.85 |   71.10   |  62.18  | 54.32 |   78.17   |
+| sentence-transformers/msmarco-MiniLM-L-12-v3 |  64.76   | 59.11 |   80.84   | 42.78 | 33.77 |   64.60   |    58.10    | 50.72 |   51.77   |  55.21  | 47.87 |   65.74   |
+| nomic-ai/nomic-embed-text-v1.5               |  65.06   | 59.90 |   81.70   | 50.31 | 40.08 |   78.14   |    69.10    | 61.32 |   67.50   |  61.49  | 53.77 |   75.78   |
+| jinaai/jina-embeddings-v3                    |  67.43   | 62.33 |   86.20   | 50.02 | 40.17 |   81.52   |    70.06    | 62.14 |   79.51   |  62.50  | 54.88 |   82.41   |
+| facebook/dragon-plus-query-encoder           |  66.67   | 61.69 |   84.06   | 46.79 | 37.17 |   73.80   |    70.30    | 62.54 |   68.40   |  61.25  | 53.80 |   75.42   |
+
+Observations:
+- All dense retrievers have better performance than sparse retrievers.
+- jina-embeddings-v3 and BGE M3 have the best performance on all datasets.
+- MiniLM provides a balance choice between performance and efficiency.
+
+Conclusion:
+We recommend using facebook/contriever-msmarco or E5 for academic usage as it is used in many papers and has a good balance between performance and efficiency. For building a prototype or production system, we recommend using jina-embeddings-v3 or BGE M3, which have the best performance on all datasets.
+
+
+## Index Benchmarks
+> Experiment Settings: In these experiments, we employ the Qwen/Qwen2-7B-Instruct as our generator and facebook/contriever-msmarco as our dense retriever. All the other settings are the same as the default settings in `ModularAssistant`.
+
+| Methods                | PopQA(%) |       |       | NQ(%) |       |       | TriviaQA(%) |       |       | Average |       |       |
+| ---------------------- | :------: | :---: | :---: | :---: | :---: | :---: | :---------: | :---: | :---: | :-----: | :---: | :---: |
+|                        |    F1    |  EM   | Succ  |  F1   |  EM   | Succ  |     F1      |  EM   | Succ  |   F1    |  EM   | Succ  |
+| FLAT                   |  63.65   | 58.40 | 82.20 | 49.20 | 39.11 | 77.95 |    70.61    | 62.70 | 80.03 |  61.15  | 53.40 | 80.06 |
+| Faiss Auto(nprobe=32)  |  51.50   | 47.03 | 67.19 | 48.17 | 37.89 | 75.21 |    69.34    | 61.56 | 78.36 |  56.34  | 48.83 | 73.59 |
+| Faiss Auto(nprobe=128) |  59.91   | 54.97 | 76.20 | 49.05 | 38.53 | 77.23 |    70.14    | 62.31 | 79.49 |  59.70  | 51.94 | 77.64 |
+| Faiss Auto(nprobe=512) |  64.14   | 59.04 | 81.42 | 49.62 | 39.11 | 77.87 |    70.48    | 62.57 | 79.80 |  61.41  | 53.57 | 79.70 |
+| Faiss Refine           |  64.11   | 58.90 | 81.27 | 48.91 | 38.34 | 77.81 |    70.24    | 62.43 | 79.89 |  61.09  | 53.22 | 79.66 |
+| ScaNN                  |  63.26   | 58.11 | 82.13 |       |       |       |             |       |       |         |       |       |
+| Annoy(40000)           |          |       |       |       |       |       |             |       |       |         |       |       |
+| Annoy(400000)          |          |       |       |       |       |       |             |       |       |         |       |       |
+
+
+## Reranker Benchmarks
+> Experiment Settings: In these experiments, we employ the Qwen/Qwen2-7B-Instruct, facebook/contriever-msmarco, and Faiss Auto(nprobe=512) as our generator, dense retriever, and index, respectively. All the other settings are the same as the default settings in `ModularAssistant`.
+
+| Methods                                   | PopQA(%) |       |       | NQ(%) |       |       | TriviaQA(%) |       |       | Average |       |       |
+| ----------------------------------------- | :------: | :---: | :---: | :---: | :---: | :---: | :---------: | :---: | :---: | :-----: | :---: | :---: |
+|                                           |    F1    |  EM   | Succ  |  F1   |  EM   | Succ  |     F1      |  EM   | Succ  |   F1    |  EM   | Succ  |
+| BAAI/bge-reranker-v2-m3                   |  66.02   | 60.76 | 86.92 | 50.94 | 40.53 | 81.91 |    74.58    | 66.71 | 84.81 |  63.85  | 56.00 | 84.55 |
+| colbert-ir/colbertv2.0                    |  65.44   | 60.47 | 83.56 | 47.18 | 37.06 | 77.53 |    72.13    | 64.24 | 81.47 |  61.58  | 53.92 | 80.85 |
+| jinaai/jina-reranker-v2-base-multilingual |  66.31   | 60.97 | 86.49 | 49.35 | 38.78 | 81.00 |    73.00    | 65.01 | 83.03 |  62.89  | 54.92 | 83.51 |
+| jinaai/jina-colbert-v2                    |  66.73   | 61.47 | 85.78 | 49.59 | 39.20 | 79.86 |    73.24    | 65.36 | 82.96 |  63.19  | 55.34 | 82.87 |
+| unicamp-dl/InRanker-base                  |  66.05   | 60.90 | 86.63 | 48.77 | 38.50 | 79.78 |    73.38    | 65.47 | 83.20 |  62.73  | 54.96 | 83.20 |
+| rankGPT(Qwen/Qwen2-7B-Instruct)           |  63.11   | 58.26 | 77.91 | 49.50 | 39.06 | 75.90 |    70.13    | 62.31 | 79.11 |  60.91  | 53.21 | 77.64 |
+
+Observations:
+- Using a reranker can significantly improve the performance of the retrieval system.
+- Cross-encoder-based rerankers have better performance than the other rerankers.
+- BGE-reranker-M3 has the best performance on all datasets.
+- rankGPT highly relies on the quality of the generator and has the highest overhead.
+
+Conclusion:
+We recommend using rerankers in latency insensitive scenarios. For building a prototype, we recommend using BGE-reranker-M3 or jina-reranker, which have the best performance on all datasets.
+
+## Generator Benchmarks
+> Experiment Settings: In these experiments, we employ the facebook/contriever-msmarco as our dense retriever. All the other settings are the same as the default settings in `ModularAssistant`. Specifically, for the Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct models, we deploy using Ollama with 4-bit quantization, while for the remaining models, we use VLLM for deployment and perform inference via the OpenAI API it provides.
+
+| Methods                               | PopQA(%) |       |       | NQ(%) |       |       | TriviaQA(%) |       |       | Average |       |       |
+| ------------------------------------- | :------: | :---: | :---: | :---: | :---: | :---: | :---------: | :---: | :---: | :-----: | :---: | :---: |
+|                                       |    F1    |  EM   | Succ  |  F1   |  EM   | Succ  |     F1      |  EM   | Succ  |   F1    |  EM   | Succ  |
+| Qwen/Qwen2-7B-Instruct \*             |  22.18   | 20.01 |   -   | 24.95 | 16.73 |   -   |    50.03    | 42.91 |   -   |  32.39  | 26.55 |   -   |
+| Qwen/Qwen2-7B-Instruct                |  64.14   | 59.04 | 81.42 | 49.62 | 39.11 | 77.87 |    70.48    | 62.57 | 79.80 |  61.41  | 53.57 | 79.70 |
+| Qwen/Qwen2.5-7B-Instruct \*           |  21.68   | 19.30 |   -   | 24.33 | 15.57 |   -   |    51.02    | 44.29 |   -   |  32.34  | 26.39 |   -   |
+| Qwen/Qwen2.5-7B-Instruct              |  61.89   | 55.18 | 81.42 | 47.89 | 36.79 | 77.87 |    70.35    | 62.32 | 79.80 |  60.04  | 51.43 | 79.70 |
+| Qwen/Qwen2.5-72B-Instruct \*          |   3.10   | 0.00  |   -   | 4.26  | 0.00  |   -   |    9.31     | 0.01  |   -   |  5.56   | 0.00  |   -   |
+| Qwen/Qwen2.5-72B-Instruct             |  13.97   | 0.00  | 81.42 | 7.06  | 0.00  | 77.87 |    13.77    | 0.03  | 79.80 |  11.60  | 0.01  | 79.70 |
+| meta-llama/Llama-3.1-8B-Instruct \*   |  22.08   | 19.44 |   -   | 34.33 | 23.41 |   -   |    64.46    | 56.52 |   -   |  40.29  | 33.12 |   -   |
+| meta-llama/Llama-3.1-8B-Instruct      |  63.20   | 55.83 | 81.42 | 47.58 | 35.73 | 77.87 |    71.75    | 62.97 | 79.80 |  60.84  | 51.51 | 79.70 |
+| meta-llama/Llama-3.3-70B-Instruct \*  |  30.14   | 27.81 |   -   | 46.40 | 32.30 |   -   |    79.60    | 72.04 |   -   |  52.05  | 44.05 |   -   |
+| meta-llama/Llama-3.3-70B-Instruct     |  64.95   | 56.83 | 81.42 | 51.29 | 37.40 | 77.87 |    77.11    | 68.20 | 79.80 |  64.45  | 54.14 | 79.70 |
+| mistralai/Mistral-7B-Instruct-v0.3 \* |  21.21   | 18.08 |   -   | 25.87 | 14.96 |   -   |    59.66    | 50.42 |   -   |  35.58  | 27.82 |   -   |
+| mistralai/Mistral-7B-Instruct-v0.3    |  54.65   | 43.03 | 81.42 | 38.92 | 24.71 | 77.87 |    67.28    | 56.26 | 79.80 |  53.62  | 41.33 | 79.70 |
+| nvidia/Llama3-ChatQA-2-8B \*          |  22.70   | 17.08 |   -   | 28.41 | 18.70 |   -   |    59.30    | 49.99 |   -   |  36.80  | 28.59 |   -   |
+| nvidia/Llama3-ChatQA-2-8B             |  60.36   | 53.82 | 81.42 | 49.84 | 39.09 | 77.87 |    71.84    | 62.67 | 79.80 |  60.68  | 51.86 | 79.70 |
+
+> \* stands for no retrieval results.
+
+Observations:
+- Without retrieval, models like Llama and Mistral have a relatively high response accuracy, and their performance continues to improve as the model size increases.
+- With retrieval, Qwen2-7B-Instruct shows the most significant improvement.
+- Qwen2.5 72B failed to follow the instruction to generate a brief answer and instead produced a longer response, resulting in a sharp decline in both F1 and EM scores. This may be related to the use of a quantized model.
+
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -6,11 +6,11 @@
 # -- Project information -----------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
 
-project = 'FlexRAG Documentation'
-html_short_title = 'FlexRAG Documentation'
-copyright = '2025, ZhuochengZhang'
-author = 'ZhuochengZhang'
-release = '0.1.2'
+project = "FlexRAG Documentation"
+html_short_title = "FlexRAG Documentation"
+copyright = "2025, ZhuochengZhang"
+author = "ZhuochengZhang"
+release = "0.1.2"
 
 # -- General configuration ---------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
@@ -20,17 +20,16 @@
     "myst_parser",
 ]
 
-templates_path = ['_templates']
+templates_path = ["_templates"]
 exclude_patterns = []
 
 
-
 # -- Options for HTML output -------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
 
-html_theme = 'piccolo_theme'
-html_static_path = ['_static']
+html_theme = "piccolo_theme"
+html_static_path = ["_static", "../../assets"]
 html_theme_options = {
-    "source_url": 'https://github.com/ictnlp/flexrag',
-    "source_icon": "github"
-}
+    "source_url": "https://github.com/ictnlp/flexrag",
+    "source_icon": "github",
+}
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -2,7 +2,7 @@
    sphinx-quickstart on Thu Jan  2 10:05:21 2025.
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
-.. image:: http://cdn.zhangzhuocheng.top/flexrag-wide.png
+.. image:: ../../assets/flexrag-wide.png
    :alt: FlexRAG
    :align: center
 

diff --git a/src/flexrag/retriever/index/scann_index.py b/src/flexrag/retriever/index/scann_index.py
@@ -1,12 +1,13 @@
 import os
+import re
 import shutil
 from dataclasses import dataclass
 
 import numpy as np
 
 from flexrag.utils import LOGGER_MANAGER
 
-from .index_base import DenseIndexBase, DenseIndexBaseConfig, DENSE_INDEX
+from .index_base import DENSE_INDEX, DenseIndexBase, DenseIndexBaseConfig
 
 logger = LOGGER_MANAGER.get_logger("flexrag.retrievers.index.scann")
 
@@ -120,12 +121,14 @@ def clean(self):
             return
         if os.path.exists(self.index_path):
             shutil.rmtree(self.index_path)
-        self.index = self._prepare_index()
+        self.index = None
         return
 
     @property
     def embedding_size(self) -> int:
-        return self.index.num_columns()
+        if self.index is None:
+            raise RuntimeError("Index is not built yet.")
+        return int(re.search("input_dim: [0-9]+", self.index.config()).group()[11:])
 
     @property
     def is_trained(self) -> bool: