diff --git a/README-zh.md b/README-zh.md index 451c571..b69285f 100644 --- a/README-zh.md +++ b/README-zh.md @@ -91,7 +91,7 @@ python -m flexrag.entrypoints.prepare_index \ saving_fields=$CORPUS_FIELDS \ retriever_type=dense \ dense_config.database_path=$DB_PATH \ - dense_config.encode_fields='[text]' \ + dense_config.encode_fields=[text] \ dense_config.passage_encoder_config.encoder_type=hf \ dense_config.passage_encoder_config.hf_config.model_path='facebook/contriever' \ dense_config.passage_encoder_config.hf_config.device_id=[0,1,2,3] \ @@ -114,7 +114,7 @@ python -m flexrag.entrypoints.prepare_index \ saving_fields=$CORPUS_FIELDS \ retriever_type=bm25s \ bm25s_config.database_path=$DB_PATH \ - bm25s_config.indexed_fields='[title,text]' \ + bm25s_config.indexed_fields=[title,text] \ bm25s_config.method=lucene \ bm25s_config.batch_size=512 \ bm25s_config.log_interval=100000 \ @@ -312,7 +312,7 @@ FlexRAG 采用**模块化**架构设计,让您可以轻松定制和扩展框

# 📊 基准测试 -我们利用 FlexRAG 进行了大量的基准测试,详情请参考 [benchmarks](benchmarks.md) 页面。 +我们利用 FlexRAG 进行了大量的基准测试,详情请参考 [benchmarks](benchmarks/README.md) 页面。 # 🏷️ 许可证 本仓库采用 **MIT License** 开源协议. 详情请参考 [LICENSE](LICENSE) 文件。 diff --git a/README.md b/README.md index 7fbcc19..3bb43ca 100644 --- a/README.md +++ b/README.md @@ -93,7 +93,7 @@ python -m flexrag.entrypoints.prepare_index \ saving_fields=$CORPUS_FIELDS \ retriever_type=dense \ dense_config.database_path=$DB_PATH \ - dense_config.encode_fields='[text]' \ + dense_config.encode_fields=[text] \ dense_config.passage_encoder_config.encoder_type=hf \ dense_config.passage_encoder_config.hf_config.model_path='facebook/contriever' \ dense_config.passage_encoder_config.hf_config.device_id=[0,1,2,3] \ @@ -116,7 +116,7 @@ python -m flexrag.entrypoints.prepare_index \ saving_fields=$CORPUS_FIELDS \ retriever_type=bm25s \ bm25s_config.database_path=$DB_PATH \ - bm25s_config.indexed_fields='[title,text]' \ + bm25s_config.indexed_fields=[title,text] \ bm25s_config.method=lucene \ bm25s_config.batch_size=512 \ bm25s_config.log_interval=100000 \ @@ -315,7 +315,7 @@ FlexRAG is designed with a **modular** architecture, allowing you to easily cust

# 📊 Benchmarks -We have conducted extensive benchmarks using the FlexRAG framework. For more details, please refer to the [benchmarks](benchmarks.md) page. +We have conducted extensive benchmarks using the FlexRAG framework. For more details, please refer to the [benchmarks](benchmarks/README.md) page. # 🏷️ License This repository is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details. diff --git a/benchmarks.md b/benchmarks.md deleted file mode 100644 index 19c355a..0000000 --- a/benchmarks.md +++ /dev/null @@ -1 +0,0 @@ -Comming Soon \ No newline at end of file diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 0000000..29fcbc3 --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,7 @@ +# Benchmarks +This directory contains benchmarks for the FlexRAG framework. We conduct these experiments for the following reasons: +1. We hope to help users gain a deeper understanding of the various components in RAG and determine their impact on the overall RAG system. +2. We aim to test the FlexRAG framework, which will help us reduce potential bugs. + +## Directory Structure +- [`singlehop_qa.md`](singlehop_qa.md): This file contains the benchmark results for the single-hop QA task. \ No newline at end of file diff --git a/benchmarks/singlehop_qa.md b/benchmarks/singlehop_qa.md new file mode 100644 index 0000000..3033b52 --- /dev/null +++ b/benchmarks/singlehop_qa.md @@ -0,0 +1,115 @@ +# Benchmark for Single-Hop QA Tasks +To better understand the performance of each component in FlexRAG within the RAG pipeline, we have conducted a series of experiments on a variety of single-hop QA datasets including PopQA, Natural Questions, and TriviaQA. The experiments are divided into four categories: sparse retriever benchmarks, dense retriever benchmarks, index benchmarks, reranker benchmarks, and generator benchmarks. We also provide best practices for single-hop QA tasks. + +All experiments are conducted using the `ModularAssistant` in FlexRAG framework. + +## Sparse Retriever Benchmarks +> Experiment Settings: In these experiments, we employ the Qwen/Qwen2-7B-Instruct as our generator. All the other settings are the same as the default settings in `ModularAssistant`. + +| Methods | PopQA(%) | | | NQ(%) | | | TriviaQA(%) | | | Average | | | +| ------------- | :------: | :---: | :-------: | :---: | :---: | :-------: | :---------: | :---: | :-------: | :-----: | :---: | :-------: | +| | F1 | EM | Recall@10 | F1 | EM | Recall@10 | F1 | EM | Recall@10 | F1 | EM | Recall@10 | +| BM25s+Lucene | 57.88 | 52.75 | 68.48 | 38.79 | 30.00 | 54.74 | 65.93 | 58.02 | 61.98 | 54.20 | 46.92 | 61.73 | +| BM25s+BM25+ | 57.88 | 52.75 | 68.48 | 38.79 | 30.00 | 54.74 | 65.92 | 58.01 | 61.99 | 54.20 | 46.92 | 61.74 | +| BM25s+BM25l | 57.97 | 53.04 | 66.55 | 36.54 | 28.12 | 50.39 | 62.70 | 54.75 | 58.15 | 52.40 | 45.30 | 58.36 | +| BM25s+Atire | 57.88 | 52.75 | 68.48 | 38.79 | 30.00 | 54.74 | 65.92 | 58.01 | 61.99 | 54.20 | 46.92 | 61.74 | +| ElasticSearch | 57.29 | 52.39 | 66.12 | 36.70 | 28.39 | 52.05 | 65.94 | 58.35 | 62.23 | 53.31 | 46.38 | 60.13 | +| Typesense | 19.41 | 17.80 | 26.38 | 20.57 | 14.88 | 15.87 | 44.69 | 38.49 | 20.48 | 28.22 | 23.72 | 20.91 | + +Observations: +- BM25s+Lucene, BM25s+BM25+, and BM25s+Atire have similar performance on all datasets. +- Typesense struggles to retrieve relevant documents when using natural language queries. +- ElasticSearch has balanced performance across three datasets, and it offers a wide range of retrieval features. + +Conclusion: +Considering the excellent performance and simple installation of the BM25S, we believe that using the BM25S as a sparse retriever in research and prototype building scenarios is a better choice. For more complex requirements, try using ElasticSearch, whose out-of-the-box configuration and rich features can provide more powerful retrieval capabilities + +## Dense Retriever Benchmarks +> Experiment Settings: In these experiments, we employ the Qwen/Qwen2-7B-Instruct as our generator. All the other settings are the same as the default settings in `ModularAssistant`. + +| Methods | PopQA(%) | | | NQ(%) | | | TriviaQA(%) | | | Average | | | +| -------------------------------------------- | :------: | :---: | :-------: | :---: | :---: | :-------: | :---------: | :---: | :-------: | :-----: | :---: | :-------: | +| | F1 | EM | Recall@10 | F1 | EM | Recall@10 | F1 | EM | Recall@10 | F1 | EM | Recall@10 | +| facebook/contriever-msmarco | 64.14 | 59.04 | 80.77 | 49.67 | 39.03 | 75.65 | 70.36 | 62.55 | 68.26 | 61.39 | 53.54 | 74.89 | +| intfloat/e5-base-v2 | 59.74 | 54.25 | 77.20 | 50.05 | 39.56 | 78.84 | 71.66 | 63.79 | 70.63 | 60.48 | 52.53 | 75.56 | +| BAAI/bge-m3 | 63.65 | 58.76 | 83.42 | 50.98 | 40.36 | 80.00 | 71.92 | 63.85 | 71.10 | 62.18 | 54.32 | 78.17 | +| sentence-transformers/msmarco-MiniLM-L-12-v3 | 64.76 | 59.11 | 80.84 | 42.78 | 33.77 | 64.60 | 58.10 | 50.72 | 51.77 | 55.21 | 47.87 | 65.74 | +| nomic-ai/nomic-embed-text-v1.5 | 65.06 | 59.90 | 81.70 | 50.31 | 40.08 | 78.14 | 69.10 | 61.32 | 67.50 | 61.49 | 53.77 | 75.78 | +| jinaai/jina-embeddings-v3 | 67.43 | 62.33 | 86.20 | 50.02 | 40.17 | 81.52 | 70.06 | 62.14 | 79.51 | 62.50 | 54.88 | 82.41 | +| facebook/dragon-plus-query-encoder | 66.67 | 61.69 | 84.06 | 46.79 | 37.17 | 73.80 | 70.30 | 62.54 | 68.40 | 61.25 | 53.80 | 75.42 | + +Observations: +- All dense retrievers have better performance than sparse retrievers. +- jina-embeddings-v3 and BGE M3 have the best performance on all datasets. +- MiniLM provides a balance choice between performance and efficiency. + +Conclusion: +We recommend using facebook/contriever-msmarco or E5 for academic usage as it is used in many papers and has a good balance between performance and efficiency. For building a prototype or production system, we recommend using jina-embeddings-v3 or BGE M3, which have the best performance on all datasets. + + +## Index Benchmarks +> Experiment Settings: In these experiments, we employ the Qwen/Qwen2-7B-Instruct as our generator and facebook/contriever-msmarco as our dense retriever. All the other settings are the same as the default settings in `ModularAssistant`. + +| Methods | PopQA(%) | | | NQ(%) | | | TriviaQA(%) | | | Average | | | +| ---------------------- | :------: | :---: | :---: | :---: | :---: | :---: | :---------: | :---: | :---: | :-----: | :---: | :---: | +| | F1 | EM | Succ | F1 | EM | Succ | F1 | EM | Succ | F1 | EM | Succ | +| FLAT | 63.65 | 58.40 | 82.20 | 49.20 | 39.11 | 77.95 | 70.61 | 62.70 | 80.03 | 61.15 | 53.40 | 80.06 | +| Faiss Auto(nprobe=32) | 51.50 | 47.03 | 67.19 | 48.17 | 37.89 | 75.21 | 69.34 | 61.56 | 78.36 | 56.34 | 48.83 | 73.59 | +| Faiss Auto(nprobe=128) | 59.91 | 54.97 | 76.20 | 49.05 | 38.53 | 77.23 | 70.14 | 62.31 | 79.49 | 59.70 | 51.94 | 77.64 | +| Faiss Auto(nprobe=512) | 64.14 | 59.04 | 81.42 | 49.62 | 39.11 | 77.87 | 70.48 | 62.57 | 79.80 | 61.41 | 53.57 | 79.70 | +| Faiss Refine | 64.11 | 58.90 | 81.27 | 48.91 | 38.34 | 77.81 | 70.24 | 62.43 | 79.89 | 61.09 | 53.22 | 79.66 | +| ScaNN | 63.26 | 58.11 | 82.13 | | | | | | | | | | +| Annoy(40000) | | | | | | | | | | | | | +| Annoy(400000) | | | | | | | | | | | | | + + +## Reranker Benchmarks +> Experiment Settings: In these experiments, we employ the Qwen/Qwen2-7B-Instruct, facebook/contriever-msmarco, and Faiss Auto(nprobe=512) as our generator, dense retriever, and index, respectively. All the other settings are the same as the default settings in `ModularAssistant`. + +| Methods | PopQA(%) | | | NQ(%) | | | TriviaQA(%) | | | Average | | | +| ----------------------------------------- | :------: | :---: | :---: | :---: | :---: | :---: | :---------: | :---: | :---: | :-----: | :---: | :---: | +| | F1 | EM | Succ | F1 | EM | Succ | F1 | EM | Succ | F1 | EM | Succ | +| BAAI/bge-reranker-v2-m3 | 66.02 | 60.76 | 86.92 | 50.94 | 40.53 | 81.91 | 74.58 | 66.71 | 84.81 | 63.85 | 56.00 | 84.55 | +| colbert-ir/colbertv2.0 | 65.44 | 60.47 | 83.56 | 47.18 | 37.06 | 77.53 | 72.13 | 64.24 | 81.47 | 61.58 | 53.92 | 80.85 | +| jinaai/jina-reranker-v2-base-multilingual | 66.31 | 60.97 | 86.49 | 49.35 | 38.78 | 81.00 | 73.00 | 65.01 | 83.03 | 62.89 | 54.92 | 83.51 | +| jinaai/jina-colbert-v2 | 66.73 | 61.47 | 85.78 | 49.59 | 39.20 | 79.86 | 73.24 | 65.36 | 82.96 | 63.19 | 55.34 | 82.87 | +| unicamp-dl/InRanker-base | 66.05 | 60.90 | 86.63 | 48.77 | 38.50 | 79.78 | 73.38 | 65.47 | 83.20 | 62.73 | 54.96 | 83.20 | +| rankGPT(Qwen/Qwen2-7B-Instruct) | 63.11 | 58.26 | 77.91 | 49.50 | 39.06 | 75.90 | 70.13 | 62.31 | 79.11 | 60.91 | 53.21 | 77.64 | + +Observations: +- Using a reranker can significantly improve the performance of the retrieval system. +- Cross-encoder-based rerankers have better performance than the other rerankers. +- BGE-reranker-M3 has the best performance on all datasets. +- rankGPT highly relies on the quality of the generator and has the highest overhead. + +Conclusion: +We recommend using rerankers in latency insensitive scenarios. For building a prototype, we recommend using BGE-reranker-M3 or jina-reranker, which have the best performance on all datasets. + +## Generator Benchmarks +> Experiment Settings: In these experiments, we employ the facebook/contriever-msmarco as our dense retriever. All the other settings are the same as the default settings in `ModularAssistant`. Specifically, for the Llama-3.3-70B-Instruct and Qwen2.5-72B-Instruct models, we deploy using Ollama with 4-bit quantization, while for the remaining models, we use VLLM for deployment and perform inference via the OpenAI API it provides. + +| Methods | PopQA(%) | | | NQ(%) | | | TriviaQA(%) | | | Average | | | +| ------------------------------------- | :------: | :---: | :---: | :---: | :---: | :---: | :---------: | :---: | :---: | :-----: | :---: | :---: | +| | F1 | EM | Succ | F1 | EM | Succ | F1 | EM | Succ | F1 | EM | Succ | +| Qwen/Qwen2-7B-Instruct \* | 22.18 | 20.01 | - | 24.95 | 16.73 | - | 50.03 | 42.91 | - | 32.39 | 26.55 | - | +| Qwen/Qwen2-7B-Instruct | 64.14 | 59.04 | 81.42 | 49.62 | 39.11 | 77.87 | 70.48 | 62.57 | 79.80 | 61.41 | 53.57 | 79.70 | +| Qwen/Qwen2.5-7B-Instruct \* | 21.68 | 19.30 | - | 24.33 | 15.57 | - | 51.02 | 44.29 | - | 32.34 | 26.39 | - | +| Qwen/Qwen2.5-7B-Instruct | 61.89 | 55.18 | 81.42 | 47.89 | 36.79 | 77.87 | 70.35 | 62.32 | 79.80 | 60.04 | 51.43 | 79.70 | +| Qwen/Qwen2.5-72B-Instruct \* | 3.10 | 0.00 | - | 4.26 | 0.00 | - | 9.31 | 0.01 | - | 5.56 | 0.00 | - | +| Qwen/Qwen2.5-72B-Instruct | 13.97 | 0.00 | 81.42 | 7.06 | 0.00 | 77.87 | 13.77 | 0.03 | 79.80 | 11.60 | 0.01 | 79.70 | +| meta-llama/Llama-3.1-8B-Instruct \* | 22.08 | 19.44 | - | 34.33 | 23.41 | - | 64.46 | 56.52 | - | 40.29 | 33.12 | - | +| meta-llama/Llama-3.1-8B-Instruct | 63.20 | 55.83 | 81.42 | 47.58 | 35.73 | 77.87 | 71.75 | 62.97 | 79.80 | 60.84 | 51.51 | 79.70 | +| meta-llama/Llama-3.3-70B-Instruct \* | 30.14 | 27.81 | - | 46.40 | 32.30 | - | 79.60 | 72.04 | - | 52.05 | 44.05 | - | +| meta-llama/Llama-3.3-70B-Instruct | 64.95 | 56.83 | 81.42 | 51.29 | 37.40 | 77.87 | 77.11 | 68.20 | 79.80 | 64.45 | 54.14 | 79.70 | +| mistralai/Mistral-7B-Instruct-v0.3 \* | 21.21 | 18.08 | - | 25.87 | 14.96 | - | 59.66 | 50.42 | - | 35.58 | 27.82 | - | +| mistralai/Mistral-7B-Instruct-v0.3 | 54.65 | 43.03 | 81.42 | 38.92 | 24.71 | 77.87 | 67.28 | 56.26 | 79.80 | 53.62 | 41.33 | 79.70 | +| nvidia/Llama3-ChatQA-2-8B \* | 22.70 | 17.08 | - | 28.41 | 18.70 | - | 59.30 | 49.99 | - | 36.80 | 28.59 | - | +| nvidia/Llama3-ChatQA-2-8B | 60.36 | 53.82 | 81.42 | 49.84 | 39.09 | 77.87 | 71.84 | 62.67 | 79.80 | 60.68 | 51.86 | 79.70 | + +> \* stands for no retrieval results. + +Observations: +- Without retrieval, models like Llama and Mistral have a relatively high response accuracy, and their performance continues to improve as the model size increases. +- With retrieval, Qwen2-7B-Instruct shows the most significant improvement. +- Qwen2.5 72B failed to follow the instruction to generate a brief answer and instead produced a longer response, resulting in a sharp decline in both F1 and EM scores. This may be related to the use of a quantized model. + diff --git a/docs/source/conf.py b/docs/source/conf.py index 0049fc4..ca0b994 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -6,11 +6,11 @@ # -- Project information ----------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information -project = 'FlexRAG Documentation' -html_short_title = 'FlexRAG Documentation' -copyright = '2025, ZhuochengZhang' -author = 'ZhuochengZhang' -release = '0.1.2' +project = "FlexRAG Documentation" +html_short_title = "FlexRAG Documentation" +copyright = "2025, ZhuochengZhang" +author = "ZhuochengZhang" +release = "0.1.2" # -- General configuration --------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration @@ -20,17 +20,16 @@ "myst_parser", ] -templates_path = ['_templates'] +templates_path = ["_templates"] exclude_patterns = [] - # -- Options for HTML output ------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output -html_theme = 'piccolo_theme' -html_static_path = ['_static'] +html_theme = "piccolo_theme" +html_static_path = ["_static", "../../assets"] html_theme_options = { - "source_url": 'https://github.com/ictnlp/flexrag', - "source_icon": "github" -} \ No newline at end of file + "source_url": "https://github.com/ictnlp/flexrag", + "source_icon": "github", +} diff --git a/docs/source/index.rst b/docs/source/index.rst index 581cba4..b11cb80 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -2,7 +2,7 @@ sphinx-quickstart on Thu Jan 2 10:05:21 2025. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -.. image:: http://cdn.zhangzhuocheng.top/flexrag-wide.png +.. image:: ../../assets/flexrag-wide.png :alt: FlexRAG :align: center diff --git a/src/flexrag/retriever/index/scann_index.py b/src/flexrag/retriever/index/scann_index.py index f9d68d5..9a3e97f 100644 --- a/src/flexrag/retriever/index/scann_index.py +++ b/src/flexrag/retriever/index/scann_index.py @@ -1,4 +1,5 @@ import os +import re import shutil from dataclasses import dataclass @@ -6,7 +7,7 @@ from flexrag.utils import LOGGER_MANAGER -from .index_base import DenseIndexBase, DenseIndexBaseConfig, DENSE_INDEX +from .index_base import DENSE_INDEX, DenseIndexBase, DenseIndexBaseConfig logger = LOGGER_MANAGER.get_logger("flexrag.retrievers.index.scann") @@ -120,12 +121,14 @@ def clean(self): return if os.path.exists(self.index_path): shutil.rmtree(self.index_path) - self.index = self._prepare_index() + self.index = None return @property def embedding_size(self) -> int: - return self.index.num_columns() + if self.index is None: + raise RuntimeError("Index is not built yet.") + return int(re.search("input_dim: [0-9]+", self.index.config()).group()[11:]) @property def is_trained(self) -> bool: