Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for the HyDE method in quey analysis for RAG plates #1413

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

lanlanguai
Copy link

Features
Added the HyDE method for query-analysis in the RAG module, including an example for better understanding.
Fixed the issue with the static methods in TestRAGEmbeddingFactory not being callable. The previous code passed static methods as parameters for parameterized testing, but static methods are not callable objects, leading to a TypeError. This was resolved by converting static methods to regular functions and defining them outside the class.
Feature Docs
No additional documentation provided.

Influence
As an optional process in RAG, query-analysis will rewrite queries to enhance search results.

Result
All unit tests for the new features have passed.
The query-analysis process in the RAG module runs smoothly, effectively rewriting and optimizing queries for better search results.
Other
Added a detailed description of the changes and fixes made in the submission.

Simulation functions (mock_openai_embedding, mock_azure_embedding, mock_gemini_embedding, and mock_ollama_embedding) have been added.
Reason for adding:
Fix the issue that static methods are not callable: The previous code parameterized the static method as a parameterized test, but the static method was not a callable object, resulting in a TypeError error.Factory.py
@lanlanguai lanlanguai closed this Jul 26, 2024
@lanlanguai lanlanguai deleted the rag_HyDE branch July 26, 2024 06:29
@lanlanguai lanlanguai restored the rag_HyDE branch July 26, 2024 06:30
@lanlanguai lanlanguai reopened this Jul 26, 2024
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 15.62500% with 27 lines in your changes missing coverage. Please review.

Project coverage is 55.66%. Comparing base (c0abe17) to head (2819b2e).
Report is 12 commits behind head on main.

Files Patch % Lines
metagpt/rag/query_analysis/HyDE.py 0.00% 14 Missing ⚠️
metagpt/rag/factories/HyDEQueryTransformFactory.py 0.00% 13 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1413       +/-   ##
===========================================
+ Coverage   30.64%   55.66%   +25.01%     
===========================================
  Files         320      323        +3     
  Lines       19426    19458       +32     
===========================================
+ Hits         5954    10831     +4877     
+ Misses      13472     8627     -4845     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -20,6 +20,10 @@ embedding:
embed_batch_size: 100
dimensions: # output dimension of embedding model

# RAG Analysis
hyde:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the structure like to support more configuration inside rag
rag:
query:
hyde:
include_original: True

api_key: "YOUR_API_KEY"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to commit this file if there are no related changes.

from pydantic import BaseModel

from metagpt.const import DATA_PATH, EXAMPLE_DATA_PATH
from metagpt.logs import logger
from metagpt.rag.engines import SimpleEngine
from metagpt.rag.factories.HyDEQueryTransformFactory import HyDEQueryTransformFactory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file name usually in low case with '_'

@@ -212,6 +214,22 @@ async def init_and_query_es(self):
answer = await engine.aquery(TRAVEL_QUESTION)
self._print_query_result(answer)

async def use_HyDe(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_hyde
and keep in a uniform format, HyDE. No HyDe

@@ -51,6 +52,9 @@ class Config(CLIParams, YamlModel):
# RAG Embedding
embedding: EmbeddingConfig = EmbeddingConfig()

# RAG Analysis
hyde: HydeConfig = HydeConfig()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HyDEConfig

@@ -0,0 +1,5 @@
from metagpt.utils.yaml_model import YamlModel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use rag_config.py to support independent rag configuration


if self._include_original:
embedding_strs.extend(query_bundle.embedding_strs)
logger.info(f" Hypothetical doc:{embedding_strs} ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually not to print embedding, it's too long and not a good log str

engine = SimpleEngine.from_docs(input_files=[TRAVEL_DOC_PATH])
# create HyDE query engine
hyde_query_transformr = HyDEQueryTransformFactory().create_hyde_query_transform()
hyde_query_engine = TransformQueryEngine(engine, hyde_query_transformr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to integrate with SimpleEngine, not directly TransformQueryEngine.
What I means is that one engine entrance to support like query rewrite, rerank and so on.

# 1. save docs
engine = SimpleEngine.from_docs(input_files=[TRAVEL_DOC_PATH])
# create HyDE query engine
hyde_query_transformr = HyDEQueryTransformFactory().create_hyde_query_transform()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add datasets comparison result with/without HyDE method.

@@ -23,13 +23,9 @@ rag:
# RAG Query Analysis
query_analysis:
hyde:
include_original: true # In the query rewrite, determines whether to include the original
include_original: True # In the query rewrite, determines whether to include the original
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true not True

@@ -0,0 +1,63 @@
from typing import Any, Dict, Optional
from llama_index.core.llms import LLM
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why import this, not used

api_version: ""
embed_batch_size: 100
dimensions: # output dimension of embedding model
embedding:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change this embedding one.

@lanlanguai
Copy link
Author

The configuration information and results from running the configurations with and without the HyDE method using metagpt/rag/benchmark/hotpotqa.py are as follows:

Model Sample_Size HyDE_Used Exact_Match F1_Score
deepseek 20 yes 0.1 0.289846
deepseek 20 no 0.1 0.265604
gpt4-o 20 yes 0.55 0.726190
gpt4-o 20 no 0.45 0.626190
gpt4-o 100 yes 0.6 0.752560
gpt4-o 100 no 0.57 0.741560

"""This example show how to use HyDE: HyDE enhances search results by generating Hypothetical doc(virtual
article), for more details please refer to the paper: http://arxiv.org/abs/2212.10496
Query Result:
Bob likes traveling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment correct?

from metagpt.configs.redis_config import RedisConfig
from metagpt.configs.s3_config import S3Config
from metagpt.configs.search_config import SearchConfig
from metagpt.configs.workspace_config import WorkspaceConfig
from metagpt.const import CONFIG_ROOT, METAGPT_ROOT
from MetaGPT.metagpt.configs.rag_config import RAGConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be deleted


class RAGConfig(YamlModel):
embedding: EmbeddingConfig = EmbeddingConfig()
query_analysis: QueryAnalysisConfig = QueryAnalysisConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to add QueryAnalysisConfig and EmbeddingConfig in rag_config.py without hyde_config.py and query_analysis_config.py files.

@geekan
Copy link
Owner

geekan commented Oct 20, 2024

@better629

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants