-
Notifications
You must be signed in to change notification settings - Fork 0
Deep research #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/gemini review all changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @0xjunhao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've introduced a powerful "deep research" system that enables the application to perform iterative, AI-driven web searches and reasoning to answer complex questions. This new capability is seamlessly integrated into our existing evaluation framework, allowing us to test its performance. Additionally, I've made minor adjustments to thread management and improved the robustness of data loading for our evaluation datasets.
Highlights
- Implemented a "Deep Research" Framework: I've added a new sampler/deep_research.py module that orchestrates multi-step research. This includes fetching content from URLs (including PDFs), summarizing information, performing iterative reasoning, and generating concise answers based on gathered data.
- Integrated Ubicloud Models and Deep Research Samplers: I've introduced UbicloudCompletionSampler and DeepRsearchCompletionSampler to allow our system to leverage Ubicloud-hosted AI models and the new deep research pipeline directly within our sampling and evaluation infrastructure.
- Updated Evaluation Configuration: The simple_evals.py script now includes the new samplers and switches the default grading model to a Ubicloud-hosted llama-3-3-70b-turbo. The evaluation scope has been temporarily narrowed to the simpleqa dataset to focus testing on the new features.
- Enhanced Data Loading for SimpleQA: I've refactored the simpleqa_eval.py to use the requests library for fetching the test dataset CSV, improving the reliability and handling of network operations during data retrieval.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
|
/gemini what is deep research? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature: a "deep research" agent capable of using search engines and language models to answer questions. It also adds new samplers for this agent and for Ubicloud endpoints. The changes are extensive and add valuable functionality.
My review focuses on improving the robustness, maintainability, and correctness of the new code. Key findings include a potential infinite recursion bug, incorrect type hints, and some unused code and parameters that could be cleaned up. I've also pointed out a bug in command-line argument parsing and some areas where logging could be improved over print statements.
| def deep_dive(state: ResearchState, search_engine: str) -> ResearchState: | ||
| """Deep dive into a few areas.""" | ||
| subtopics = inference([ | ||
| create_system_message(""" | ||
| Generate 1 to 2 search queries on the given topic. | ||
| Return a JSON array of strings. | ||
| Each string should be a well-structured search engine query. | ||
| At least one of them should explicitly ask for trustworthy sources, such as wiki, gov, bookforum, or fandom."""), | ||
| create_user_message({ | ||
| "topic": state["topic"], | ||
| "thinking": state.get("thinking", "") | ||
| }) | ||
| ], InferenceMode.JSON) | ||
| if subtopics is None: | ||
| logger.warning("No JSON object found. Retrying deep_dive...") | ||
| return deep_dive(state, search_engine) | ||
| for subtopic in subtopics: | ||
| state = gather_information(subtopic, state, search_engine) | ||
| return state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function uses recursion to retry if it fails to get subtopics from the inference call. However, there is no base case to stop the recursion after a certain number of retries. This can lead to a RecursionError and a stack overflow if the inference call consistently returns None. This is a critical issue that could crash the application.
def deep_dive(state: ResearchState, search_engine: str, _retries_left: int = 3) -> ResearchState:
"""Deep dive into a few areas."""
if _retries_left == 0:
logger.error("Failed to get subtopics after multiple retries.")
return state
subtopics = inference([
create_system_message("""
Generate 1 to 2 search queries on the given topic.
Return a JSON array of strings.
Each string should be a well-structured search engine query.
At least one of them should explicitly ask for trustworthy sources, such as wiki, gov, bookforum, or fandom."""),
create_user_message({
"topic": state["topic"],
"thinking": state.get("thinking", "")
})
], InferenceMode.JSON)
if subtopics is None:
logger.warning(f"No JSON object found. Retrying deep_dive... ({4 - _retries_left}/3)")
return deep_dive(state, search_engine, _retries_left - 1)
for subtopic in subtopics:
state = gather_information(subtopic, state, search_engine)
return state|
|
||
|
|
||
| def deep_research( | ||
| topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is type-hinted to return None, but it actually returns state["answer"] on line 397. This type mismatch can cause issues with static analysis tools and makes the code harder to understand and maintain.
| topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> None: | |
| topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> str: |
| help="Path to a JSON file to resume state from") | ||
| parser.add_argument("--search_engine", type=str, default="duckduckgo", | ||
| help="The search engine to use, either DuckDuckGo or Tavily.") | ||
| parser.add_argument("--debug", type=bool, default=False, help="Run in debug mode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using type=bool for command-line flags in argparse is problematic and doesn't work as one might expect. For instance, --debug false would still result in True. The standard and recommended way to handle boolean flags is to use action="store_true".
| parser.add_argument("--debug", type=bool, default=False, help="Run in debug mode") | |
| parser.add_argument("--debug", action="store_true", help="Run in debug mode") |
| def __init__( | ||
| self, | ||
| model: str = "llama-3-3-70b-turbo", | ||
| system_message: str | None = None, | ||
| temperature: float = 0.5, | ||
| max_tokens: int = 1024, | ||
| depth: int = 3, | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model, temperature, and max_tokens parameters are initialized but never used. The deep_research function that this sampler calls has its own hardcoded model configurations and does not accept these parameters. This makes the sampler's API misleading and could cause confusion for developers using this class. Please remove these unused parameters and their corresponding instance attributes.
| ) | ||
|
|
||
|
|
||
| class DeepRsearchCompletionSampler(SamplerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the class name DeepRsearchCompletionSampler. It should be DeepResearchCompletionSampler. This should be corrected here and in any files where it is imported, such as simple_evals.py.
| class DeepRsearchCompletionSampler(SamplerBase): | |
| class DeepResearchCompletionSampler(SamplerBase): |
| def _handle_image( | ||
| self, image: str, encoding: str = "base64", format: str = "png", fovea: int = 768 | ||
| ): | ||
| new_image = { | ||
| "type": "image_url", | ||
| "image_url": { | ||
| "url": f"data:image/{format};{encoding},{image}", | ||
| }, | ||
| } | ||
| return new_image | ||
|
|
||
| def _handle_text(self, text: str): | ||
| return {"type": "text", "text": text} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| print("Bad Request Error", e) | ||
| return "" | ||
| except Exception as e: | ||
| exception_backoff = 2**trial # expontial back off | ||
| print( | ||
| f"Rate limit exception so wait and retry {trial} after {exception_backoff} sec", | ||
| e, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _handle_image( | ||
| self, image: str, encoding: str = "base64", format: str = "png", fovea: int = 768 | ||
| ): | ||
| new_image = { | ||
| "type": "image_url", | ||
| "image_url": { | ||
| "url": f"data:image/{format};{encoding},{image}", | ||
| }, | ||
| } | ||
| return new_image | ||
|
|
||
| def _handle_text(self, text: str): | ||
| return {"type": "text", "text": text} | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| except openai.BadRequestError as e: | ||
| print("Bad Request Error", e) | ||
| return "" | ||
| except Exception as e: | ||
| exception_backoff = 2**trial # expontial back off | ||
| print( | ||
| f"Rate limit exception so wait and retry {trial} after {exception_backoff} sec", | ||
| e, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # for eval_name in ["simpleqa", "mmlu", "math", "gpqa", "mgsm", "drop", "humaneval"] | ||
| for eval_name in ["simpleqa"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of evaluations to run is hardcoded to just ["simpleqa"], with the full list commented out. This appears to be for debugging purposes. This should be reverted to the full list before merging to ensure all evaluations are run.
| # for eval_name in ["simpleqa", "mmlu", "math", "gpqa", "mgsm", "drop", "humaneval"] | |
| for eval_name in ["simpleqa"] | |
| for eval_name in ["simpleqa", "mmlu", "math", "gpqa", "mgsm", "drop", "humaneval"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature for performing deep research using a combination of web search and large language models. The implementation is comprehensive, covering everything from data fetching and summarization to report generation. My review focuses on improving the robustness, maintainability, and clarity of the new code. I've identified a few critical issues, including a potential for unbounded recursion and a sampler class with misleading and unused code. I have also provided suggestions to address incorrect type hints, argument parsing, and brittle implementation details. Addressing these points will make the new feature more reliable and easier to maintain.
| if subtopics is None: | ||
| logger.warning("No JSON object found. Retrying deep_dive...") | ||
| return deep_dive(state, search_engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recursive call to deep_dive when subtopics is None lacks a retry limit. If the inference call consistently fails to produce a valid JSON, this will lead to unbounded recursion and eventually a RecursionError. To make this more robust, you should introduce a retry mechanism with a maximum number of attempts.
| def __init__( | ||
| self, | ||
| model: str = "llama-3-3-70b-turbo", | ||
| system_message: str | None = None, | ||
| temperature: float = 0.5, | ||
| max_tokens: int = 1024, | ||
| depth: int = 3, | ||
| ): | ||
| self.api_key_name = "OPENAI_API_KEY" | ||
| self.client = OpenAI(base_url=f"https://{model}.ai.ubicloud.com/v1/") | ||
| # using api_key=os.environ.get("OPENAI_API_KEY") # please set your API_KEY | ||
| self.model = model | ||
| self.system_message = system_message | ||
| self.temperature = temperature | ||
| self.max_tokens = max_tokens | ||
| self.image_format = "url" | ||
| self.depth = depth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class's __init__ method contains a significant amount of unused and misleading code. It accepts parameters (model, temperature, max_tokens) that are ignored in __call__. The self.client attribute is also initialized but never used. This makes the class difficult to understand and use correctly. The code should be refactored to remove all unused elements.
| def __init__( | |
| self, | |
| model: str = "llama-3-3-70b-turbo", | |
| system_message: str | None = None, | |
| temperature: float = 0.5, | |
| max_tokens: int = 1024, | |
| depth: int = 3, | |
| ): | |
| self.api_key_name = "OPENAI_API_KEY" | |
| self.client = OpenAI(base_url=f"https://{model}.ai.ubicloud.com/v1/") | |
| # using api_key=os.environ.get("OPENAI_API_KEY") # please set your API_KEY | |
| self.model = model | |
| self.system_message = system_message | |
| self.temperature = temperature | |
| self.max_tokens = max_tokens | |
| self.image_format = "url" | |
| self.depth = depth | |
| def __init__( | |
| self, | |
| system_message: str | None = None, | |
| depth: int = 3, | |
| ): | |
| self.api_key_name = "OPENAI_API_KEY" | |
| self.client = OpenAI(base_url=f"https://{model}.ai.ubicloud.com/v1/") | |
| # using api_key=os.environ.get("OPENAI_API_KEY") # please set your API_KEY | |
| self.model = model | |
| self.system_message = system_message | |
| self.depth = depth |
|
|
||
| def get_model(self) -> str: | ||
| """Retrieve the corresponding model name for the inference mode.""" | ||
| return globals().get(f"{self.name}_MODEL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using globals().get(...) to retrieve model names is brittle and relies on a naming convention for global variables. This makes the code harder to understand and maintain, as the connection between the InferenceMode enum and the model strings is implicit. A more robust and explicit approach would be to use a dictionary to map enum members to model names.
return {
"SUMMARIZATION": SUMMARIZATION_MODEL,
"REASONING": REASONING_MODEL,
"WRITING": WRITING_MODEL,
"JSON": JSON_MODEL,
}[self.name]| def deep_research( | ||
| topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function deep_research is type-hinted to return None, but it actually returns state["answer"] which is a string. This mismatch can be misleading for developers using this function and can cause issues with static analysis tools. The return type hint should be updated to str.
def deep_research(
topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> str:| help="Path to a JSON file to resume state from") | ||
| parser.add_argument("--search_engine", type=str, default="duckduckgo", | ||
| help="The search engine to use, either DuckDuckGo or Tavily.") | ||
| parser.add_argument("--debug", type=bool, default=False, help="Run in debug mode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) | ||
|
|
||
|
|
||
| class DeepRsearchCompletionSampler(SamplerBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return deep_research.deep_research( | ||
| message_list[0]["content"], depth=self.depth, search_engine="tavily", initial_state=None, debug=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code accesses message_list[0] without checking if message_list is empty. If an empty list is passed, this will raise an IndexError. You should add a check to handle this case gracefully.
| return deep_research.deep_research( | |
| message_list[0]["content"], depth=self.depth, search_engine="tavily", initial_state=None, debug=False) | |
| if not message_list: | |
| return "" | |
| return deep_research.deep_research( | |
| message_list[0]["content"], depth=self.depth, search_engine="tavily", initial_state=None, debug=False) |
|
The term "deep research" in this pull request refers to an automated process implemented in Specifically, the
In essence, it's an automated system designed to simulate a researcher's process of searching, reading, synthesizing, and answering questions on a topic, going beyond a single search query to explore related concepts in depth. |
No description provided.