Skip to content

Conversation

@0xjunhao
Copy link
Owner

@0xjunhao 0xjunhao commented Aug 9, 2025

No description provided.

@0xjunhao
Copy link
Owner Author

0xjunhao commented Aug 9, 2025

/gemini review all changes

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @0xjunhao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've introduced a powerful "deep research" system that enables the application to perform iterative, AI-driven web searches and reasoning to answer complex questions. This new capability is seamlessly integrated into our existing evaluation framework, allowing us to test its performance. Additionally, I've made minor adjustments to thread management and improved the robustness of data loading for our evaluation datasets.

Highlights

  • Implemented a "Deep Research" Framework: I've added a new sampler/deep_research.py module that orchestrates multi-step research. This includes fetching content from URLs (including PDFs), summarizing information, performing iterative reasoning, and generating concise answers based on gathered data.
  • Integrated Ubicloud Models and Deep Research Samplers: I've introduced UbicloudCompletionSampler and DeepRsearchCompletionSampler to allow our system to leverage Ubicloud-hosted AI models and the new deep research pipeline directly within our sampling and evaluation infrastructure.
  • Updated Evaluation Configuration: The simple_evals.py script now includes the new samplers and switches the default grading model to a Ubicloud-hosted llama-3-3-70b-turbo. The evaluation scope has been temporarily narrowed to the simpleqa dataset to focus testing on the new features.
  • Enhanced Data Loading for SimpleQA: I've refactored the simpleqa_eval.py to use the requests library for fetching the test dataset CSV, improving the reliability and handling of network operations during data retrieval.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@0xjunhao
Copy link
Owner Author

0xjunhao commented Aug 9, 2025

/gemini what is deep research?

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: a "deep research" agent capable of using search engines and language models to answer questions. It also adds new samplers for this agent and for Ubicloud endpoints. The changes are extensive and add valuable functionality.

My review focuses on improving the robustness, maintainability, and correctness of the new code. Key findings include a potential infinite recursion bug, incorrect type hints, and some unused code and parameters that could be cleaned up. I've also pointed out a bug in command-line argument parsing and some areas where logging could be improved over print statements.

Comment on lines +261 to +279
def deep_dive(state: ResearchState, search_engine: str) -> ResearchState:
"""Deep dive into a few areas."""
subtopics = inference([
create_system_message("""
Generate 1 to 2 search queries on the given topic.
Return a JSON array of strings.
Each string should be a well-structured search engine query.
At least one of them should explicitly ask for trustworthy sources, such as wiki, gov, bookforum, or fandom."""),
create_user_message({
"topic": state["topic"],
"thinking": state.get("thinking", "")
})
], InferenceMode.JSON)
if subtopics is None:
logger.warning("No JSON object found. Retrying deep_dive...")
return deep_dive(state, search_engine)
for subtopic in subtopics:
state = gather_information(subtopic, state, search_engine)
return state

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function uses recursion to retry if it fails to get subtopics from the inference call. However, there is no base case to stop the recursion after a certain number of retries. This can lead to a RecursionError and a stack overflow if the inference call consistently returns None. This is a critical issue that could crash the application.

def deep_dive(state: ResearchState, search_engine: str, _retries_left: int = 3) -> ResearchState:
    """Deep dive into a few areas."""
    if _retries_left == 0:
        logger.error("Failed to get subtopics after multiple retries.")
        return state

    subtopics = inference([
        create_system_message("""
Generate 1 to 2 search queries on the given topic.
Return a JSON array of strings.
Each string should be a well-structured search engine query.
At least one of them should explicitly ask for trustworthy sources, such as wiki, gov, bookforum, or fandom."""),
        create_user_message({
            "topic": state["topic"],
            "thinking": state.get("thinking", "")
        })
    ], InferenceMode.JSON)
    if subtopics is None:
        logger.warning(f"No JSON object found. Retrying deep_dive... ({4 - _retries_left}/3)")
        return deep_dive(state, search_engine, _retries_left - 1)
    for subtopic in subtopics:
        state = gather_information(subtopic, state, search_engine)
    return state



def deep_research(
topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function is type-hinted to return None, but it actually returns state["answer"] on line 397. This type mismatch can cause issues with static analysis tools and makes the code harder to understand and maintain.

Suggested change
topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> None:
topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> str:

help="Path to a JSON file to resume state from")
parser.add_argument("--search_engine", type=str, default="duckduckgo",
help="The search engine to use, either DuckDuckGo or Tavily.")
parser.add_argument("--debug", type=bool, default=False, help="Run in debug mode")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using type=bool for command-line flags in argparse is problematic and doesn't work as one might expect. For instance, --debug false would still result in True. The standard and recommended way to handle boolean flags is to use action="store_true".

Suggested change
parser.add_argument("--debug", type=bool, default=False, help="Run in debug mode")
parser.add_argument("--debug", action="store_true", help="Run in debug mode")

Comment on lines +24 to +31
def __init__(
self,
model: str = "llama-3-3-70b-turbo",
system_message: str | None = None,
temperature: float = 0.5,
max_tokens: int = 1024,
depth: int = 3,
):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model, temperature, and max_tokens parameters are initialized but never used. The deep_research function that this sampler calls has its own hardcoded model configurations and does not accept these parameters. This makes the sampler's API misleading and could cause confusion for developers using this class. Please remove these unused parameters and their corresponding instance attributes.

)


class DeepRsearchCompletionSampler(SamplerBase):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the class name DeepRsearchCompletionSampler. It should be DeepResearchCompletionSampler. This should be corrected here and in any files where it is imported, such as simple_evals.py.

Suggested change
class DeepRsearchCompletionSampler(SamplerBase):
class DeepResearchCompletionSampler(SamplerBase):

Comment on lines +42 to +55
def _handle_image(
self, image: str, encoding: str = "base64", format: str = "png", fovea: int = 768
):
new_image = {
"type": "image_url",
"image_url": {
"url": f"data:image/{format};{encoding},{image}",
},
}
return new_image

def _handle_text(self, text: str):
return {"type": "text", "text": text}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These methods are defined but are not used anywhere in the class. This represents dead code and should be removed to improve maintainability.

Comment on lines +70 to +77
print("Bad Request Error", e)
return ""
except Exception as e:
exception_backoff = 2**trial # expontial back off
print(
f"Rate limit exception so wait and retry {trial} after {exception_backoff} sec",
e,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using print for logging errors and status messages is not ideal for production code. It's better to use the logging module, which allows for configurable log levels, formats, and destinations. Please replace these print statements with appropriate logger calls (e.g., logger.error, logger.warning).

Comment on lines +39 to +52
def _handle_image(
self, image: str, encoding: str = "base64", format: str = "png", fovea: int = 768
):
new_image = {
"type": "image_url",
"image_url": {
"url": f"data:image/{format};{encoding},{image}",
},
}
return new_image

def _handle_text(self, text: str):
return {"type": "text", "text": text}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These helper methods are not used within the class. They should be removed to avoid dead code.

Comment on lines +71 to +79
except openai.BadRequestError as e:
print("Bad Request Error", e)
return ""
except Exception as e:
exception_backoff = 2**trial # expontial back off
print(
f"Rate limit exception so wait and retry {trial} after {exception_backoff} sec",
e,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of print for logging errors and retry attempts should be replaced with the logging module for better control over log levels and output in a production environment.

Comment on lines +174 to +175
# for eval_name in ["simpleqa", "mmlu", "math", "gpqa", "mgsm", "drop", "humaneval"]
for eval_name in ["simpleqa"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list of evaluations to run is hardcoded to just ["simpleqa"], with the full list commented out. This appears to be for debugging purposes. This should be reverted to the full list before merging to ensure all evaluations are run.

Suggested change
# for eval_name in ["simpleqa", "mmlu", "math", "gpqa", "mgsm", "drop", "humaneval"]
for eval_name in ["simpleqa"]
for eval_name in ["simpleqa", "mmlu", "math", "gpqa", "mgsm", "drop", "humaneval"]

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for performing deep research using a combination of web search and large language models. The implementation is comprehensive, covering everything from data fetching and summarization to report generation. My review focuses on improving the robustness, maintainability, and clarity of the new code. I've identified a few critical issues, including a potential for unbounded recursion and a sampler class with misleading and unused code. I have also provided suggestions to address incorrect type hints, argument parsing, and brittle implementation details. Addressing these points will make the new feature more reliable and easier to maintain.

Comment on lines +274 to +276
if subtopics is None:
logger.warning("No JSON object found. Retrying deep_dive...")
return deep_dive(state, search_engine)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The recursive call to deep_dive when subtopics is None lacks a retry limit. If the inference call consistently fails to produce a valid JSON, this will lead to unbounded recursion and eventually a RecursionError. To make this more robust, you should introduce a retry mechanism with a maximum number of attempts.

Comment on lines +24 to +40
def __init__(
self,
model: str = "llama-3-3-70b-turbo",
system_message: str | None = None,
temperature: float = 0.5,
max_tokens: int = 1024,
depth: int = 3,
):
self.api_key_name = "OPENAI_API_KEY"
self.client = OpenAI(base_url=f"https://{model}.ai.ubicloud.com/v1/")
# using api_key=os.environ.get("OPENAI_API_KEY") # please set your API_KEY
self.model = model
self.system_message = system_message
self.temperature = temperature
self.max_tokens = max_tokens
self.image_format = "url"
self.depth = depth

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This class's __init__ method contains a significant amount of unused and misleading code. It accepts parameters (model, temperature, max_tokens) that are ignored in __call__. The self.client attribute is also initialized but never used. This makes the class difficult to understand and use correctly. The code should be refactored to remove all unused elements.

Suggested change
def __init__(
self,
model: str = "llama-3-3-70b-turbo",
system_message: str | None = None,
temperature: float = 0.5,
max_tokens: int = 1024,
depth: int = 3,
):
self.api_key_name = "OPENAI_API_KEY"
self.client = OpenAI(base_url=f"https://{model}.ai.ubicloud.com/v1/")
# using api_key=os.environ.get("OPENAI_API_KEY") # please set your API_KEY
self.model = model
self.system_message = system_message
self.temperature = temperature
self.max_tokens = max_tokens
self.image_format = "url"
self.depth = depth
def __init__(
self,
system_message: str | None = None,
depth: int = 3,
):
self.api_key_name = "OPENAI_API_KEY"
self.client = OpenAI(base_url=f"https://{model}.ai.ubicloud.com/v1/")
# using api_key=os.environ.get("OPENAI_API_KEY") # please set your API_KEY
self.model = model
self.system_message = system_message
self.depth = depth


def get_model(self) -> str:
"""Retrieve the corresponding model name for the inference mode."""
return globals().get(f"{self.name}_MODEL")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using globals().get(...) to retrieve model names is brittle and relies on a naming convention for global variables. This makes the code harder to understand and maintain, as the connection between the InferenceMode enum and the model strings is implicit. A more robust and explicit approach would be to use a dictionary to map enum members to model names.

        return {
            "SUMMARIZATION": SUMMARIZATION_MODEL,
            "REASONING": REASONING_MODEL,
            "WRITING": WRITING_MODEL,
            "JSON": JSON_MODEL,
        }[self.name]

Comment on lines +379 to +380
def deep_research(
topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function deep_research is type-hinted to return None, but it actually returns state["answer"] which is a string. This mismatch can be misleading for developers using this function and can cause issues with static analysis tools. The return type hint should be updated to str.

def deep_research(
        topic: str, depth: int, search_engine: str, initial_state: Optional[dict], debug: bool) -> str:

help="Path to a JSON file to resume state from")
parser.add_argument("--search_engine", type=str, default="duckduckgo",
help="The search engine to use, either DuckDuckGo or Tavily.")
parser.add_argument("--debug", type=bool, default=False, help="Run in debug mode")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using type=bool with argparse does not create a boolean flag as one might expect. For example, --debug false will be interpreted as True. The correct way to handle boolean flags is to use action="store_true".

    parser.add_argument("--debug", action="store_true", help="Run in debug mode")

)


class DeepRsearchCompletionSampler(SamplerBase):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a typo in the class name DeepRsearchCompletionSampler. It should be DeepResearchCompletionSampler. This should be corrected for clarity and consistency.

Suggested change
class DeepRsearchCompletionSampler(SamplerBase):
class DeepResearchCompletionSampler(SamplerBase):

Comment on lines +66 to +67
return deep_research.deep_research(
message_list[0]["content"], depth=self.depth, search_engine="tavily", initial_state=None, debug=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code accesses message_list[0] without checking if message_list is empty. If an empty list is passed, this will raise an IndexError. You should add a check to handle this case gracefully.

Suggested change
return deep_research.deep_research(
message_list[0]["content"], depth=self.depth, search_engine="tavily", initial_state=None, debug=False)
if not message_list:
return ""
return deep_research.deep_research(
message_list[0]["content"], depth=self.depth, search_engine="tavily", initial_state=None, debug=False)

@gemini-code-assist
Copy link

The term "deep research" in this pull request refers to an automated process implemented in sampler/deep_research.py that leverages large language models (LLMs) and web search capabilities to comprehensively investigate a given topic.

Specifically, the deep_research function performs the following steps:

  1. Information Gathering: It starts by performing web searches related to the initial topic.
  2. Content Summarization: It fetches content from the search results (including handling PDFs) and summarizes them using an LLM.
  3. Iterative Deep Dive: For a specified depth, it iteratively generates new search queries (subtopics) based on the current understanding and then gathers and summarizes information for these subtopics. This allows for a more thorough exploration of related areas.
  4. Reasoning and Synthesis: An LLM is used to "think" about and synthesize all the gathered information, prioritizing trustworthy sources.
  5. Answer Generation: Finally, it generates a concise and accurate answer to the original research topic based on the accumulated knowledge.

In essence, it's an automated system designed to simulate a researcher's process of searching, reading, synthesizing, and answering questions on a topic, going beyond a single search query to explore related concepts in depth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants