Multi-Path Reasoning and Tool Integration in LLMs
This repository contains the implementation of the Multi-Path ReAct Agent, a hybrid reasoning framework that extends the standard ReAct paradigm by integrating Tree-of-Thought (ToT) search strategies. By exploring multiple reasoning-action trajectories and systematically integrating external tools, this approach improves performance on complex reasoning tasks using smaller language models like Gemma 3 27b.
- Multi-Path Exploration: Unlike linear Chain-of-Thought (CoT) or standard ReAct, this agent explores a tree of reasoning steps, allowing it to backtrack and recover from early errors.
- Tool Integration: Integrates external tools, including an Equation Solver, Calculator, and Wikipedia Search, to provide factual grounding and reduce hallucinations.
- Pruning & Evaluation: Utilizes a "model-based evaluator" to score reasoning states (0-10) and prune unpromising paths using Breadth-First Search (BFS).
The source code is located in the src/ directory:
src/proposed_method/our_method.py: Implementation of the proposed Tree-based ReAct agent, including the BFS search algorithm, state expansion, and scoring mechanisms.src/baselines/react_baseline.py: Implementation of the standard single-path ReAct baseline, which interleaves thought and action steps linearly.src/baselines/llm_baseline.py: Script for running the Vanilla LLM (Input-Output) baseline.src/baselines/cot_baselineScript for Chain-of-Thought (CoT) baseline.src/tools/dataset_utils.py: Utilities for loading and processing benchmarks (GSM8K, MMLU, HotpotQA, AI2 ARC, Hendrycks MATH) from Hugging Face.src/tools/math_tools.py: Contains a script for math-based tools like a calculator and equation solver.src/tools/retriever.py: Contains a Wikipedia-based retriever tool.
-
Clone the repository:
git clone https://github.com/Rish-01/ToT-Agent.git cd ToT-Agent -
Install dependencies: Ensure you have the necessary Python packages installed (e.g.,
datasets,google-generativeaiortransformers,torch).pip install -r requirements.txt
-
Set up API Keys: This project utilizes the Gemma 3 27b-it model via API. You must configure your API access accordingly.
export GEMMA_API_KEY="your_api_key_here"
Run the proposed approach using:
python -m src.proposed_method.our_methodWe evaluated our approach against three baselines across five datasets. The table below reports the Mean Accuracy ± Standard Deviation and the [95% Confidence Interval] over three random seeds (30 samples per dataset).
| Dataset | I/O Baseline | CoT | ReAct | Proposed Approach |
|---|---|---|---|---|
| GSM8K | 0.933 ± 0.034 [0.850, 1.000] |
0.947 ± 0.003 [0.940, 0.955] |
0.900 ± 0.010 [0.890, 0.910] |
0.856 ± 0.019 [0.808, 0.903] |
| HendrycksMath | 0.978 ± 0.019 [0.931, 1.000] |
0.925 ± 0.009 [0.902, 0.948] |
0.820 ± 0.024 [0.796, 0.844] |
0.844 ± 0.051 [0.717, 0.972] |
| AI2 ARC | 0.967 ± 0.000 [0.967, 0.967] |
0.915 ± 0.014 [0.881, 0.950] |
0.900 ± 0.025 [0.875, 0.925] |
1.000 ± 0.000 [1.000, 1.000] |
| HotpotQA | 0.456 ± 0.102 [0.202, 0.710] |
0.564 ± 0.028 [0.494, 0.634] |
0.833 ± 0.041 [0.792, 0.874] |
0.789 ± 0.019 [0.741, 0.837] |
| MMLU | 0.744 ± 0.084 [0.533, 0.955] |
0.828 ± 0.014 [0.794, 0.862] |
0.800 ± 0.028 [0.772, 0.828] |
1.000 ± 0.000 [1.000, 1.000] |
- Knowledge Tasks: Our method got all 30 samples right on AI2 ARC and MMLU, demonstrating superior handling of multi-choice and commonsense reasoning tasks compared to all baselines.
- Math Tasks: On simpler math tasks (GSM8K), our method slightly underperformed compared to CoT (0.856 vs 0.947), likely due to limited tool expressiveness and the high efficacy of linear reasoning for grade-school math.
- Abhijit Chunduru - UMass Amherst
- Aditi Ravindra - UMass Amherst
- Rishab Sharma - UMass Amherst