Tree Search for Language Model Agents

We propose an inference-time tree search algorithm to enable language model agents to perform exploration and multi-step planning in interactive web environments. This repository demonstrates how to run our method on the VisualWebArena and WebArena benchmarks.

TODOs

Add other options besides gpt-4o for the value function

News

[07/24/2024]: Released trajectories of the gpt-4o agent.
[06/19/2024]: GitHub repo released.

Install

# Python 3.10 or 3.11 recommended
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
playwright install
pip install -e .

End-to-end Evaluation on (V)WA

Setup the standalone environments. Please check out this page for details.
Configurate the urls for each website. First, export the DATASET to be visualwebarena:

export DATASET=visualwebarena

Then, set the URL for the websites

export CLASSIFIEDS="<your_classifieds_domain>:9980"
export CLASSIFIEDS_RESET_TOKEN="4b61655535e7ed388f0d40a93600254c"  # Default reset token for classifieds site, change if you edited its docker-compose.yml
export SHOPPING="<your_shopping_site_domain>:7770"
export REDDIT="<your_reddit_domain>:9999"
export WIKIPEDIA="<your_wikipedia_domain>:8888"
export HOMEPAGE="<your_homepage_domain>:4399"

If you want to run on the WebArena tasks instead, make sure to also set up the CMS, GitLab, and map environments, and then set their respective environment variables:

export DATASET=webarena
export SHOPPING_ADMIN="<your_e_commerce_cms_domain>:7780/admin"
export GITLAB="<your_gitlab_domain>:8023"
export MAP="<your_map_domain>:3000"

Generate config files for each test example:

python scripts/generate_test_data.py

You will see *.json files generated in the config_files folder. Each file contains the configuration for one test example.

Obtain and save the auto-login cookies for all websites:

bash prepare.sh

Set up API keys.

If using OpenAI models, set a valid OpenAI API key (starting with sk-) as the environment variable:

export OPENAI_API_KEY=your_key

Launch the evaluation. For example, to reproduce our GPT-4o + Search agent, you can run the script provided:

bash scripts/run_vwa_shopping_search.sh

This script will run the search agent with the default hyperparams from our paper on the full set of VWA shopping tasks. Note that the baselines that include a captioning model run on GPU by default (e.g., BLIP-2-T5XL as the captioning model will take up approximately 12GB of GPU VRAM). Similarly, the other bash scripts in scripts/ reproduce the results on the other VWA sites and the text-only WA environment.

By default, the scripts run experiments with the agents with search. If you wish to reproduce the baseline results (without search), set --agent_type prompt when executing run.py.

Running Llama-3 models

If you wish to run the Llama-3 models we have in our paper, first set up a vLLM OpenAI compatible server. Then, update the OPENAI_BASE_URL environment variable in scripts/run_llama_vwa_shopping_search.sh to reflect the URL that the vLLM server is running on. This particular script shows how to run the Llama-3 agent on the VWA shopping environment; it is otherwise very similar to the OpenAI scripts for running on the other environments.

Agent Trajectories

We release the agent trajectories and results of the gpt-4o agent (with gpt-4o as the reward function) here. They are saved in the same format specified in run.py.

Citation

If you methods or code useful, please consider citing our paper:

@article{koh2024tree,
  title={Tree Search for Language Model Agents},
  author={Koh, Jing Yu and McAleer, Stephen and Fried, Daniel and Salakhutdinov, Ruslan},
  journal={arXiv preprint arXiv:2407.01476},
  year={2024}
}

Acknowledgements

Our code is heavily based off the VisualWebArena codebase and the WebArena codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
agent		agent
browser_env		browser_env
coco_images		coco_images
config_files		config_files
environment_docker		environment_docker
evaluation_harness		evaluation_harness
llms		llms
media		media
scripts		scripts
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
paper.pdf		paper.pdf
prepare.sh		prepare.sh
requirements.txt		requirements.txt
run.py		run.py
run_demo.py		run_demo.py
setup.cfg		setup.cfg
setup.py		setup.py
wa_parallel_run.sh		wa_parallel_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tree Search for Language Model Agents

TODOs

News

Install

End-to-end Evaluation on (V)WA

Running Llama-3 models

Agent Trajectories

Citation

Acknowledgements

About

Releases 1

Packages

Languages

License

kohjingyu/search-agents

Folders and files

Latest commit

History

Repository files navigation

Tree Search for Language Model Agents

TODOs

News

Install

End-to-end Evaluation on (V)WA

Running Llama-3 models

Agent Trajectories

Citation

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages