This repository contains a curated list of agent-related papers from ICLR 2025, sorted by their average ratings.
- Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models (Rating: 8.00)
- Spider 2.0: Can Language Models Resolve Real-World Enterprise Text-to-SQL Workflows? (Rating: 8.00)
- Do as We Do, Not as You Think: the Conformity of Large Language Models (Rating: 7.50)
- Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence (Rating: 7.00)
- Online Neuro-Symbolic Predicate Invention for High-Level Planning (Rating: 7.00)
- MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering (Rating: 7.00)
- Monte Carlo Planning with Large Language Model for Text-Based Games (Rating: 7.00)
- Self-Evolving Multi-Agent Networks for Software Development (Rating: 7.00)
- Scaling Test-Time Compute Optimally Can be More Effective than Scaling LLM Parameters (Rating: 6.75)
- Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice (Rating: 6.75)
- PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks (Rating: 6.75)
- OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning (Rating: 6.75)
- AFlow: Automating Agentic Workflow Generation (Rating: 6.75)
- Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Rating: 6.75)
- EmbodiedSAM: Online Segment Any 3D Thing in Real Time (Rating: 6.67)
- Strong Preferences Affect the Robustness of Value Alignment (Rating: 6.50)
- Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning (Rating: 6.50)
- Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel (Rating: 6.50)
- Learning Closed-Loop Concept-Guided Policies from Unlabeled Demonstrations (Rating: 6.50)
- VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks (Rating: 6.50)
- Building Math Agents with Multi-Turn Iterative Preference Learning (Rating: 6.50)
- An Investigation of Conformal Isometry Hypothesis for Grid Cells (Rating: 6.50)
- Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage (Rating: 6.50)
- VisualAgentBench: Towards Large Multimodal Models as Visual Agents (Rating: 6.50)
- MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs (Rating: 6.40)
- Robust Function-Calling for On-Device Language Model via Function Masking (Rating: 6.40)
- Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment (Rating: 6.40)
- DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? (Rating: 6.40)
- Active Task Disambiguation with LLMs (Rating: 6.33)
- AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (Rating: 6.33)
- SPA-BENCH: A COMPREHENSIVE BENCHMARK FOR SMARTPHONE AGENT EVALUATION (Rating: 6.33)
- LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (Rating: 6.25)
- An Intelligent Agentic System for Complex Image Restoration Problems (Rating: 6.25)
- Discriminator-Guided Embodied Planning for LLM Agent (Rating: 6.25)
- Evidence from the Synthetic Laboratory: Language Models as Auction Participants (Rating: 6.25)
- Harnessing Webpage UIs for Text-Rich Visual Understanding (Rating: 6.25)
- OSDA Agent: Leveraging Large Language Models for De Novo Design of Organic Structure Directing Agents (Rating: 6.25)
- Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments (Rating: 6.25)
- Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models (Rating: 6.25)
- DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory (Rating: 6.25)
- Can Multimodal Foundation Models Perform Visual Temporal Reasoning? (Rating: 6.25)
- Visual Agents as Fast and Slow Thinkers (Rating: 6.25)
- DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback (Rating: 6.25)
- Generalized Principal-Agent Problem with a Learning Agent (Rating: 6.25)
- RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code (Rating: 6.25)
- AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (Rating: 6.25)
- {$\tau$}-bench: A Benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser Interaction in Real-World Domains (Rating: 6.25)
- Counterfactual Concept Bottleneck Models (Rating: 6.25)
- OS-ATLAS: Foundation Action Model for Generalist GUI Agents (Rating: 6.25)
- MaestroMotif: Skill Design from Artificial Intelligence Feedback (Rating: 6.25)
- DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agent (Rating: 6.25)
- Lightweight Neural App Control (Rating: 6.25)
- Language Agents Meet Causality -- Bridging LLMs and Causal World Models (Rating: 6.25)
- Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction (Rating: 6.25)
- On Bits and Bandits: Quantifying the Regret-Information Trade-off (Rating: 6.25)
- DataGen: Unified Synthetic Dataset Generation via Large Language Models (Rating: 6.25)
- MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (Rating: 6.25)
- ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer (Rating: 6.20)
- Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs (Rating: 6.20)
- Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology (Rating: 6.00)
- OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures? (Rating: 6.00)
- Multimodal Situational Safety (Rating: 6.00)
- EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation (Rating: 6.00)
- Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency (Rating: 6.00)
- Autonomous agents from automatic reward modeling and planning (Rating: 6.00)
- AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents (Rating: 6.00)
- Training Language Models to Critique with Multi-Agent Feedback (Rating: 6.00)
- ImProver: Agent-Based Automated Proof Optimization (Rating: 6.00)
- PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration (Rating: 6.00)
- Mixture-of-Agents Enhances Large Language Model Capabilities (Rating: 6.00)
- Deep Learning Algorithms for Mean Field Optimal Stopping in Finite Space and Discrete Time (Rating: 6.00)
- Agent S: An Open Agentic Framework that Uses Computers Like a Human (Rating: 6.00)
- ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery (Rating: 6.00)
- Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface (Rating: 6.00)
- Aligned LLMs Are Not Aligned Browser Agents (Rating: 6.00)
- COMBO: Compositional World Models for Embodied Multi-Agent Cooperation (Rating: 6.00)
- MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding (Rating: 6.00)
- Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents (Rating: 6.00)
- OccProphet: Pushing the Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with an Observer-Forecaster-Refiner Framework (Rating: 6.00)
- Neural Interactive Proofs (Rating: 6.00)
- GOAL: A Generalist Combinatorial Optimization Agent Learning (Rating: 6.00)
- Advantage Alignment Algorithms (Rating: 6.00)
- Scaling Large Language Model-based Multi-Agent Collaboration (Rating: 6.00)
- Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training (Rating: 6.00)
- LoRA-Gen: Specializing Language Model via Online LoRA Generation (Rating: 6.00)
- MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents (Rating: 6.00)
- Do LLMs ``know'' internally when they follow instructions? (Rating: 6.00)
- AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses (Rating: 5.83)
- EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE (Rating: 5.80)
- POIL: Preference Optimization for Imitation Learning (Rating: 5.80)
- MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents (Rating: 5.75)
- Discrete Latent Plans via Semantic Skill Abstractions (Rating: 5.75)
- LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture (Rating: 5.75)
- Graph Neural Networks Gone Hogwild (Rating: 5.75)
- Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers (Rating: 5.75)
- CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL (Rating: 5.75)
- Audio Large Language Models Can Be Descriptive Speech Quality Evaluators (Rating: 5.75)
- Tool-Planner: Task Planning with Clusters across Multiple Tools (Rating: 5.75)
- WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models (Rating: 5.75)
- Agent Skill Acquisition for Large Language Models via CycleQD (Rating: 5.75)
- AlphaZero Neural Scaling and Zipf's Law: a Tale of Board Games and Power Laws (Rating: 5.75)
- AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks (Rating: 5.75)
- MindSearch: Mimicking Human Minds Elicits Deep AI Searcher (Rating: 5.75)
- Grounding Multimodal Large Language Model in GUI World (Rating: 5.75)
- Knowledge Graph Based Agent For Complex, Knowledge-Intensive QA in Medicine (Rating: 5.75)
- InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation (Rating: 5.75)
- LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs (Rating: 5.75)
- OpenHands: An Open Platform for AI Software Developers as Generalist Agents (Rating: 5.75)
- RL, but don't do anything I wouldn't do (Rating: 5.75)
- SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction (Rating: 5.75)
- Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs (Rating: 5.75)
- Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning (Rating: 5.75)
- Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent (Rating: 5.75)
- HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions (Rating: 5.75)
- Integrating Expertise of Software Engineering Agents (Rating: 5.75)
- Language Guided Skill Discovery (Rating: 5.75)
- Collab: Controlled Decoding using Mixture of Agents for LLM Alignment (Rating: 5.75)
- AgentQuest: Benchmarking LLM and VLM Agents on Long-Horizon Interactive Tasks (Rating: 5.75)
- NeSyC: A Neuro-symbolic Continual Learner For Complex Embodied Tasks in Open Domains (Rating: 5.75)
- Neural Exploratory Landscape Analysis (Rating: 5.75)
- CityNav: Language-Goal Aerial Navigation Dataset Using Geographic Information (Rating: 5.75)
- Moral Alignment for LLM Agents (Rating: 5.67)
- Robotouille: An Asynchronous Planning Benchmark for LLM Agents (Rating: 5.67)
- CtD: Composition through Decomposition in Emergent Communication (Rating: 5.67)
- Commit0: Library Generation from Scratch (Rating: 5.67)
- MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis (Rating: 5.67)
- Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making (Rating: 5.67)
- When Prompt Engineering Meets Software Engineering: CNL-P as Natural and Robust "APIs'' for Human-AI Interaction (Rating: 5.67)
- Steering Large Language Models between Code Execution and Textual Reasoning (Rating: 5.67)
- Agents' Room: Narrative Generation through Multi-step Collaboration (Rating: 5.67)
- VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning (Rating: 5.60)
- Benchmarking Agentic Workflow Generation (Rating: 5.60)
- Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner (Rating: 5.60)
- Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Rating: 5.60)
- Scattered Forest Search: Smarter Code Space Exploration with LLMs (Rating: 5.60)
- EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents (Rating: 5.50)
- GUI-World: A GUI-oriented Dataset for Multimodal LLM-based Agents (Rating: 5.50)
- CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation (Rating: 5.50)
- When LLMs Play the Telephone Game: Cumulative Changes and Attractors in Iterated Cultural Transmissions (Rating: 5.50)
- Can Textual Gradient Work in Federated Learning? (Rating: 5.50)
- RLSF: Reinforcement Learning via Symbolic Feedback (Rating: 5.50)
- Balancing Act: Diversity and Consistency in Large Language Model Ensembles (Rating: 5.50)
- Language Model Non-Myopic Generation for Reasoning and Planning (Rating: 5.50)
- Do LLMs estimate uncertainty well in instruction-following? (Rating: 5.50)
- MIRAI: Evaluating LLM Agents for International Event Forecasting (Rating: 5.50)
- DCA-Bench: A Benchmark for Dataset Curation Agents (Rating: 5.50)
- ToolACE: Enhancing Function Calling with Accuracy, Complexity, and Diversity (Rating: 5.50)
- Stochastic Semi-Gradient Descent for Learning Mean Field Games with Population-Aware Function Approximation (Rating: 5.50)
- FinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning (Rating: 5.50)
- SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models (Rating: 5.50)
- ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks (Rating: 5.50)
- An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits (Rating: 5.50)
- Mr.Steve: Instruction-Following Agents in Minecraft with What-Where-When Memory (Rating: 5.50)
- Compositional Hardness of Code in Large Language Models - A Probabilistic Perspective (Rating: 5.50)
- PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding (Rating: 5.50)
- Meta-Referential Games to Learn Compositional Learning Behaviours (Rating: 5.50)
- JudgeBench: A Benchmark for Evaluating LLM-Based Judges (Rating: 5.50)
- Multi-Agent Collaborative Data Selection for Efficient Language Model Pretraining (Rating: 5.50)
- Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation (Rating: 5.50)
- AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments (Rating: 5.50)
- LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models (Rating: 5.50)
- Adaptive In-conversation Team Building for Language Model Agents (Rating: 5.50)
- Eligibility Traces for Confounding Robust Off-Policy Evaluation: A Causal Approach (Rating: 5.50)
- Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance (Rating: 5.50)
- Modeling dynamic social vision highlights gaps between deep learning and humans (Rating: 5.50)
- EmpathyRobot: A Dataset and Benchmark for Empathetic Task Planning of Robotic Agent (Rating: 5.50)
- JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework (Rating: 5.50)
- Tree Search for Language Model Agents (Rating: 5.50)
- MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model (Rating: 5.50)
- Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-Based Decision-Making Systems (Rating: 5.50)
- Do LLM Agents Have Regret? A Case Study in Online Learning and Games (Rating: 5.50)
- Generative World Explorer (Rating: 5.50)
- KnowTrace: Explicit Knowledge Tracing for Structured Retrieval-Augmented Generation (Rating: 5.50)
- Automated Red Teaming with GOAT: the Generative Offensive Agent Tester (Rating: 5.40)
- RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards (Rating: 5.40)
- MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct (Rating: 5.40)
- General Scene Adaptation for Vision-and-Language Navigation (Rating: 5.40)
- Empowering LLM Agents with Zero-Shot Optimal Decision-Making through Q-learning (Rating: 5.40)
- DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding (Rating: 5.40)
- Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale (Rating: 5.40)
- Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning (Rating: 5.40)
- Human Simulacra: Benchmarking the Personification of Large Language Models (Rating: 5.40)
- On the Convergence of No-Regret Dynamics in Information Retrieval Games with Proportional Ranking Functions (Rating: 5.33)
- DoF: A Diffusion Factorization Framework for Offline Multi-Agent Decision Making (Rating: 5.33)
- The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind (Rating: 5.33)
- Agent-as-a-Judge: Evaluating Agents with Agents (Rating: 5.33)
- Evolving Alignment via Asymmetric Self-Play (Rating: 5.33)
- Grounding Robot Policies with Visuomotor Language Guidance (Rating: 5.33)
- BraiNav: Incorporating Human Brain Activity to Enhance Robustness in Embodied Visual Navigation (Rating: 5.33)
- Multiagent Finetuning of Language Models (Rating: 5.33)
- What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices (Rating: 5.33)
- Benchmarking Intelligent LLM Agents for Conversational Data Analysis (Rating: 5.33)
- ADAM: An Embodied Causal Agent in Open-World Environments (Rating: 5.25)
- Competing Large Language Models in Multi-Agent Gaming Environments (Rating: 5.25)
- Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning (Rating: 5.25)
- Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models (Rating: 5.25)
- AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML (Rating: 5.25)
- GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation (Rating: 5.25)
- GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning (Rating: 5.25)
- Auction-Based Regulation for Artificial Intelligence (Rating: 5.25)
- ToolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities (Rating: 5.25)
- SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents (Rating: 5.25)
- ACC-Debate: An Actor-Critic Approach to Multi-Agent Debate (Rating: 5.25)
- Prompt Injection Benchmark for Foundation Model Integrated Systems (Rating: 5.25)
- Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention (Rating: 5.25)
- Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs (Rating: 5.25)
- Fourier Head: Helping Large Language Models Learn Complex Probability Distributions (Rating: 5.25)
- Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? (Rating: 5.25)
- ThinkBot: Embodied Instruction Following with Thought Chain Reasoning (Rating: 5.25)
- Agent-to-Sim: Learning Interactive Behavior Model from Casual Longitudinal Videos (Rating: 5.25)
- MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization (Rating: 5.25)
- Communicating Activations Between Language Model Agents (Rating: 5.25)
- AgentGym: Evaluating and Evolving Large Language Model-based Agents across Diverse Envronments (Rating: 5.25)
- How to Correctly Do Semantic Backpropagation on Language-based Agentic Systems (Rating: 5.25)
- Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents (Rating: 5.25)
- Efficiently Scanning and Resampling Spatio-Temporal Tasks with Irregular Observations (Rating: 5.25)
- Adapting Communicating MLLMs on the Fly in Referring Expression Tasks (Rating: 5.25)
- Towards Efficient and Scalable Multi-agent Reasoning via Bayesian Nash Equilibrium (Rating: 5.25)
- Video Action Differencing (Rating: 5.25)
- A Contextual Online Learning Theory of Brokerage (Rating: 5.25)
- $\textit{RwR}$: A Reason-while-Retrieve framework for Reasoning on Scene Graphs with LLMs (Rating: 5.25)
- Private Mechanism Design via Quantile Estimation (Rating: 5.25)
- MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis (Rating: 5.25)
- From Commands to Prompts: LLM-based Semantic File System (Rating: 5.25)
- ToolGen: Unified Tool Retrieval and Calling via Generation (Rating: 5.25)
- SpiritSight Agent: Advanced GUI Agent with One Look (Rating: 5.25)
- Efficient Active Imitation Learning with Random Network Distillation (Rating: 5.25)
- DUET: Decentralized Bilevel Optimization without Lower-Level Strong Convexity (Rating: 5.25)
- The Impact of Element Ordering on LM Agent Performance (Rating: 5.25)
- Coding Reliable LLM-based Integrated Task and Knowledge Agents with GenieWorksheets (Rating: 5.20)
- Tell Me What You Don't Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing (Rating: 5.20)
- Federated Coordination: Private and Distributed Strategy Alignment (Rating: 5.20)
- Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems (Rating: 5.20)
- BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments (Rating: 5.20)
- HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale (Rating: 5.17)
- AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (Rating: 5.00)
- 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination (Rating: 5.00)
- ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning (Rating: 5.00)
- Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search (Rating: 5.00)
- Zero-Shot Task-Level Adaptation via Coarse-to-Fine Policy Refinement and Holistic-Local Contrastive Representations (Rating: 5.00)
- Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System (Rating: 5.00)
- REvolve: Reward Evolution with Large Language Models using Human Feedback (Rating: 5.00)
- Neuralized Markov Random Field for Interaction-Aware Stochastic Human Trajectory Prediction (Rating: 5.00)
- Breaking Mental Set to Improve Reasoning through Diverse Multi-Agent Debate (Rating: 5.00)
- BRIDGE: Bootstrapping Text to Guide Time-Series Generation via Multi-Agent Iterative Optimisation and Diffusion Modelling (Rating: 5.00)
- Re-Aligning Language to Visual Objects with an Agentic Workflow (Rating: 5.00)
- Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case (Rating: 5.00)
- Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (Rating: 5.00)
- On the Diversity of Synthetic Data and its Impact on Training Large Language Models (Rating: 5.00)
- Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback (Rating: 5.00)
- IGOR: Image-GOal Representations are the Atomic Building Blocks for Next-Level Generalization in Embodied AI (Rating: 5.00)
- Mora: Enabling Generalist Video Generation via A Multi-Agent Framework (Rating: 5.00)
- Rational Decision-Making Agent with Learning Internal Utility Judgment (Rating: 5.00)
- Efficacy of Language Model Self-Play in Non-Zero-Sum Games (Rating: 5.00)
- STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models (Rating: 5.00)
- AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions (Rating: 5.00)
- Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models (Rating: 5.00)
- Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations (Rating: 5.00)
- Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View (Rating: 5.00)
- AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs (Rating: 5.00)
- OmniParser for Pure Vision Based GUI Agent (Rating: 5.00)
- Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation (Rating: 5.00)
- Triples as the Key: Structuring Makes Decomposition and Verification Easier in LLM-based TableQA (Rating: 5.00)
- Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants (Rating: 5.00)
- Towards Full Delegation: Designing Ideal Agentic Behaviors for Travel Planning (Rating: 5.00)
- Sample Efficient Alignment for LLMs (Rating: 5.00)
- Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling (Rating: 5.00)
- ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL (Rating: 5.00)
- Improving Large Language Model based Multi-Agent Framework through Dynamic Workflow Updating (Rating: 5.00)
- Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems (Rating: 5.00)
- CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing (Rating: 5.00)
- Actions Speak Louder Than Words: Rate-Reward Trade-off in Markov Decision Processes (Rating: 5.00)
- ChinaTravel: A Real-World Benchmark for Language Agents in Chinese Travel Planning (Rating: 5.00)
- OptiBench: Benchmarking Large Language Models in Optimization Modeling with Equivalence-Detection Evaluation (Rating: 5.00)
- DRESSing Up LLM: Efficient Stylized Question-Answering via Style Subspace Editing (Rating: 5.00)
- Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration (Rating: 5.00)
- Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions (Rating: 5.00)
- WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning (Rating: 5.00)
- InvestAlign: Align LLMs with Investor Decision-Making under Herd Behavior (Rating: 5.00)
- Informing Reinforcement Learning Agents by Grounding Language to Markov Decision Processes (Rating: 5.00)
- Understanding Prejudice and Fidelity of Diverge-to-Converge Multi-Agent Systems (Rating: 5.00)
- Who Should Join the Decision-Making Table? Targeted Expert Selection for Enhanced Human-AI Collaboration (Rating: 4.83)
- Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL (Rating: 4.80)
- MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning (Rating: 4.80)
- On the Resilience of Multi-Agent Systems with Malicious Agents (Rating: 4.80)
- Haland: Human-AI Coordination via Policy Generation from Language-guided Diffusion (Rating: 4.80)
- AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems (Rating: 4.80)
- Knapsack Schema Linking Agent for LLM-Based Text-to-SQL Generation (Rating: 4.80)
- Stochastic Matching Bandits under Preference Feedback (Rating: 4.80)
- How language models extrapolate outside the training data: A Case study in Textualized Gridworld (Rating: 4.80)
- Empowering Users in Digital Privacy Management through Interactive LLM-Based Agents (Rating: 4.80)
- Deep Exploration with PAC-Bayes (Rating: 4.75)
- On the Modeling Capabilities of Large Language Models for Sequential Decision Making (Rating: 4.75)
- War and Peace (WarAgent): LLM-based Multi-Agent Simulation of World Wars (Rating: 4.75)
- Task-oriented Sequential Grounding in 3D Scenes (Rating: 4.75)
- BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science (Rating: 4.75)
- 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Referred Object Grounding (Rating: 4.75)
- Automated Design of Agentic Systems (Rating: 4.75)
- Language-conditioned Multi-Style Policies with Reinforcement Learning (Rating: 4.75)
- LLF-Bench: A Benchmark for Interactive Learning from Language Feedback (Rating: 4.75)
- MAD-Sherlock: Multi-Agent Debates for Out-of-Context Misinformation Detection (Rating: 4.75)
- Research Town: Simulator of Research Community (Rating: 4.75)
- ToM-agent: Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection (Rating: 4.75)
- COMMA: A Communicative Multimodal Multi-Agent Benchmark (Rating: 4.75)
- SWE-bench Multimodal: Do Autonomous Programming Systems Generalize to New Software Domains? (Rating: 4.75)
- Dissecting Adversarial Robustness of Multimodal LM Agents (Rating: 4.75)
- ESDMotion: End-to-end Motion Prediction Only with SD Maps (Rating: 4.75)
- MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation (Rating: 4.75)
- Data Interpreter: An LLM Agent For Data Science (Rating: 4.75)
- Sparse Rewards Can Self-Train Dialogue Agents (Rating: 4.75)
- MISR: Measuring Instrumental Self-Reasoning in Frontier Models (Rating: 4.75)
- Controlling Large Language Model Agents with Entropic Activation Steering (Rating: 4.75)
- Emergence of Hierarchical Emotion Representations in Large Language Models (Rating: 4.75)
- JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking (Rating: 4.75)
- Truthful Aggregation of LLMs with an Application to Online Advertising (Rating: 4.75)
- LifelongSotopia: Evaluating Social Intelligence Of Language Agents Over Lifelong Social Interactions (Rating: 4.75)
- DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient Self-Driving (Rating: 4.75)
- MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Rating: 4.75)
- A Research on Result Interpretability of Medical AI Based on Large Language Model (Rating: 4.75)
- DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects (Rating: 4.75)
- WebCanvas: Benchmarking Web Agents in Online Environments (Rating: 4.75)
- RoundTable: Investigating Group Decision-Making Mechanism in Multi-Agent Collaboration (Rating: 4.75)
- Interactive Dialogue Agents via Reinforcement Learning with Hindsight Regenerations (Rating: 4.75)
- STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making (Rating: 4.75)
- NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator (Rating: 4.75)
- Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning (Rating: 4.75)
- ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents (Rating: 4.75)
- Simulate Before Act: Model-Based Planning for Web Agents (Rating: 4.75)
- Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning (Rating: 4.75)
- Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents (Rating: 4.75)
- BOIL: Learning Environment Personalized Information (Rating: 4.75)
- Query-Efficient Planning with Language Models (Rating: 4.75)
- GLEE: A Framework and Benchmark for LLM Evaluation in Language-based Economics (Rating: 4.75)
- TestAgent: An Adaptive and Intelligent Expert for Human Assessment (Rating: 4.75)
- Enhancing Language Model Agents using Diversity of Thoughts (Rating: 4.75)
- Prioritize Alignment in Dataset Distillation (Rating: 4.75)
- Wolf: Accurate Video Captioning with a World Summarization Framework (Rating: 4.75)
- DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents (Rating: 4.75)
- From an LLM Swarm to a PDDL-empowered Hive: Planning Self-executed Instructions in a Multi-modal Jungle (Rating: 4.67)
- Deviation Ratings: A general, clone invariant rating method (Rating: 4.67)
- ControlAgent: Automating Control System Design via Novel Integration of LLM Agents and Domain Expertise (Rating: 4.67)
- Direct Multi-agent Motion Generation Preference Alignment with Implicit Feedback from Demonstrations (Rating: 4.67)
- Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities (Rating: 4.67)
- Improving the Efficiency of Test-Time Search in LLMs with Backtracking (Rating: 4.67)
- Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning (Rating: 4.67)
- VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use (Rating: 4.67)
- Review and Rebuttal: Zero-shot In-context Adversarial Learning for Improving Research Ideation (Rating: 4.67)
- LAM Simulator: Advancing Large Action Model Training for Agent via Online Exploration and Feedback Simulation (Rating: 4.67)
- GridAgent: A 2D Grid-Based Game Framework And Benchmark For Multimodal Large Language Models (Rating: 4.67)
- Vision-Language Models Provide Promptable Representations for Reinforcement Learning (Rating: 4.67)
- A Generalist Hanabi Agent (Rating: 4.64)
- Unlocking Video-LLM via Agent-of-Thoughts Distillation (Rating: 4.60)
- DPM: Dual Preferences-based Multi-Agent Reinforcement Learning (Rating: 4.60)
- IDEA: Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction (Rating: 4.60)
- LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence (Rating: 4.50)
- Cognitive Insights and Stable Coalition Matching for Fostering Multi-Agent Cooperation (Rating: 4.50)
- Chain of Ideas: Revolutionizing Research in Idea Development with LLM Agents (Rating: 4.50)
- Agent Workflow Memory (Rating: 4.50)
- MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data (Rating: 4.50)
- RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation (Rating: 4.50)
- UrbanWorld: An Urban World Model for 3D City Generation (Rating: 4.50)
- Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses (Rating: 4.50)
- Large Legislative Models: Towards Efficient AI Policymaking in Economic Simulations (Rating: 4.50)
- Progressive LLM Alignments Using Two-Player Games (Rating: 4.50)
- Uncertainty-aware Human Mobility Modeling and Anomaly Detection (Rating: 4.50)
- MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models (Rating: 4.50)
- Personalized Federated Learning via Variational Massage Passing (Rating: 4.50)
- Towards Safe and Honest AI Agents with Neural Self-Other Overlap (Rating: 4.50)
- Improving Planning with Large Language Models: A Modular Agentic Architecture (Rating: 4.50)
- Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Exploration (Rating: 4.50)
- Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs (Rating: 4.50)
- DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction (Rating: 4.50)
- CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants (Rating: 4.50)
- Visually Descriptive Language Model for Vector Graphics Reasoning (Rating: 4.50)
- PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories (Rating: 4.50)
- Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems (Rating: 4.50)
- Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes (Rating: 4.50)
- Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics (Rating: 4.50)
- RedCodeAgent: Automatic Red-teaming Agent against Code Agents (Rating: 4.50)
- Advancing Algorithmic Trading with Large Language Models: A Reinforcement Learning Approach for Stock Market Optimization (Rating: 4.50)
- NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative (Rating: 4.50)
- PokeChamp: an Expert-level Minimax Language Agent for Competitive Pokemon (Rating: 4.50)
- LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval (Rating: 4.50)
- AISciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification (Rating: 4.50)
- CycleResearcher: Improving Automated Research via Automated Review (Rating: 4.50)
- Shell Games: Control Protocols for Adversarial AI Agents (Rating: 4.50)
- Versatile Motion-Language Models for Multi-turn Interactive Agents (Rating: 4.50)
- FlowAgent: a New Paradigm for Workflow Agent (Rating: 4.50)
- Q* Agent: Optimizing Language Agents with Q-Guided Exploration (Rating: 4.50)
- Simulating Human-like Daily Activities with Desire-driven Autonomy (Rating: 4.50)
- AgentRefine: Enhancing Agent Generalization through Refinement Tuning (Rating: 4.50)
- RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning (Rating: 4.50)
- Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning (Rating: 4.40)
- CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception (Rating: 4.40)
- ICDA: Interactive Causal Discovery through Large Language Model Agents (Rating: 4.40)
- Adversarial Attacks on Cooperative Multi-agent Bandits (Rating: 4.40)
- Adaptive Video Understanding Agent: Enhancing Efficiency with Dynamic Frame Sampling and Feedback-driven Reasoning (Rating: 4.40)
- ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models (Rating: 4.40)
- Synthesizing Bonds: Enhancing Adult Attachment Predictions with LLM-Generated Data (Rating: 4.33)
- Multi-Agent Path Finding via Decision Transformer and LLM Collaboration (Rating: 4.33)
- Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models (Rating: 4.33)
- Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale (Rating: 4.33)
- Enhance Reasoning for Large Language Models with Reinforcement Learning in the Game Werewolf (Rating: 4.33)
- Towards Specialized Web Agents Using Production-Scale Workflow Data (Rating: 4.33)
- Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (Rating: 4.33)
- CogMath: Evaluating LLMs' Authentic Mathematical Ability from a Cognitive Perspective (Rating: 4.33)
- Detecting Out-of-Context Misinformation via Multi-Agent and Multi-Grained Retrieval (Rating: 4.33)
- Last Iterate Convergence in Monotone Mean Field Games (Rating: 4.33)
- EcoAct: Economic Agent Determines When to Register What Action (Rating: 4.33)
- Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation (Rating: 4.33)
- CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only (Rating: 4.33)
- CodeCloak: A Method for Mitigating Code Leakage by LLM Code Assistants (Rating: 4.33)
- MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants (Rating: 4.25)
- Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs (Rating: 4.25)
- VideoAgent: Self-Improving Video Generation (Rating: 4.25)
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition (Rating: 4.25)
- CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents (Rating: 4.25)
- SnapMem: Snapshot-based 3D Scene Memory for Embodied Exploration and Reasoning (Rating: 4.25)
- MetaAgent: Automatically Building Multi-Agent System based on Finite State Machine (Rating: 4.25)
- Teaching Transformers Causal Reasoning through Axiomatic Training (Rating: 4.25)
- Learning 4D Embodied World Models (Rating: 4.25)
- Contextual Experience Replay for Continual Learning of Language Agents (Rating: 4.25)
- Large Language Models Can Self-Improve At Web Agent Tasks (Rating: 4.25)
- Talking Vehicles: Cooperative Driving via Natural Language (Rating: 4.25)
- Open-World Planning via Lifted Regression with LLM-based Affordances for Embodied Agents (Rating: 4.25)
- Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (Rating: 4.25)
- ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents (Rating: 4.25)
- Towards Evaluating Generalist Agents: An Automated Benchmark in Open World (Rating: 4.25)
- SMART: Self-Learning Meta-strategy Agent for Reasoning Tasks (Rating: 4.25)
- OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents (Rating: 4.25)
- OASIS: Open Agents Social Interaction Simulations on a Large Scale (Rating: 4.25)
- Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-like and Personalised Early Adolescences (Rating: 4.25)
- Provably Efficient and Practical Self-Play for Better LLM Alignment (Rating: 4.25)
- Agents Help Agents: Exploring Training-Free Knowledge Distillation for Small Language Models in Data Science Code Generation (Rating: 4.25)
- Evaluating the Goal-Directedness of Large Language Models (Rating: 4.25)
- SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication (Rating: 4.25)
- In-Context Learning for Games (Rating: 4.25)
- Optimizing Inference-Time Reasoning in LLMs via Retrieval-Augmented Reflection (Rating: 4.25)
- AltDev: Achieving Real-Time Alignment in Multi-Agent Software Development (Rating: 4.25)
- A Third-Person Appraisal Agent: Learning to Reason About Emotions in Conversational Contexts (Rating: 4.25)
- How Can LLM Guide RL? A Value-Based Approach (Rating: 4.25)
- Explicit-Constrained Single Agent for Enhanced Task-Solving in LLMs (Rating: 4.25)
- LLMs for Generalizable Language-Conditioned Policy Learning under Minimal Data Requirements (Rating: 4.25)
- MorphAgent: Empowering Agents through Self-Evolving Profiles and Decentralized Collaboration (Rating: 4.25)
- Uncertainty Quantification with Generative-Semantic Entropy Estimation for Large Language Models (Rating: 4.25)
- UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding (Rating: 4.25)
- AgentSquare: Automatic LLM Agent Search in Modular Design Space (Rating: 4.25)
- CogniPair - Dynamic LLM Matching Algorithm in Chaotic Environments Mimicking Human Cognitive Processes for Relationship Pairing (Rating: 4.25)
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? (Rating: 4.25)
- AutoHijacker: Automatic Indirect Prompt Injection Against Black-box LLM Agents (Rating: 4.25)
- Skill Discovery using Language Models (Rating: 4.25)
- ReAcTree: Hierarchical Task Planning with Dynamic Tree Expansion using LLM Agent Nodes (Rating: 4.25)
- Odyssey: Empowering Minecraft Agents with Open-World Skills (Rating: 4.25)
- 'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue (Rating: 4.20)
- ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World (Rating: 4.20)
- B-MoCA: Benchmarking Mobile Device Control Agents across Diverse Configurations (Rating: 4.20)
- AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents (Rating: 4.20)
- LossAgent: Towards Any Optimization Objectives for Image Processing with LLM Agents (Rating: 4.20)
- Large-Scale Dynamic Graph Generation via LLM-based Agent Simulation (Rating: 4.20)
- PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play (Rating: 4.20)
- VeSX: A Framework Featured by Verification, Self-Correction and In-context Learning for Web Automation Tasks (Rating: 4.20)
- Improving Model Alignment Through Collective Intelligence of Open-Source Models (Rating: 4.20)
- Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection (Rating: 4.17)
- Plan B: Training LLMs to fail less severely (Rating: 4.00)
- Memory-Driven Multimodal Chain of Thought for Embodied Long-Horizon Task Planning (Rating: 4.00)
- The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests (Rating: 4.00)
- Towards Reliable Offline Reinforcement Learning via Lyapunov Uncertainty Control (Rating: 4.00)
- Computing Ex Ante Equilibrium in Heterogeneous Zero-Sum Team Games (Rating: 4.00)
- MS$^3$M: Multi-Stage State Space Model for Motion Forecasting (Rating: 4.00)
- Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines (Rating: 4.00)
- Two Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation (Rating: 4.00)
- Leveraging Imitation Learning and LLMs for Efficient Hierarchical Reinforcement Learning (Rating: 4.00)
- GenoAgent: A Baseline method for LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians (Rating: 4.00)
- EnvBridge: Bridging Diverse Environments with Cross-Environment Knowledge Transfer for Embodied AI (Rating: 4.00)
- Efficient Predictive Counterfactual Regret Minimization$^+$ Algorithm in Solving Extensive-Form Games (Rating: 4.00)
- Contextual Bandits with Entropy-based Human Feedback (Rating: 4.00)
- LeanAgent: Lifelong Learning for Formal Theorem Proving (Rating: 4.00)
- MultiMedia-Agent: A Multimodal Agent for Multimedia Content Generation (Rating: 4.00)
- SWE-Bench+: Enhanced Coding Benchmark for LLMs (Rating: 4.00)
- Scaling Laws for Pre-training Agents and World Models (Rating: 4.00)
- AutoRedTeamer: An Autonomous Red Teaming Agent Against Language Models (Rating: 4.00)
- SmartBackdoor: Malicious Language Model Agents that Avoid Being Caught (Rating: 4.00)
- CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations (Rating: 4.00)
- Sketch-Plan-Generalize: Learning Inductive Representations for Grounded Spatial Concepts (Rating: 4.00)
- Shapley Value Approximation based on k-Additive Games (Rating: 4.00)
- MAC: A Multimodal Benchmark for Understanding and Generating Academic Journal Covers (Rating: 4.00)
- On Inherent 3D Reasoning of VLMs in Indoor Scene Layout Design (Rating: 4.00)
- YOLO-MARL: You Only LLM Once for Multi-agent Reinforcement Learning (Rating: 4.00)
- Entropy-Based Uncertainty Modeling for Trajectory Prediction in Autonomous Driving (Rating: 4.00)
- Large Language Model Critics for Execution-Free Evaluation of Code Changes (Rating: 4.00)
- iAgent: LLM Agent as a Shield between User and Recommender Systems (Rating: 4.00)
- Benchmark for Temporal, Ambiguous, and Grounded Embodied Question-Answering (Rating: 4.00)
- ReGen: Generative Robot Simulation via Inverse Design (Rating: 4.00)
- Optimal Transport-Based Domain Alignment as a Preprocessing Step for Federated Learning (Rating: 4.00)
- Understanding Data Poisoning Attacks for RAG: Insights and Algorithms (Rating: 4.00)
- DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation (Rating: 4.00)
- SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI (Rating: 4.00)
- Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning (Rating: 4.00)
- Grey-box Prompt Optimization and Fine-Tuning for Cloud-Edge LLM Agents (Rating: 4.00)
- Gödel Agent: A Self-Referential Framework Helps for Recursively Self-Improvement (Rating: 4.00)
- MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion (Rating: 4.00)
- ANALOGXPERT: AUTOMATING ANALOG TOPOLOGY SYNTHESIS BY INCORPORATING CIRCUIT DESIGN EXPERTISE INTO LARGE LANGUAGE MODELS (Rating: 4.00)
- Designing Deep Learning Programs with Large Language Models (Rating: 4.00)
- Inverse Attention Agent in Multi-Agent System (Rating: 4.00)
- Multi-Grained Knowledge for Retrieval-Augmented Question Answering on Hyper-long Contexts (Rating: 4.00)
- SCALE: Augmenting Content Analysis via LLM Agents and AI-Human Collaboration (Rating: 4.00)
- Embodied Instruction Following in Unknown Environments (Rating: 4.00)
- Language-Guided Object-Centric World Models for Predictive Control (Rating: 4.00)
- Denial-of-Service Poisoning Attacks against Large Language Models (Rating: 4.00)
- Symbolic Learning Enables Self-Evolving Agents (Rating: 4.00)
- AD-H: Autonomous Driving with Hierarchical Agents (Rating: 4.00)
- Towards LLM4Floorplan: Agents Can Do What Engineers Do in Chip Design (Rating: 4.00)
- Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback (Rating: 4.00)
- A Super-Aligned Driving Generalist Is Your Cockpit (Rating: 3.83)
- Make LLMs better zero-shot reasoners: structure-oriented autonomous reasoning (Rating: 3.83)
- LLM-Mediated Guidance of MARL Systems (Rating: 3.80)
- LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models (Rating: 3.80)
- HeurAgenix: A Multi-Agent LLM-Based Paradigm for Adaptive Heuristic Evolution and Selection in Combinatorial Optimization (Rating: 3.80)
- Unlocking Speech Instruction Data Potential with Query Rewriting (Rating: 3.75)
- Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages (Rating: 3.75)
- Verbalized Bayesian Persuasion (Rating: 3.75)
- S3E: Semantic Symbolic State Estimation With Vision-Language Foundation Models (Rating: 3.75)
- Egocentric Vision Language Planning (Rating: 3.75)
- Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts (Rating: 3.75)
- SparsitySolver: Efficient Reinforcement Learning-based Pruning for LLMs (Rating: 3.75)
- MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control (Rating: 3.75)
- Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? (Rating: 3.75)
- Inductive Linguistic Reasoning with Large Language Models (Rating: 3.75)
- Feynman: Knowledge-Infused Diagramming Agent for Scaling Visual Reasoning Data (Rating: 3.75)
- EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment (Rating: 3.75)
- NextBestPath: Efficient 3D Mapping of Unseen Environments (Rating: 3.75)
- Large language models as windows on the mental structure of psychopathology (Rating: 3.75)
- Harnessing Input-adaptive Inference for Efficient Vision-and-Language Navigation (Rating: 3.75)
- From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge (Rating: 3.67)
- A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment (Rating: 3.67)
- Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities (Rating: 3.67)
- Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis (Rating: 3.67)
- InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making (Rating: 3.67)
- Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess (Rating: 3.67)
- Decentralized Blockchain-based Robust Multi-agent Multi-armed Bandit (Rating: 3.67)
- DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning (Rating: 3.67)
- Solving Robotics Problems in Zero-Shot with Vision-Language Models (Rating: 3.67)
- Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents (Rating: 3.67)
- Boundless Socratic Learning (Rating: 3.60)
- Learning a Bi-directional Driving Data Generator via Large Multi-modal Model Tuning (Rating: 3.60)
- A Scalable Communication Protocol for Networks of Large Language Models (Rating: 3.50)
- SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning (Rating: 3.50)
- ALIA: An LLM for Industrial Assets using Synthetic Data (Rating: 3.50)
- FAIRMINDSIM: ALIGNMENT OF BEHAVIOR, EMO- TION, AND BELIEF IN HUMANS AND LLM AGENTS AMID ETHICAL DILEMMAS (Rating: 3.50)
- Defend against Jailbreak Attacks via Debate with Partially Perceptive Agents (Rating: 3.50)
- Cracking the Collective Mind: Adversarial Manipulation in Multi-Agent Systems (Rating: 3.50)
- Beyond Browsing: API-Based Web Agents (Rating: 3.50)
- AutoCoder: Enhancing Code Large Language Model with AIEV-INSTRUCT (Rating: 3.50)
- FEABench: Evaluating Language Models on Real World Physics Reasoning Ability (Rating: 3.50)
- AutoPR: Automatically Pull Request Generation for Fix Issued Bugs of CodeBase (Rating: 3.50)
- Massively Multi-Agents Reveal That Large Language Models Can Understand Value (Rating: 3.50)
- SimUSER: When Language Models Pretend to Be Believable Users in Recommender Systems (Rating: 3.50)
- Enhancing Software Agents with Monte Carlo Tree Search and Hindsight Feedback (Rating: 3.50)
- Zero-Shot Goal Dialogue via Reinforcement Learning on Imagined Conversations (Rating: 3.50)
- AIME: AI System Optimization via Multiple LLM Evaluators (Rating: 3.50)
- Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1 (Rating: 3.50)
- REDO: Execution-Free Runtime Error Detection for Coding Agents (Rating: 3.50)
- Agent-G: An Agentic Framework for Graph Retrieval Augmented Generation (Rating: 3.50)
- LLM-Exp: Exploring the Policy in Reinforcement Learning with Large Language Models (Rating: 3.50)
- Value Explicit Pretraining for Learning Transferable Representations (Rating: 3.50)
- Logic Agent: Enhancing Validity with Logic Rule Invocation (Rating: 3.50)
- iMotion-LLM: Motion Prediction Instruction Tuning (Rating: 3.50)
- Autoverse: an Evolvable Game Language for Learning Robust Embodied Agents (Rating: 3.50)
- Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning (Rating: 3.50)
- Self-controller: Controlling LLMs with Multi-round Step-by-step Self-awareness (Rating: 3.50)
- LASER: Script Execution by Autonomous Agents for On-demand Traffic Simulation (Rating: 3.50)
- EconAI: Preference-driven Agents Simulating Economic Activities via Large Language Model (Rating: 3.50)
- VISION-LANGUAGE MODELS AS TRAINERS FOR INSTRUCTION-FOLLOWING AGENTS (Rating: 3.50)
- Your Agent Can Defend Itself against Backdoor Attacks (Rating: 3.50)
- Probing the contents of text, behavior, and brain data toward improving human-LLM alignment (Rating: 3.50)
- LLMs Synergy : From Closed-Source Prototyping to Open-Source Model based Instruction Following (Rating: 3.40)
- Training Open-ended Policies to follow Video-prompt Instructions with Reinforcement Learning (Rating: 3.40)
- Multi-Agent Causal Discovery Using Large Language Models (Rating: 3.40)
- GFLAgent: Green Federated Learning Agent for Alleviating Heterogeneity (Rating: 3.40)
- MAC-CAFE: Multi-actor, Centralized Critic Architecture for Feedback-driven Editing (Rating: 3.25)
- TeamCraft: A Benchmark for Embodied Multi-Agent Systems in Minecraft (Rating: 3.25)
- DataSciBench: An LLM Agent Benchmark for Data Science (Rating: 3.20)
- How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks (Rating: 3.00)
- HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments (Rating: 3.00)
- Entering Real Social World! Benchmarking the Theory of Mind and Socialization Capabilities of LLMs from a First-person Perspective (Rating: 3.00)
- SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs (Rating: 3.00)
- ProCEED: Prototype Consolidation and Ensemble-based Exemplar-Free Deep Incremental Learning (Rating: 3.00)
- EchoQA: Tuning into the Heart of Echocardiogram Reports (Rating: 3.00)
- Seeker: Enhancing Exception Handling in Code with a LLM-based Multi-Agent Approach (Rating: 3.00)
- Enhancing Multi-Agent Learning in Real-World Interactive Environments through Process Reward Decomposition (Rating: 3.00)
- Test-Time RAG: Enhancing Long Context Understanding in LLMs with Retrieval-Augmented Mechanisms (Rating: 3.00)
- Grounded Robotic Action-Rule Induction through Language Models (GRAIL) (Rating: 3.00)
- Investigating Self-Attention: Its Impact on Sample Efficiency in Deep Reinforcement Learning (Rating: 3.00)
- Orca: Enhancing Role-Playing Abilities of Large Language Models by Integrating Personality Traits (Rating: 3.00)
- Rapfi: Distilling Efficient Neural Network for the Game of Gomoku (Rating: 3.00)
- Planning with MCTS: Enhancing Problem-Solving in Large Language Models (Rating: 3.00)
- StarCraft II Arena: Evaluating LLMs in Strategic Planning, Real-Time Decision Making, and Adaptability (Rating: 3.00)
- ChemThinker: Thinking Like a Chemist with Multi-Agent LLMs for Deep Molecular Insights (Rating: 3.00)
- DebUnc: Improving Large Language Model Agent Communication Via Uncertainty Metrics (Rating: 3.00)
- AutoModel: Autonomous Model Development for Image Classification with LLM Agents (Rating: 3.00)
- Very Large-Scale Multi-Agent Simulation with LLM-Powered Agents (Rating: 3.00)
- On the Design and Analysis of LLM-Based Algorithms (Rating: 3.00)
- I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy (Rating: 3.00)
- Human-like Communication Strategies for Improved Multi-Agent Reinforcement Learning (Rating: 3.00)
- GLIMO: Grounding Large Language Models With Imperfect World Models (Rating: 3.00)
- IDS-Agent: An LLM Agent for Explainable Intrusion Detection in IoT Networks (Rating: 3.00)
- Foundation Models for Enhanced Exploration in Reinforcement Learning (Rating: 3.00)
- ActionFiller: Fill-In-The-Blank Prompting for OS Agent (Rating: 3.00)
- FALCON: A Feedback-Driven Adaptive Long/Short-Term Memory Reinforced Coding Optimization (Rating: 3.00)
- LOB-Bench: Benchmarking Generative AI for Finance - with an Application to Limit Order Book Markets (Rating: 2.67)
- RePrompt: Prompt Engineering for Large Language Models Agents through Reflection (Rating: 2.50)
- Towards Autonomous Agents: Adaptive-planning, Reasoning, and Acting in Language Models (Rating: 2.50)
- Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debate (Rating: 2.50)
- DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction and Repurposing (Rating: 2.50)
- Why Solving Multi-agent Path Finding with Large Language Models has not Succeeded Yet (Rating: 2.50)
- EMERGENCE OF GROUNDED, OPTIMALLY COMPOSITIONAL SPATIAL LANGUAGE AMONG HOMOGENEOUS AGENTS (Rating: 2.33)
- Leveraging System-Prompt Attention to Counteract Novel Jailbreak Attacks (Rating: 2.33)
- D2Coder: large language models based agent for coding with dynamic debugging tools (Rating: 2.33)
- Poly-Autoregressive Modeling for Interacting Entities (Rating: 2.33)
- EReLELA: Exploration in Reinforcement Learning via Emergent Language Abstractions (Rating: 2.33)
- Generate explorative goals with large language model guidance (Rating: 2.00)
- CELI: CONTROLLER-EMBEDDED LANGUAGE MODEL INTERACTIONS (Rating: N/A)
- Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving (Rating: N/A)
- Agential AI for integrated continual learning, deliberative behavior, and comprehensible models (Rating: N/A)
- Self-Improving Logic from Experimental Observations (Rating: N/A)
- Collaborative Theorem Proving with Large Language Models: Enhancing Formal Proofs with ProofRefiner (Rating: N/A)
- Hindsight Planner: A Closed-loop few-shot planner for Embodied Instruction Following (Rating: N/A)
- A collaborative Multi-Agent LLM Approach for Knowledge Graph Curation and query from multimodal data sources (Rating: N/A)
- WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents (Rating: N/A)
The ratings are based on the reviews from ICLR 2025 reviewers. Papers are sorted by their average ratings.
Feel free to submit a PR or issue if you find any errors or have suggestions for improvement.