ICLR 2025 Agent-Related Papers

This repository contains a curated list of agent-related papers from ICLR 2025, sorted by their average ratings.

Papers

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models (Rating: 8.00)
Spider 2.0: Can Language Models Resolve Real-World Enterprise Text-to-SQL Workflows? (Rating: 8.00)
Do as We Do, Not as You Think: the Conformity of Large Language Models (Rating: 7.50)
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence (Rating: 7.00)
Online Neuro-Symbolic Predicate Invention for High-Level Planning (Rating: 7.00)
MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering (Rating: 7.00)
Monte Carlo Planning with Large Language Model for Text-Based Games (Rating: 7.00)
Self-Evolving Multi-Agent Networks for Software Development (Rating: 7.00)
Scaling Test-Time Compute Optimally Can be More Effective than Scaling LLM Parameters (Rating: 6.75)
Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice (Rating: 6.75)
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks (Rating: 6.75)
OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning (Rating: 6.75)
AFlow: Automating Agentic Workflow Generation (Rating: 6.75)
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Rating: 6.75)
EmbodiedSAM: Online Segment Any 3D Thing in Real Time (Rating: 6.67)
Strong Preferences Affect the Robustness of Value Alignment (Rating: 6.50)
Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning (Rating: 6.50)
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel (Rating: 6.50)
Learning Closed-Loop Concept-Guided Policies from Unlabeled Demonstrations (Rating: 6.50)
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks (Rating: 6.50)
Building Math Agents with Multi-Turn Iterative Preference Learning (Rating: 6.50)
An Investigation of Conformal Isometry Hypothesis for Grid Cells (Rating: 6.50)
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage (Rating: 6.50)
VisualAgentBench: Towards Large Multimodal Models as Visual Agents (Rating: 6.50)
MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs (Rating: 6.40)
Robust Function-Calling for On-Device Language Model via Function Masking (Rating: 6.40)
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment (Rating: 6.40)
DSBench: How Far Are Data Science Agents from Becoming Data Science Experts? (Rating: 6.40)
Active Task Disambiguation with LLMs (Rating: 6.33)
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (Rating: 6.33)
SPA-BENCH: A COMPREHENSIVE BENCHMARK FOR SMARTPHONE AGENT EVALUATION (Rating: 6.33)
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (Rating: 6.25)
An Intelligent Agentic System for Complex Image Restoration Problems (Rating: 6.25)
Discriminator-Guided Embodied Planning for LLM Agent (Rating: 6.25)
Evidence from the Synthetic Laboratory: Language Models as Auction Participants (Rating: 6.25)
Harnessing Webpage UIs for Text-Rich Visual Understanding (Rating: 6.25)
OSDA Agent: Leveraging Large Language Models for De Novo Design of Organic Structure Directing Agents (Rating: 6.25)
Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments (Rating: 6.25)
Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models (Rating: 6.25)
DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory (Rating: 6.25)
Can Multimodal Foundation Models Perform Visual Temporal Reasoning? (Rating: 6.25)
Visual Agents as Fast and Slow Thinkers (Rating: 6.25)
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback (Rating: 6.25)
Generalized Principal-Agent Problem with a Learning Agent (Rating: 6.25)
RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code (Rating: 6.25)
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (Rating: 6.25)
{$\tau$}-bench: A Benchmark for \underline{T}ool-\underline{A}gent-\underline{U}ser Interaction in Real-World Domains (Rating: 6.25)
Counterfactual Concept Bottleneck Models (Rating: 6.25)
OS-ATLAS: Foundation Action Model for Generalist GUI Agents (Rating: 6.25)
MaestroMotif: Skill Design from Artificial Intelligence Feedback (Rating: 6.25)
DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agent (Rating: 6.25)
Lightweight Neural App Control (Rating: 6.25)
Language Agents Meet Causality -- Bridging LLMs and Causal World Models (Rating: 6.25)
Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction (Rating: 6.25)
On Bits and Bandits: Quantifying the Regret-Information Trade-off (Rating: 6.25)
DataGen: Unified Synthetic Dataset Generation via Large Language Models (Rating: 6.25)
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark (Rating: 6.25)
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer (Rating: 6.20)
Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs (Rating: 6.20)
Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology (Rating: 6.00)
OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures? (Rating: 6.00)
Multimodal Situational Safety (Rating: 6.00)
EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation (Rating: 6.00)
Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency (Rating: 6.00)
Autonomous agents from automatic reward modeling and planning (Rating: 6.00)
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents (Rating: 6.00)
Training Language Models to Critique with Multi-Agent Feedback (Rating: 6.00)
ImProver: Agent-Based Automated Proof Optimization (Rating: 6.00)
PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration (Rating: 6.00)
Mixture-of-Agents Enhances Large Language Model Capabilities (Rating: 6.00)
Deep Learning Algorithms for Mean Field Optimal Stopping in Finite Space and Discrete Time (Rating: 6.00)
Agent S: An Open Agentic Framework that Uses Computers Like a Human (Rating: 6.00)
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery (Rating: 6.00)
Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface (Rating: 6.00)
Aligned LLMs Are Not Aligned Browser Agents (Rating: 6.00)
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation (Rating: 6.00)
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding (Rating: 6.00)
Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents (Rating: 6.00)
OccProphet: Pushing the Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with an Observer-Forecaster-Refiner Framework (Rating: 6.00)
Neural Interactive Proofs (Rating: 6.00)
GOAL: A Generalist Combinatorial Optimization Agent Learning (Rating: 6.00)
Advantage Alignment Algorithms (Rating: 6.00)
Scaling Large Language Model-based Multi-Agent Collaboration (Rating: 6.00)
Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training (Rating: 6.00)
LoRA-Gen: Specializing Language Model via Online LoRA Generation (Rating: 6.00)
MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents (Rating: 6.00)
Do LLMs ``know'' internally when they follow instructions? (Rating: 6.00)
AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses (Rating: 5.83)
EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE (Rating: 5.80)
POIL: Preference Optimization for Imitation Learning (Rating: 5.80)
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents (Rating: 5.75)
Discrete Latent Plans via Semantic Skill Abstractions (Rating: 5.75)
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture (Rating: 5.75)
Graph Neural Networks Gone Hogwild (Rating: 5.75)
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers (Rating: 5.75)
CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL (Rating: 5.75)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators (Rating: 5.75)
Tool-Planner: Task Planning with Clusters across Multiple Tools (Rating: 5.75)
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models (Rating: 5.75)
Agent Skill Acquisition for Large Language Models via CycleQD (Rating: 5.75)
AlphaZero Neural Scaling and Zipf's Law: a Tale of Board Games and Power Laws (Rating: 5.75)
AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks (Rating: 5.75)
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher (Rating: 5.75)
Grounding Multimodal Large Language Model in GUI World (Rating: 5.75)
Knowledge Graph Based Agent For Complex, Knowledge-Intensive QA in Medicine (Rating: 5.75)
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation (Rating: 5.75)
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs (Rating: 5.75)
OpenHands: An Open Platform for AI Software Developers as Generalist Agents (Rating: 5.75)
RL, but don't do anything I wouldn't do (Rating: 5.75)
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction (Rating: 5.75)
Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs (Rating: 5.75)
Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning (Rating: 5.75)
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent (Rating: 5.75)
HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions (Rating: 5.75)
Integrating Expertise of Software Engineering Agents (Rating: 5.75)
Language Guided Skill Discovery (Rating: 5.75)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment (Rating: 5.75)
AgentQuest: Benchmarking LLM and VLM Agents on Long-Horizon Interactive Tasks (Rating: 5.75)
NeSyC: A Neuro-symbolic Continual Learner For Complex Embodied Tasks in Open Domains (Rating: 5.75)
Neural Exploratory Landscape Analysis (Rating: 5.75)
CityNav: Language-Goal Aerial Navigation Dataset Using Geographic Information (Rating: 5.75)
Moral Alignment for LLM Agents (Rating: 5.67)
Robotouille: An Asynchronous Planning Benchmark for LLM Agents (Rating: 5.67)
CtD: Composition through Decomposition in Emergent Communication (Rating: 5.67)
Commit0: Library Generation from Scratch (Rating: 5.67)
MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis (Rating: 5.67)
Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making (Rating: 5.67)
When Prompt Engineering Meets Software Engineering: CNL-P as Natural and Robust "APIs'' for Human-AI Interaction (Rating: 5.67)
Steering Large Language Models between Code Execution and Textual Reasoning (Rating: 5.67)
Agents' Room: Narrative Generation through Multi-step Collaboration (Rating: 5.67)
VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning (Rating: 5.60)
Benchmarking Agentic Workflow Generation (Rating: 5.60)
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner (Rating: 5.60)
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Rating: 5.60)
Scattered Forest Search: Smarter Code Space Exploration with LLMs (Rating: 5.60)
EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents (Rating: 5.50)
GUI-World: A GUI-oriented Dataset for Multimodal LLM-based Agents (Rating: 5.50)
CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation (Rating: 5.50)
When LLMs Play the Telephone Game: Cumulative Changes and Attractors in Iterated Cultural Transmissions (Rating: 5.50)
Can Textual Gradient Work in Federated Learning? (Rating: 5.50)
RLSF: Reinforcement Learning via Symbolic Feedback (Rating: 5.50)
Balancing Act: Diversity and Consistency in Large Language Model Ensembles (Rating: 5.50)
Language Model Non-Myopic Generation for Reasoning and Planning (Rating: 5.50)
Do LLMs estimate uncertainty well in instruction-following? (Rating: 5.50)
MIRAI: Evaluating LLM Agents for International Event Forecasting (Rating: 5.50)
DCA-Bench: A Benchmark for Dataset Curation Agents (Rating: 5.50)
ToolACE: Enhancing Function Calling with Accuracy, Complexity, and Diversity (Rating: 5.50)
Stochastic Semi-Gradient Descent for Learning Mean Field Games with Population-Aware Function Approximation (Rating: 5.50)
FinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning (Rating: 5.50)
SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models (Rating: 5.50)
ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks (Rating: 5.50)
An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits (Rating: 5.50)
Mr.Steve: Instruction-Following Agents in Minecraft with What-Where-When Memory (Rating: 5.50)
Compositional Hardness of Code in Large Language Models - A Probabilistic Perspective (Rating: 5.50)
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding (Rating: 5.50)
Meta-Referential Games to Learn Compositional Learning Behaviours (Rating: 5.50)
JudgeBench: A Benchmark for Evaluating LLM-Based Judges (Rating: 5.50)
Multi-Agent Collaborative Data Selection for Efficient Language Model Pretraining (Rating: 5.50)
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation (Rating: 5.50)
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments (Rating: 5.50)
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models (Rating: 5.50)
Adaptive In-conversation Team Building for Language Model Agents (Rating: 5.50)
Eligibility Traces for Confounding Robust Off-Policy Evaluation: A Causal Approach (Rating: 5.50)
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance (Rating: 5.50)
Modeling dynamic social vision highlights gaps between deep learning and humans (Rating: 5.50)
EmpathyRobot: A Dataset and Benchmark for Empathetic Task Planning of Robotic Agent (Rating: 5.50)
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework (Rating: 5.50)
Tree Search for Language Model Agents (Rating: 5.50)
MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model (Rating: 5.50)
Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-Based Decision-Making Systems (Rating: 5.50)
Do LLM Agents Have Regret? A Case Study in Online Learning and Games (Rating: 5.50)
Generative World Explorer (Rating: 5.50)
KnowTrace: Explicit Knowledge Tracing for Structured Retrieval-Augmented Generation (Rating: 5.50)
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester (Rating: 5.40)
RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards (Rating: 5.40)
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct (Rating: 5.40)
General Scene Adaptation for Vision-and-Language Navigation (Rating: 5.40)
Empowering LLM Agents with Zero-Shot Optimal Decision-Making through Q-learning (Rating: 5.40)
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding (Rating: 5.40)
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale (Rating: 5.40)
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning (Rating: 5.40)
Human Simulacra: Benchmarking the Personification of Large Language Models (Rating: 5.40)
On the Convergence of No-Regret Dynamics in Information Retrieval Games with Proportional Ranking Functions (Rating: 5.33)
DoF: A Diffusion Factorization Framework for Offline Multi-Agent Decision Making (Rating: 5.33)
The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind (Rating: 5.33)
Agent-as-a-Judge: Evaluating Agents with Agents (Rating: 5.33)
Evolving Alignment via Asymmetric Self-Play (Rating: 5.33)
Grounding Robot Policies with Visuomotor Language Guidance (Rating: 5.33)
BraiNav: Incorporating Human Brain Activity to Enhance Robustness in Embodied Visual Navigation (Rating: 5.33)
Multiagent Finetuning of Language Models (Rating: 5.33)
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices (Rating: 5.33)
Benchmarking Intelligent LLM Agents for Conversational Data Analysis (Rating: 5.33)
ADAM: An Embodied Causal Agent in Open-World Environments (Rating: 5.25)
Competing Large Language Models in Multi-Agent Gaming Environments (Rating: 5.25)
Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning (Rating: 5.25)
Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models (Rating: 5.25)
AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML (Rating: 5.25)
GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation (Rating: 5.25)
GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning (Rating: 5.25)
Auction-Based Regulation for Artificial Intelligence (Rating: 5.25)
ToolBridge: An Open-Source Dataset to Equip LLMs with External Tool Capabilities (Rating: 5.25)
SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents (Rating: 5.25)
ACC-Debate: An Actor-Critic Approach to Multi-Agent Debate (Rating: 5.25)
Prompt Injection Benchmark for Foundation Model Integrated Systems (Rating: 5.25)
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention (Rating: 5.25)
Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs (Rating: 5.25)
Fourier Head: Helping Large Language Models Learn Complex Probability Distributions (Rating: 5.25)
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? (Rating: 5.25)
ThinkBot: Embodied Instruction Following with Thought Chain Reasoning (Rating: 5.25)
Agent-to-Sim: Learning Interactive Behavior Model from Casual Longitudinal Videos (Rating: 5.25)
MACPO: Weak-to-Strong Alignment via Multi-Agent Contrastive Preference Optimization (Rating: 5.25)
Communicating Activations Between Language Model Agents (Rating: 5.25)
AgentGym: Evaluating and Evolving Large Language Model-based Agents across Diverse Envronments (Rating: 5.25)
How to Correctly Do Semantic Backpropagation on Language-based Agentic Systems (Rating: 5.25)
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents (Rating: 5.25)
Efficiently Scanning and Resampling Spatio-Temporal Tasks with Irregular Observations (Rating: 5.25)
Adapting Communicating MLLMs on the Fly in Referring Expression Tasks (Rating: 5.25)
Towards Efficient and Scalable Multi-agent Reasoning via Bayesian Nash Equilibrium (Rating: 5.25)
Video Action Differencing (Rating: 5.25)
A Contextual Online Learning Theory of Brokerage (Rating: 5.25)
$\textit{RwR}$: A Reason-while-Retrieve framework for Reasoning on Scene Graphs with LLMs (Rating: 5.25)
Private Mechanism Design via Quantile Estimation (Rating: 5.25)
MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis (Rating: 5.25)
From Commands to Prompts: LLM-based Semantic File System (Rating: 5.25)
ToolGen: Unified Tool Retrieval and Calling via Generation (Rating: 5.25)
SpiritSight Agent: Advanced GUI Agent with One Look (Rating: 5.25)
Efficient Active Imitation Learning with Random Network Distillation (Rating: 5.25)
DUET: Decentralized Bilevel Optimization without Lower-Level Strong Convexity (Rating: 5.25)
The Impact of Element Ordering on LM Agent Performance (Rating: 5.25)
Coding Reliable LLM-based Integrated Task and Knowledge Agents with GenieWorksheets (Rating: 5.20)
Tell Me What You Don't Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing (Rating: 5.20)
Federated Coordination: Private and Distributed Strategy Alignment (Rating: 5.20)
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems (Rating: 5.20)
BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments (Rating: 5.20)
HyperAgent: Generalist Software Engineering Agents to Solve Coding Tasks at Scale (Rating: 5.17)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (Rating: 5.00)
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination (Rating: 5.00)
ChemAgent: Self-updating Memories in Large Language Models Improves Chemical Reasoning (Rating: 5.00)
Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search (Rating: 5.00)
Zero-Shot Task-Level Adaptation via Coarse-to-Fine Policy Refinement and Holistic-Local Contrastive Representations (Rating: 5.00)
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System (Rating: 5.00)
REvolve: Reward Evolution with Large Language Models using Human Feedback (Rating: 5.00)
Neuralized Markov Random Field for Interaction-Aware Stochastic Human Trajectory Prediction (Rating: 5.00)
Breaking Mental Set to Improve Reasoning through Diverse Multi-Agent Debate (Rating: 5.00)
BRIDGE: Bootstrapping Text to Guide Time-Series Generation via Multi-Agent Iterative Optimisation and Diffusion Modelling (Rating: 5.00)
Re-Aligning Language to Visual Objects with an Agentic Workflow (Rating: 5.00)
Can VLMs Play Action Role-Playing Games? Take Black Myth Wukong as a Study Case (Rating: 5.00)
Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems (Rating: 5.00)
On the Diversity of Synthetic Data and its Impact on Training Large Language Models (Rating: 5.00)
Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback (Rating: 5.00)
IGOR: Image-GOal Representations are the Atomic Building Blocks for Next-Level Generalization in Embodied AI (Rating: 5.00)
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework (Rating: 5.00)
Rational Decision-Making Agent with Learning Internal Utility Judgment (Rating: 5.00)
Efficacy of Language Model Self-Play in Non-Zero-Sum Games (Rating: 5.00)
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models (Rating: 5.00)
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions (Rating: 5.00)
Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models (Rating: 5.00)
Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations (Rating: 5.00)
Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View (Rating: 5.00)
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs (Rating: 5.00)
OmniParser for Pure Vision Based GUI Agent (Rating: 5.00)
Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation (Rating: 5.00)
Triples as the Key: Structuring Makes Decomposition and Verification Easier in LLM-based TableQA (Rating: 5.00)
Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants (Rating: 5.00)
Towards Full Delegation: Designing Ideal Agentic Behaviors for Travel Planning (Rating: 5.00)
Sample Efficient Alignment for LLMs (Rating: 5.00)
Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling (Rating: 5.00)
ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL (Rating: 5.00)
Improving Large Language Model based Multi-Agent Framework through Dynamic Workflow Updating (Rating: 5.00)
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems (Rating: 5.00)
CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing (Rating: 5.00)
Actions Speak Louder Than Words: Rate-Reward Trade-off in Markov Decision Processes (Rating: 5.00)
ChinaTravel: A Real-World Benchmark for Language Agents in Chinese Travel Planning (Rating: 5.00)
OptiBench: Benchmarking Large Language Models in Optimization Modeling with Equivalence-Detection Evaluation (Rating: 5.00)
DRESSing Up LLM: Efficient Stylized Question-Answering via Style Subspace Editing (Rating: 5.00)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration (Rating: 5.00)
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions (Rating: 5.00)
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning (Rating: 5.00)
InvestAlign: Align LLMs with Investor Decision-Making under Herd Behavior (Rating: 5.00)
Informing Reinforcement Learning Agents by Grounding Language to Markov Decision Processes (Rating: 5.00)
Understanding Prejudice and Fidelity of Diverge-to-Converge Multi-Agent Systems (Rating: 5.00)
Who Should Join the Decision-Making Table? Targeted Expert Selection for Enhanced Human-AI Collaboration (Rating: 4.83)
Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL (Rating: 4.80)
MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning (Rating: 4.80)
On the Resilience of Multi-Agent Systems with Malicious Agents (Rating: 4.80)
Haland: Human-AI Coordination via Policy Generation from Language-guided Diffusion (Rating: 4.80)
AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems (Rating: 4.80)
Knapsack Schema Linking Agent for LLM-Based Text-to-SQL Generation (Rating: 4.80)
Stochastic Matching Bandits under Preference Feedback (Rating: 4.80)
How language models extrapolate outside the training data: A Case study in Textualized Gridworld (Rating: 4.80)
Empowering Users in Digital Privacy Management through Interactive LLM-Based Agents (Rating: 4.80)
Deep Exploration with PAC-Bayes (Rating: 4.75)
On the Modeling Capabilities of Large Language Models for Sequential Decision Making (Rating: 4.75)
War and Peace (WarAgent): LLM-based Multi-Agent Simulation of World Wars (Rating: 4.75)
Task-oriented Sequential Grounding in 3D Scenes (Rating: 4.75)
BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science (Rating: 4.75)
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Referred Object Grounding (Rating: 4.75)
Automated Design of Agentic Systems (Rating: 4.75)
Language-conditioned Multi-Style Policies with Reinforcement Learning (Rating: 4.75)
LLF-Bench: A Benchmark for Interactive Learning from Language Feedback (Rating: 4.75)
MAD-Sherlock: Multi-Agent Debates for Out-of-Context Misinformation Detection (Rating: 4.75)
Research Town: Simulator of Research Community (Rating: 4.75)
ToM-agent: Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection (Rating: 4.75)
COMMA: A Communicative Multimodal Multi-Agent Benchmark (Rating: 4.75)
SWE-bench Multimodal: Do Autonomous Programming Systems Generalize to New Software Domains? (Rating: 4.75)
Dissecting Adversarial Robustness of Multimodal LM Agents (Rating: 4.75)
ESDMotion: End-to-end Motion Prediction Only with SD Maps (Rating: 4.75)
MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation (Rating: 4.75)
Data Interpreter: An LLM Agent For Data Science (Rating: 4.75)
Sparse Rewards Can Self-Train Dialogue Agents (Rating: 4.75)
MISR: Measuring Instrumental Self-Reasoning in Frontier Models (Rating: 4.75)
Controlling Large Language Model Agents with Entropic Activation Steering (Rating: 4.75)
Emergence of Hierarchical Emotion Representations in Large Language Models (Rating: 4.75)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking (Rating: 4.75)
Truthful Aggregation of LLMs with an Application to Online Advertising (Rating: 4.75)
LifelongSotopia: Evaluating Social Intelligence Of Language Agents Over Lifelong Social Interactions (Rating: 4.75)
DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient Self-Driving (Rating: 4.75)
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Rating: 4.75)
A Research on Result Interpretability of Medical AI Based on Large Language Model (Rating: 4.75)
DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects (Rating: 4.75)
WebCanvas: Benchmarking Web Agents in Online Environments (Rating: 4.75)
RoundTable: Investigating Group Decision-Making Mechanism in Multi-Agent Collaboration (Rating: 4.75)
Interactive Dialogue Agents via Reinforcement Learning with Hindsight Regenerations (Rating: 4.75)
STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making (Rating: 4.75)
NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator (Rating: 4.75)
Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning (Rating: 4.75)
ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents (Rating: 4.75)
Simulate Before Act: Model-Based Planning for Web Agents (Rating: 4.75)
Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning (Rating: 4.75)
Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents (Rating: 4.75)
BOIL: Learning Environment Personalized Information (Rating: 4.75)
Query-Efficient Planning with Language Models (Rating: 4.75)
GLEE: A Framework and Benchmark for LLM Evaluation in Language-based Economics (Rating: 4.75)
TestAgent: An Adaptive and Intelligent Expert for Human Assessment (Rating: 4.75)
Enhancing Language Model Agents using Diversity of Thoughts (Rating: 4.75)
Prioritize Alignment in Dataset Distillation (Rating: 4.75)
Wolf: Accurate Video Captioning with a World Summarization Framework (Rating: 4.75)
DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents (Rating: 4.75)
From an LLM Swarm to a PDDL-empowered Hive: Planning Self-executed Instructions in a Multi-modal Jungle (Rating: 4.67)
Deviation Ratings: A general, clone invariant rating method (Rating: 4.67)
ControlAgent: Automating Control System Design via Novel Integration of LLM Agents and Domain Expertise (Rating: 4.67)
Direct Multi-agent Motion Generation Preference Alignment with Implicit Feedback from Demonstrations (Rating: 4.67)
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities (Rating: 4.67)
Improving the Efficiency of Test-Time Search in LLMs with Backtracking (Rating: 4.67)
Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning (Rating: 4.67)
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use (Rating: 4.67)
Review and Rebuttal: Zero-shot In-context Adversarial Learning for Improving Research Ideation (Rating: 4.67)
LAM Simulator: Advancing Large Action Model Training for Agent via Online Exploration and Feedback Simulation (Rating: 4.67)
GridAgent: A 2D Grid-Based Game Framework And Benchmark For Multimodal Large Language Models (Rating: 4.67)
Vision-Language Models Provide Promptable Representations for Reinforcement Learning (Rating: 4.67)
A Generalist Hanabi Agent (Rating: 4.64)
Unlocking Video-LLM via Agent-of-Thoughts Distillation (Rating: 4.60)
DPM: Dual Preferences-based Multi-Agent Reinforcement Learning (Rating: 4.60)
IDEA: Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction (Rating: 4.60)
LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence (Rating: 4.50)
Cognitive Insights and Stable Coalition Matching for Fostering Multi-Agent Cooperation (Rating: 4.50)
Chain of Ideas: Revolutionizing Research in Idea Development with LLM Agents (Rating: 4.50)
Agent Workflow Memory (Rating: 4.50)
MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data (Rating: 4.50)
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation (Rating: 4.50)
UrbanWorld: An Urban World Model for 3D City Generation (Rating: 4.50)
Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses (Rating: 4.50)
Large Legislative Models: Towards Efficient AI Policymaking in Economic Simulations (Rating: 4.50)
Progressive LLM Alignments Using Two-Player Games (Rating: 4.50)
Uncertainty-aware Human Mobility Modeling and Anomaly Detection (Rating: 4.50)
MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models (Rating: 4.50)
Personalized Federated Learning via Variational Massage Passing (Rating: 4.50)
Towards Safe and Honest AI Agents with Neural Self-Other Overlap (Rating: 4.50)
Improving Planning with Large Language Models: A Modular Agentic Architecture (Rating: 4.50)
Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Exploration (Rating: 4.50)
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs (Rating: 4.50)
DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction (Rating: 4.50)
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants (Rating: 4.50)
Visually Descriptive Language Model for Vector Graphics Reasoning (Rating: 4.50)
PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories (Rating: 4.50)
Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems (Rating: 4.50)
Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes (Rating: 4.50)
Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics (Rating: 4.50)
RedCodeAgent: Automatic Red-teaming Agent against Code Agents (Rating: 4.50)
Advancing Algorithmic Trading with Large Language Models: A Reinforcement Learning Approach for Stock Market Optimization (Rating: 4.50)
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative (Rating: 4.50)
PokeChamp: an Expert-level Minimax Language Agent for Competitive Pokemon (Rating: 4.50)
LLaMP: Large Language Model Made Powerful for High-fidelity Materials Knowledge Retrieval (Rating: 4.50)
AISciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification (Rating: 4.50)
CycleResearcher: Improving Automated Research via Automated Review (Rating: 4.50)
Shell Games: Control Protocols for Adversarial AI Agents (Rating: 4.50)
Versatile Motion-Language Models for Multi-turn Interactive Agents (Rating: 4.50)
FlowAgent: a New Paradigm for Workflow Agent (Rating: 4.50)
Q* Agent: Optimizing Language Agents with Q-Guided Exploration (Rating: 4.50)
Simulating Human-like Daily Activities with Desire-driven Autonomy (Rating: 4.50)
AgentRefine: Enhancing Agent Generalization through Refinement Tuning (Rating: 4.50)
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning (Rating: 4.50)
Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning (Rating: 4.40)
CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception (Rating: 4.40)
ICDA: Interactive Causal Discovery through Large Language Model Agents (Rating: 4.40)
Adversarial Attacks on Cooperative Multi-agent Bandits (Rating: 4.40)
Adaptive Video Understanding Agent: Enhancing Efficiency with Dynamic Frame Sampling and Feedback-driven Reasoning (Rating: 4.40)
ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models (Rating: 4.40)
Synthesizing Bonds: Enhancing Adult Attachment Predictions with LLM-Generated Data (Rating: 4.33)
Multi-Agent Path Finding via Decision Transformer and LLM Collaboration (Rating: 4.33)
Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models (Rating: 4.33)
Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale (Rating: 4.33)
Enhance Reasoning for Large Language Models with Reinforcement Learning in the Game Werewolf (Rating: 4.33)
Towards Specialized Web Agents Using Production-Scale Workflow Data (Rating: 4.33)
Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (Rating: 4.33)
CogMath: Evaluating LLMs' Authentic Mathematical Ability from a Cognitive Perspective (Rating: 4.33)
Detecting Out-of-Context Misinformation via Multi-Agent and Multi-Grained Retrieval (Rating: 4.33)
Last Iterate Convergence in Monotone Mean Field Games (Rating: 4.33)
EcoAct: Economic Agent Determines When to Register What Action (Rating: 4.33)
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation (Rating: 4.33)
CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only (Rating: 4.33)
CodeCloak: A Method for Mitigating Code Leakage by LLM Code Assistants (Rating: 4.33)
MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants (Rating: 4.25)
Decoding Intelligence: A Framework for Certifying Knowledge Comprehension in LLMs (Rating: 4.25)
VideoAgent: Self-Improving Video Generation (Rating: 4.25)
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition (Rating: 4.25)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents (Rating: 4.25)
SnapMem: Snapshot-based 3D Scene Memory for Embodied Exploration and Reasoning (Rating: 4.25)
MetaAgent: Automatically Building Multi-Agent System based on Finite State Machine (Rating: 4.25)
Teaching Transformers Causal Reasoning through Axiomatic Training (Rating: 4.25)
Learning 4D Embodied World Models (Rating: 4.25)
Contextual Experience Replay for Continual Learning of Language Agents (Rating: 4.25)
Large Language Models Can Self-Improve At Web Agent Tasks (Rating: 4.25)
Talking Vehicles: Cooperative Driving via Natural Language (Rating: 4.25)
Open-World Planning via Lifted Regression with LLM-based Affordances for Embodied Agents (Rating: 4.25)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction (Rating: 4.25)
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents (Rating: 4.25)
Towards Evaluating Generalist Agents: An Automated Benchmark in Open World (Rating: 4.25)
SMART: Self-Learning Meta-strategy Agent for Reasoning Tasks (Rating: 4.25)
OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents (Rating: 4.25)
OASIS: Open Agents Social Interaction Simulations on a Large Scale (Rating: 4.25)
Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-like and Personalised Early Adolescences (Rating: 4.25)
Provably Efficient and Practical Self-Play for Better LLM Alignment (Rating: 4.25)
Agents Help Agents: Exploring Training-Free Knowledge Distillation for Small Language Models in Data Science Code Generation (Rating: 4.25)
Evaluating the Goal-Directedness of Large Language Models (Rating: 4.25)
SimSiam Naming Game: A Unified Approach for Representation Learning and Emergent Communication (Rating: 4.25)
In-Context Learning for Games (Rating: 4.25)
Optimizing Inference-Time Reasoning in LLMs via Retrieval-Augmented Reflection (Rating: 4.25)
AltDev: Achieving Real-Time Alignment in Multi-Agent Software Development (Rating: 4.25)
A Third-Person Appraisal Agent: Learning to Reason About Emotions in Conversational Contexts (Rating: 4.25)
How Can LLM Guide RL? A Value-Based Approach (Rating: 4.25)
Explicit-Constrained Single Agent for Enhanced Task-Solving in LLMs (Rating: 4.25)
LLMs for Generalizable Language-Conditioned Policy Learning under Minimal Data Requirements (Rating: 4.25)
MorphAgent: Empowering Agents through Self-Evolving Profiles and Decentralized Collaboration (Rating: 4.25)
Uncertainty Quantification with Generative-Semantic Entropy Estimation for Large Language Models (Rating: 4.25)
UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding (Rating: 4.25)
AgentSquare: Automatic LLM Agent Search in Modular Design Space (Rating: 4.25)
CogniPair - Dynamic LLM Matching Algorithm in Chaotic Environments Mimicking Human Cognitive Processes for Relationship Pairing (Rating: 4.25)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? (Rating: 4.25)
AutoHijacker: Automatic Indirect Prompt Injection Against Black-box LLM Agents (Rating: 4.25)
Skill Discovery using Language Models (Rating: 4.25)
ReAcTree: Hierarchical Task Planning with Dynamic Tree Expansion using LLM Agent Nodes (Rating: 4.25)
Odyssey: Empowering Minecraft Agents with Open-World Skills (Rating: 4.25)
'No' Matters: Out-of-Distribution Detection in Multimodality Long Dialogue (Rating: 4.20)
ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World (Rating: 4.20)
B-MoCA: Benchmarking Mobile Device Control Agents across Diverse Configurations (Rating: 4.20)
AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents (Rating: 4.20)
LossAgent: Towards Any Optimization Objectives for Image Processing with LLM Agents (Rating: 4.20)
Large-Scale Dynamic Graph Generation via LLM-based Agent Simulation (Rating: 4.20)
PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play (Rating: 4.20)
VeSX: A Framework Featured by Verification, Self-Correction and In-context Learning for Web Automation Tasks (Rating: 4.20)
Improving Model Alignment Through Collective Intelligence of Open-Source Models (Rating: 4.20)
Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection (Rating: 4.17)
Plan B: Training LLMs to fail less severely (Rating: 4.00)
Memory-Driven Multimodal Chain of Thought for Embodied Long-Horizon Task Planning (Rating: 4.00)
The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests (Rating: 4.00)
Towards Reliable Offline Reinforcement Learning via Lyapunov Uncertainty Control (Rating: 4.00)
Computing Ex Ante Equilibrium in Heterogeneous Zero-Sum Team Games (Rating: 4.00)
MS$^3$M: Multi-Stage State Space Model for Motion Forecasting (Rating: 4.00)
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines (Rating: 4.00)
Two Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation (Rating: 4.00)
Leveraging Imitation Learning and LLMs for Efficient Hierarchical Reinforcement Learning (Rating: 4.00)
GenoAgent: A Baseline method for LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians (Rating: 4.00)
EnvBridge: Bridging Diverse Environments with Cross-Environment Knowledge Transfer for Embodied AI (Rating: 4.00)
Efficient Predictive Counterfactual Regret Minimization$^+$ Algorithm in Solving Extensive-Form Games (Rating: 4.00)
Contextual Bandits with Entropy-based Human Feedback (Rating: 4.00)
LeanAgent: Lifelong Learning for Formal Theorem Proving (Rating: 4.00)
MultiMedia-Agent: A Multimodal Agent for Multimedia Content Generation (Rating: 4.00)
SWE-Bench+: Enhanced Coding Benchmark for LLMs (Rating: 4.00)
Scaling Laws for Pre-training Agents and World Models (Rating: 4.00)
AutoRedTeamer: An Autonomous Red Teaming Agent Against Language Models (Rating: 4.00)
SmartBackdoor: Malicious Language Model Agents that Avoid Being Caught (Rating: 4.00)
CALF: Benchmarking Evaluation of LFQA Using Chinese Examinations (Rating: 4.00)
Sketch-Plan-Generalize: Learning Inductive Representations for Grounded Spatial Concepts (Rating: 4.00)
Shapley Value Approximation based on k-Additive Games (Rating: 4.00)
MAC: A Multimodal Benchmark for Understanding and Generating Academic Journal Covers (Rating: 4.00)
On Inherent 3D Reasoning of VLMs in Indoor Scene Layout Design (Rating: 4.00)
YOLO-MARL: You Only LLM Once for Multi-agent Reinforcement Learning (Rating: 4.00)
Entropy-Based Uncertainty Modeling for Trajectory Prediction in Autonomous Driving (Rating: 4.00)
Large Language Model Critics for Execution-Free Evaluation of Code Changes (Rating: 4.00)
iAgent: LLM Agent as a Shield between User and Recommender Systems (Rating: 4.00)
Benchmark for Temporal, Ambiguous, and Grounded Embodied Question-Answering (Rating: 4.00)
ReGen: Generative Robot Simulation via Inverse Design (Rating: 4.00)
Optimal Transport-Based Domain Alignment as a Preprocessing Step for Federated Learning (Rating: 4.00)
Understanding Data Poisoning Attacks for RAG: Insights and Algorithms (Rating: 4.00)
DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation (Rating: 4.00)
SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI (Rating: 4.00)
Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning (Rating: 4.00)
Grey-box Prompt Optimization and Fine-Tuning for Cloud-Edge LLM Agents (Rating: 4.00)
Gödel Agent: A Self-Referential Framework Helps for Recursively Self-Improvement (Rating: 4.00)
MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion (Rating: 4.00)
ANALOGXPERT: AUTOMATING ANALOG TOPOLOGY SYNTHESIS BY INCORPORATING CIRCUIT DESIGN EXPERTISE INTO LARGE LANGUAGE MODELS (Rating: 4.00)
Designing Deep Learning Programs with Large Language Models (Rating: 4.00)
Inverse Attention Agent in Multi-Agent System (Rating: 4.00)
Multi-Grained Knowledge for Retrieval-Augmented Question Answering on Hyper-long Contexts (Rating: 4.00)
SCALE: Augmenting Content Analysis via LLM Agents and AI-Human Collaboration (Rating: 4.00)
Embodied Instruction Following in Unknown Environments (Rating: 4.00)
Language-Guided Object-Centric World Models for Predictive Control (Rating: 4.00)
Denial-of-Service Poisoning Attacks against Large Language Models (Rating: 4.00)
Symbolic Learning Enables Self-Evolving Agents (Rating: 4.00)
AD-H: Autonomous Driving with Hierarchical Agents (Rating: 4.00)
Towards LLM4Floorplan: Agents Can Do What Engineers Do in Chip Design (Rating: 4.00)
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback (Rating: 4.00)
A Super-Aligned Driving Generalist Is Your Cockpit (Rating: 3.83)
Make LLMs better zero-shot reasoners: structure-oriented autonomous reasoning (Rating: 3.83)
LLM-Mediated Guidance of MARL Systems (Rating: 3.80)
LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models (Rating: 3.80)
HeurAgenix: A Multi-Agent LLM-Based Paradigm for Adaptive Heuristic Evolution and Selection in Combinatorial Optimization (Rating: 3.80)
Unlocking Speech Instruction Data Potential with Query Rewriting (Rating: 3.75)
Multi-Modal Foundation Models Induce Interpretable Molecular Graph Languages (Rating: 3.75)
Verbalized Bayesian Persuasion (Rating: 3.75)
S3E: Semantic Symbolic State Estimation With Vision-Language Foundation Models (Rating: 3.75)
Egocentric Vision Language Planning (Rating: 3.75)
Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts (Rating: 3.75)
SparsitySolver: Efficient Reinforcement Learning-based Pruning for LLMs (Rating: 3.75)
MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control (Rating: 3.75)
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? (Rating: 3.75)
Inductive Linguistic Reasoning with Large Language Models (Rating: 3.75)
Feynman: Knowledge-Infused Diagramming Agent for Scaling Visual Reasoning Data (Rating: 3.75)
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment (Rating: 3.75)
NextBestPath: Efficient 3D Mapping of Unseen Environments (Rating: 3.75)
Large language models as windows on the mental structure of psychopathology (Rating: 3.75)
Harnessing Input-adaptive Inference for Efficient Vision-and-Language Navigation (Rating: 3.75)
From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge (Rating: 3.67)
A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment (Rating: 3.67)
Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities (Rating: 3.67)
Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis (Rating: 3.67)
InteractiveCOT: Aligning Dynamic Chain-of-Thought Planning for Embodied Decision-Making (Rating: 3.67)
Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess (Rating: 3.67)
Decentralized Blockchain-based Robust Multi-agent Multi-armed Bandit (Rating: 3.67)
DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning (Rating: 3.67)
Solving Robotics Problems in Zero-Shot with Vision-Language Models (Rating: 3.67)
Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents (Rating: 3.67)
Boundless Socratic Learning (Rating: 3.60)
Learning a Bi-directional Driving Data Generator via Large Multi-modal Model Tuning (Rating: 3.60)
A Scalable Communication Protocol for Networks of Large Language Models (Rating: 3.50)
SELA: Tree-Search Enhanced LLM Agents for Automated Machine Learning (Rating: 3.50)
ALIA: An LLM for Industrial Assets using Synthetic Data (Rating: 3.50)
FAIRMINDSIM: ALIGNMENT OF BEHAVIOR, EMO- TION, AND BELIEF IN HUMANS AND LLM AGENTS AMID ETHICAL DILEMMAS (Rating: 3.50)
Defend against Jailbreak Attacks via Debate with Partially Perceptive Agents (Rating: 3.50)
Cracking the Collective Mind: Adversarial Manipulation in Multi-Agent Systems (Rating: 3.50)
Beyond Browsing: API-Based Web Agents (Rating: 3.50)
AutoCoder: Enhancing Code Large Language Model with AIEV-INSTRUCT (Rating: 3.50)
FEABench: Evaluating Language Models on Real World Physics Reasoning Ability (Rating: 3.50)
AutoPR: Automatically Pull Request Generation for Fix Issued Bugs of CodeBase (Rating: 3.50)
Massively Multi-Agents Reveal That Large Language Models Can Understand Value (Rating: 3.50)
SimUSER: When Language Models Pretend to Be Believable Users in Recommender Systems (Rating: 3.50)
Enhancing Software Agents with Monte Carlo Tree Search and Hindsight Feedback (Rating: 3.50)
Zero-Shot Goal Dialogue via Reinforcement Learning on Imagined Conversations (Rating: 3.50)
AIME: AI System Optimization via Multiple LLM Evaluators (Rating: 3.50)
Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1 (Rating: 3.50)
REDO: Execution-Free Runtime Error Detection for Coding Agents (Rating: 3.50)
Agent-G: An Agentic Framework for Graph Retrieval Augmented Generation (Rating: 3.50)
LLM-Exp: Exploring the Policy in Reinforcement Learning with Large Language Models (Rating: 3.50)
Value Explicit Pretraining for Learning Transferable Representations (Rating: 3.50)
Logic Agent: Enhancing Validity with Logic Rule Invocation (Rating: 3.50)
iMotion-LLM: Motion Prediction Instruction Tuning (Rating: 3.50)
Autoverse: an Evolvable Game Language for Learning Robust Embodied Agents (Rating: 3.50)
Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning (Rating: 3.50)
Self-controller: Controlling LLMs with Multi-round Step-by-step Self-awareness (Rating: 3.50)
LASER: Script Execution by Autonomous Agents for On-demand Traffic Simulation (Rating: 3.50)
EconAI: Preference-driven Agents Simulating Economic Activities via Large Language Model (Rating: 3.50)
VISION-LANGUAGE MODELS AS TRAINERS FOR INSTRUCTION-FOLLOWING AGENTS (Rating: 3.50)
Your Agent Can Defend Itself against Backdoor Attacks (Rating: 3.50)
Probing the contents of text, behavior, and brain data toward improving human-LLM alignment (Rating: 3.50)
LLMs Synergy : From Closed-Source Prototyping to Open-Source Model based Instruction Following (Rating: 3.40)
Training Open-ended Policies to follow Video-prompt Instructions with Reinforcement Learning (Rating: 3.40)
Multi-Agent Causal Discovery Using Large Language Models (Rating: 3.40)
GFLAgent: Green Federated Learning Agent for Alleviating Heterogeneity (Rating: 3.40)
MAC-CAFE: Multi-actor, Centralized Critic Architecture for Feedback-driven Editing (Rating: 3.25)
TeamCraft: A Benchmark for Embodied Multi-Agent Systems in Minecraft (Rating: 3.25)
DataSciBench: An LLM Agent Benchmark for Data Science (Rating: 3.20)
How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks (Rating: 3.00)
HomieBot: an Adaptive System for Embodied Mobile Manipulation in Open Environments (Rating: 3.00)
Entering Real Social World! Benchmarking the Theory of Mind and Socialization Capabilities of LLMs from a First-person Perspective (Rating: 3.00)
SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs (Rating: 3.00)
ProCEED: Prototype Consolidation and Ensemble-based Exemplar-Free Deep Incremental Learning (Rating: 3.00)
EchoQA: Tuning into the Heart of Echocardiogram Reports (Rating: 3.00)
Seeker: Enhancing Exception Handling in Code with a LLM-based Multi-Agent Approach (Rating: 3.00)
Enhancing Multi-Agent Learning in Real-World Interactive Environments through Process Reward Decomposition (Rating: 3.00)
Test-Time RAG: Enhancing Long Context Understanding in LLMs with Retrieval-Augmented Mechanisms (Rating: 3.00)
Grounded Robotic Action-Rule Induction through Language Models (GRAIL) (Rating: 3.00)
Investigating Self-Attention: Its Impact on Sample Efficiency in Deep Reinforcement Learning (Rating: 3.00)
Orca: Enhancing Role-Playing Abilities of Large Language Models by Integrating Personality Traits (Rating: 3.00)
Rapfi: Distilling Efficient Neural Network for the Game of Gomoku (Rating: 3.00)
Planning with MCTS: Enhancing Problem-Solving in Large Language Models (Rating: 3.00)
StarCraft II Arena: Evaluating LLMs in Strategic Planning, Real-Time Decision Making, and Adaptability (Rating: 3.00)
ChemThinker: Thinking Like a Chemist with Multi-Agent LLMs for Deep Molecular Insights (Rating: 3.00)
DebUnc: Improving Large Language Model Agent Communication Via Uncertainty Metrics (Rating: 3.00)
AutoModel: Autonomous Model Development for Image Classification with LLM Agents (Rating: 3.00)
Very Large-Scale Multi-Agent Simulation with LLM-Powered Agents (Rating: 3.00)
On the Design and Analysis of LLM-Based Algorithms (Rating: 3.00)
I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy (Rating: 3.00)
Human-like Communication Strategies for Improved Multi-Agent Reinforcement Learning (Rating: 3.00)
GLIMO: Grounding Large Language Models With Imperfect World Models (Rating: 3.00)
IDS-Agent: An LLM Agent for Explainable Intrusion Detection in IoT Networks (Rating: 3.00)
Foundation Models for Enhanced Exploration in Reinforcement Learning (Rating: 3.00)
ActionFiller: Fill-In-The-Blank Prompting for OS Agent (Rating: 3.00)
FALCON: A Feedback-Driven Adaptive Long/Short-Term Memory Reinforced Coding Optimization (Rating: 3.00)
LOB-Bench: Benchmarking Generative AI for Finance - with an Application to Limit Order Book Markets (Rating: 2.67)
RePrompt: Prompt Engineering for Large Language Models Agents through Reflection (Rating: 2.50)
Towards Autonomous Agents: Adaptive-planning, Reasoning, and Acting in Language Models (Rating: 2.50)
Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debate (Rating: 2.50)
DrugAgent: Multi-Agent Large Language Model-Based Reasoning for Drug-Target Interaction Prediction and Repurposing (Rating: 2.50)
Why Solving Multi-agent Path Finding with Large Language Models has not Succeeded Yet (Rating: 2.50)
EMERGENCE OF GROUNDED, OPTIMALLY COMPOSITIONAL SPATIAL LANGUAGE AMONG HOMOGENEOUS AGENTS (Rating: 2.33)
Leveraging System-Prompt Attention to Counteract Novel Jailbreak Attacks (Rating: 2.33)
D2Coder: large language models based agent for coding with dynamic debugging tools (Rating: 2.33)
Poly-Autoregressive Modeling for Interacting Entities (Rating: 2.33)
EReLELA: Exploration in Reinforcement Learning via Emergent Language Abstractions (Rating: 2.33)
Generate explorative goals with large language model guidance (Rating: 2.00)
CELI: CONTROLLER-EMBEDDED LANGUAGE MODEL INTERACTIONS (Rating: N/A)
Tooling or Not Tooling? The Impact of Tools on Language Agents for Chemistry Problem Solving (Rating: N/A)
Agential AI for integrated continual learning, deliberative behavior, and comprehensible models (Rating: N/A)
Self-Improving Logic from Experimental Observations (Rating: N/A)
Collaborative Theorem Proving with Large Language Models: Enhancing Formal Proofs with ProofRefiner (Rating: N/A)
Hindsight Planner: A Closed-loop few-shot planner for Embodied Instruction Following (Rating: N/A)
A collaborative Multi-Agent LLM Approach for Knowledge Graph Curation and query from multimodal data sources (Rating: N/A)
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents (Rating: N/A)

Note

The ratings are based on the reviews from ICLR 2025 reviewers. Papers are sorted by their average ratings.

Contributing

Feel free to submit a PR or issue if you find any errors or have suggestions for improvement.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ICLR 2025 Agent-Related Papers

Papers

Note

Contributing

About

Releases

Packages

Aaron617/ICLR-2025-Submissions-Agent

Folders and files

Latest commit

History

Repository files navigation

ICLR 2025 Agent-Related Papers

Papers

Note

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages