[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-luo-junyu--Awesome-Agent-Papers":3,"tool-luo-junyu--Awesome-Agent-Papers":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",156804,2,"2026-04-15T11:34:33",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":77,"owner_url":78,"languages":77,"stars":79,"forks":80,"last_commit_at":81,"license":77,"difficulty_score":82,"env_os":83,"env_gpu":84,"env_ram":84,"env_deps":85,"category_tags":88,"github_topics":89,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":94,"updated_at":95,"faqs":96,"releases":97},7794,"luo-junyu\u002FAwesome-Agent-Papers","Awesome-Agent-Papers","[Up-to-date] Large Language Model Agent: A Survey on Methodology, Applications and Challenges","Awesome-Agent-Papers 是一个专注于大语言模型（LLM）智能体领域的开源学术资源库，旨在系统性地收集、整理并分类该方向的前沿研究论文。随着 2023 年以来智能体研究的爆发式增长，相关文献分散且更新迅速，开发者与研究者往往难以全面把握技术脉络。Awesome-Agent-Papers 通过构建清晰的分类体系，将零散的研究成果整合为涵盖智能体构建、多智能体协作、自我进化、工具调用、安全伦理及基准测试等关键维度的知识图谱，有效解决了信息碎片化问题，帮助用户快速定位核心文献并理解技术演进路径。\n\n该项目不仅提供了一份动态更新的论文清单，还配套了深入的综述文章，详细阐述了从架构设计到实际落地的方法论与挑战。其独特的亮点在于对“多智能体协作失败原因”及“自动化工作流”等新兴议题的敏锐追踪，为社区提供了宝贵的洞察。无论是希望深入探索算法原理的科研人员，还是寻求落地解决方案的 AI 工程师，亦或是关注行业趋势的技术决策者，都能从中获得极具价值的参考指引，是进入 LLM 智能体领域不可或缺的知识导航站。","# 🤖 Comprehensive LLM Agent Research Collection\n\n\u003Cdiv align=\"center\">\n\n![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)\n[![commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002Fluo-junyu\u002FAwesome-Agent-Papers?color=blue)](https:\u002F\u002Fgithub.com\u002Fluo-junyu\u002FAwesome-Agent-Papers\u002Fcommits\u002Fmain)\n[![PR](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-Welcome-red)](https:\u002F\u002Fgithub.com\u002Fluo-junyu\u002FAwesome-Agent-Papers\u002Fpulls)\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fluo-junyu_Awesome-Agent-Papers_readme_0dcbdccd61b6.png\" width=\"90%\" alt=\"LLM Agent Research Overview\">\n\u003C\u002Fp>\n\n## 🌟 Overview\n\nThis repository contains a **comprehensive collection** of research papers on Large Language Model (LLM) agents. We organize papers across key categories including agent construction, collaboration mechanisms, evolution, tools, security, benchmarks, and applications.\n\nOur taxonomy provides a structured framework for understanding the rapidly evolving field of LLM agents, from architectural foundations to practical implementations. The repository bridges fragmented research threads by highlighting connections between agent design principles and emergent behaviors.\n\n📄 **[Read our survey paper here](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21460)**\n\n\nOur survey covers the rapidly evolving field of LLM agents, with a significant increase in research publications since 2023.\n\n\n## 📑 Table of Contents\n\n- [🌟 Overview](#-overview)\n- [📊 Statistics & Trends](#-statistics--trends)\n- [🔍 Key Categories](#-key-categories)\n- [📚 Resource List](#-resource-list)\n  - [Agent Collaboration](#agent-collaboration)\n  - [Agent Construction](#agent-construction)\n  - [Agent Evolution](#agent-evolution)\n  - [Applications](#applications)\n  - [Datasets & Benchmarks](#datasets--benchmarks)\n  - [Ethics](#ethics)\n  - [Security](#security)\n  - [Survey](#survey)\n  - [Tools](#tools)\n- [🤝 Contributing](#-contributing)\n\n## 🔍 Key Categories\n\n- **🏗️ Agent Construction**: Methodologies and architectures for building LLM agents\n- **👥 Agent Collaboration**: Frameworks for multi-agent interaction and cooperation\n- **🌱 Agent Evolution**: Self-improvement and learning capabilities of agents\n- **🔧 Tools**: Integration of external tools and APIs with LLM agents\n- **🛡️ Security**: Security concerns and protections for LLM agent systems\n- **📊 Benchmarks**: Evaluation frameworks and datasets for testing agent capabilities\n- **💡 Applications**: Real-world implementations and use cases\n\n\n## 📚 Resource List\n\n### Agent Collaboration\n\n- **[Foam-Agent: Towards Automated Intelligent CFD Workflows](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04997)** (*2025*) `Arxiv`\n  > The paper presents Foam - Agent, a multi - agent framework automating CFD workflows from natural language. It features unique retrieval, file - generation and error - correction systems, lowering expertise barriers.\n\n- **[Why Do Multi-Agent LLM Systems Fail?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13657)** (*2025*) `Arxiv`\n  > Presents MAST, a taxonomy for MAS failures. Develops an LLM-as-a-Judge pipeline, and opensources data to guide MAS development.\n\n- **[Linear formation control of multi-agent systems](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS0005109824004291)** (*2025*)\n  > A new distributed leader–follower control architecture (linear formation control) is proposed for formation variations, with new concepts and estimation methods.\n\n- **[MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01935)** (*2025*) `Arxiv`\n  > Introduces MultiAgentBench to evaluate LLM - based multi - agent systems. Assesses collaboration and competition, evaluates protocols and strategies, code & data open - sourced.\n\n- **[A Survey of AI Agent Protocols](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16736)** (*2025*) `Arxiv`\n  > Paper analyzes existing LLM agent protocols, proposes a classification, explores future directions for next - gen protocols.\n\n- **[C^2: Scalable Auto-Feedback for LLM-based Chart Generation](https:\u002F\u002Faclanthology.org\u002F2025.naacl-long.232\u002F)** (*2025*) `*ACL`\n  > The paper introduces C2, a framework with an auto - feedback provider and a reference - free dataset, eliminating human curation, open - sourced at chartsquared.github.io.\n\n- **[AgentRxiv: Towards Collaborative Autonomous Research](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18102)** (*2025*) `Arxiv`\n  > Introduces AgentRxiv, a framework enabling LLM agent labs to share research on a preprint server for collaboration, aiding future AI design with humans.\n\n- **[Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05707)** (*2025*) `Arxiv`\n  > Proposes multiagent finetuning for language models. Specializes models via multiagent - generated data, preserving diverse reasoning chains for better self - improvement.\n\n- **[From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08292)** (*2025*) `ICML`\n  > Proposes ECON, a hierarchical RL paradigm recasting multi - LLM coordination as a BNE game, with tighter regret bound and scalability.\n\n- **[Chain of Agents: Large language models collaborating on long-context tasks](https:\u002F\u002Fresearch.google\u002Fblog\u002Fchain-of-agents-large-language-models-collaborating-on-long-context-tasks\u002F)** (*2025*) `Blog`\n  > Proposes Chain-of-Agents, a training-free, task-agnostic framework using LLM collaboration for long-context tasks, outperforming RAG and long-context LLMs.\n\n- **[CS-Agent: LLM-based Community Search via Dual-agent Collaboration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.09549)** (*2025*) `Arxiv`\n  > Proposes CS-Agent with dual-agent collaboration (Solver, Validator) and Decider for LLM-based community search, addressing limitations without fine-tuning.\n\n- **[MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18669)** (*2025*) `Arxiv`\n  > MUA-RL integrates LLM-simulated users into RL loop for agentic tool use, enabling dynamic multi-turn user interaction learning.\n\n- **[CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games](https:\u002F\u002Faclanthology.org\u002F2025.acl-long.389\u002F)** (*2025*) `*ACL`\n  > CoMet introduces a framework for LLM-based agents to process metaphors, combining a hypothesis-based reasoner and a self-reflective generator. This novel approach enhances strategic, covert communication in multi-agent language games through nuanced metaphor interpretation and application.\n\n- **[Thought Communication in Multiagent Collaboration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20733)** (*2025*) `Arxiv`\n  > This paper introduces thought communication, a new paradigm enabling agents to share latent thoughts directly, going beyond natural language. It provides a theoretical framework for identifying and structuring these thoughts, enhancing multi-agent collaboration.\n\n- **[Cache-to-Cache: Direct Semantic Communication Between Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.03215)** (*2025*) `Arxiv`\n  > This paper proposes Cache-to-Cache (C2C), a method for direct semantic communication between LLMs using their internal KV-cache, bypassing inefficient text generation to enable richer, lower-latency inter-model collaboration.\n\n- **[Adaptive Collaboration Strategy for LLMs in\nMedical Decision Making](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.15155)** (*2024*) `NeurIPS`\n  > Proposes Medical Decisionmaking Agents (MDAgents) to assign LLM collaboration structures, adapting to task complexity and exploring group consensus.\n\n- **[ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13007)** (*2024*) `*ACL`\n  > Proposes ReConcile, a multi - model multi - agent framework like a round - table conference, enhancing LLM collaborative reasoning via discussion and voting.\n\n- **[MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.00352)** (*2024*) `ICLR`\n  > Introduces MetaGPT, a meta - programming framework integrating human workflows into LLM - based multi - agent systems, improving task breakdown and error reduction.\n\n- **[Debating with More Persuasive LLMs Leads to More Truthful Answers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.06782)** (*2024*) `ICML`\n  > The paper explores if weaker models can assess stronger ones via debate. It shows debate helps non - experts and optimising debaters aids truth - finding without ground truth.\n\n- **[Roco: Dialectic multi-robot collaboration with large language\nmodels](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04738)** (*2024*) `Arxiv`\n  > Proposes using pre - trained LLMs for multi - robot high - level comm. and low - level path planning, with in - context improvement, and introduces RoCoBench.\n\n- **[AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.05268)** (*2024*) `*ACL`\n  > AutoAct is an automatic QA agent learning framework. It synthesizes trajectories without external help and uses labor - division for task completion.\n\n- **[Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.12954)** (*2024*) `Arxiv`\n  > Introduces meta - prompting, a task - agnostic scaffolding to turn one LM into a multi - role system, integrates external tools, enhancing task performance.\n\n- **[Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19118)** (*2024*) `*ACL`\n  > The paper proposes a Multi - Agent Debate framework to solve the DoT problem in LLMs, encouraging divergent thinking for complex tasks.\n\n- **[AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors](https:\u002F\u002Fopenreview.net\u002Fforum?id=EHg5GDnyq1)** (*2024*) `ICLR`\n  > The paper proposes AgentVerse, a multi - agent framework inspired by human group dynamics, facilitating collaboration and revealing emergent behaviors in agents.\n\n- **[ChatDev: Communicative Agents for Software Development](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.07924)** (*2024*) `*ACL`\n  > This paper presents ChatDev, a chat - powered framework. Specialized LLM agents collaborate via unified linguistic comm., bridging phases for autonomous task - solving.\n\n- **[ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate](https:\u002F\u002Fopenreview.net\u002Fforum?id=FQepisCUWu)** (*2024*) `ICLR`\n  > The paper presents ChatEval, a multi - agent referee team, leveraging multi - agent debate for text evaluation, offering a human - mimicking process.\n\n- **[A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration](https:\u002F\u002Fopenreview.net\u002Fforum?id=XII0Wp1XA9#discussion)** (*2024*) `COLM`\n  > This paper proposes DyLAN, a framework for LLM - powered agent collaboration. It selects agents dynamically and uses a two - stage paradigm for task - solving.\n\n- **[AgentCoord: Visually Exploring Coordination Strategy for LLM-based Multi-Agent Collaboration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.11943)** (*2024*) `Arxiv`\n  > Presents a visual exploration framework for multi - agent coordination strategy design, converts goals to strategies, allowing user intervention.\n\n- **[TradingAgents: Multi-Agents LLM Financial Trading Framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.20138)** (*2024*) `Arxiv`\n  > This paper proposes a novel stock trading framework with LLM - powered agents in specialized roles, simulating real - world collaboration to improve trading performance.\n\n- **[AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08155)** (*2023*) `COLM`\n  > AutoGen is an open - source framework enabling LLM app building with multi - agent conversation, customizable agents, and flexible interaction definition.\n\n- **[Improving Factuality and Reasoning in Language Models through Multiagent Debate](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14325)** (*2023*) `ICML`\n  > A multi - agent debate approach is presented to improve LLMs' reasoning and factuality, applicable to black - box models with a unified procedure.\n\n- **[Autonomous chemical research with large\nlanguage models](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-023-06792-0)** (*2023*) `Nature`\n  > The paper introduces Coscientist, an AI system driven by GPT - 4. It integrates tools, shows potential in research, and demonstrates AI's versatility and efficacy.\n\n### Agent Construction\n\n- **[Planning with Multi-Constraints via Collaborative Language Agents](https:\u002F\u002Faclanthology.org\u002F2025.coling-main.672\u002F)** (*2025*) `*ACL`\n  > This paper proposes PMC, a zero - shot method for LLM - based multi - agent systems. It simplifies complex, constraint - heavy task planning via task decomposition.\n\n- **[Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002Fb631da756d1573c24c9ba9c702fde5a9-Abstract-Datasets_and_Benchmarks_Track.html)** (*2025*) `NeurIPS`\n  > The paper proposes an Embodied Agent Interface to unify tasks, modules, and metrics, comprehensively assessing LLMs for embodied decision making.\n\n- **[SPeCtrum: A Grounded Framework for Multidimensional Identity Representation in LLM-Based Agent](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.08599)** (*2025*) `Arxiv`\n  > Introduces SPeCtrum, a framework integrating S, P, C for LLM agent personas. Enhances identity realism, enabling personalized AI interactions.\n\n- **[Adaptive Thinking via Mode Policy Optimization for Social Language Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.02156)** (*2025*)\n  > Proposes Adaptive Mode Learning (AML) framework and AMPO algorithm, offering multi - granular modes, context - aware switching, and token - efficient reasoning.\n\n- **[On Architecture of LLM agents](http:\u002F\u002Fwww.injoit.ru\u002Findex.php\u002Fj1\u002Farticle\u002Fview\u002F2057)** (*2025*) `Arxiv`\n  > The paper discusses LLM agent architecture. Agents are a key area in AI, acting like mashups and robots, and frameworks can simplify their creation.\n\n- **[Unified Mind Model: Reimagining Autonomous Agents in the LLM Era](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.03459)** (*2025*) `Arxiv`\n  > This paper proposes the Unified Mind Model (UMM) for human - level agents. It also develops MindOS to create task - specific agents without programming.\n\n- **[ATLaS: Agent Tuning via Learning Critical Steps](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02197)** (*2025*) `Arxiv`\n  > Proposes ATLaS to identify critical steps in expert trajectories for LLM agent tuning, reducing cost and enhancing generalization.\n\n- **[Cognitive AI Memory: A Framework for More Human-like Memory in LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13044)** (*2025*) `Arxiv`\n  > The paper proposes CAIM framework inspired by cognitive AI for LLMs, with three modules, enhancing long - term human - AI interaction by holistic memory modeling.\n\n- **[Adaptive Graph Pruning: A Task-Adaptive Multi-Agent Collaboration Framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02951)** (*2025*) `Arxiv`\n  > Proposes Adaptive Graph Pruning (AGP), a task - adaptive multi - agent framework jointly optimizing agent quantity and communication topology via a two - stage strategy.\n\n- **[Agents of Change: Self-Evolving LLM Agents for Strategic Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04651)** (*2025*) `Arxiv`\n  > Puts LLM agents in strategic - challenging environments, uses Catan game for benchmarking, and proposes a multi - agent architecture for self - improvement.\n\n- **[Reinforcing Large Language Model Reasoning through Multi-Agent Reflection](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08379)** (*2025*) `ICML`\n  > The paper models multi - turn refinement as an MDP and introduces DPSDP, a RL algorithm for iterative answer refinement, showing theoretical and empirical benefits.\n\n- **[Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19828)** (*2025*) `Arxiv`\n  > Proposes Memory - R1, an RL framework with two agents for LLMs to actively manage and utilize external memory, offering insights into RL - enabled behavior.\n\n- **[BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17196)** (*2025*) `Arxiv`\n  > Introduces BudgetThinker, a framework for budget - aware LLM reasoning. Inserts control tokens and uses a two - stage training pipeline for efficient, controllable reasoning.\n\n- **[A-MEM: Agentic Memory for LLM Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12110)** (*2025*) `Arxiv`\n  > Paper proposes an agentic memory system for LLMs, organizing memories like Zettelkasten, enabling dynamic updates and more adaptive memory management.\n\n- **[MemoCue: Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23633)** (*2025*) `Arxiv`\n  > Proposes MemoCue, a strategy-guided agent with Recall Router framework, using 5W Recall Map and hierarchical recall tree to enhance memory recall via cue-rich queries.\n\n- **[Analyzing Information Sharing and Coordination in Multi-Agent Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12981)** (*2025*) `Arxiv`\n  > This paper constructs an LLM-based MAS for travel planning, introducing a notebook for structured info sharing and an orchestrator for reflective coordination to enhance long-horizon planning.\n\n- **[AutoAgents: A Framework for Automatic Agent Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17288)** (*2024*) `IJCAI`\n  > Introduces AutoAgents, a framework generating and coordinating specialized agents per task. Incorporates an observer. Offers new complex - task - tackling perspectives.\n\n- **[MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.00352)** (*2024*) `ICLR`\n  > MetaGPT is a meta - programming framework integrating human workflows into LLM - based multi - agent collaborations, streamlining workflows and reducing errors.\n\n- **[Cognitive Architectures for Language Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.02427)** (*2024*) `TMLR`\n  > Proposes CoALA, a framework for language agents with modular memory, action space, and decision - making, organizing work and guiding future development.\n\n- **[Executable Code Actions Elicit Better LLM Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01030)** (*2024*) `ICML`\n  > This work proposes CodeAct using executable Python code for LLM agents, unifying action space, and builds an open - source agent with a tuned dataset.\n\n- **[ChatDev: Communicative Agents for Software Development](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.07924)** (*2024*) `*ACL`\n  > The paper introduces ChatDev, an LLM - powered framework enabling agents to collaborate via language for software design, coding, and testing, unifying phases.\n\n- **[Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2024\u002Fpapers\u002FWei_Editable_Scene_Simulation_for_Autonomous_Driving_via_Collaborative_LLM-Agents_CVPR_2024_paper.pdf)** (*2024*) `CVPR\u002FICCV\u002FECCV`\n  > This paper presents ChatSim, enabling editable 3D driving scene sims via NLP. It uses LLM - agent, new neural radiance field and lighting estimation methods.\n\n- **[A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02170)** (*2024*) `COLM`\n  > A framework named DyLAN is proposed for LLM - powered agent collaboration. It has a two - stage paradigm with dynamic agent selection and communication for tasks.\n\n- **[More Agents Is All You Need](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05120)** (*2024*) `TMLR`\n  > The paper proposes Agent Forest, a sampling - and - voting method. It's orthogonal to existing ones, enhancing LLMs with performance correlated to task difficulty.\n\n- **[Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.02957)** (*2024*) `Arxiv`\n  > Presents Agent Hospital, a hospital simulacrum with LLM - powered agents. Doctor agents evolve without manual labeling, and methods benefit broader apps.\n\n- **[Empowering biomedical discovery with AI agents](https:\u002F\u002Fwww.cell.com\u002Fcell\u002Ffulltext\u002FS0092-8674(24)01070-5?&target=_blank)** (*2024*) `Others`\n  > Paper proposes “AI scientists” as collaborative agents integrating AI and bio - tools. They combine human and AI abilities and impact multiple bio - areas.\n\n- **[SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F10802322)** (*2024*) `IROS`\n  > Proposes SMART-LLM, an LLM - based framework for multi - robot task planning. Creates a benchmark dataset and offers resources on https:\u002F\u002Fsites.google.com\u002Fview\u002Fsmart-llm\u002F.\n\n- **[Perceive, Reflect, and Plan: Designing LLM Agent for Goal-Directed City Navigation without Instructions](http:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04168)** (*2024*) `Arxiv`\n  > The paper presents a novel LLM agent workflow with perception, reflection, and planning for goal - directed city navigation, avoiding baseline drawbacks.\n\n- **[Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.19962)** (*2024*) `Arxiv`\n  > Proposes constructing agent - specific data with GPT - 4 and fine - tuning small - parameter LLMs. Multi - path reasoning and task decomposition improve agent performance.\n\n- **[PlanCritic: Formal Planning with Human Feedback](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00300)** (*2024*) `Arxiv`\n  > Presents a feedback - driven plan critic, optimizing plans via RL with human feedback and GA, bridging gaps in planner research.\n\n- **[Enhancing Robot Task Planning: Integrating Environmental Information and Feedback Insights through Large Language Models](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F10661782)** (*2024*) `CCC`\n  > Presents EnviroFeedback Planner, integrating environmental info into prompt building and feedback for better agent execution in task planning.\n\n- **[Devil's Advocate: Anticipatory Reflection for LLM Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16334)** (*2024*) `Arxiv`\n  > A novel approach equips LLM agents with introspection, prompting task decomposition, continuous self - assessment, and three - fold intervention for better consistency and adaptability.\n\n- **[Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.17167)** (*2024*) `*ACL`\n  > Presents UltraTool, a benchmark for LLMs in real - world tool use. It evaluates the whole process, independently assesses planning, and removes pre - defined toolset restrictions.\n\n- **[On the Structural Memory of LLM Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15266)** (*2024*) `Arxiv`\n  > This paper explores how memory structures and retrieval methods affect LLM - based agents, finding task - specific advantages and iterative retrieval's superiority.\n\n- **[CAMEL: Communicative Agents for \"Mind\" Exploration of Large Language Model Society](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17760)** (*2023*) `NeurIPS`\n  > Proposes a role - playing communicative agent framework, offers scalable study approach for multi - agent systems, and open - sources library.\n\n- **[AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08155)** (*2023*) `Arxiv`\n  > AutoGen is an open - source framework for LLM apps. It enables customizable multi - agent conversation, flexible programming, and building diverse apps.\n\n- **[AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010)** (*2023*) `Arxiv`\n  > The paper introduces AgentCoder, a multi - agent framework for code generation. It addresses balancing issues and outperforms existing methods.\n\n- **[War and Peace (WarAgent): Large Language Model-based Multi-Agent Simulation of World Wars](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17227)** (*2023*) `Arxiv`\n  > Proposes WarAgent, an LLM - powered multi - agent system for simulating historical conflicts, offering new insights for conflict resolution and peacekeeping.\n\n- **[Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F6b8dfb8c0c12e6fafc6c256cb08a5ca7-Abstract-Conference.html)** (*2023*) `NeurIPS`\n  > The paper studies Minecraft planning for multi - task agents. It identifies two challenges and proposes a method to address inefficient planning.\n\n- **[TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.03427)** (*2023*) `Arxiv`\n  > Presents a framework for LLM - based AI agents, designs two agent types, evaluating TPTU abilities to guide LLM use in AI apps.\n\n### Agent Evolution\n\n- **[Evolutionary optimization of model merging recipes](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs42256-024-00975-8)** (*2025*) `NMI`\n  > Proposes an evolutionary approach for model merging, operating in two spaces, enabling cross - domain merging and introducing a new model composition paradigm.\n\n- **[CREAM: Consistency Regularized Self-Rewarding Language Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Vf6RDObyEF)** (*2025*) `ICLR`\n  > This paper formulates a framework for self - rewarding LLM, introduces regularization, and proposes CREAM to use reward consistency for more reliable data.\n\n- **[KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.03101)** (*2025*) `NAACL`\n  > The paper presents KNOWAGENT, using action knowledge base and self - learning to enhance LLM planning and mitigate hallucinations.\n\n- **[STeCa: Step-level Trajectory Calibration for LLM Agent Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14276)** (*2025*) `*ACL`\n  > Paper proposes STeCa, a framework for LLM agent learning. It constructs calibrated trajectories via step - level reward comparison and LLM reflection.\n\n- **[SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15478)** (*2025*) `Arxiv`\n  > Introduced ColBench benchmark. Proposed SWEET - RL, using training - time info for critic model, offering step - level rewards to optimize LLM agents.\n\n- **[DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18243)** (*2025*) `Arxiv`\n  > The paper proposes DualRAG, a dual - process framework integrating reasoning and retrieval for MHQA. Its coupled processes form a cycle and work well across scales.\n\n- **[Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12800)** (*2025*) `Arxiv`\n  > Proposes Atomic Thought paradigm and Atom - Searcher RL framework, integrating thought units and rewards for better agentic deep research with unique supervision and reasoning.\n\n- **[PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21104)** (*2025*) `Arxiv`\n  > Proposes PVPO, a RL method with advantage reference anchor and pre - sampling. Corrects bias, cuts rollout reliance, and selects high - gain data.\n\n- **[SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02085)** (*2025*) `Arxiv`\n  > SE-Agent optimizes multi-step reasoning via self-evolution with revision, recombination, refinement to expand search space and leverage cross-trajectory inspiration.\n\n- **[LLM Collaboration With Multi-Agent Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04652)** (*2025*) `Arxiv`\n  > Models LLM collaboration as cooperative MARL, develops MAGRPO algorithm to enable effective cooperation without complex individual rewards.\n\n- **[VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20718)** (*2025*) `Arxiv`\n  > This paper introduces a self-improving framework that enhances embodied visual tracking by integrating a VLM. It uses a novel memory-augmented self-reflection mechanism to enable the VLM to learn from failures and assist in proactive recovery.\n\n- **[EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.16079)** (*2025*) `Arxiv`\n  > This paper introduces a framework for LLM agents to self-improve through a closed-loop lifecycle, distilling past experiences into abstract principles to guide future decision-making and enable iterative strategy refinement.\n\n- **[Self-Improving LLM Agents at Test-Time](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07841)** (*2025*) `Arxiv`\n  > This paper introduces a test-time self-improvement method where an agent identifies its uncertain predictions, generates similar training examples, and fine-tunes itself on them, enabling efficient and effective self-evolution.\n\n- **[CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.08529)** (*2025*) `Arxiv`\n  > This framework enables autonomous agent co-evolution by generating intrinsic rewards from inter-agent discussions, optimized via reinforcement learning without external supervision.\n\n- **[Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11443)** (*2024*) `Arxiv`\n  > A benchmark self - evolving multi - agent framework extends benchmarks, uses six operations for fine - grained LLM evaluation, aiding model selection.\n\n- **[Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.292.pdf)** (*2024*) `ACL`\n  > Proposes Agent-Pro, an LLM-based agent using policy-level reflection and optimization. It evolves via dynamic belief process and DFS for better policies.\n\n- **[Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Ffile\u002F1c2b1c8f7d317719a9ce32dd7386ba35-Paper-Conference.pdf)** (*2024*) `NeurIPS`\n  > The paper proposes CORY, extending LLM fine - tuning to a multi - agent framework. Agents coevolve, potentially superior to PPO for real - world refinement.\n\n- **[A Survey on Self-Evolution of Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14387)** (*2024*) `Arxiv`\n  > Presents a framework for LLM self - evolution with four phases. Categorizes objectives, summarizes literature, and points out challenges and future directions.\n\n- **[LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks](https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.940.pdf)** (*2024*) `EMNLP`\n  > This paper proposes LLM-Evolve, an innovative framework extending benchmarks to sequential settings, enabling LLMs to learn from past experiences.\n\n- **[CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Sx038qxjek)** (*2024*) `ICLR`\n  > The paper introduces CRITIC, a framework enabling LLMs to self - correct via tool interaction, highlighting external feedback's role in LLMs' self - improvement.\n\n- **[Iterative Translation Refinement with Large Language Models](https:\u002F\u002Faclanthology.org\u002F2024.eamt-1.17.pdf)** (*2024*) `EAMT`\n  > The paper proposes iterative prompting of LLMs for self - correcting translations. It emphasizes source - anchoring and shows improved human - perceived quality.\n\n- **[Agent Alignment in Evolving Social Norms](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.04620)** (*2024*) `Arxiv`\n  > Proposes EvolutionaryAgent, an evolutionary framework for agent alignment. Transforms alignment into evolution\u002Fselection, applicable to various LLMs.\n\n- **[Mitigating the Alignment Tax of RLHF](https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.35.pdf)** (*2024*) `EMNLP`\n  > The paper reveals alignment tax in RLHF. It proposes HMA via model averaging to balance alignment and forgetting, maximizing performance with minimal tax.\n\n- **[Self-Rewarding Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.10020)** (*2024*) `Arxiv`\n  > Paper studies Self - Rewarding LMs using LLM - as - a - Judge to self - reward during training, opening door for continuous improvement.\n\n- **[V-STaR: Training Verifiers for Self-Taught Reasoners](https:\u002F\u002Fopenreview.net\u002Fpdf?id=stmqBSW2dV)** (*2024*) `COLM`\n  > Proposes V - STaR to train a verifier using both correct and incorrect self - generated solutions, improving solution selection and reasoning ability.\n\n- **[RLCD: Reinforcement learning from contrastive distillation for LM alignment](https:\u002F\u002Fopenreview.net\u002Fpdf?id=v3XXtxWKi6)** (*2024*) `ICLR`\n  > Proposes RLCD, a method for LM alignment without human feedback. Creates preference pairs via contrasting prompts to train a preference model.\n\n- **[LANGUAGE MODEL SELF-IMPROVEMENT BY REIN- FORCEMENT LEARNING CONTEMPLATION](https:\u002F\u002Fopenreview.net\u002Fpdf?id=38E4yUbrgr)** (*2024*) `ICLR`\n  > This paper presents RLC, a novel LMSI method leveraging evaluation - generation gap. It improves models without supervision and has broad applicability.\n\n- **[ProAgent: Building Proactive Cooperative Agents with Large Language Models](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F29710\u002F31219)** (*2024*) `AAAI`\n  > Proposes ProAgent, an LLM - based framework for proactive agents. It can adapt behavior, analyze states, infer intentions, and is modular, address zero - shot issues.\n\n- **[Agent Planning with World Knowledge Model](https:\u002F\u002Fopenreview.net\u002Fpdf?id=j6kJSS9O6I)** (*2024*) `NeurIPS`\n  > Presents parametric World Knowledge Model (WKM) for agent planning, synthesizing knowledge and guiding global & local planning, shows unique potential.\n\n- **[Refining Guideline Knowledge for Agent Planning Using Textgrad](https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fproceedings-article\u002Fickg\u002F2024\u002F088200a102\u002F24sKrMSCxr2)** (*2024*) `ICKG`\n  > This paper introduces Textgrad to optimize Guideline Knowledge for agents' embodied tasks, enabling auto - optimization via text gradients and failed trajectory analysis.\n\n- **[Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19118)** (*2024*) `Arxiv`\n  > This paper proposes a Multi - Agent Debate framework to solve the Degeneration - of - Thought problem in LLMs, encouraging divergent thinking.\n\n- **[LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.570\u002F)** (*2024*) `*ACL`\n  > Existing LLMs' tool - use accuracy is low. A novel simulated trial - and - error method is proposed, inspired by biological systems, for better tool learning.\n\n- **[Richelieu: Self-Evolving LLM-Based Agents for AI Diplomacy](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06813)** (*2024*) `NeurIPS`\n  > This paper introduces a self-evolving LLM-based agent for Diplomacy that integrates strategic planning, goal-oriented negotiation, and a novel self-play mechanism for autonomous evolution without human intervention.\n\n- **[Simulating Human-like Daily Activities with Desire-driven Autonomy](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06435)** (*2024*) `Arxiv`\n  > This paper introduces a desire-driven autonomous agent (D2A) that uses a dynamic Value System to enable an LLM to autonomously propose and select tasks, motivated by intrinsic human-like needs rather than explicit instructions.\n\n- **[AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F5fc47800ee5b30b8777fdd30abcaaf3b-Paper-Conference.pdf)** (*2023*) `NeurIPS`\n  > AlpacaFarm addresses challenges in LLM development. It simulates feedback cheaply, offers evaluation and method implementations, validating end - to - end.\n\n- **[SELF-REFINE:\n Iterative Refinement with Self-Feedback](https:\u002F\u002Fopenreview.net\u002Fpdf?id=S37hOerQLB)** (*2023*) `NeurIPS`\n  > Introduces SELF - REFINE, an approach for iterative LLM output refinement without extra training, demonstrating test - time improvement of LLMs.\n\n- **[Self-Evolution Learning for Discriminative Language Model Pretraining](https:\u002F\u002Faclanthology.org\u002F2023.findings-acl.254.pdf)** (*2023*) `EMNLP`\n  > Presents Self - Evolution learning (SE), a method for token masking and learning. Exploits data knowledge and uses novel smoothing, improving linguistic learning.\n\n- **[Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.08182)** (*2023*) `Arxiv`\n  > The paper introduces DIVERSEEVOL, a self-evolving mechanism for label-efficient instruction tuning, enhancing data diversity without human\u002FLLM intervention.\n\n- **[SELFEVOLVE: A Code Evolution Framework via Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02907)** (*2023*) `Arxiv`\n  > Proposes SELF-EVOLVE, a two - step pipeline using LLMs as knowledge providers and self - reflective programmers, with no need for special test cases.\n\n- **[SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.754.pdf)** (*2023*) `ACL`\n  > Introduces SELF - INSTRUCT, a framework to boost language models' instruction - following via self - generated samples, almost annotation - free.\n\n- **[Large Language Models are Better Reasoners with Self-Verification](https:\u002F\u002Faclanthology.org\u002F2023.findings-emnlp.167.pdf)** (*2023*) `EMNLP`\n  > The paper proposes LLMs have self - verification abilities. It uses backward verification, taking CoT conclusions as conditions, to improve reasoning performance.\n\n- **[CODET: CODE GENERATION WITH GENERATED TESTS](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ktrw68Cmu9c)** (*2023*) `ICLR`\n  > The paper proposes CODET, a method using pre - trained LMs to auto - generate test cases for code samples, facilitating better solution selection.\n\n- **[Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00322)** (*2023*) `Arxiv`\n  > Introduces dynamic Red Team Game to analyze multi - round interactions, develops GRTS to mitigate mode collapse, paves way for LLM safety.\n\n- **[Improving Factuality and Reasoning in Language Models through Multiagent Debate](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14325)** (*2023*) `Arxiv`\n  > A multi - agent debate approach for LLMs is proposed. It enhances reasoning, factuality, is applicable to black - box models, and has potential for LLM advancement.\n\n- **[CAMEL: Communicative Agents for \"Mind\" Exploration of Large Language Model Society](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17760)** (*2023*) `NeurIPS`\n  > Paper proposes role - playing framework for autonomous agent cooperation, offers scalable study approach, and open - sources library.\n\n- **[STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=_3ELRdg2sgI)** (*2022*) `NeurIPS`\n  > Proposes STaR, a technique leveraging few rationale examples and rationale - free data to bootstrap complex reasoning, letting models learn from self - generated reasoning.\n\n### Applications\n\n- **[An active inference strategy for prompting reliable responses from large language models in medical practice](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41746-025-01516-2)** (*2025*) `npj Digital Medicine`\n  > The paper proposes a domain-specific dataset and an active inference-based prompting protocol to address LLM issues, enabling its safe medical integration.\n\n- **[An evaluation framework for clinical use of large language models in patient interaction tasks](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41591-024-03328-5)** (*2025*) `Nature Medicine`\n  > Presents CRAFT - MD, an approach using natural dialogues for LLM clinical evaluation. Proposes recommendations for future LLM eval to enhance medical practice.\n\n- **[Large Language Models lack essential metacognition for reliable medical reasoning](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41467-024-55628-6)** (*2025*) `Nature Communications`\n  > Developed MetaMedQA to evaluate models' metacognition in medical reasoning, revealing deficiencies, emphasizing need for metacognition - based frameworks.\n\n- **[Balancing autonomy and expertise in autonomous synthesis laboratories](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs43588-025-00769-x)** (*2025*) `Nature Computational Science`\n  > Comment on barriers in autonomous synthesis labs, propose human on - the - loop approach, and strategies for optimizing labs' features.\n\n- **[SimUSER: Simulating User Behavior with Large Language Models for Recommender System Evaluation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.12722)** (*2025*) `Arxiv`\n  > Introduces SimUSER, an agent framework using personas for cost - effective user simulation in recommender system eval., refines params for real - world engagement.\n\n- **[Swarm Autonomy: From Agent Functionalization to Machine Intelligence](https:\u002F\u002Fadvanced.onlinelibrary.wiley.com\u002Fdoi\u002Ffull\u002F10.1002\u002Fadma.202312956)** (*2025*) `Advanced Materials`\n  > This review summarizes synthetic swarms from agent basics to applications, discussing emergent machine intelligence for real - world autonomous swarm design.\n\n- **[ShowUI: One Vision-Language-Action Model for GUI Visual Agent](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2025\u002Fhtml\u002FLin_ShowUI_One_Vision-Language-Action_Model_for_GUI_Visual_Agent_CVPR_2025_paper.html)** (*2025*) `CVPR\u002FICCV\u002FECCV`\n  > A vision-language-action model for GUI visual agents with UI-guided token selection, interleaved streaming, and curated datasets advances GUI assistance.\n\n- **[Agent Laboratory: Using LLM Agents as Research Assistants](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04227)** (*2025*) `Arxiv`\n  > The paper introduces Agent Laboratory, an LLM - based framework for full - cycle research. It reduces costs, and user feedback improves quality, accelerating discovery.\n\n- **[Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24047)** (*2025*) `Arxiv`\n  > The paper reviews LLM - based scientific agents, highlights differences from general agents, and offers a roadmap for scientific discovery.\n\n- **[CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21805)** (*2025*) `Arxiv`\n  > The paper presents CitySim, an urban simulator using LLMs. It uses recursive value - driven approach and endows agents with key features, enabling scalable urban studies.\n\n- **[A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20743)** (*2025*) `Arxiv`\n  > This paper surveys FMs in MatSci, introducing a taxonomy, discussing advances, reviewing resources, assessing pros & cons, and suggesting future directions.\n\n- **[An Auditable Agent Platform For Automated Molecular Optimisation](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2508.03444)** (*2025*) `Arxiv`\n  > A hierarchical agent framework automates molecular optimisation, creates auditable reasoning trajectories, and converts LLMs into auditable design systems.\n\n- **[PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21720)** (*2025*) `Arxiv`\n  > A training - free PosterForest framework is proposed. It uses Poster Tree and multi - agent collaboration for poster generation, addressing structure and integration challenges.\n\n- **[Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21803)** (*2025*) `Arxiv`\n  > A collaborative multi - agent system (MAS) mimicking a clinical team analyzes SOAP notes' S&O sections, offering a path to better clinical decision support tools.\n\n- **[Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21365)** (*2025*) `Arxiv`\n  > Proposes Think in Games (TiG) framework enabling LLMs to gain procedural knowledge via game - env interaction, bridging knowledge gaps and enhancing transparency.\n\n- **[AlphaEvolve: A coding agent for scientific and algorithmic discovery](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13131)** (*2025*) `Arxiv`\n  > The paper presents AlphaEvolve, an evolutionary coding agent. It autonomously improves algorithms, discovers new ones, and broadens automated discovery scope.\n\n- **[Agent Laboratory: Using LLM Agents as Research Assistants](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04227)** (*2025*) `Arxiv`\n  > Introduces Agent Laboratory, an LLM - based framework for full - cycle research. It cuts costs, benefits from human feedback, and frees researchers for ideation.\n\n- **[CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21805)** (*2025*) `Arxiv`\n  > Proposes CitySim using LLMs to simulate urban behaviors. Agents have beliefs, goals, memory. It's a scalable testbed for urban phenomena.\n\n- **[aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.15126)** (*2025*) `Arxiv`\n  > This paper introduces aiXiv, a multi-agent open-access platform that enables AI-generated research to be submitted, reviewed, and iteratively refined through seamless human-AI collaboration, addressing the lack of appropriate publication venues.\n\n- **[GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21035)** (*2025*) `Arxiv`\n  > This framework introduces a multi-agent system that combines structured workflows with autonomous planning for gene expression analysis. Its core novelty is a guided-planning approach where agents dynamically adapt a shared analytical plan, ensuring both reliability and flexibility in scientific discovery.\n\n- **[Motif: Intrinsic Motivation from Artificial Intelligence Feedback](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00166)** (*2024*) `ICLR`\n  > Paper proposes Motif, a method to interface LLM prior knowledge with agents via intrinsic rewards, yielding intuitive behaviors and progress on tough tasks.\n\n- **[Baba Is AI: Break the Rules to Beat the Benchmark](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.13729)** (*2024*) `ICML`\n  > This paper likely presents a novel approach in action games under “Applications” section, with potential rule - breaking strategies for agents.\n\n- **[Large language model-empowered agents for simulating macroeconomic activities](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.829\u002F)** (*2024*) `*ACL`\n  > This paper uses large language model-empowered agents to simulate macro - economic activities, offering a novel approach in economic applications.\n\n- **[CompeteAI: Understanding the Competition Dynamics in Large Language Model-based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17512)** (*2024*) `ICML`\n  > The paper focuses on competition dynamics in LLMs-based agents in Economy applications, offering novel insights for the field.\n\n- **[Understanding the benefits and challenges of using large language model-based conversational agents for mental well-being support](https:\u002F\u002Fpmc.ncbi.nlm.nih.gov\u002Farticles\u002FPMC10785945\u002F)** (*2024*) `AMIA`\n  > The paper explores benefits and challenges of large language model - based conversational agents for mental well - being support in psychology applications.\n\n- **[Exploring Collaboration Mechanisms for LLM Agents](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.782\u002F)** (*2024*) `*ACL`\n  > This paper explores collaboration mechanisms for LLM agents in the psychology applications, bringing novel ideas to large model - based agents.\n\n- **[Simulating Human Society with Large Language Model Agents: City, Social Media, and Economic System](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3589335.3641253)** (*2024*) `WWW`\n  > The paper applies large language model agents to simulate human society, covering city, social media, and economic systems, a novel contribution in the field.\n\n- **[Can large language models transform computational social science?](https:\u002F\u002Faclanthology.org\u002F2024.cl-1.8\u002F)** (*2024*) `*ACL`\n  > Paper explores if large language models can transform computational social science in social applications, offering novel insights.\n\n- **[AgentCF: Collaborative Learning with Autonomous Language Agents for Recommender Systems](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09233)** (*2024*) `SIGIR`\n  > Proposes AgentCF to simulate user-item interactions. Considers users and items as agents, uses collaborative learning to model two-sided relations, inspiring behavior simulation.\n\n- **[On Generative Agents in Recommendation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10108)** (*2024*) `SIGIR`\n  > Proposes Agent4Rec, an LLM - empowered user simulator for recommenders with profile, memory, and action modules, exploring human behavior simulation.\n\n- **[ChatDev: Communicative Agents for Software Development](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.810\u002F)** (*2024*) `*ACL`\n  > Introduces ChatDev, a chat - powered dev framework. LLMs agents use unified language comm. in design, coding, testing, bridging phases via lang. https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FChatDev\n\n- **[CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.18021)** (*2024*) `Arxiv`\n  > Introduces CRISPR - GPT, an LLM agent with domain knowledge and tools for auto - designing gene - editing experiments, addresses ethics, bridges researcher - technique gap.\n\n- **[SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning](https:\u002F\u002Fadvanced.onlinelibrary.wiley.com\u002Fdoi\u002Ffull\u002F10.1002\u002Fadma.202413523)** (*2024*) `Advanced Materials`\n  > SciAgents uses ontological KGs, LLMs, and multi - agent systems to uncover interdisciplinary relations in biomaterials, autonomously generating and refining hypotheses for discovery.\n\n- **[Medical large language models are susceptible to targeted misinformation attacks](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41746-024-01282-7)** (*2024*) `npj Digital Medicine`\n  > The paper reveals LLMs in medicine are vulnerable. Just 1.1% weight manipulation can inject incorrect facts, stressing need for security measures.\n\n- **[CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.09811)** (*2024*) `Arxiv`\n  > Introduces CellAgent, an LLM - driven multi - agent framework for scRNA - seq analysis. It has expert roles, decision - making and self - iterative mechanisms, reducing workload.\n\n- **[Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01560)** (*2023*) `NeurIPS`\n  > Paper proposes “DEPS”, an interactive planning approach with LLMs for multi - task agents, refining plans and showing effectiveness across domains.\n\n- **[Language Models Meet World Models: Embodied Experiences Enhance Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10626.pdf)** (*2023*) `NeurIPS`\n  > Combines language & world models, using embodied experiences. Applicable in simulation games, enhancing model capabilities.\n\n- **[ChessGPT: Bridging Policy Learning and Language Modeling](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F16b14e3f288f076e0ca73bdad6405f77-Abstract-Datasets_and_Benchmarks.html)** (*2023*) `NeurIPS`\n  > The paper bridges policy learning and language modeling, with potential applications in competition games, offering a novel approach in large model - based agents.\n\n- **[Mindagent: Emergent gaming interaction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.09971)** (*2023*) `Arxiv`\n  > This paper explores emergent gaming interaction in cooperation games, presenting novel applications for large model - based agents.\n\n- **[Exploring large language models for communication games: An empirical study on Werewolf](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04658)** (*2023*) `Arxiv`\n  > This paper empirically explores large language models in Werewolf, a communication game, contributing novel applications in this area.\n\n- **[Language as reality: a co-creative storytelling game experience in 1001 nights using generative AI](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAIIDE\u002Farticle\u002Fview\u002F27539)** (*2023*) `AAAI`\n  > This paper presents a co - creative storytelling game in \"1001 Nights\" via generative AI, contributing fresh application in game generation.\n\n- **[TradingGPT: Multi-Agent System with Layered Memory and Distinct Characters for Enhanced Financial Trading Performance](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.03736)** (*2023*) `Arxiv`\n  > TradingGPT presents a multi - agent system with layered memory and distinct characters to boost financial trading performance, a novel approach in the field.\n\n- **[Using large language models to simulate multiple humans and replicate human subject studies](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Faher23a\u002Faher23a.pdf)** (*2023*) `ICML`\n  > The paper applies large language models to simulate humans for replicating subject studies, contributing novel methods in psychological applications.\n\n- **[Generative Agents: Interactive Simulacra of Human Behavior](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.03442)** (*2023*) `UIST`\n  > Paper presents generative agents simulating human behavior, with novelty in application to society, a potential addition to large model - based agents study.\n\n- **[Self-collaboration Code Generation via ChatGPT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.07590)** (*2023*) `TOSEM`\n  > A self - collaboration framework for code generation using LLMs like ChatGPT is proposed. It forms virtual teams of agents, improving complex task handling without human intervention.\n\n- **[Language models can solve computer tasks](https:\u002F\u002Fopenreview.net\u002Fpdf?id=M6OmjAZ4CX)** (*2023*) `NeurIPS`\n  > The paper presents RCI, a simple prompting scheme enabling pre - trained LLMs to execute computer tasks via natural language, enhancing reasoning and outperforming other methods.\n\n- **[ChemCrow: Augmenting large-language models with chemistry tools](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.05376)** (*2023*) `Arxiv`\n  > Introduces ChemCrow, an LLM chemistry agent integrating 18 tools. Augments LLM in chemistry, automates tasks, and bridges experimental and computational chemistry.\n\n- **[AlphaFlow: autonomous discovery and optimization of multi-step chemistry using a self-driven fluidic lab guided by reinforcement learning](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41467-023-37139-y)** (*2023*) `Nature Communications`\n  > The paper presents AlphaFlow, a self - driven fluidic lab using reinforcement learning for multi - step chemistry discovery, demonstrating its potential for new synthetic routes beyond cALD.\n\n- **[Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fhuang22a.html)** (*2022*) `ICML`\n  > This paper presents language models as zero - shot planners for embodied agents, with applications in simulation games, offering actionable knowledge extraction novelty.\n\n- **[Stress-testing the resilience of the Austrian healthcare system using agent-based simulation](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41467-022-31766-7)** (*2022*) `Nature Communications`\n  > A data - driven agent - based framework quantifies regional healthcare resilience to shocks, helps identify care access bottlenecks and relates systemic to individual indicators.\n\n### Datasets & Benchmarks\n\n- **[AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks](https:\u002F\u002Fopenreview.net\u002Fpdf?id=AC5n7xHuR1)** (*2025*) `ICLR`\n  > Proposes AgentHarm, a new benchmark for LLM agents' robustness. Covers 11 harm categories, enabling evaluation of attacks and defenses.\n\n- **[AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator](https:\u002F\u002Faclanthology.org\u002F2025.coling-main.680.pdf)** (*2025*) `*ACL`\n  > Introduced AI Hospital for simulating medical interactions, developed MVME benchmark, proposed dispute - resolution mechanism to boost LLMs' clinical abilities.\n\n- **[Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation](https:\u002F\u002Faclanthology.org\u002F2025.coling-main.223.pdf)** (*2025*) `*ACL`\n  > A multi - agent benchmark self - evolving framework dynamically evaluates LLMs. It reframes instances and extends datasets, aiding model selection and benchmark evolution.\n\n- **[DCA-Bench: A Benchmark for Dataset Curation Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=a4sknPttwV)** (*2025*) `ICLR`\n  > The paper sets up a benchmark for LLM agents to detect wild dataset issues, curates test cases, and proposes an evaluation framework, promoting real - world curation.\n\n- **[MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14654)** (*2025*) `Arxiv`\n  > The paper introduces MedAgentBench, a benchmark for medical LLM agents with clinically - derived tasks and realistic data, enabling evaluation and optimization in medical domain.\n\n- **[MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering](https:\u002F\u002Fopenreview.net\u002Fpdf?id=6s5uXNWGIh)** (*2025*) `ICLR`\n  > Introduced MLE - bench for evaluating AI agents in ML engineering. Curated tasks, set baselines, evaluated models, and opened - sourced code for future research.\n\n- **[EgoLife: Towards Egocentric Life Assistant](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.03803)** (*2025*) `Arxiv`\n  > Introduced EgoLife project for egocentric life assistant. Created EgoLife Dataset and EgoLifeQA tasks for daily life assistance.\n\n- **[DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.07703)** (*2025*) `ICLR`\n  > The paper introduces DSBench, a comprehensive benchmark for data science agents with realistic tasks, bridging the gap between benchmarks and real - world apps.\n\n- **[Towards Internet-Scale Training For Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06776)** (*2025*) `Arxiv`\n  > The paper develops a pipeline for Internet-scale agent training without extensive human annotations, with LLMs handling task gen., execution, and review.\n\n- **[macOSWorld: An Interactive Benchmark for GUI Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04135)** (*2025*) `Arxiv`\n  > Paper presents macOSWorld, first macOS GUI agent benchmark with multilingual tasks and safety subset, bridging OS evaluation gaps.\n\n- **[Humanity's Last Exam](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14249)** (*2025*) `Arxiv`\n  > Introduces Humanity's Last Exam (HLE), a multi - modal, broad - coverage benchmark for LLMs, gap shown, and publicly released for research.\n\n- **[MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12806)** (*2025*) `Arxiv, *ACL`\n  > Introduces MCPEval, an MCP - based open - source framework for automating LLM agent evaluation, standardizing metrics and reducing manual work.\n\n- **[IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18223)** (*2025*) `Arxiv`\n  > Introduces IDA - Bench, a novel benchmark for LLMs in multi - round data analysis, emphasizing balance between instruction - following and reasoning.\n\n- **[SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11791)** (*2025*) `Arxiv`\n  > Introduces SEC - bench, an automated benchmarking framework for LLM agents on real - world security tasks, with a novel multi - agent scaffold to create datasets.\n\n- **[MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21475)** (*2025*) `Arxiv`\n  > Introduces MMSearch - Plus benchmark for multimodal browsing agents, with novel curation and agent framework, addressing genuine multimodal challenges.\n\n- **[MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents](https:\u002F\u002Faclanthology.org\u002F2025.acl-long.421\u002F)** (*2025*) `*ACL`\n  > This paper introduces MultiAgentBench for evaluating multi - agent systems in diverse scenarios, assesses protocols and strategies like cognitive planning, and will release code and data.\n\n- **[Establishing Best Practices for Building Rigorous Agentic Benchmarks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02825)** (*2025*) `Arxiv`\n  > Many agentic benchmarks have setup or reward issues. The paper introduces ABC guidelines to make agentic evaluation rigorous.\n\n- **[UserBench: An Interactive Gym Environment for User-Centric Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.22034)** (*2025*) `Arxiv`\n  > Introduces UserBench, an interactive environment with simulated users to evaluate agents on proactive collaboration. It measures their ability to clarify vague, evolving goals through multi-turn dialogue and tool use, highlighting a critical gap between task completion and user alignment.\n\n- **[PillagerBench: A Competitive Multi-Agent Benchmark for Evaluating LLM-based Agents in Minecraft](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06235)** (*2025*) `Arxiv`\n  > This paper introduces PillagerBench, a benchmark for competitive multi-agent evaluation in Minecraft, and TactiCrafter, an agent that uses human-readable tactics and learns causal dependencies to adapt to opponents.\n\n- **[UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.20977)** (*2025*) `CVPR\u002FICCV\u002FECCV`\n  > UnrealZoo is a high-fidelity 3D virtual world platform with diverse entities and enhanced tools for embodied AI. It enables efficient multi-agent training and reveals that environmental diversity is crucial for developing generalizable agents that can handle open-world complexity.\n\n- **[Probe by Gaming: A Game-based Benchmark for Assessing Conceptual Knowledge in LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17512)** (*2025*) `Arxiv`\n  > CK-Arena introduces a multi-agent game benchmark to evaluate LLMs' conceptual reasoning through interactive tasks like describing and differentiating concepts, moving beyond static factual recall.\n\n- **[NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07172)** (*2025*) `Arxiv`\n  > NewtonBench introduces a scalable and memorization-resistant benchmark for scientific law discovery. It evaluates agents' ability to uncover hidden principles through interactive model exploration, moving beyond static function fitting to capture the authentic scientific process.\n\n- **[LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.15760)** (*2025*) `Arxiv`\n  > This paper introduces LiveMCP-101, a benchmark of 101 real-world queries requiring multi-tool orchestration. Its key novelty is a novel evaluation method using ground-truth execution plans to better reflect dynamic environments, rigorously testing agent capabilities.\n\n- **[Achilles Heel of Distributed Multi-Agent Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07461)** (*2025*) `Arxiv`\n  > This paper proposes a Distributed Multi-Agent System (DMAS) framework and identifies it as vulnerable to critical trustworthiness issues like free riding and malicious attacks, serving as a red-teaming tool for future research.\n\n- **[AgentBench: Evaluating LLMs as Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zAdUB0aCTQ)** (*2024*) `ICLR`\n  > Presents AgentBench with 8 environments to evaluate LLM agents, identifies failure reasons, and offers improvement strategies like multi - round alignment training.\n\n- **[AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents](https:\u002F\u002Faclanthology.org\u002F2024.naacl-demo.19.pdf)** (*2024*) `*ACL`\n  > Proposes AgentQuest, a framework with modular benchmarks\u002Fmetrics and two new evaluation metrics for tracking LLM agent progress.\n\n- **[BENCHAGENTS: Automated Benchmark Creation with Agent Interaction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.22584)** (*2024*) `Arxiv`\n  > Introduces BENCHAGENTS, an LLM - based framework to automate benchmark creation for complex capabilities, ensuring data quality with agent interaction and human feedback.\n\n- **[Benchmarking Data Science Agents](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.308.pdf)** (*2024*) `*ACL`\n  > Presents DSEval, a novel evaluation paradigm and benchmarks for data science agents, with a bootstrapped method for better coverage and comprehensiveness.\n\n- **[Benchmarking Large Language Models as AI Research Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=N9wD4RFWY0)** (*2024*) `ICLR`\n  > Proposes MLAgent-Bench for benchmarking AI research agents, designs an LLM-based agent, and identifies key challenges for such agents.\n\n- **[Benchmarking Large Language Models for Multi-agent Systems: A Comparative Analysis of AutoGen, CrewAI, and TaskWeaver](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-031-70415-4_4)** (*2024*) `Others`\n  > The paper benchmarks three LLMs-powered multi - agent systems (AutoGen, CrewAI, TaskWeaver) on ML code gen, advancing collaborative problem - solving research.\n\n- **[BLADE- Benchmarking Language Model Agents](https:\u002F\u002Faclanthology.org\u002F2024.findings-emnlp.815.pdf)** (*2024*) `*ACL`\n  > This paper presents BLADE, a benchmark for automatically evaluating agents' multifaceted approaches to open - ended research, enabling agent evaluation for data - driven science.\n\n- **[CRAB: Cross-platfrom agent benchmark for multi-modal embodied language model agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kyExS4V0H7)** (*2024*) `NeurIPS`\n  > Introduced Crab, a cross - environment agent benchmark framework with graph - based eval. method and task construction mechanism, and developed Crab Benchmark - v0.\n\n- **[CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.928.pdf)** (*2024*) `*ACL`\n  > Propose CToolEval benchmark for Chinese LLM agents with 398 APIs. Present an evaluation framework and release data\u002Fcodes to promote agent - level research.\n\n- **[DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models](https:\u002F\u002Faclanthology.org\u002F2024.emnlp-main.748.pdf)** (*2024*) `*ACL`\n  > Introduces DA - Code, a code gen benchmark for LLMs on agent - based data science. It has unique tasks, real data, and requires complex langs, released on GitHub.\n\n- **[Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002Fb631da756d1573c24c9ba9c702fde5a9-Abstract-Datasets_and_Benchmarks_Track.html)** (*2024*) `NeurIPS`\n  > Proposes an Embodied Agent Interface to unify tasks, modules, and metrics, comprehensively assessing LLMs for embodied decision making.\n\n- **[GTA: A Benchmark for General Tool Agents](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Ffile\u002F8a75ee6d4b2eb0b777f549a32a5a5c28-Paper-Datasets_and_Benchmarks_Track.pdf)** (*2024*) `NeurIPS`\n  > Proposes GTA, a benchmark for general tool agents with real queries, deployed tools, and multimodal inputs to assess LLMs' real - world problem - solving.\n\n- **[LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04372)** (*2024*) `CVPR\u002FICCV\u002FECCV`\n  > Presents LaMPilot, integrating LLMs into AD for following user instructions. Introduces LaMPilot - Bench and releases code\u002Fdata for further research.\n\n- **[ML Research Benchmark](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.22553)** (*2024*) `Arxiv`\n  > The paper presents ML Research Benchmark with 7 tasks for AI agents, offering a framework to assess them on real - world research challenges.\n\n- **[MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.18961)** (*2024*) `Arxiv`\n  > The paper introduces MMAU benchmark, with offline tasks across 5 domains and 5 capabilities, enhancing interpretability of LLM agents.\n\n- **[OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.17553)** (*2024*) `CVPR\u002FICCV\u002FECCV`\n  > Introduces OmniACT, a first - of - its - kind dataset and benchmark for assessing agents' ability in computer task automation, covering desktop apps.\n\n- **[OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Ffile\u002F5d413e48f84dc61244b6be550f1cd8f5-Paper-Datasets_and_Benchmarks_Track.pdf)** (*2024*) `NeurIPS`\n  > Introduces OSWorld, a scalable real computer env. for multimodal agents, creates 369 - task benchmark, aids multimodal generalist agent development.\n\n- **[Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.11507)** (*2024*) `Arxiv`\n  > The paper introduces Benchmark+ and Assessment+, proposes TestAgent framework, enabling dynamic benchmark generation and domain - adaptive assessments for LLMs.\n\n- **[Seal-Tools: Self-instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.08355)** (*2024*) `Others`\n  > A new tool learning dataset Seal - Tools with self - instruct method for generation, hard instances, strict metrics as a new benchmark for LLMs' tool - calling.\n\n- **[Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05307)** (*2024*) `Arxiv`\n  > The paper introduces Tapilot - Crossing benchmark for evaluating LLM agents in interactive data analysis and proposes AIR strategy to evolve LLMs into effective agents.\n\n- **[TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.14161)** (*2024*) `Arxiv`\n  > This paper introduces TheAgentCompany, an extensible benchmark for evaluating AI agents on real - world tasks, mimicking a software company environment.\n\n- **[Tur[k]ingBench: A Challenge Benchmark for Web Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.11905)** (*2024*) `Arxiv`\n  > Presents TurkingBench, a benchmark using natural HTML pages from crowdsourcing. Develops an evaluation framework to spur web - based agent progress.\n\n- **[Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.557\u002F)** (*2024*) `*ACL`\n  > Paper identifies issues in agent training, proposes Agent - FLAN to fine - tune LLMs, decomposes corpus, uses negatives to reduce hallucinations.\n\n- **[AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories](https:\u002F\u002Faclanthology.org\u002F2024.findings-emnlp.116\u002F)** (*2024*) `*ACL`\n  > Introduces AgentBank, a large trajectory data collection. Uses novel annotation, fine - tunes LLMs to get Samoyed, shows data scaling for agent capabilities.\n\n- **[AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning](http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15506)** (*2024*) `Arxiv`\n  > Introduces AgentOhana to unify agent trajectories from diverse sources, enabling a balanced training pipeline, and presents xLAM-v0.1 for AI agents.\n\n- **[AgentTuning: Enabling Generalized Agent Abilities for LLMs](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.181\u002F)** (*2024*) `*ACL`\n  > This paper presents AgentTuning, a method using AgentInstruct and hybrid tuning to boost LLMs' agent abilities without sacrificing generality, and open-sources models.\n\n- **[Executable Code Actions Elicit Better LLM Agents](https:\u002F\u002Fproceedings.mlr.press\u002Fv235\u002Fwang24h.html)** (*2024*) `ICML`\n  > This paper proposes CodeAct, using Python code for LLM agents' actions. It creates a dataset and a finetuned agent with self - debugging for complex tasks.\n\n- **[AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18901)** (*2024*) `*ACL`\n  > Built AppWorld Engine and Benchmark to address gaps in existing tool - use benchmarks, enabling rich and interactive code gen. for agents.\n\n- **[SheetAgent: Towards A Generalist Agent for Spreadsheet Reasoning and Manipulation via Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.03636)** (*2024*) `WWW`\n  > The paper introduces SheetRM benchmark and proposes SheetAgent, an LLM - based agent with three modules, enabling autonomous spreadsheet reasoning and manipulation.\n\n- **[GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15341)** (*2024*) `Arxiv`\n  > This paper introduces GenoTEX, a benchmark for LLM agents in gene expression analysis, and GenoAgent, a multi-agent system using self-correcting code generation to automate the entire bioinformatics pipeline.\n\n- **[FireAct: Toward Language Agent Fine-tuning](http:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05915)** (*2023*) `Arxiv`\n  > The paper explores LM fine - tuning for language agents. It proposes FireAct using diverse data, revealing benefits and offering experimental insights.\n\n### Ethics\n\n- **[Medical large language models are vulnerable to data-poisoning attacks](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41591-024-03445-1)** (*2025*) `Nature Medicine`\n  > Paper assesses LLM data - poisoning attacks, finds low - ratio misinfo harms models, and proposes a graph - based mitigation strategy.\n\n- **[Foundation Models and Fair Use](https:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fv24\u002F23-0569.html)** (*2024*) `JMLR`\n  > Discusses legal & ethical risks of foundation models on copyrighted data, suggests technical mitigations, and advocates law-tech co - evolution for fair use.\n\n- **[Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model](https:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fv24\u002F23-0069.html)** (*2023*) `JMLR`\n  > The paper quantifies BLOOM's carbon footprint across its life - cycle and studies its inference emissions, discussing estimation challenges and future research.\n\n- **[LLaMA: Open and Efficient Foundation Language Models](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fllama-open-and-efficient-foundation-language-models\u002F)** (*2023*)\n  > Introduces LLaMA models (7B - 65B params). Trains with public datasets only, and releases them to the research community.\n\n- **[Predictability and Surprise in Large Generative Models](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3531146.3533229)** (*2022*) `FAccT`\n  > This paper reveals large generative models' paradox of predictability and unpredictability, shows harms, and lists AI community interventions.\n\n- **[On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3442188.3445922)** (*2021*) `FAccT`\n  > The paper questions the size of language models, explores associated risks, and offers mitigation recommendations beyond just scaling up.\n\n- **[Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2021\u002Fhash\u002F2e855f9489df0712b4bd8ea9e2848c5a-Abstract.html)** (*2021*) `NeurIPS`\n  > Proposes PALMS, an iterative process using values - targeted datasets to change language model behavior with a small, curated dataset.\n\n- **[GPT-3: Its Nature, Scope, Limits, and Consequences](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs11023-020-09548-1)** (*2020*)\n  > Paper analyzes GPT - 3 via reversible\u002Firreversible questions, presents three tests it fails, and outlines consequences of artefact industrialization.\n\n- **[Energy and Policy Considerations for Modern Deep Learning Research](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F7123)** (*2020*) `AAAI`\n  > It reveals high costs of large neural network computation, quantifies NLP model costs, and offers recommendations for cost - reduction and equity.\n\n- **[Defending Against Neural Fake News](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2019\u002Fhash\u002F3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html)** (*2019*) `NeurIPS`\n  > Presents Grover for controllable text gen to study neural fake - news risks, shows Grover's self - defense value, and discusses ethics.\n\n### Security\n\n- **[RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.08966)** (*2025*) `Arxiv`\n  > Paper introduces RTBAS for TBAS, adapting Information Flow Control and using novel screeners to auto - handle tool calls, reducing user burden.\n\n- **[Red-Teaming LLM Multi-Agent Systems via Communication Attacks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.14847)** (*2025*) `Arxiv`\n  > Presents Agent-in-the-Middle (AiTM), a novel attack on LLM-MAS via message manipulation, showing need for multi - agent system security.\n\n- **[Unveiling Privacy Risks in LLM Agent Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13172)** (*2025*) `Arxiv`\n  > The paper proposes MEXTRA under black-box setting to extract private info from LLM agent memory, and explores factors of leakage, urging for safeguards.\n\n- **[AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13053)** (*2025*) `Arxiv`\n  > The paper defines Active Environment Injection Attack (AEIA) and proposes AEIA - MN to evaluate MLLM - based agents' robustness against such threats.\n\n- **[Firewalls to Secure Dynamic LLM Agentic Networks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.01822)** (*2025*) `Arxiv`\n  > The paper identifies comm. props. for LLM agentic networks, proposes a design for balance, and constructs firewall rules via simulations.\n\n- **[AUTOHIJACKER: AUTOMATIC INDIRECT PROMPT INJECTION AGAINST BLACK-BOX LLM AGENTS](https:\u002F\u002Fopenreview.net\u002Fpdf?id=2VmB01D9Ef)** (*2025*) `Arxiv`\n  > Proposes AutoHijacker, an automatic indirect black - box prompt injection attack. It uses LLM - as - optimizers, batch - based optimization, and trainable memory.\n\n- **[AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3716628)** (*2025*) `ACM Computing Survey`\n  > This paper categorizes emerging security threats to AI agents into four knowledge gaps, aiming to inspire research for more secure agent apps.\n\n- **[SAGA: A Security Architecture for Governing AI Agentic Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21034)** (*2025*) `Arxiv`\n  > Proposes SAGA, a security architecture for governing agentic systems, offering user oversight and fine - grained access control, enabling secure agent deployment.\n\n- **[WebInject: Prompt Injection Attack to Web Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11717)** (*2025*) `Arxiv`\n  > Proposes WebInject, a prompt injection attack on web agents. Adds pixel perturbation, trains NN for mapping, and solves optimization problem.\n\n- **[Web Fraud Attacks on LLM-driven Multi-Agent Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01211)** (*2025*) `Arxiv`\n  > This paper introduces Web Fraud Attacks, a novel method exploiting vulnerabilities in LLM-driven multi-agent systems by inducing them to visit malicious websites through domain tampering and link camouflage, bypassing complex jailbreaking techniques.\n\n- **[Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17674)** (*2025*) `Arxiv`\n  > Introduces Advertisement Embedding Attacks (AEA), a novel threat that hijacks LLMs to inject covert promotional or malicious content into outputs. It details two low-cost attack vectors and proposes a prompt-based defense, highlighting a critical gap in AI security.\n\n- **[Beyond Data Privacy: New Privacy Risks for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.14278)** (*2025*) `Arxiv`\n  > This paper argues that beyond training data leakage, the deployment of LLMs introduces novel privacy risks from their autonomous reasoning and integration into applications, enabling sophisticated, large-scale attacks that threaten individual and societal security.\n\n- **[Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.17488)** (*2025*) `Arxiv`\n  > This paper introduces PrivacyChecker, a model-agnostic mitigation method, and PrivacyLens-Live, a dynamic benchmark. They address novel privacy risks in LLM agents by integrating contextual integrity into agent protocols.\n\n- **[PrivWeb: Unobtrusive and Content-aware Privacy Protection For Web Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.11939)** (*2025*) `Arxiv`\n  > This work introduces PrivWeb, a privacy framework for web agents that uses a local LLM to automatically anonymize on-screen data based on user preferences, balancing automated protection with user control through adaptive, context-aware notifications.\n\n- **[DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12575)** (*2025*) `*ACL`\n  > Proposes Dynamically Encrypted Multi - Backdoor Implantation Attack with dynamic encryption and sub - fragments to bypass safety audits. Also presents AgentBackdoorEval dataset.\n\n- **[CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14529)** (*2025*) `Arxiv`\n  > Presents Contagious Recursive Blocking Attacks (Corba) on LLM - MASs. Novel in propagation and resource - depletion, hard to mitigate by alignment.\n\n- **[G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11127)** (*2025*) `*ACL`\n  > The paper introduces G - Safeguard for LLM - MAS. It uses graph neural networks for anomaly detection and topological intervention, adaptable and combinable with mainstream MAS.\n\n- **[AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks](https:\u002F\u002Fopenreview.net\u002Fforum?id=AC5n7xHuR1)** (*2025*) `ICLR`\n  > Proposes AgentHarm, a new benchmark for LLM agent misuse with 110 malicious tasks, enabling evaluation of attacks and defenses.\n\n- **[Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.08586)** (*2025*) `Arxiv`\n  > This paper analyzes unique security and privacy vulnerabilities of LLM agents, provides an attack taxonomy, and conducts simple attacks, needing no ML knowledge.\n\n- **[A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15585)** (*2025*) `Arxiv`\n  > This paper introduces \"full - stack\" safety for LLMs, covering the whole lifecycle, with extensive literature and unique insights on research directions.\n\n- **Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems** (*2025*)\n  > The paper reveals LLM-to-LLM prompt injection in multi - agent systems, proposes Prompt Infection, and suggests LLM Tagging to mitigate it.\n\n- **[LLM-based Multi-Agent Systems: Techniques and Business Perspectives](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14033?)** (*2024*) `Arxiv`\n  > This paper explores LLM-based Multi-Agent Systems (LaMAS), presents its advantages, provides a protocol, and sees it as a solution for artificial collective intelligence.\n\n- **[BlockAgents: Towards Byzantine-Robust LLM-Based Multi-Agent Coordination via Blockchain](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3674399.3674445)** (*2024*) `TURC`\n  > The paper proposes BlockAgents, integrating blockchain into LLM-based multi-agent systems. It features PoT and multi-metric evaluation to mitigate Byzantine behaviors.\n\n- **[PROMPT INFECTION: LLM-TO-LLM PROMPT INJECTION WITHIN MULTI-AGENT SYSTEMS](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.07283)** (*2024*) `Arxiv`\n  > Reveals LLM-to-LLM prompt injection in multi - agent systems, proposes “Prompt Infection”, and “LLM Tagging” defense to enhance security.\n\n- **[AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=m1YYAQjO3w)** (*2024*) `NeurIPS`\n  > Introduces AgentDojo, an extensible evaluation framework for AI agents on untrusted data, aiming to foster reliable and robust agent design.\n\n- **[AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Ffile\u002Feb113910e9c3f6242541c1652e30dfd6-Paper-Conference.pdf)** (*2024*) `NeurIPS`\n  > Proposes AGENTPOISON, a novel backdoor attack on LLM agents by poisoning memory\u002FKB. No extra training, with good transferability and stealth.\n\n- **[AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04783)** (*2024*) `Arxiv`\n  > Proposes AutoDefense, a multi - agent framework filtering LLM harmful responses, robust against attacks, enabling small LMs to defend larger ones.\n\n- **[Imprompter- Tricking LLM Agents into Improper Tool Use](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.14923)** (*2024*) `Arxiv`\n  > The paper contributes to agent - based system security, presents obfuscated adversarial prompt attacks, and shows they work on multiple agents.\n\n- **[TARGETING THE CORE: A SIMPLE AND EFFECTIVE METHOD TO ATTACK RAG-BASED AGENTS VIA DIRECT LLM MANIPULATION](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.04415)** (*2024*) `Arxiv`\n  > This paper reveals a critical LLM vulnerability via adversarial prefixes, highlighting need for multi - layered security in agent - based architectures.\n\n- **[Prompt Injection as a Defense Against LLM-driven Cyberattacks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.20911)** (*2024*) `Arxiv`\n  > Proposes Mantis, a defensive framework using prompt - injection to counter LLM - driven cyberattacks, can hack back attackers autonomously and is open - source.\n\n- **[Evil Geniuses: Delving into the Safety of LLM-based Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.11855)** (*2024*) `Arxiv`\n  > The paper explores LLM - based agent safety from three aspects. It proposes a template - based attack and \"Evil Geniuses\" method for in - depth analysis.\n\n- **[AGENT SECURITY BENCH (ASB): FORMALIZING AND BENCHMARKING ATTACKS AND DEFENSES IN LLM-BASED AGENTS](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.02644?)** (*2024*) `Arxiv`\n  > Introduces Agent Security Bench (ASB) for formalizing, benchmarking LLM - based agent attacks\u002Fdefenses, revealing vulnerabilities and future work.\n\n- **[AGENTHARM: A BENCHMARK FOR MEASURING HARMFULNESS OF LLM AGENTS](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.09024)** (*2024*) `Arxiv`\n  > A new benchmark AgentHarm is proposed for LLM agent misuse research, with diverse malicious tasks and unique scoring criteria.\n\n- **[CLAS 2024: The Competition for LLM and Agent Safety](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GIDw94AlZK)** (*2024*) `Arxiv`\n  > CLAS 2024 advances LLM and agent safety understanding via three tracks, fostering community collaboration for safer AI systems.\n\n- **[The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.16682)** (*2024*) `Arxiv`\n  > The paper proposes reframing agent security to ensure task alignment. It develops Task Shield to verify instruction contribution to user goals.\n\n- **[WIPI: A New Web Threat for LLM-Driven Web Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.16965)** (*2024*) `Arxiv`\n  > This paper introduces a novel threat, WIPI, which indirectly controls Web Agents via web - page instructions, enhancing attack efficiency and stealth.\n\n- **[Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.08567)** (*2024*) `Arxiv`\n  > Paper reveals 'infectious jailbreak' in multi - agent MLLM, shows its feasibility, and proposes a principle for defense spread restraint.\n\n- **[CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.14529)** (*2024*) `Arxiv`\n  > Introduces CORBA, a novel attack on LLM - MASs. It's contagious and recursive, hard to mitigate by alignment, effective across topologies and models.\n\n- **[PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.812\u002F)** (*2024*) `ACL`\n  > This paper explores multi - agent system safety through agent psychology, proposes PsySafe framework, and offers insights for future research.\n\n- **[Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.16950)** (*2024*) `Arxiv`\n  > The paper introduces the foot - in - the - door attack on ReAct agents. It proposes a reflection mechanism to mitigate this security vulnerability.\n\n- **[AGENT-SAFETYBENCH: Evaluating the Safety of LLM Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.14470)** (*2024*) `Arxiv`\n  > The paper introduces AGENT - SAFETYBENCH to evaluate LLM agent safety, identifies flaws, and stresses need for advanced strategies, will release the benchmark.\n\n- **[INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.02691)** (*2024*) `Arxiv`\n  > Introduces INJECAGENT, a benchmark for assessing IPI attack vulnerability of tool - integrated LLM agents, categorizing attack intents.\n\n- **[PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11880)** (*2024*) `Arxiv`\n  > This paper proposes PsySafe, a framework based on agent psychology, to address multi - agent system safety, offering insights into risk identification, evaluation, and mitigation.\n\n- **[TrustAgent: Towards Safe and Trustworthy LLM-based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01586)** (*2024*) `*ACL`\n  > Presents TrustAgent, an agent - constitution - based framework. Uses three strategies for LLM - agent safety, impacts helpfulness and reveals LLM reasoning importance.\n\n- **[Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002Fb6e9d6f4f3428cd5f3f9e9bbae2cab10-Abstract-Conference.html)** (*2024*) `NeurIPS`\n  > This work formulates a framework for agent backdoor attacks, analyzes diverse forms, and reveals a need for targeted defenses against them.\n\n- **[R-Judge: Benchmarking Safety Risk Awareness for LLM Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10019)** (*2024*) `*ACL`\n  > Introduces R-Judge, a benchmark for evaluating LLM agents' safety risk awareness, shows room for improvement, and reveals effective fine - tuning approach.\n\n- **[NetSafe: Exploring the Topological Safety of Multi-agent Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.15686)** (*2024*) `*ACL`\n  > This paper offers a topological view on multi - agent network safety, proposes NetSafe, identifies new phenomena, guiding future safety research.\n\n- **[A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.10196)** (*2024*) `Arxiv`\n  > Presents first systematic mapping of adversarial attacks on language agents, with a framework and 12 scenarios, stressing risk - understanding urgency.\n\n### Survey\n\n- **[A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15585)** (*2025*) `Arxiv`\n  > This paper first introduces \"full - stack\" safety for LLMs, covering the whole lifecycle, with rich literature and unique insights for future research.\n\n- **[Trust but Verify! A Survey on Verification Design for Test-time Scaling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16665)** (*2025*) `Arxiv`\n  > This survey covers diverse TTS verification approaches, presents a unified view of verifier training, filling a gap in the literature.\n\n- **[A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21148)** (*2025*) `Arxiv`\n  > This survey reframes Sci - LLMs development around data, analyzes datasets, evaluation, and proposes a shift to closed - loop systems for scientific discovery.\n\n- **[Evaluation and Benchmarking of LLM Agents: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21504)** (*2025*) `Arxiv`\n  > Presents a 2D taxonomy for LLM agent evaluation, highlights enterprise challenges, and identifies future research directions for systematic assessment.\n\n- **[A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07407)** (*2025*) `Arxiv`\n  > This paper provides a comprehensive survey of self-evolving AI agents, introducing a unified framework and reviewing techniques, domain-specific strategies, evaluation, safety and ethics to bridge foundation models and lifelong agentic systems.\n\n- **[Evaluation and Benchmarking of LLM Agents: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21504)** (*2025*) `Arxiv`\n  > This survey introduces a two-dimensional taxonomy for organizing LLM agent evaluation methods and highlights critical enterprise challenges, providing a systematic framework for researchers and practitioners to assess agents for real-world deployment.\n\n- **[The Landscape of Agentic Reinforcement Learning for LLMs: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02547)** (*2025*) `Arxiv`\n  > This survey formalizes the shift from LLMs as generators to autonomous agents, proposing a dual taxonomy of core capabilities and applications. It positions reinforcement learning as the key mechanism for integrating these modules into adaptive, robust agentic behavior.\n\n- **[Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.02189)** (*2025*) `Arxiv`\n  > This paper offers a systematic VLM overview: model info, architectures, benchmarks, applications, and challenges, with details in a GitHub repo.\n\n- **[The Future is Agentic: Definitions, Perspectives, and Open Challenges of Multi-Agent Recommender Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02097)** (*2025*) `Arxiv`\n  > This paper explores LLM agents in recommender systems, introducing a formalism, use - cases, challenges, and paves the way for next - gen services.\n\n- **[Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.08586)** (*2025*) `Arxiv`\n  > Analyzes unique security & privacy vulnerabilities of LLM agents, provides attack taxonomy, and conducts simple attacks on popular agents.\n\n- **[Multi-Agent Collaboration Mechanisms: A Survey of LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06322)** (*2025*) `Arxiv`\n  > This paper surveys LLM-based Multi-Agent Systems, introduces a framework, explores applications, and identifies challenges and directions for AI collective intelligence.\n\n- **[AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3716628)** (*2025*) `ACM Computing Survey`\n  > This survey identifies four security knowledge - gaps for AI agents. It reviews threats, shows progress\u002Flimitations, and inspires future research.\n\n- **[Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.14457)** (*2024*) `Arxiv`\n  > Paper explores future autonomous collaboration of LM agents, covers current state, collaboration paradigms, security risks, and proposes future research directions.\n\n- **[Agent AI: Surveying the Horizons of Multimodal Interaction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03568)** (*2024*) `Arxiv`\n  > Defines \"Agent AI\" for interactive multimodal systems, explores action prediction, mitigates model hallucinations, and envisions virtual interactions.\n\n- **[Large Language Model based Multi-Agents: A Survey of Progress and Challenges](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01680)** (*2024*) `Arxiv`\n  > This survey discusses essential aspects and challenges of LLM-based multi-agent systems, provides datasets, and maintains a GitHub repo for latest research.\n\n- **[Large Multimodal Agents: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15116)** (*2024*) `Arxiv`\n  > Reviews LLM-driven multimodal agents, categorizes research, compiles evaluation methods, and proposes future directions.\n\n- **[Understanding the planning of LLM agents: A survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02716)** (*2024*) `Arxiv`\n  > This survey offers the first systematic view of LLM - based agent planning, categorizes related works, analyzes directions, and discusses challenges.\n\n- **[Computational Experiments Meet Large Language Model Based Agents: A Survey and Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.00262)** (*2024*) `Arxiv`\n  > The paper explores combining computational experiments with LLM - based Agents, outlines their history, mutual advantages, and addresses challenges and trends.\n\n- **[Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.05459)** (*2024*) `Arxiv`\n  > The paper focuses on Personal LLM Agents, discusses architecture, challenges, and solutions, envisioning them as a major software paradigm.\n\n- **[Large Model Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.14457)** (*2024*) `Arxiv`\n  > This paper explores future LM agent autonomous collaboration, covering current state, key tech, security & privacy, and suggests future research directions.\n\n- **[The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.11584)** (*2024*) `Arxiv`\n  > This survey assesses AI agent implementations, sharing capabilities, insights, and design considerations, highlighting key themes for robust systems.\n\n- **[Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03428)** (*2024*) `Arxiv`\n  > This paper surveys LLM - based intelligent agents in single - and multi - agent systems, covering definitions, components, deployment mechanisms, and envisions their prospects.\n\n- **[Position Paper: Agent AI Towards a Holistic Intelligence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00833)** (*2024*) `Arxiv`\n  > Paper proposes Agent Foundation Model for embodied intelligence, discusses Agent AI's domain capabilities and interdisciplinary potential, guiding future research.\n\n- **[Large Language Model based Multi-Agents: A Survey of Progress and Challenges](https:\u002F\u002Fwww.ijcai.org\u002Fproceedings\u002F2024\u002F0890.pdf)** (*2024*) `IJCAI`\n  > This survey delves into LLM - based multi - agent systems. It covers operation domains, agent profiles, and skill - development means. It also lists datasets and maintains a GitHub repo.\n\n- **[LLM With Tools: A Survey](http:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18807)** (*2024*) `Arxiv`\n  > Presents a standardized tool - integration paradigm, explores challenges, innovative solutions, and the idea of LLMs creating tools, reproduces results.\n\n- **[A Survey on the Memory Mechanism of Large Language Model based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.13501)** (*2024*) `Arxiv`\n  > This paper comprehensively surveys LLM-based agents' memory mechanisms, reviewing design and evaluation, presenting applications, and suggesting future directions.\n\n- **[Understanding the planning of LLM agents: A survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02716)** (*2024*) `Arxiv`\n  > This survey offers a systematic view of LLM - based agent planning, taxonomizes existing works, analyzes directions, and discusses research challenges.\n\n- **[Large Language Model based Multi-Agents: A Survey of Progress and Challenges](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01680)** (*2024*) `Arxiv`\n  > This survey offers in - depth discussion on LLM - based multi - agent systems, their aspects, challenges, and provides datasets and an open - source repo.\n\n- **[A Survey on Large Language Model-Based Game Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.02039)** (*2024*) `Arxiv`\n  > Paper offers holistic overview of LLM - based game agents, details architecture, surveys agents across game genres, and presents future R & D directions.\n\n- **[Large Language Models and Games: A Survey and Roadmap](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.18659)** (*2024*) `Arxiv`\n  > This paper surveys LLM applications in games, identifies LLM roles, discusses unexplored areas, and reconciles potential and limitations, paving the way for new research.\n\n- **[Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03428)** (*2024*) `Arxiv`\n  > This paper surveys LLM - based intelligent agents in single - and multi - agent systems, covering definitions, components, deployment, datasets, and envisions prospects.\n\n- **[Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.09523?)** (*2024*) `Arxiv`\n  > This survey analyzes security, privacy, and ethics threats in LLM - based agents, proposes a taxonomy, and suggests future research directions.\n\n- **[Security of AI Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.08689)** (*2024*) `Arxiv`\n  > The paper identifies AI agents' security vulnerabilities from a system perspective, introduces defenses, and offers ways to enhance their safety and reliability.\n\n- **[PERSONAL LLM AGENTS: INSIGHTS AND SURVEY ABOUT THE CAPABILITY, EFFICIENCY AND SECURITY](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.05459)** (*2024*) `Arxiv`\n  > The paper focuses on Personal LLM Agents, discusses their architecture, challenges, and presents solutions, envisioning them as a major software paradigm.\n\n- **[The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.19354)** (*2024*) `Arxiv`\n  > This paper comprehensively overviews LLM agents' privacy and security issues, covers threats, impacts, defenses, trends, with case - studies to inspire future research.\n\n- **[Inferring the Goals of Communicating Agents from Actions and Instructions](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.16207)** (*2024*) `ICML Workshop`\n  > The paper models human inferential ability in cooperation. It uses GPT - 3 for instruction utterances and multi - modal Bayesian inverse planning to infer goals, showing verbal comm's importance.\n\n- **[Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.05459)** (*2024*) `Arxiv`\n  > The paper focuses on Personal LLM Agents, discussing their architecture, challenges, and solutions, envisioning them as a major software paradigm.\n\n- **[Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09097)** (*2024*) `Arxiv`\n  > Paper analyzes LLM red - teaming attacks (e.g., gradient - based, RL) and defenses, aiming to foster more secure and reliable language models.\n\n- **[Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas: A Surveyhttps:\u002F\u002Fui.adsabs.harvard.edu\u002F](https:\u002F\u002Fui.adsabs.harvard.edu\u002Fabs\u002F2024arXiv240605392D\u002Fabstract)** (*2024*)\n  > This paper surveys LLMs' ethical challenges from old to new, analyzes related research, and emphasizes integrating ethics into LLM development.\n\n- **[A survey on large language model based autonomous agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11432)** (*2023*) `FCS`\n  > This paper surveys LLM-based autonomous agents, offers a unified construction framework, overviews applications, and presents challenges and future directions.\n\n- **[The rise and potential of large language model based agents: a survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07864)** (*2023*) `SCIS`\n  > This paper surveys LLM-based agents, presents a general framework, explores applications, delves into agent societies, and discusses key topics and problems.\n\n- **[Large Language Model Alignment: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15025)** (*2023*) `Arxiv`\n  > This survey categorizes LLM alignment methods, explores related issues, presents benchmarks, and envisions future research for capable and safe LLMs.\n\n- **[Ethical and social risks of harm from Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.04359)** (*2021*) `Arxiv`\n  > Paper analyzes risks of large-scale LMs, outlines six areas with 21 risks, suggests mitigations, and points to further research directions.\n\n- **[On the Opportunities and Risks of Foundation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.07258)** (*2021*) `Arxiv`\n  > This paper details opportunities and risks of foundation models, notes emergent capabilities & homogenization issues, calls for interdisciplinary research.\n\n- **[Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.07213)** (*2020*) `Arxiv`\n  > The paper proposes steps for different stakeholders to enhance verifiability of AI claims, analyzes ten mechanisms, and gives related recommendations.\n\n- **[Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3306618.3314244?casa_token=1ogqoO70pDgAAAAA:7r8-ICJ2Ym55Fg2aaW11gpz7FR15yYHzuqBdGu7ifBfkiMRdbknxo34ItX_GwjeUZPg9k4U22tRX)** (*2019*) `AIES`\n  > This paper analyzes the impact of disclosing biased AI results via Gender Shades audit, showing it can prompt companies to reduce algorithmic disparities.\n\n### Tools\n\n- **[ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11404)** (*2025*) `Arxiv`\n  > Proposes ToolCoder, reformulating tool learning as code gen. Transforms queries to Python scaffolds, reuses code & debugs systematically.\n\n- **[VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19255)** (*2025*) `Arxiv`\n  > Introduces VTool - R1, the first framework training VLMs for multimodal thought chains. Integrates visual tools into RFT, enabling strategic tool use without process supervision.\n\n- **[Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval](http:\u002F\u002Farxiv.org\u002Fabs\u002F2408.01875)** (*2024*) `Arxiv`\n  > Introduces Re-Invoke, an unsupervised tool retrieval method for large toolsets, with query synthesis, intent extraction, and multi - view ranking.\n\n- **[Chain of Tools: Large Language Model is an Automatic Multi-tool Learner](http:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16533)** (*2024*) `Arxiv`\n  > Proposes Automatic Tool Chain (ATC) for LLMs as multi - tool users, a black - box probing method for tool learning, and builds ToolFlow benchmark.\n\n- **[EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction](http:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06201)** (*2024*) `Arxiv`\n  > Introduces EASYTOOL, a framework that transforms diverse tool docs into concise instructions for LLMs, enhancing tool - using capabilities.\n\n- **[ToolGen: Unified Tool Retrieval and Calling via Generation](http:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03439)** (*2024*) `Arxiv`\n  > Introduces ToolGen, integrating tool knowledge into LLM parameters via unique tokens, turning tool retrieval into generation, enhancing LLM's versatility and autonomy.\n\n- **[ToolNet: Connecting Large Language Models with Massive Tools via Tool Graph](http:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00839)** (*2024*) `Arxiv`\n  > The paper proposes ToolNet, a plug - and - play framework. It organizes tools into a graph, enabling LLMs to handle thousands of tools more effectively.\n\n- **[ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback](http:\u002F\u002Farxiv.org\u002Fabs\u002F2409.14826)** (*2024*) `Arxiv`\n  > The paper constructs MGToolBench to reflect real - world scenarios and proposes ToolPlanner with path planning and feedback for better task completion and instruction - following.\n\n- **[Making Language Models Better Tool Learners with Execution Feedback](https:\u002F\u002Faclanthology.org\u002F2024.naacl-long.195\u002F)** (*2024*) `*ACL`\n  > Proposes TRICE, a two - stage framework. Allows models to learn from tool execution feedback, deciding when and how to use tools effectively.\n\n- **[Leveraging Large Language Models to Improve REST API Testing](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3639476.3639769)** (*2024*) `ICSE-NIER`\n  > The paper presents RESTGPT, leveraging LLMs to extract rules and generate values from API specs, addressing limitations of existing methods.\n\n- **[LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.570\u002F)** (*2024*) `*ACL`\n  > Existing LLMs have low tool-use correctness. The paper proposes STE, a bio-inspired method with trial, imagination, and memory, boosting tool learning.\n\n- **[Skills-in-Context: Unlocking Compositionality in Large Language Models](https:\u002F\u002Faclanthology.org\u002F2024.findings-emnlp.812\u002F)** (*2024*) `*ACL`\n  > The paper proposes Skills-in-Context (SKiC) prompting in in-context learning, unlocking LLMs' compositional ability and enabling zero-shot generalization.\n\n- **[TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs](https:\u002F\u002Fspj.science.org\u002Fdoi\u002F10.34133\u002Ficomputing.0063)** (*2024*) `Others`\n  > Paper proposes connecting foundation models with APIs to complete tasks, leveraging models' abilities like conversation and code gen for real - world use.\n\n- **[Gorilla: Large Language Model Connected with Massive APIs](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002Fe4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html)** (*2024*) `NeurIPS`\n  > Developed Gorilla, a fine-tuned LLaMA, with RAT training. Mitigates hallucination and uses retrieval for better API call writing, shown via APIBench.\n\n- **[LARGE LANGUAGE MODELS AS TOOL MAKERS](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17126)** (*2024*) `ICLR`\n  > The paper presents LATM, a closed-loop framework enabling LLMs to make and use their own tools, dividing labor for cost - efficiency and extending cache applicability.\n\n- **[Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents](http:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03314)** (*2023*) `Arxiv`\n  > A novel multi - agent framework enhances LLMs' capabilities. It addresses limitations and shows potential in AGI via diverse case - studies.\n\n- **[Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations](http:\u002F\u002Farxiv.org\u002Fabs\u002F2308.16505)** (*2023*) `Arxiv`\n  > Paper bridges recommender models and LLMs with \"InteRecAgent\", using LLMs as brain, models as tools, enabling interactive recommendation.\n\n- **[ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs](http:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16789)** (*2023*) `Arxiv`\n  > Introduces ToolLLM framework with ToolBench dataset, novel decision - tree algo, and ToolEval. Fine - tunes LLaMA to get ToolLLaMA with good generalization.\n\n- **[TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems](http:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11315)** (*2023*) `Arxiv`\n  > The paper introduces a framework for boosting LLM-based agents' TPTU in real-world systems, with API Retriever, LLM Finetuner, and Demo Selector.\n\n- **[TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage](http:\u002F\u002Farxiv.org\u002Fabs\u002F2308.03427)** (*2023*) `Arxiv`\n  > Proposes an LLM-based AI agent framework, designs two agent types, evaluates TPTU abilities, offering guidance for LLM use in AI.\n\n- **[GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002Fe393677793767624f2821cec8bdd02f1-Abstract-Conference.html?utm_campaign=Artificial%2BIntelligence%2BWeekly&utm_medium=email&utm_source=Artificial_Intelligence_Weekly_411)** (*2023*) `NeurIPS`\n  > Proposes GPT4Tools via self - instruct to enable open - source LLMs use tools, provides a benchmark, and shows broad applicability.\n\n- **[API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.187\u002F)** (*2023*) `*ACL`\n  > Introduces API - Bank for tool - augmented LLMs. Develops evaluation system, builds training set, and highlights future challenges.\n\n- **[ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models](https:\u002F\u002Faclanthology.org\u002F2023.findings-emnlp.985\u002F)** (*2023*) `*ACL`\n  > Proposes ChatCoT, a tool - augmented CoT reasoning framework for chat - based LLMs, models CoT as multi - turn chats, unifies reasoning and tool use.\n\n- **[ToolQA: A Dataset for LLM Question Answering with External Tools](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F9cb2a7495900f8b602cb10159246a016-Abstract-Datasets_and_Benchmarks.html)** (*2023*) `NeurIPS`\n  > Introduced ToolQA dataset to evaluate LLMs' external - tool use in QA. Used scalable curation, minimized data overlap, and offered new evaluation directions.\n\n- **[On the Tool Manipulation Capability of Open-source Large Language Models](http:\u002F\u002Farxiv.org\u002Fabs\u002F2305.16504)** (*2023*) `Arxiv`\n  > The paper revisits classical LLM methods for open - source LLMs in tool manipulation, creates ToolBench, and offers a practical human - supervised recipe.\n\n- **[RestGPT: Connecting Large Language Models with Real-World RESTful APIs](http:\u002F\u002Farxiv.org\u002Fabs\u002F2306.06624)** (*2023*) `Arxiv`\n  > This paper proposes RestGPT, connecting LLMs with RESTful APIs via a planning mechanism and an API executor. It also offers RestBench for evaluation.\n\n- **[Toolformer: Language Models Can Teach Themselves to Use Tools](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002Fd842425e4bf79ba039352da0f658a906-Abstract-Conference.html)** (*2023*) `NeurIPS`\n  > The paper proposes Toolformer, enabling LMs to self - learn using external tools via APIs with few demos, enhancing zero - shot task performance.\n\n- **[WebCPM: Interactive Web Search for Chinese Long-form Question Answering](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.499\u002F)** (*2023*) `*ACL`\n  > Presents WebCPM, the first Chinese LFQA dataset with interactive web search. Records search behaviors, fine - tunes models, and makes resources public.\n\n- **[ToolCoder: Teach Code Generation Models to use API search tools](http:\u002F\u002Farxiv.org\u002Fabs\u002F2305.04032)** (*2023*) `Arxiv`\n  > Proposes ToolCoder, integrating API search tools into code generation. Uses ChatGPT for annotation and fine - tuning, innovatively incorporating tools in the process.\n\n- **[ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases](http:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05301)** (*2023*) `Arxiv`\n  > The paper presents ToolAlpaca, a framework to generate tool - use corpus and learn generalized skills on compact models, showing feasibility for such models.\n\n- **[ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F8fd1a81c882cd45f64958da6284f4a3f-Abstract-Conference.html)** (*2023*) `NeurIPS`\n  > The paper proposes ToolkenGPT, using tool embeddings to let LLMs master tools like predicting tokens, addressing existing integration limitations.\n\n- **[MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting](https:\u002F\u002Faclanthology.org\u002F2023.acl-short.130\u002F)** (*2023*) `*ACL`\n  > Proposes MultiTool - CoT, a framework using CoT prompting to integrate multiple external tools in reasoning for better performance on NumGLUE.\n\n- **[CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models](https:\u002F\u002Faclanthology.org\u002F2023.findings-emnlp.462\u002F)** (*2023*) `*ACL`\n  > Proposes CREATOR to enable LLMs to create tools, disentangling creation and execution. Introduces Creation Challenge, revolutionizing problem - solving paradigm.\n\n- **[GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08775)** (*2023*) `Arxiv`\n  > Introduces GEAR, a generalizable and efficient query - tool grounding algo that delegates to SLM\u002FLLM, improving precision at reduced cost.\n\n- **[Dify](https:\u002F\u002Fgithub.com\u002Flanggenius\u002Fdify)** (*2023*)\n  > Dify is an open - source LLM app dev platform. Its interface integrates multiple features, enabling rapid prototype - to - production.\n\n- **[LangChain](https:\u002F\u002Fgithub.com\u002Flangchain-ai\u002Flangchain)** (*2023*)\n  > LangChain simplifies LLM app lifecycle, offering dev components, production tools, and deployment platform for large model - based agents.\n\n- **[WebGPT: Browser-assisted question-answering with human feedback](http:\u002F\u002Farxiv.org\u002Fabs\u002F2112.09332)** (*2022*) `Arxiv`\n  > Fine - tune GPT - 3 for long - form Q&A with web - browsing, use imitation learning, human feedback, and reference collection, a novel approach.\n\n- **[Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance](https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fproceedings-article\u002Fsc\u002F2020\u002F999800a864\u002F1oeOToMWZBC)** (*2020*) `SC`\n  > Task Bench is a parameterized benchmark for distributed programming systems. It simplifies benchmarking and has a novel metric METG to assess systems.\n\n\n## 🤝 Contributing\n\nWe welcome contributions to expand our collection. You can:\n- Submit a pull request to add papers or resources\n- Open an issue to suggest additional papers or resources\n- Submit your paper at [our submission form](https:\u002F\u002Fforms.office.com\u002Fr\u002FsW0Zzymi5b) or email us at luo.junyu@outlook.com\n\nWe regularly update the repository to include new research.\n\n## 📝 Citation\n\nIf you find our survey helpful, please consider citing our work:\n\n```\n\n@article{agentsurvey2025,\n  title={Large Language Model Agent: A Survey on Methodology, Applications and Challenges},\n  author={Luo, J. and Zhang, W. and Yuan, Y. and others},\n  journal={arXiv preprint arXiv:2503.21460},\n  year={2025}\n}\n\n```\n\n---\n\n\u003Cp align=\"center\">\n  \u003Ci>For questions or suggestions, please open an issue or contact the repository maintainers.\u003C\u002Fi>\n\u003C\u002Fp>\n\n\n","# 🤖 LLM智能体研究综合合集\n\n\u003Cdiv align=\"center\">\n\n![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)\n[![commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002Fluo-junyu\u002FAwesome-Agent-Papers?color=blue)](https:\u002F\u002Fgithub.com\u002Fluo-junyu\u002FAwesome-Agent-Papers\u002Fcommits\u002Fmain)\n[![PR](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-Welcome-red)](https:\u002F\u002Fgithub.com\u002Fluo-junyu\u002FAwesome-Agent-Papers\u002Fpulls)\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fluo-junyu_Awesome-Agent-Papers_readme_0dcbdccd61b6.png\" width=\"90%\" alt=\"LLM智能体研究概览\">\n\u003C\u002Fp>\n\n## 🌟 概述\n\n本仓库收录了关于大型语言模型（LLM）智能体的**全面研究论文合集**。我们按关键类别对论文进行了整理，包括智能体构建、协作机制、进化、工具、安全、基准测试以及应用等。\n\n我们的分类体系为理解快速发展的LLM智能体领域提供了一个结构化的框架，从架构基础到实际应用均有涵盖。该仓库通过揭示智能体设计原则与涌现行为之间的联系，将分散的研究线索有机地串联起来。\n\n📄 **[在此阅读我们的综述论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21460)**\n\n\n我们的综述覆盖了快速演进的LLM智能体领域，自2023年以来相关研究论文数量显著增加。\n\n\n## 📑 目录\n\n- [🌟 概述](#-overview)\n- [📊 统计与趋势](#-statistics--trends)\n- [🔍 关键类别](#-key-categories)\n- [📚 资源列表](#-resource-list)\n  - [智能体协作](#agent-collaboration)\n  - [智能体构建](#agent-construction)\n  - [智能体进化](#agent-evolution)\n  - [应用](#applications)\n  - [数据集与基准测试](#datasets--benchmarks)\n  - [伦理](#ethics)\n  - [安全](#security)\n  - [综述](#survey)\n  - [工具](#tools)\n- [🤝 贡献](#-contributing)\n\n## 🔍 关键类别\n\n- **🏗️ 智能体构建**: 构建LLM智能体的方法论和架构\n- **👥 智能体协作**: 多智能体交互与合作的框架\n- **🌱 智能体进化**: 智能体的自我改进与学习能力\n- **🔧 工具**: 外部工具及API与LLM智能体的集成\n- **🛡️ 安全**: LLM智能体系统的安全问题与防护措施\n- **📊 基准测试**: 用于测试智能体能力的评估框架和数据集\n- **💡 应用**: 现实世界中的实现与使用场景\n\n\n## 📚 资源列表\n\n### 智能体协作\n\n- **[Foam-Agent: 向自动化智能CFD工作流迈进](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04997)** (*2025*) `Arxiv`\n  > 本文提出了Foam-Agent，一个能够从自然语言自动执行CFD工作流的多智能体框架。它具备独特的检索、文件生成和错误纠正系统，从而降低了专业门槛。\n\n- **[为什么多智能体LLM系统会失败？](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13657)** (*2025*) `Arxiv`\n  > 提出了MAST这一用于分析MAS失败原因的分类体系，并开发了一条“LLM即法官”的流水线，同时开源了相关数据以指导MAS的开发。\n\n- **[多智能体系统的线性编队控制](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS0005109824004291)** (*2025*)\n  > 针对编队变化提出了一种新的分布式领导者—跟随者控制架构（线性编队控制），并引入了新概念和估计方法。\n\n- **[MultiAgentBench: 评估LLM智能体的协作与竞争能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01935)** (*2025*) `Arxiv`\n  > 介绍了MultiAgentBench，用于评估基于LLM的多智能体系统。该基准测试可评估协作与竞争能力、协议与策略，并已开源代码和数据。\n\n- **[AI智能体协议综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16736)** (*2025*) `Arxiv`\n  > 论文分析了现有的LLM智能体协议，提出了分类方案，并探讨了下一代协议的发展方向。\n\n- **[C^2: 面向LLM图表生成的可扩展自动反馈机制](https:\u002F\u002Faclanthology.org\u002F2025.naacl-long.232\u002F)** (*2025*) `*ACL`\n  > 本文介绍了一种名为C2的框架，包含自动反馈提供者和无参考数据集，无需人工标注，相关资源已在chartsquared.github.io上开源。\n\n- **[AgentRxiv: 朝着协作式自主研究迈进](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18102)** (*2025*) `Arxiv`\n  > 提出了AgentRxiv，这是一个允许LLM智能体实验室在预印本服务器上共享研究成果以促进协作的框架，有助于未来人机协同的AI设计。\n\n- **[多智能体微调：利用多样化推理链进行自我提升](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05707)** (*2025*) `Arxiv`\n  > 提出了一种针对语言模型的多智能体微调方法，通过多智能体生成的数据来优化模型，同时保留多样化的推理链，从而实现更好的自我改进效果。\n\n- **[从辩论到均衡：基于贝叶斯纳什均衡的信念驱动型多智能体LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08292)** (*2025*) `ICML`\n  > 提出了ECON，一种将多LLM协调重新建模为BNE博弈的分层强化学习范式，具有更紧的后悔边界和更强的可扩展性。\n\n- **[智能体链：大型语言模型协作处理长上下文任务](https:\u002F\u002Fresearch.google.blog\u002Fchain-of-agents-large-language-models-collaborating-on-long-context-tasks\u002F)** (*2025*) `Blog`\n  > 提出了Chain-of-Agents，这是一种无需训练、任务无关的框架，利用LLM协作来处理长上下文任务，性能优于RAG和长上下文LLM。\n\n- **[CS-Agent: 基于双智能体协作的LLM社区搜索](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.09549)** (*2025*) `Arxiv`\n  > 提出了CS-Agent，采用双智能体协作模式（求解器、验证器）并结合决策器，用于LLM驱动的社区搜索，无需微调即可解决现有局限性。\n\n- **[MUA-RL: 面向代理式工具使用的多轮用户交互强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18669)** (*2025*) `Arxiv`\n  > MUA-RL将LLM模拟的用户融入强化学习循环中，以实现动态的多轮用户交互学习，从而更好地支持代理式工具的使用。\n\n- **[CoMet: 基于隐喻的隐蔽通信，用于多智能体语言游戏](https:\u002F\u002Faclanthology.org\u002F2025.acl-long.389\u002F)** (*2025*) `*ACL`\n  > CoMet提出了一套让LLM驱动的智能体处理隐喻的框架，结合假设推理器和自我反思生成器。这种新颖的方法通过细致入微的隐喻解读与运用，增强了多智能体语言游戏中战略性的隐蔽沟通能力。\n\n- **[多智能体协作中的思想交流](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20733)** (*2025*) `Arxiv`\n  > 本文提出了一种全新的思想交流范式，使智能体能够直接分享潜在的思想，超越自然语言的限制。同时，提供了识别和结构化这些思想的理论框架，从而进一步提升多智能体协作的效果。\n\n- **[缓存到缓存：大型语言模型之间的直接语义通信](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.03215)** (*2025*) `Arxiv`\n  > 本文提出了一种名为“缓存到缓存”（C2C）的方法，利用大型语言模型的内部键值缓存实现模型间的直接语义通信，从而绕过低效的文本生成过程，以支持更丰富、更低延迟的跨模型协作。\n\n- **[用于医疗决策的大型语言模型自适应协作策略](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.15155)** (*2024*) `NeurIPS`\n  > 提出了一种名为医学决策代理（MDAgents）的框架，用于动态分配大型语言模型的协作结构，使其能够根据任务复杂度进行调整，并探索群体共识。\n\n- **[ReConcile：圆桌会议通过多样化大型语言模型间的共识提升推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.13007)** (*2024*) `*ACL`\n  > 提出了ReConcile框架，这是一种类似圆桌会议的多模型多智能体系统，通过讨论和投票机制来增强大型语言模型的协作推理能力。\n\n- **[MetaGPT：面向多智能体协作框架的元编程](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.00352)** (*2024*) `ICLR`\n  > 介绍了MetaGPT框架，该框架将人类工作流程融入基于大型语言模型的多智能体系统中，从而改进任务分解并减少错误。\n\n- **[与更具说服力的大型语言模型辩论有助于获得更真实的答案](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.06782)** (*2024*) `ICML`\n  > 本文探讨了较弱的模型是否可以通过辩论来评估更强的模型。研究表明，辩论可以帮助非专业人士，并且优化辩论者的表现有助于在缺乏真实答案的情况下寻找真相。\n\n- **[Roco：结合大型语言模型的辩证式多机器人协作](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04738)** (*2024*) `Arxiv`\n  > 提议使用预训练的大型语言模型来进行多机器人的高层通信和低层路径规划，并引入上下文学习机制，同时推出了RoCoBench基准测试集。\n\n- **[AutoAct：基于自我规划的QA任务从零开始的自动智能体学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.05268)** (*2024*) `*ACL`\n  > AutoAct是一个自动化的QA智能体学习框架，能够在无需外部帮助的情况下合成行动轨迹，并通过分工协作完成任务。\n\n- **[元提示：用任务无关的支架增强语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.12954)** (*2024*) `Arxiv`\n  > 提出了元提示方法，这是一种任务无关的支架技术，可以将单一的语言模型转变为多角色系统，并集成外部工具以提升任务性能。\n\n- **[通过多智能体辩论鼓励大型语言模型的发散性思维](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.19118)** (*2024*) `*ACL`\n  > 本文提出了一种多智能体辩论框架，旨在解决大型语言模型中的DoT问题，鼓励其在复杂任务中进行发散性思考。\n\n- **[AgentVerse：促进多智能体协作并探索涌现行为](https:\u002F\u002Fopenreview.net\u002Fforum?id=EHg5GDnyq1)** (*2024*) `ICLR`\n  > 本文提出了AgentVerse框架，该框架受人类群体动力学启发，旨在促进智能体之间的协作，并揭示智能体的涌现行为。\n\n- **[ChatDev：面向软件开发的沟通型智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.07924)** (*2024*) `*ACL`\n  > 本文介绍了ChatDev框架，这是一个基于聊天的协作平台。专门的大型语言模型智能体通过统一的语言交流进行协作，打通不同阶段以实现自主任务解决。\n\n- **[ChatEval：通过多智能体辩论打造更优秀的基于大型语言模型的评估工具](https:\u002F\u002Fopenreview.net\u002Fforum?id=FQepisCUWu)** (*2024*) `ICLR`\n  > 本文提出了ChatEval框架，即一个多智能体裁判团队，利用多智能体辩论来进行文本评估，提供一种模拟人类评审的过程。\n\n- **[面向任务导向型智能体协作的动态大型语言模型驱动网络](https:\u002F\u002Fopenreview.net\u002Fforum?id=XII0Wp1XA9#discussion)** (*2024*) `COLM`\n  > 本文提出了DyLAN框架，用于大型语言模型驱动的智能体协作。该框架能够动态选择智能体，并采用两阶段范式来解决问题。\n\n- **[AgentCoord：可视化探索基于大型语言模型的多智能体协作协调策略](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.11943)** (*2024*) `Arxiv`\n  > 提出了一种用于设计多智能体协作协调策略的可视化探索框架，可将目标转化为策略，并允许用户干预。\n\n- **[TradingAgents：基于大型语言模型的多智能体金融交易框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.20138)** (*2024*) `Arxiv`\n  > 本文提出了一种新颖的股票交易框架，其中大型语言模型驱动的智能体各司其职，模拟现实世界的协作方式，以提升交易绩效。\n\n- **[AutoGen：通过多智能体对话赋能下一代大型语言模型应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08155)** (*2023*) `COLM`\n  > AutoGen是一个开源框架，支持使用多智能体对话构建大型语言模型应用，提供可定制的智能体和灵活的交互定义。\n\n- **[通过多智能体辩论提升语言模型的事实性和推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14325)** (*2023*) `ICML`\n  > 本文提出了一种多智能体辩论方法，用于改善大型语言模型的推理能力和事实准确性，该方法适用于黑盒模型，并具有统一的操作流程。\n\n- **[利用大型语言模型开展自主化学研究](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-023-06792-0)** (*2023*) `Nature`\n  > 本文介绍了Coscientist系统，该系统由GPT-4驱动，整合了多种工具，在科研领域展现出巨大潜力，证明了人工智能的多功能性和高效性。\n\n\n\n### 智能体构建\n\n- **[通过协作式语言智能体实现多约束规划](https:\u002F\u002Faclanthology.org\u002F2025.coling-main.672\u002F)** (*2025*) `*ACL`\n  > 本文提出PMC方法，这是一种零样本的大型语言模型驱动多智能体系统构建技术。它通过任务分解简化复杂的、具有多重约束的任务规划。\n\n- **[具身智能体接口：评估大型语言模型在具身决策中的表现](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002Fb631da756d1573c24c9ba9c702fde5a9-Abstract-Datasets_and_Benchmarks_Track.html)** (*2025*) `NeurIPS`\n  > 本文提出了一种具身智能体接口，用于统一任务、模块和指标，从而全面评估大型语言模型在具身决策中的表现。\n\n- **[SPeCtrum：面向大型语言模型驱动智能体的多维身份表示基础框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.08599)** (*2025*) `Arxiv`\n  > 介绍了SPeCtrum框架，该框架整合了S、P、C三个维度来构建大型语言模型智能体的角色形象。它增强了身份的真实感，使AI交互更加个性化。\n\n- **[通过模式策略优化实现社交语言智能体的自适应思维](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.02156)** (*2025*)\n  > 提出了自适应模式学习（AML）框架和AMPO算法，提供多粒度模式切换、上下文感知能力以及高效的令牌利用率。\n\n- **[关于大型语言模型智能体的架构](http:\u002F\u002Fwww.injoit.ru\u002Findex.php\u002Fj1\u002Farticle\u002Fview\u002F2057)** (*2025*) `Arxiv`\n  > 本文讨论了大型语言模型智能体的架构。智能体是人工智能领域的一个关键方向，它们既像混合体又像机器人，而合适的框架可以简化其创建过程。\n\n- **[统一心智模型：LLM时代下自主智能体的重新构想](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.03459)** (*2025*) `Arxiv`\n  > 本文提出了面向人类水平智能体的统一心智模型（UMM），并开发了MindOS，无需编程即可创建特定任务的智能体。\n\n- **[ATLaS：通过学习关键步骤进行智能体调优](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02197)** (*2025*) `Arxiv`\n  > 提出ATLaS方法，用于识别专家轨迹中的关键步骤，以优化LLM智能体，从而降低成本并提升泛化能力。\n\n- **[认知AI记忆：一种使LLM具备更类人记忆的框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13044)** (*2025*) `Arxiv`\n  > 本文受认知AI启发，为LLM提出CAIM框架，包含三个模块，通过整体性记忆建模增强人机长期交互能力。\n\n- **[自适应图剪枝：一种任务自适应的多智能体协作框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02951)** (*2025*) `Arxiv`\n  > 提出自适应图剪枝（AGP）框架，该框架采用两阶段策略联合优化智能体数量与通信拓扑结构，实现任务自适应的多智能体协作。\n\n- **[变革者：用于战略规划的自我进化LLM智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04651)** (*2025*) `Arxiv`\n  > 将LLM智能体置于具有战略挑战性的环境中，以卡坦岛游戏作为基准测试，并提出一种可自我改进的多智能体架构。\n\n- **[通过多智能体反思强化大型语言模型推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08379)** (*2025*) `ICML`\n  > 本文将多轮细化过程建模为马尔可夫决策过程，并引入DPSDP算法用于迭代式答案精炼，展示了其理论和实证优势。\n\n- **[Memory-R1：利用强化学习增强大型语言模型智能体的记忆管理与利用能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19828)** (*2025*) `Arxiv`\n  > 提出Memory-R1框架，该框架包含两个智能体，用于LLM主动管理和利用外部记忆，为强化学习驱动的行为提供了新见解。\n\n- **[BudgetThinker：借助控制标记赋能预算感知型LLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17196)** (*2025*) `Arxiv`\n  > 介绍BudgetThinker框架，用于实现预算感知的LLM推理。通过插入控制标记并采用两阶段训练流程，实现高效且可控的推理。\n\n- **[A-MEM：面向LLM智能体的代理式记忆系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12110)** (*2025*) `Arxiv`\n  > 本文提出一种面向LLM的代理式记忆系统，以Zettelkasten方式组织记忆，支持动态更新和更灵活的记忆管理。\n\n- **[MemoCue：基于策略引导查询赋能LLM驱动智能体的人类记忆提取](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23633)** (*2025*) `Arxiv`\n  > 提出MemoCue策略引导型智能体，配备回忆路由器框架，利用5W回忆地图和层次化回忆树，通过富含线索的查询提升记忆提取效果。\n\n- **[多智能体规划中的信息共享与协调分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12981)** (*2025*) `Arxiv`\n  > 本文构建了一个基于LLM的多智能体系统，用于旅行规划，引入结构化信息共享笔记本和反思式协调器，以提升长周期规划能力。\n\n- **[AutoAgents：自动智能体生成框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17288)** (*2024*) `IJCAI`\n  > 介绍AutoAgents框架，可根据任务自动生成并协调专用智能体，内置观察者功能，为复杂任务解决提供了全新视角。\n\n- **[MetaGPT：面向多智能体协作框架的元编程](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.00352)** (*2024*) `ICLR`\n  > MetaGPT是一个元编程框架，能够将人类工作流整合到基于LLM的多智能体协作中，从而简化流程并减少错误。\n\n- **[面向语言智能体的认知架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.02427)** (*2024*) `TMLR`\n  > 提出CoALA框架，为语言智能体提供模块化的记忆、动作空间和决策机制，以组织工作并指导未来发展。\n\n- **[可执行代码行动促使LLM智能体表现更佳](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01030)** (*2024*) `ICML`\n  > 本研究提出CodeAct方法，利用可执行Python代码定义LLM智能体的动作空间，并基于调优后的数据集构建了一个开源智能体。\n\n- **[ChatDev：用于软件开发的沟通型智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.07924)** (*2024*) `ACL`\n  > 本文介绍了ChatDev框架，该框架允许智能体通过语言协作完成软件设计、编码和测试等环节，实现了各阶段的无缝衔接。\n\n- **[基于协作LLM智能体的可编辑场景仿真用于自动驾驶](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2024\u002Fpapers\u002FWei_Editable_Scene_Simulation_for_Autonomous_Driving_via_Collaborative_LLM-Agents_CVPR_2024_paper.pdf)** (*2024*) `CVPR\u002FICCV\u002FECCV`\n  > 本文提出了ChatSim，可通过自然语言处理实现3D驾驶场景的可编辑仿真。该方法结合了LLM智能体、新型神经辐射场及光照估计技术。\n\n- **[面向任务导向型智能体协作的动态LLM驱动智能体网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.02170)** (*2024*) `COLM`\n  > 提出名为DyLAN的框架，用于LLM驱动的智能体协作。该框架采用两阶段范式，支持动态智能体选择和针对不同任务的通信。\n\n- **[更多智能体就是你需要的一切](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05120)** (*2024*) `TMLR`\n  > 本文提出Agent Forest采样与投票方法，该方法与现有方法正交，能够根据任务难度提升LLM的表现。\n\n- **[智能医院：一个由可进化医疗智能体组成的医院模拟系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.02957)** (*2024*) `Arxiv`\n  > 展示了智能医院这一LLM驱动的医院模拟系统。医生智能体无需人工标注即可进化，相关方法也可应用于更广泛的领域。\n\n- **[利用AI智能体赋能生物医学发现](https:\u002F\u002Fwww.cell.com\u002Fcell\u002Ffulltext\u002FS0092-8674(24)01070-5?&target=_blank)** (*2024*) `Others`\n  > 本文提出“AI科学家”概念，即整合AI与生物工具的协作型智能体，结合人类与AI的能力，对多个生物领域产生深远影响。\n\n- **[SMART-LLM：基于大型语言模型的智能多机器人任务规划](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F10802322)** (*2024*) `IROS`\n  > 提出SMART-LLM框架，用于多机器人任务规划。同时创建了一个基准数据集，并在https:\u002F\u002Fsites.google.com\u002Fview\u002Fsmart-llm\u002F上提供了相关资源。\n\n- **[感知、反思与规划：设计无需指令的目标导向城市导航LLM智能体](http:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04168)** (*2024*) `Arxiv`\n  > 本文提出一种新颖的LLM智能体工作流程，包含感知、反思和规划三个环节，用于目标导向的城市导航，有效避免了基线方法的缺陷。\n\n- **[通过微调与多分支推理提升小参数大模型的智能体能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.19962)** (*2024*) `Arxiv`\n  > 提出利用GPT-4构建智能体专用数据，并对小参数大模型进行微调。多路径推理和任务分解进一步提升了智能体性能。\n\n- **[PlanCritic：结合人类反馈的形式化规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00300)** (*2024*) `Arxiv`\n  > 提出一种基于反馈的计划批评器，通过结合人类反馈的强化学习和遗传算法优化规划方案，填补了规划研究中的空白。\n\n- **[增强机器人任务规划：通过大型语言模型整合环境信息与反馈见解](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F10661782)** (*2024*) `CCC`\n  > 提出EnviroFeedback Planner，将环境信息融入提示词构建及反馈机制中，从而提升智能体在任务规划中的执行效果。\n\n- **[魔鬼代言人：面向LLM智能体的预见性反思](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16334)** (*2024*) `Arxiv`\n  > 该方法为LLM智能体赋予自我反思能力，通过任务分解、持续自评以及三重干预策略，显著提高其一致性和适应性。\n\n- **[规划、创造、使用：针对现实复杂场景中全面工具使用的LLM基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.17167)** (*2024*) `*ACL`\n  > 提出UltraTool，一个用于评估LLM在真实世界工具使用中的基准测试框架。它覆盖全流程，独立评估规划环节，并取消了预定义工具集的限制。\n\n- **[论LLM智能体的结构化记忆](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15266)** (*2024*) `Arxiv`\n  > 本文探讨了记忆结构与检索方法对基于LLM的智能体的影响，发现特定任务下的优势以及迭代式检索的优越性。\n\n- **[CAMEL：用于探索大型语言模型社会“心智”的沟通型智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.17760)** (*2023*) `NeurIPS`\n  > 提出了一种角色扮演式的沟通型智能体框架，提供了一种可扩展的多智能体系统研究方法，并开源了相关库。\n\n- **[AutoGen：通过多智能体对话赋能下一代LLM应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08155)** (*2023*) `Arxiv`\n  > AutoGen是一个用于LLM应用的开源框架，支持自定义的多智能体对话、灵活的编程方式，以及多样化应用的构建。\n\n- **[AgentCoder：基于多智能体的代码生成，结合迭代测试与优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13010)** (*2023*) `Arxiv`\n  > 该文介绍了AgentCoder，一个用于代码生成的多智能体框架，解决了平衡性问题，并优于现有方法。\n\n- **[战争与和平（WarAgent）：基于大型语言模型的世界大战多智能体仿真](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.17227)** (*2023*) `Arxiv`\n  > 提出WarAgent，一个由LLM驱动的多智能体系统，用于模拟历史冲突，为冲突解决和维和行动提供了新的洞见。\n\n- **[描述、解释、规划与选择：基于LLM的交互式规划助力开放世界多任务智能体](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F6b8dfb8c0c12e6fafc6c256cb08a5ca7-Abstract-Conference.html)** (*2023*) `NeurIPS`\n  > 该文以Minecraft规划为例，研究多任务智能体面临的挑战，并提出一种解决低效规划的方法。\n\n- **[TPTU：基于大型语言模型的任务规划与工具使用智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.03427)** (*2023*) `Arxiv`\n  > 提出了一套基于LLM的智能体框架，设计了两种智能体类型，通过评估TPTU的能力来指导LLM在AI应用中的使用。\n\n\n\n### 智能体演进\n\n- **[模型融合配方的进化优化](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs42256-024-00975-8)** (*2025*) `NMI`\n  > 提出一种双空间运作的模型融合进化方法，支持跨领域融合，并引入了一种全新的模型组合范式。\n\n- **[CREAM：一致性正则化的自奖励语言模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Vf6RDObyEF)** (*2025*) `ICLR`\n  > 本文提出了一个自奖励LLM的框架，引入了正则化技术，并提出了CREAM，利用奖励一致性来获取更可靠的数据。\n\n- **[KnowAgent：知识增强型LLM智能体规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.03101)** (*2025*) `NAACL`\n  > 该文介绍了KNOWAGENT，通过动作知识库和自我学习来增强LLM的规划能力，并缓解幻觉现象。\n\n- **[STeCa：面向LLM智能体学习的步骤级轨迹校准](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14276)** (*2025*) `*ACL`\n  > 文章提出STeCa框架，通过步骤级奖励比较和LLM反思来构建校准后的轨迹。\n\n- **[SWEET-RL：在协作推理任务上训练多轮LLM智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15478)** (*2025*) `Arxiv`\n  > 引入了ColBench基准测试，提出了SWEET-RL方法，利用训练时的信息构建批评者模型，提供步骤级奖励以优化LLM智能体的表现。\n\n- **[DualRAG：一种双过程方法，用于整合推理与检索以实现多跳问答](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18243)** (*2025*) `Arxiv`\n  > 该文提出了DualRAG，一个将推理与检索相结合的双过程框架，用于处理MHQA任务。其耦合的过程形成循环，在不同规模下均表现出色。\n\n- **[Atom-Searcher：通过细粒度原子思维奖励提升智能体深度研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12800)** (*2025*) `Arxiv`\n  > 提出了原子思维范式和Atom-Searcher RL框架，通过整合思维单元与奖励机制，实现了独特的监督与推理方式，从而更好地支持智能体的深度研究。\n\n- **[PVPO：面向智能体推理的预估价值策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21104)** (*2025*) `Arxiv`\n  > 提出PVPO，一种具有优势参考锚点和预采样的强化学习方法，能够纠正偏差、减少对展开的依赖，并筛选出高收益的数据。\n\n- **[SE-Agent：基于LLM智能体的多步推理中自进化轨迹优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02085)** (*2025*) `Arxiv`\n  > SE-Agent通过自我进化、重组和精炼来优化多步推理，扩大搜索空间并借助跨轨迹的启发，实现更高效的推理。\n\n- **[LLM与多智能体强化学习的合作](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04652)** (*2025*) `Arxiv`\n  > 将LLM合作建模为协作式MARL，开发了MAGRPO算法，无需复杂的个体奖励即可实现有效合作。\n\n- **[VLM 可以成为优秀的助手：利用自我改进的视觉语言模型增强具身视觉跟踪]** (*2025*) `Arxiv`\n  > 本文提出了一种自我改进框架，通过集成 VLM 来提升具身视觉跟踪性能。该框架采用一种新颖的内存增强型自我反思机制，使 VLM 能够从失败中学习，并协助主动恢复。\n\n- **[EvolveR：基于经验驱动生命周期的自进化大模型代理]** (*2025*) `Arxiv`\n  > 本文介绍了一种大模型代理的自进化框架，通过闭环生命周期将过往经验提炼为抽象原则，指导未来的决策并实现策略的迭代优化。\n\n- **[测试时的自改进大模型代理]** (*2025*) `Arxiv`\n  > 本文提出了一种测试时的自改进方法，代理能够识别其不确定的预测，生成类似的训练样本，并在其上进行微调，从而实现高效且有效的自我进化。\n\n- **[CoMAS：通过交互奖励实现多智能体系统的协同进化]** (*2025*) `Arxiv`\n  > 该框架通过智能体间的讨论生成内在奖励，并利用强化学习进行优化，无需外部监督，从而实现自主智能体的协同进化。\n\n- **[基准自进化：用于动态大模型评估的多智能体框架]** (*2024*) `Arxiv`\n  > 一个基准自进化的多智能体框架扩展了基准测试，使用六种操作对大模型进行细粒度评估，有助于模型选择。\n\n- **[Agent-Pro：通过策略级反思与优化学习进化]** (*2024*) `ACL`\n  > 提出了 Agent-Pro，一种基于大模型的代理，利用策略级反思和优化机制，通过动态信念过程和深度优先搜索来进化更好的策略。\n\n- **[与另一个“你”共同进化：基于序列式合作多智能体强化学习的大模型微调]** (*2024*) `NeurIPS`\n  > 本文提出了 CORY 方法，将大模型微调扩展到多智能体框架中。智能体之间协同进化，可能在实际应用中的优化效果优于 PPO 方法。\n\n- **[大型语言模型自进化综述]** (*2024*) `Arxiv`\n  > 介绍了包含四个阶段的大模型自进化框架，分类了不同目标，总结了相关文献，并指出了当前面临的挑战及未来研究方向。\n\n- **[LLM-Evolve：面向基准测试的大模型进化能力评估框架]** (*2024*) `EMNLP`\n  > 本文提出了 LLM-Evolve 框架，将基准测试扩展到序列化场景中，使大模型能够从过往经验中学习。\n\n- **[CRITIC：大型语言模型可通过工具交互式批评实现自我修正]** (*2024*) `ICLR`\n  > 本文介绍了 CRITIC 框架，使大模型能够通过工具交互进行自我修正，强调了外部反馈在大模型自我改进中的作用。\n\n- **[利用大型语言模型进行迭代式翻译优化]** (*2024*) `EAMT`\n  > 本文提出通过迭代提示引导大模型进行自我纠正式翻译，强调源语文本的锚定作用，并展示了人类感知质量的显著提升。\n\n- **[演化社会规范下的智能体对齐]** (*2024*) `Arxiv`\n  > 提出了 EvolutionaryAgent，一种用于智能体对齐的演化框架，将对齐问题转化为演化与选择的过程，适用于多种大模型。\n\n- **[缓解 RLHF 的对齐代价]** (*2024*) `EMNLP`\n  > 本文揭示了 RLHF 中存在的对齐代价，提出通过模型平均实现 HMA 方法，以平衡对齐与遗忘之间的关系，从而在最小代价下最大化性能。\n\n- **[自我奖励的语言模型]** (*2024*) `Arxiv`\n  > 该论文研究了使用“大模型作为裁判”的方式，在训练过程中对语言模型进行自我奖励，为持续改进打开了新的大门。\n\n- **[V-STaR：训练面向自教式推理者的验证器]** (*2024*) `COLM`\n  > 提出了 V-STaR 方法，利用正确和错误的自动生成解决方案来训练验证器，从而提升解题选择能力和推理能力。\n\n- **[RLCD：基于对比蒸馏的强化学习用于大模型对齐]** (*2024*) `ICLR`\n  > 提出了 RLCD 方法，无需人工反馈即可实现大模型对齐。通过对比提示生成偏好对，进而训练偏好模型。\n\n- **[基于强化学习沉思的大模型自我改进]** (*2024*) `ICLR`\n  > 本文提出了 RLC 方法，利用评估与生成之间的差距来改进模型，无需监督即可实现模型优化，具有广泛的应用前景。\n\n- **[ProAgent：利用大型语言模型构建主动协作型智能体]** (*2024*) `AAAI`\n  > 提出了 ProAgent 框架，用于构建基于大模型的主动型智能体。该框架能够适应行为、分析状态、推断意图，并且模块化设计，可解决零样本问题。\n\n- **[结合世界知识模型的智能体规划]** (*2024*) `NeurIPS`\n  > 提出了参数化世界知识模型（WKM），用于智能体规划，整合知识并指导全局与局部规划，展现出独特潜力。\n\n- **[利用 Textgrad 优化智能体规划的指南知识]** (*2024*) `ICKG`\n  > 本文引入了 Textgrad 工具，用于优化智能体具身任务中的指南知识，通过文本梯度和失败轨迹分析实现自动优化。\n\n- **[通过多智能体辩论促进大型语言模型的发散性思维]** (*2024*) `Arxiv`\n  > 本文提出了一种多智能体辩论框架，旨在解决大模型中的“思维退化”问题，鼓励发散性思维。\n\n- **[想象空间中的大模型：通过模拟试错进行工具学习]** (*2024*) `ACL`\n  > 现有大模型的工具使用准确率较低。本文提出了一种受生物系统启发的新颖模拟试错方法，以提升工具学习能力。\n\n- **[里谢留：基于大语言模型的自进化AI外交智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06813)** (*2024*) `NeurIPS`\n  > 本文提出了一种用于Diplomacy游戏的自进化大语言模型智能体，该智能体集成了战略规划、目标导向的谈判，以及一种新颖的自我博弈机制，可在无需人类干预的情况下实现自主进化。\n\n- **[基于欲望驱动自主性的类人日常活动模拟](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06435)** (*2024*) `Arxiv`\n  > 本文介绍了一种欲望驱动的自主智能体（D2A），它利用动态价值体系使大语言模型能够自主提出并选择任务，其动机源自内在的类人需求，而非明确的指令。\n\n- **[AlpacaFarm：一种用于学习人类反馈方法的仿真框架](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F5fc47800ee5b30b8777fdd30abcaaf3b-Paper-Conference.pdf)** (*2023*) `NeurIPS`\n  > AlpacaFarm解决了大语言模型开发中的挑战。它能够低成本地模拟反馈，提供评估和方法实现，从而验证端到端流程。\n\n- **[SELF-REFINE：基于自我反馈的迭代优化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=S37hOerQLB)** (*2023*) `NeurIPS`\n  > 本文提出了SELF-REFINE方法，这是一种无需额外训练即可对大语言模型输出进行迭代优化的技术，展示了在测试阶段提升大语言模型性能的能力。\n\n- **[判别式语言模型预训练中的自进化学习](https:\u002F\u002Faclanthology.org\u002F2023.findings-acl.254.pdf)** (*2023*) `EMNLP`\n  > 本文提出了自进化学习（SE）方法，用于标记掩码和学习过程。该方法利用数据知识并采用新颖的平滑技术，从而提升语言学习效果。\n\n- **[用于高效指令微调的自进化多样化数据采样](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.08182)** (*2023*) `Arxiv`\n  > 本文介绍了一种名为DIVERSEEVOL的自进化机制，用于标签高效的指令微调，在无需人工或大语言模型干预的情况下增强数据多样性。\n\n- **[SELFEVOLVE：基于大型语言模型的代码进化框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02907)** (*2023*) `Arxiv`\n  > 本文提出SELF-EVOLVE框架，该框架采用两步流程，利用大语言模型作为知识提供者和自我反思的程序员，且无需专门的测试用例。\n\n- **[SELF-INSTRUCT：通过自动生成指令对齐语言模型](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.754.pdf)** (*2023*) `ACL`\n  > 本文介绍了SELF-INSTRUCT框架，该框架通过自动生成样本几乎无需标注即可提升语言模型的指令遵循能力。\n\n- **[大型语言模型具备自我验证能力后推理能力更强](https:\u002F\u002Faclanthology.org\u002F2023.findings-emnlp.167.pdf)** (*2023*) `EMNLP`\n  > 本文提出大语言模型具有自我验证能力。通过反向验证，将思维链结论作为条件，以提升推理性能。\n\n- **[CODET：基于生成测试用例的代码生成](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ktrw68Cmu9c)** (*2023*) `ICLR`\n  > 本文提出了CODET方法，利用预训练的语言模型自动为代码样本生成测试用例，从而帮助选择更优的解决方案。\n\n- **[多轮多智能体游戏中多样化的红队语言模型演化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00322)** (*2023*) `Arxiv`\n  > 本文引入了动态红队游戏来分析多轮交互，并开发了GRTS以缓解模式崩溃问题，为大语言模型的安全性奠定了基础。\n\n- **[通过多智能体辩论提升语言模型的事实性和推理能力](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14325)** (*2023*) `Arxiv`\n  > 本文提出了一种适用于大语言模型的多智能体辩论方法，能够增强推理能力和事实准确性，适用于黑盒模型，并有望推动大语言模型的发展。\n\n- **[CAMEL：用于探索大型语言模型社会“心智”的沟通型智能体](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.17760)** (*2023*) `NeurIPS`\n  > 本文提出了一种用于自主智能体协作的角色扮演框架，提供可扩展的研究方法，并开源了相关库。\n\n- **[STaR：利用推理引导自身推理的自教学推理器](https:\u002F\u002Fopenreview.net\u002Fpdf?id=_3ELRdg2sgI)** (*2022*) `NeurIPS`\n  > 本文提出了STaR技术，该技术利用少量推理示例和无推理数据来引导复杂推理，使模型能够从自我生成的推理中学习。\n\n\n\n### 应用\n\n- **[一种主动推理策略，用于在医疗实践中引导大型语言模型产生可靠响应](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41746-025-01516-2)** (*2025*) `npj Digital Medicine`\n  > 本文提出了一种领域特定的数据集和基于主动推理的提示协议，以解决大语言模型存在的问题，从而实现其在医疗领域的安全应用。\n\n- **[用于评估大型语言模型在患者交互任务中临床应用的框架](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41591-024-03328-5)** (*2025*) `Nature Medicine`\n  > 本文介绍了CRAFT-MD方法，该方法利用自然对话对大语言模型进行临床评估，并提出了未来大语言模型评估的建议，以提升医疗实践水平。\n\n- **[大型语言模型缺乏可靠的医疗推理所需的元认知能力](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41467-024-55628-6)** (*2025*) `Nature Communications`\n  > 研发了MetaMedQA工具，用于评估模型在医疗推理中的元认知能力，揭示了其不足之处，强调了需要基于元认知的框架。\n\n- **[在自主合成实验室中平衡自主性与专业知识](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs43588-025-00769-x)** (*2025*) `Nature Computational Science`\n  > 本文评论了自主合成实验室面临的障碍，提出了人机协同的方法，以及优化实验室功能的策略。\n\n- **[SimUSER：利用大型语言模型模拟用户行为以评估推荐系统](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.12722)** (*2025*) `Arxiv`\n  > 本文介绍了SimUSER框架，该框架使用人物角色以低成本模拟用户行为，用于推荐系统的评估，并优化参数以提高实际应用中的参与度。\n\n- **[群体自治：从智能体功能化到机器智能](https:\u002F\u002Fadvanced.onlinelibrary.wiley.com\u002Fdoi\u002Ffull\u002F10.1002\u002Fadma.202312956)** (*2025*) `Advanced Materials`\n  > 本综述总结了从智能体基础到应用的合成群体研究，探讨了涌现的机器智能在现实世界自主群体设计中的作用。\n\n- **[ShowUI：面向GUI视觉智能体的一体化视觉-语言-行动模型](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2025\u002Fhtml\u002FLin_ShowUI_One_Vision-Language-Action_Model_for_GUI_Visual_Agent_CVPR_2025_paper.html)** (*2025*) `CVPR\u002FICCV\u002FECCV`\n  > 本文介绍了一种面向GUI视觉智能体的视觉-语言-行动模型，该模型通过UI引导的令牌选择、交错流传输和精选数据集，推动了GUI辅助技术的发展。\n\n- **[Agent Laboratory：利用大语言模型智能体作为科研助手](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04227)** (*2025*) `Arxiv`\n  > 本文介绍了Agent Laboratory框架，这是一个基于大语言模型的全周期科研框架。它能够降低成本，同时通过用户反馈提升质量，加速科学发现进程。\n\n- **[迈向科学智能：基于大语言模型的科学智能体综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24047)** (*2025*) `Arxiv`\n  > 该论文回顾了基于大语言模型的科学智能体，强调其与通用智能体的区别，并提出了科学发现的路线图。\n\n- **[CitySim：利用大规模大语言模型驱动的智能体仿真建模城市行为与城市动态](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21805)** (*2025*) `Arxiv`\n  > 本文提出了CitySim，一种使用大语言模型的城市仿真器。它采用递归的价值驱动方法，赋予智能体关键特性，从而实现可扩展的城市研究。\n\n- **[材料科学中的人工智能综述：基础模型、LLM智能体、数据集和工具](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20743)** (*2025*) `Arxiv`\n  > 本文综述了材料科学领域的基础模型，提出了分类体系，讨论了最新进展，回顾了相关资源，评估了优缺点，并提出了未来发展方向。\n\n- **[用于自动化分子优化的可审计智能体平台](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2508.03444)** (*2025*) `Arxiv`\n  > 该分层智能体框架实现了分子优化的自动化，生成可审计的推理轨迹，并将大语言模型转化为可审计的设计系统。\n\n- **[PosterForest：用于科学海报生成的分层多智能体协作框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21720)** (*2025*) `Arxiv`\n  > 提出了一种无需训练的PosterForest框架。它利用海报树和多智能体协作进行海报生成，解决了结构和集成方面的挑战。\n\n- **[基于协作式多智能体LLM架构从SOAP病历中自动检测临床问题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21803)** (*2025*) `Arxiv`\n  > 一个模拟临床团队的协作式多智能体系统（MAS）分析SOAP病历中的主观和客观部分，为更好的临床决策支持工具提供了路径。\n\n- **[在游戏中思考：通过大语言模型强化学习在游戏中学习推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21365)** (*2025*) `Arxiv`\n  > 提出了“在游戏中思考”（TiG）框架，使大语言模型能够通过与游戏环境的交互获得程序性知识，弥合知识鸿沟并提高透明度。\n\n- **[AlphaEvolve：用于科学和算法发现的编码智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13131)** (*2025*) `Arxiv`\n  > 本文介绍了AlphaEvolve，一种进化型编码智能体。它可以自主改进算法、发现新算法，并拓展自动化发现的范围。\n\n- **[Agent Laboratory：将LLM智能体用作科研助手](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04227)** (*2025*) `Arxiv`\n  > 介绍了Agent Laboratory，一个基于大语言模型的全周期研究框架。它降低了成本，受益于人类反馈，使研究人员能够专注于创意构思。\n\n- **[CitySim：利用大规模大语言模型驱动的智能体仿真建模城市行为与城市动态](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21805)** (*2025*) `Arxiv`\n  > 提出了CitySim，利用大语言模型模拟城市行为。智能体具有信念、目标和记忆。这是一个可扩展的城市现象测试平台。\n\n- **[aiXiv：由AI科学家生成的下一代开放获取科学发现生态系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.15126)** (*2025*) `Arxiv`\n  > 本文介绍了aiXiv，一个基于多智能体的开放获取平台，允许AI生成的研究通过无缝的人工智能与人类协作提交、评审和迭代完善，从而解决缺乏合适发表渠道的问题。\n\n- **[GenoMAS：基于代码驱动基因表达分析的科学发现多智能体框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21035)** (*2025*) `Arxiv`\n  > 该框架引入了一个多智能体系统，将结构化工作流与自主规划相结合，用于基因表达分析。其核心创新在于引导式规划方法，智能体可以动态调整共享的分析计划，从而确保科学发现的可靠性和灵活性。\n\n- **[Motif：来自人工智能反馈的内在动机](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00166)** (*2024*) `ICLR`\n  > 论文提出Motif方法，通过内在奖励将大语言模型的先验知识与智能体对接，从而产生直观的行为并在困难任务上取得进展。\n\n- **[Baba Is AI：打破规则以超越基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.13729)** (*2024*) `ICML`\n  > 该论文可能在“应用”部分展示了一种新颖的方法在动作游戏中，其中包含智能体可能使用的破例策略。\n\n- **[大语言模型赋能的智能体用于模拟宏观经济活动](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.829\u002F)** (*2024*) `*ACL`\n  > 本文使用大语言模型赋能的智能体来模拟宏观经济活动，为经济应用提供了一种全新的方法。\n\n- **[CompeteAI：理解大语言模型驱动智能体中的竞争动态](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17512)** (*2024*) `ICML`\n  > 该论文聚焦于经济应用中基于大语言模型的智能体的竞争动态，为该领域提供了新的见解。\n\n- **[理解基于大语言模型的对话式智能体在心理健康支持中的益处与挑战](https:\u002F\u002Fpmc.ncbi.nlm.nih.gov\u002Farticles\u002FPMC10785945\u002F)** (*2024*) `AMIA`\n  > 本文探讨了在心理学应用中，基于大语言模型的对话式智能体在心理健康支持方面的益处与挑战。\n\n- **[探索LLM智能体的协作机制](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.782\u002F)** (*2024*) `*ACL`\n  > 本文探讨了心理学应用中LLM智能体的协作机制，为基于大型模型的智能体带来了新的思路。\n\n- **[利用大语言模型智能体模拟人类社会：城市、社交媒体和经济系统](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3589335.3641253)** (*2024*) `WWW`\n  > 该论文将大语言模型智能体应用于人类社会的模拟，涵盖城市、社交媒体和经济系统，是该领域的全新贡献。\n\n- **[大语言模型能否变革计算社会科学？](https:\u002F\u002Faclanthology.org\u002F2024.cl-1.8\u002F)** (*2024*) `*ACL`\n  > 论文探讨了大语言模型是否能改变社会应用中的计算社会科学，提供了新的见解。\n\n- **[AgentCF：用于推荐系统的自主语言智能体协同学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09233)** (*2024*) `SIGIR`\n  > 提出了AgentCF来模拟用户-物品交互。将用户和物品视为智能体，利用协同学习建模双边关系，启发行为模拟。\n\n- **[关于推荐系统中的生成式智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.10108)** (*2024*) `SIGIR`\n  > 提出了Agent4Rec，一个由大语言模型赋能的用户模拟器，具备个人资料、记忆和行动模块，用于探索人类行为的模拟。\n\n- **[ChatDev：面向软件开发的沟通型智能体](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.810\u002F)** (*2024*) `*ACL`\n  > 介绍了 ChatDev，一个基于聊天的开发框架。LLM 智能体在设计、编码和测试等阶段使用统一的语言进行通信，并通过语言桥接不同环节。https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FChatDev\n\n- **[CRISPR-GPT：用于自动化基因编辑实验设计的 LLM 智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.18021)** (*2024*) `Arxiv`\n  > 介绍了 CRISPR-GPT，这是一种具备领域知识和工具的 LLM 智能体，可自动设计基因编辑实验，同时关注伦理问题并弥合研究人员与技术之间的鸿沟。\n\n- **[SciAgents：通过生物启发的多智能体智能图推理实现科学发现自动化](https:\u002F\u002Fadvanced.onlinelibrary.wiley.com\u002Fdoi\u002Ffull\u002F10.1002\u002Fadma.202413523)** (*2024*) `Advanced Materials`\n  > SciAgents 利用本体知识图谱、LLM 和多智能体系统，在生物材料领域挖掘跨学科关系，自主生成并优化假设以推动科学发现。\n\n- **[医学大型语言模型易受定向虚假信息攻击](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41746-024-01282-7)** (*2024*) `npj Digital Medicine`\n  > 文章揭示了医疗领域的 LLM 存在脆弱性。仅需对权重进行 1.1% 的微调，即可注入错误信息，强调了加强安全防护措施的必要性。\n\n- **[CellAgent：基于 LLM 的多智能体框架，用于自动化单细胞数据分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.09811)** (*2024*) `Arxiv`\n  > 介绍了 CellAgent，一个由 LLM 驱动的多智能体框架，专门用于 scRNA-seq 数据分析。该框架具备专家角色、决策机制和自我迭代能力，能够显著减轻工作负担。\n\n- **[描述、解释、规划与选择：结合大型语言模型的交互式规划赋能开放世界多任务智能体](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01560)** (*2023*) `NeurIPS`\n  > 文章提出了一种名为“DEPS”的交互式规划方法，利用 LLM 为多任务智能体提供支持，能够不断优化计划并在多个领域展现出有效性。\n\n- **[语言模型与世界模型的结合：具身经验增强语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.10626.pdf)** (*2023*) `NeurIPS`\n  > 将语言模型与世界模型相结合，借助具身化体验提升模型性能。该方法适用于模拟游戏场景，有助于增强模型的能力。\n\n- **[ChessGPT：连接策略学习与语言建模](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F16b14e3f288f076e0ca73bdad6405f77-Abstract-Datasets_and_Benchmarks.html)** (*2023*) `NeurIPS`\n  > 本文探讨了策略学习与语言建模之间的桥梁作用，其应用潜力涵盖竞技类游戏，并为基于大模型的智能体提供了全新思路。\n\n- **[Mindagent：涌现式游戏互动](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.09971)** (*2023*) `Arxiv`\n  > 该文研究合作游戏中涌现式的互动行为，为基于大模型的智能体开辟了新的应用场景。\n\n- **[探索大型语言模型在交流类游戏中的应用：狼人杀的实证研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04658)** (*2023*) `Arxiv`\n  > 本文通过实证研究探讨了大型语言模型在狼人杀这一交流类游戏中的应用，为该领域贡献了新颖的实践案例。\n\n- **[语言即现实：基于生成式 AI 的《一千零一夜》共创叙事游戏体验](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAIIDE\u002Farticle\u002Fview\u002F27539)** (*2023*) `AAAI`\n  > 本文介绍了一款基于生成式 AI 的《一千零一夜》共创叙事游戏，为游戏生成领域带来了全新的应用方向。\n\n- **[TradingGPT：具有分层记忆与独特角色的多智能体系统，用于提升金融交易绩效](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.03736)** (*2023*) `Arxiv`\n  > TradingGPT 提出了一种结合分层记忆和独特角色的多智能体系统，旨在提升金融交易表现，是该领域的创新性尝试。\n\n- **[利用大型语言模型模拟多人并复制人类受试者研究](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Faher23a\u002Faher23a.pdf)** (*2023*) `ICML`\n  > 本文将大型语言模型应用于模拟人类，以复制受试者研究，为心理学领域的应用提供了新方法。\n\n- **[生成式智能体：人类行为的交互式模拟](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.03442)** (*2023*) `UIST`\n  > 文章提出了能够模拟人类行为的生成式智能体，其在社会应用方面具有创新性，有望成为基于大模型的智能体研究的重要补充。\n\n- **[通过 ChatGPT 实现代码的自我协作生成](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.07590)** (*2023*) `TOSEM`\n  > 提出了一种利用 ChatGPT 等 LLM 进行代码自动生成的协作框架。该框架能够组建虚拟智能体团队，在无需人工干预的情况下提升复杂任务的处理效率。\n\n- **[语言模型可解决计算机任务](https:\u002F\u002Fopenreview.net\u002Fpdf?id=M6OmjAZ4CX)** (*2023*) `NeurIPS`\n  > 文章介绍了 RCI 方法，这是一种简单的提示工程方案，使预训练的 LLM 能够通过自然语言执行计算机任务，从而提升推理能力并优于其他方法。\n\n- **[ChemCrow：用化学工具增强大型语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.05376)** (*2023*) `Arxiv`\n  > 介绍了 ChemCrow，一种集成了 18 种工具的 LLM 化学智能体。它能够增强 LLM 在化学领域的应用能力，实现任务自动化，并打通实验化学与计算化学之间的壁垒。\n\n- **[AlphaFlow：基于强化学习引导的自驱动流体实验室，用于多步化学反应的自主发现与优化](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41467-023-37139-y)** (*2023*) `Nature Communications`\n  > 本文介绍了 AlphaFlow，一种利用强化学习指导的自驱动流体实验室，可用于多步化学反应的发现，展示了其在 cALD 之外开辟新合成路径的潜力。\n\n- **[语言模型作为零样本规划器：为具身智能体提取可操作知识](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fhuang22a.html)** (*2022*) `ICML`\n  > 本文提出将语言模型用作具身智能体的零样本规划器，可在模拟游戏中应用，其新颖之处在于能够提取可操作的知识。\n\n- **[基于智能体仿真的奥地利医疗体系韧性压力测试](https:\u002F\u002Fdoi.org\u002F10.1038\u002Fs41467-022-31766-7)** (*2022*) `Nature Communications`\n  > 该数据驱动的智能体仿真框架量化了区域医疗体系应对冲击的韧性，有助于识别医疗服务获取瓶颈，并将系统性指标与个体指标关联起来。\n\n\n\n### 数据集与基准测试\n\n- **[AgentHarm：评估 LLM 智能体在有害任务上的鲁棒性基准](https:\u002F\u002Fopenreview.net\u002Fpdf?id=AC5n7xHuR1)** (*2025*) `ICLR`\n  > 提出了 AgentHarm，一项用于评估 LLM 智能体鲁棒性的新基准。涵盖 11 类危害，可用来评估攻击与防御能力。\n\n- **[AI Hospital：多智能体医疗交互模拟器中的大型语言模型基准测试](https:\u002F\u002Faclanthology.org\u002F2025.coling-main.680.pdf)** (*2025*) `*ACL`\n  > 提出了用于模拟医疗交互的AI Hospital，开发了MVME基准，并提出了一种争议解决机制，以提升LLM的临床能力。\n\n- **[Benchmark Self-Evolving：用于动态LLM评估的多智能体框架](https:\u002F\u002Faclanthology.org\u002F2025.coling-main.223.pdf)** (*2025*) `*ACL`\n  > 一个能够自我演进的多智能体基准测试框架，可动态评估LLM。该框架会重新构建实例并扩展数据集，有助于模型选择和基准的持续发展。\n\n- **[DCA-Bench：数据集整理智能体的基准测试](https:\u002F\u002Fopenreview.net\u002Fpdf?id=a4sknPttwV)** (*2025*) `ICLR`\n  > 该论文为LLM智能体设立了一个基准，用于检测原始数据集中的问题、整理测试用例，并提出了一个评估框架，从而推动实际场景下的数据集整理工作。\n\n- **[MedAgentBench：用于评估医疗LLM智能体的逼真虚拟电子病历环境](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14654)** (*2025*) `Arxiv`\n  > 论文介绍了MedAgentBench，这是一个基于临床任务和真实数据的医疗LLM智能体基准测试平台，能够在医疗领域进行评估与优化。\n\n- **[MLE-Bench：在机器学习工程中评估机器学习智能体](https:\u002F\u002Fopenreview.net\u002Fpdf?id=6s5uXNWGIh)** (*2025*) `ICLR`\n  > 提出了MLE-Bench，用于评估机器学习工程领域的AI智能体。该基准包括精心设计的任务、基线设置、模型评估，并开源了相关代码以供后续研究使用。\n\n- **[EgoLife：迈向第一人称视角的生活助手](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.03803)** (*2025*) `Arxiv`\n  > 介绍了EgoLife项目，旨在构建第一人称视角的生活助手。同时创建了EgoLife数据集和EgoLifeQA任务，用于日常生活辅助。\n\n- **[DSBench：数据科学智能体距离成为数据科学专家还有多远？](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.07703)** (*2025*) `ICLR`\n  > 论文介绍了DSBench，一个包含真实任务的全面数据科学智能体基准测试，旨在缩小基准测试与实际应用之间的差距。\n\n- **[迈向互联网规模的智能体训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06776)** (*2025*) `Arxiv`\n  > 该论文开发了一套无需大量人工标注即可实现互联网规模智能体训练的流程，由LLM负责任务生成、执行和评审。\n\n- **[macOSWorld：面向GUI智能体的交互式基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04135)** (*2025*) `Arxiv`\n  > 论文提出了macOSWorld，这是首个包含多语言任务和安全子集的macOS GUI智能体基准测试，填补了操作系统评估方面的空白。\n\n- **[人类的最后一场考试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14249)** (*2025*) `Arxiv`\n  > 介绍了“人类的最后一场考试”（HLE），这是一个多模态、覆盖广泛的LLM基准测试，指出了当前存在的不足，并向公众开放以供研究使用。\n\n- **[MCPEval：基于MCP的AI智能体模型深度自动评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.12806)** (*2025*) `Arxiv, *ACL`\n  > 提出了MCPEval，一个基于MCP的开源框架，用于自动化LLM智能体的评估，标准化评估指标并减少人工工作量。\n\n- **[IDA-Bench：评估LLM在交互式引导数据分析中的表现](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.18223)** (*2025*) `Arxiv`\n  > 介绍了IDA-Bench，这是一个针对LLM在多轮数据分析中表现的新基准，强调指令遵循与推理能力之间的平衡。\n\n- **[SEC-bench：对LLM智能体在真实世界软件安全任务中的自动化基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11791)** (*2025*) `Arxiv`\n  > 提出了SEC-bench，一个用于LLM智能体在真实世界安全任务中进行自动化基准测试的框架，并创新性地采用多智能体协作结构来生成数据集。\n\n- **[MMSearch-Plus：面向多模态浏览智能体的简单而具有挑战性的基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.21475)** (*2025*) `Arxiv`\n  > 介绍了MMSearch-Plus基准，专为多模态浏览智能体设计，包含新颖的数据整理方法和智能体框架，旨在应对真实的多模态挑战。\n\n- **[MultiAgentBench：评估LLM智能体的协作与竞争能力](https:\u002F\u002Faclanthology.org\u002F2025.acl-long.421\u002F)** (*2025*) `*ACL`\n  > 本文介绍了MultiAgentBench，用于在多种场景下评估多智能体系统，考察认知规划等协议与策略，并将发布代码和数据。\n\n- **[建立严谨的智能体基准测试的最佳实践](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.02825)** (*2025*) `Arxiv`\n  > 许多智能体基准测试存在设置或奖励机制方面的问题。本文提出了ABC指南，以使智能体评估更加严谨。\n\n- **[UserBench：面向以用户为中心的智能体的交互式训练环境](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.22034)** (*2025*) `Arxiv`\n  > 介绍了UserBench，一个包含模拟用户的交互式环境，用于评估智能体在主动协作方面的能力。它通过多轮对话和工具使用来衡量智能体澄清模糊且不断变化的目标的能力，揭示了任务完成度与用户满意度之间的重要差距。\n\n- **[PillagerBench：用于评估基于LLM的智能体在Minecraft中竞争力的多智能体基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06235)** (*2025*) `Arxiv`\n  > 本文介绍了PillagerBench，一个用于在Minecraft中评估多智能体竞争力的基准测试，以及TactiCrafter，一种使用人类可读战术并学习因果依赖关系以适应对手的智能体。\n\n- **[UnrealZoo：为具身AI丰富照片级逼真的虚拟世界](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.20977)** (*2025*) `CVPR\u002FICCV\u002FECCV`\n  > UnrealZoo是一个高保真3D虚拟世界平台，拥有丰富的实体和增强的工具，适用于具身AI。它能够高效地进行多智能体训练，并表明环境多样性对于开发能够应对开放世界复杂性的通用智能体至关重要。\n\n- **[以游戏为探针：评估LLM概念知识的游戏化基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17512)** (*2025*) `Arxiv`\n  > CK-Arena引入了一个多智能体游戏基准，通过描述和区分概念等互动任务来评估LLM的概念推理能力，从而超越了单纯的静态事实回忆。\n\n- **[NewtonBench：评估LLM智能体在通用科学定律发现方面的基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07172)** (*2025*) `Arxiv`\n  > NewtonBench引入了一个可扩展且不易受记忆影响的科学定律发现基准测试。它通过交互式的模型探索来评估智能体发现隐藏原理的能力，超越了静态的函数拟合，更贴近真实的科学探究过程。\n\n- **[LiveMCP-101：在复杂查询上对启用MCP的智能体进行压力测试与诊断]**（2025）`Arxiv`\n  > 本文介绍了LiveMCP-101，这是一个包含101个需要多工具协同操作的真实世界查询基准。其核心创新在于采用基于真实执行计划的新评估方法，以更准确地反映动态环境，从而严格测试智能体的能力。\n\n- **[分布式多智能体系统的阿喀琉斯之踵]**（2025）`Arxiv`\n  > 本文提出了一种分布式多智能体系统（DMAS）框架，并指出其易受搭便车行为和恶意攻击等关键可信度问题的影响，可作为未来研究的红队工具。\n\n- **[AgentBench：评估LLM作为智能体]**（2024）`ICLR`\n  > 介绍了AgentBench，该基准包含8个环境用于评估LLM智能体，分析了失败原因，并提出了多轮对齐训练等改进策略。\n\n- **[AgentQuest：衡量进展并改进LLM智能体的模块化基准框架]**（2024）`ACL`\n  > 提出了AgentQuest，一个包含模块化基准和指标的框架，并设计了两种新的评估指标来跟踪LLM智能体的进展。\n\n- **[BENCHAGENTS：通过智能体交互实现基准自动化创建]**（2024）`Arxiv`\n  > 介绍了BENCHAGENTS，一个基于LLM的框架，用于自动创建复杂能力的基准，并通过智能体交互和人工反馈确保数据质量。\n\n- **[数据科学智能体的基准测试]**（2024）`ACL`\n  > 提出了DSEval，一种针对数据科学智能体的新型评估范式及基准，采用自举方法以提高覆盖范围和全面性。\n\n- **[将大型语言模型作为AI研究智能体进行基准测试]**（2024）`ICLR`\n  > 提出了MLAgent-Bench，用于基准测试AI研究智能体，设计了一个基于LLM的智能体，并指出了此类智能体面临的关键挑战。\n\n- **[面向多智能体系统的大型语言模型基准测试：AutoGen、CrewAI和TaskWeaver的比较分析]**（2024）`其他`\n  > 该文对三个由LLM驱动的多智能体系统（AutoGen、CrewAI、TaskWeaver）在机器学习代码生成方面的性能进行了基准测试，推动了协作式问题解决的研究。\n\n- **[BLADE：语言模型智能体的基准测试]**（2024）`ACL`\n  > 本文提出了BLADE，一个用于自动评估智能体在开放式研究中多方面方法的基准，支持面向数据驱动科学的智能体评估。\n\n- **[CRAB：跨平台多模态具身语言模型智能体基准]**（2024）`NeurIPS`\n  > 推出了CRAB，一个基于图的评估方法和任务构建机制的跨环境智能体基准框架，并开发了CRAB Benchmark - v0。\n\n- **[CToolEval：面向中文LLM智能体在真实API交互中的基准]**（2024）`ACL`\n  > 提议了CToolEval基准，适用于中文LLM智能体，包含398个API。提出了评估框架，并公开了数据和代码，以促进智能体层面的研究。\n\n- **[DA-Code：面向大型语言模型的数据科学代码生成智能体基准]**（2024）`ACL`\n  > 介绍了DA-Code，这是一个针对基于智能体的数据科学领域的代码生成基准。它具有独特的任务、真实数据，并要求使用复杂语言，相关资源已在GitHub上发布。\n\n- **[具身智能体接口：评估LLM在具身决策中的表现]**（2024）`NeurIPS`\n  > 提出了具身智能体接口，旨在统一任务、模块和指标，全面评估LLM在具身决策中的能力。\n\n- **[GTA：通用工具智能体基准]**（2024）`NeurIPS`\n  > 提出了GTA，一个针对通用工具智能体的基准，包含真实查询、已部署工具和多模态输入，用于评估LLM在现实世界问题解决中的能力。\n\n- **[LaMPilot：基于语言模型程序的自动驾驶开放基准数据集]**（2024）`CVPR\u002FICCV\u002FECCV`\n  > 展示了LaMPilot，将LLM集成到自动驾驶系统中以遵循用户指令。同时推出了LaMPilot - Bench，并发布了代码和数据供进一步研究。\n\n- **[ML研究基准]**（2024）`Arxiv`\n  > 本文提出了ML研究基准，包含7项针对AI智能体的任务，提供了一个评估其在现实世界研究挑战中表现的框架。\n\n- **[MMAU：跨不同领域智能体能力的综合性基准]**（2024）`Arxiv`\n  > 本文介绍了MMAU基准，包含横跨5个领域和5种能力的离线任务，提升了LLM智能体的可解释性。\n\n- **[OmniACT：为桌面和Web端赋能多模态通才型自主智能体的数据集与基准]**（2024）`CVPR\u002FICCV\u002FECCV`\n  > 介绍了OmniACT，这是首个用于评估智能体在计算机任务自动化方面能力的数据集和基准，覆盖桌面应用程序。\n\n- **[OSWorld：在真实计算机环境中对多模态智能体进行开放式任务的基准测试]**（2024）`NeurIPS`\n  > 推出了OSWorld，一个可扩展的真实计算机环境，专为多模态智能体设计，并创建了包含369个任务的基准，助力多模态通才智能体的发展。\n\n- **[重新审视基准与评估：一种基于智能体的探索性动态评估框架，用于LLM]**（2024）`Arxiv`\n  > 本文引入了Benchmark+和Assessment+，提出了TestAgent框架，能够动态生成基准并对LLM进行领域自适应评估。\n\n- **[Seal-Tools：用于智能体调优及详细基准测试的自我指令式工具学习数据集]**（2024）`其他`\n  > 这是一个全新的工具学习数据集Seal - Tools，采用自我指令式生成方法，包含高难度实例和严格指标，作为LLM工具调用能力的新基准。\n\n- **[Tapilot-Crossing：面向交互式数据分析代理的LLM基准测试与演进](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05307)** (*2024*) `Arxiv`\n  > 该论文提出了用于评估交互式数据分析中LLM代理的Tapilot-Crossing基准，并提出AIR策略，以将LLM演化为高效的代理。\n\n- **[TheAgentCompany：针对具有实际影响的真实世界任务的LLM代理基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.14161)** (*2024*) `Arxiv`\n  > 本文介绍了TheAgentCompany，这是一个可扩展的基准测试平台，用于在模拟软件公司环境的真实世界任务上评估AI代理。\n\n- **[Tur[k]ingBench：面向网络代理的挑战性基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.11905)** (*2024*) `Arxiv`\n  > 介绍了TurkingBench，一个基于众包自然HTML页面的基准测试。开发了一套评估框架，以推动基于Web的代理技术进步。\n\n- **[Agent-FLAN：为大型语言模型设计有效的代理微调数据与方法](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.557\u002F)** (*2024*) `*ACL`\n  > 论文指出了代理训练中的问题，提出了Agent-FLAN来微调LLM，对语料库进行分解，并使用负样本以减少幻觉现象。\n\n- **[AgentBank：通过在5万多个交互轨迹上微调，迈向通用LLM代理](https:\u002F\u002Faclanthology.org\u002F2024.findings-emnlp.116\u002F)** (*2024*) `*ACL`\n  > 介绍了AgentBank，一个大规模的交互轨迹数据集。采用了新颖的标注方式，对LLM进行微调以得到Samoyed模型，并展示了数据规模对代理能力的影响。\n\n- **[AgentOhana：设计统一的数据与训练流水线，以实现高效的代理学习](http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15506)** (*2024*) `Arxiv`\n  > 介绍了AgentOhana，旨在整合来自不同来源的代理轨迹，构建平衡的训练流水线，并提出了xLAM-v0.1用于AI代理。\n\n- **[AgentTuning：为LLM赋予通用代理能力](https:\u002F\u002Faclanthology.org\u002F2024.findings-acl.181\u002F)** (*2024*) `*ACL`\n  > 本文提出了AgentTuning方法，利用AgentInstruct和混合微调来提升LLM的代理能力，同时不牺牲其通用性，并开源了相关模型。\n\n- **[可执行代码动作能催生更优秀的LLM代理](https:\u002F\u002Fproceedings.mlr.press\u002Fv235\u002Fwang24h.html)** (*2024*) `ICML`\n  > 本文提出了CodeAct，即使用Python代码作为LLM代理的动作。创建了一个数据集，并训练了一个具备自我调试能力的微调代理，以应对复杂任务。\n\n- **[AppWorld：用于基准测试交互式编码代理的可控应用与人物世界](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18901)** (*2024*) `*ACL`\n  > 构建了AppWorld引擎和基准测试，以填补现有工具使用基准的空白，从而支持丰富且交互式的代码生成任务。\n\n- **[SheetAgent：借助大型语言模型迈向电子表格推理与操作的通用代理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.03636)** (*2024*) `WWW`\n  > 该论文介绍了SheetRM基准，并提出了SheetAgent，一种由三个模块组成的基于LLM的代理，能够实现自主的电子表格推理与操作。\n\n- **[GenoTEX：用于自动化基因表达数据分析的LLM代理基准](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.15341)** (*2024*) `Arxiv`\n  > 本文介绍了GenoTEX，一个用于基因表达分析的LLM代理基准，以及GenoAgent，一个采用自我纠正代码生成技术的多智能体系统，用于自动化整个生物信息学流程。\n\n- **[FireAct：迈向语言代理的微调](http:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05915)** (*2023*) `Arxiv`\n  > 该论文探讨了语言代理的LM微调问题。提出了使用多样化数据的FireAct方法，揭示了其优势并提供了实验见解。\n\n\n\n### 伦理\n\n- **[医疗大型语言模型易受数据投毒攻击](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41591-024-03445-1)** (*2025*) `Nature Medicine`\n  > 论文评估了LLM的数据投毒攻击，发现低比例的错误信息也会损害模型，并提出了一种基于图的缓解策略。\n\n- **[基础模型与合理使用](https:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fv24\u002F23-0569.html)** (*2024*) `JMLR`\n  > 讨论了基础模型在受版权保护数据上的法律与伦理风险，提出了技术性缓解措施，并倡导法律与技术的协同发展以实现合理使用。\n\n- **[估算BLOOM语言模型的碳足迹（参数量1760亿）](https:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fv24\u002F23-0069.html)** (*2023*) `JMLR`\n  > 该论文量化了BLOOM在其整个生命周期中的碳足迹，并研究了其推理阶段的排放情况，讨论了估算中的挑战及未来的研究方向。\n\n- **[LLaMA：开放且高效的语言基础模型](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fllama-open-and-efficient-foundation-language-models\u002F)** (*2023*)\n  > 介绍了LLaMA系列模型（70亿至650亿参数）。仅使用公开数据集进行训练，并将其发布给科研社区。\n\n- **[大型生成模型中的可预测性与意外性](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3531146.3533229)** (*2022*) `FAccT`\n  > 本文揭示了大型生成模型在可预测性与不可预测性之间的悖论，指出了其潜在危害，并列举了AI社区可以采取的干预措施。\n\n- **[关于随机鹦鹉的危险：语言模型是否可能太大？🦜](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3442188.3445922)** (*2021*) `FAccT`\n  > 该论文质疑语言模型的规模，探讨了相关的风险，并提出了超越单纯扩大规模之外的缓解建议。\n\n- **[价值观导向数据集驱动的语言模型社会适应过程（PALMS）](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2021\u002Fhash\u002F2e855f9489df0712b4bd8ea9e2848c5a-Abstract.html)** (*2021*) `NeurIPS`\n  > 提出了PALMS，一种迭代过程，利用价值观导向的数据集，通过少量精选数据来改变语言模型的行为。\n\n- **[GPT-3：其本质、范围、局限及后果](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs11023-020-09548-1)** (*2020*)\n  > 论文通过可逆与不可逆的问题分析了GPT-3，提出了它无法通过的三项测试，并阐述了人工制品工业化的后果。\n\n- **[现代深度学习研究中的能源与政策考量](https:\u002F\u002Fojs.aaai.org\u002Findex.php\u002FAAAI\u002Farticle\u002Fview\u002F7123)** (*2020*) `AAAI`\n  > 揭示了大型神经网络计算的高昂成本，量化了NLP模型的成本，并提出了降低成本与促进公平的建议。\n\n- **[防御神经网络虚假新闻](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2019\u002Fhash\u002F3e9f0fc9b2f89e043bc6233994dfcf76-Abstract.html)** (*2019*) `NeurIPS`\n  > 介绍了Grover，一种可控制文本生成技术，用于研究神经网络虚假新闻的风险，展示了Grover的自我防御价值，并讨论了相关伦理问题。\n\n### 安全\n\n- **[RTBAS：防御LLM代理免受提示注入与隐私泄露](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.08966)** (*2025*) `Arxiv`\n  > 论文介绍了RTBAS，它是TBAS的一种变体，结合了信息流控制技术，并使用新型筛选器自动处理工具调用，从而减轻用户负担。\n\n- **[通过通信攻击对 LLM 多智能体系统进行红队测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.14847)** (*2025*) `Arxiv`\n  > 提出了一种名为“中间人智能体”（AiTM）的新攻击方法，通过操纵消息对 LLM-MAS 进行攻击，揭示了多智能体系统安全性的迫切需求。\n\n- **[揭示 LLM 智能体内存中的隐私风险](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13172)** (*2025*) `Arxiv`\n  > 该论文在黑盒环境下提出了 MEXTRA 方法，用于从 LLM 智能体内存中提取隐私信息，并探讨了泄露因素，呼吁采取相应的防护措施。\n\n- **[AEIA-MN：评估多模态 LLM 驱动的移动智能体对抗主动环境注入攻击的鲁棒性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13053)** (*2025*) `Arxiv`\n  > 论文定义了主动环境注入攻击（AEIA），并提出 AEIA-MN 方法来评估基于 MLLM 的智能体对此类威胁的鲁棒性。\n\n- **[用于保护动态 LLM 智能体网络的防火墙](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.01822)** (*2025*) `Arxiv`\n  > 论文识别了 LLM 智能体网络的通信特性，提出了一种平衡设计，并通过仿真构建了防火墙规则。\n\n- **[AUTOHIJACKER：针对黑盒 LLM 智能体的自动间接提示注入攻击](https:\u002F\u002Fopenreview.net\u002Fpdf?id=2VmB01D9Ef)** (*2025*) `Arxiv`\n  > 提出了 AutoHijacker，一种自动化的黑盒间接提示注入攻击。它利用 LLM 作为优化器，采用批处理优化和可训练的记忆机制。\n\n- **[受威胁的 AI 智能体：关键安全挑战与未来发展方向综述](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3716628)** (*2025*) `ACM 计算综述`\n  > 本文将新兴的 AI 智能体安全威胁归纳为四大知识空白，旨在激发研究以开发更安全的智能体应用。\n\n- **[SAGA：用于治理 AI 智能体系统的安全架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21034)** (*2025*) `Arxiv`\n  > 提出了 SAGA 安全架构，用于治理智能体系统，提供用户监督和细粒度访问控制，从而实现安全的智能体部署。\n\n- **[WebInject：针对 Web 智能体的提示注入攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11717)** (*2025*) `Arxiv`\n  > 提出了 WebInject，一种针对 Web 智能体的提示注入攻击。该方法引入像素扰动，训练神经网络进行映射，并解决优化问题。\n\n- **[LLM 驱动的多智能体系统中的网络欺诈攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.01211)** (*2025*) `Arxiv`\n  > 本文介绍了网络欺诈攻击，这是一种新颖的方法，通过篡改域名和伪装链接，诱导 LLM 驱动的多智能体系统访问恶意网站，从而绕过复杂的越狱技术。\n\n- **[攻击 LLM 和 AI 智能体：针对大型语言模型的广告嵌入攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17674)** (*2025*) `Arxiv`\n  > 介绍了广告嵌入攻击（AEA），这是一种新型威胁，能够劫持 LLM 将隐蔽的推广或恶意内容注入到输出中。文中详细描述了两种低成本的攻击向量，并提出了一种基于提示的防御方法，突出了 AI 安全领域的一个关键缺口。\n\n- **[超越数据隐私：大型语言模型的新隐私风险](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.14278)** (*2025*) `Arxiv`\n  > 本文认为，除了训练数据泄露之外，LLM 的部署还带来了新的隐私风险，这些风险源于其自主推理能力以及与应用程序的集成，从而可能引发复杂的大规模攻击，威胁个人和社会的安全。\n\n- **[隐私在行动：面向 LLM 驱动智能体的真实隐私缓解与评估](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.17488)** (*2025*) `Arxiv`\n  > 本文提出了 PrivacyChecker，一种与模型无关的缓解方法，以及 PrivacyLens-Live，一个动态基准测试工具。它们通过将情境完整性融入智能体协议，来应对 LLM 智能体中的新隐私风险。\n\n- **[PrivWeb：面向 Web 智能体的无感且内容感知型隐私保护](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.11939)** (*2025*) `Arxiv`\n  > 该工作提出了 PrivWeb，一种适用于 Web 智能体的隐私框架，利用本地 LLM 根据用户偏好自动匿名化屏幕上的数据，在自动化保护与用户控制之间取得平衡，并通过自适应、上下文感知的通知实现这一目标。\n\n- **[DemonAgent：针对基于 LLM 的智能体的动态加密多后门植入攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12575)** (*2025*) `*ACL`\n  > 提出了动态加密多后门植入攻击，采用动态加密和子片段技术以绕过安全审计。同时发布了 AgentBackdoorEval 数据集。\n\n- **[CORBA：基于大型语言模型的多智能体系统中的传染性递归阻塞攻击](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14529)** (*2025*) `Arxiv`\n  > 提出了 LLM-MAS 中的传染性递归阻塞攻击（Corba）。该攻击在传播性和资源耗竭方面具有创新性，难以通过对齐手段缓解。\n\n- **[G-Safeguard：一种基于拓扑的 LLM 基础多智能体系统安全视角与治理方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11127)** (*2025*) `*ACL`\n  > 论文介绍了 G-Safeguard，用于 LLM-MAS。它利用图神经网络进行异常检测和拓扑干预，具有较强的适应性，可与主流的 MAS 系统结合使用。\n\n- **[AgentHarm：评估 LLM 智能体在有害任务上的鲁棒性基准](https:\u002F\u002Fopenreview.net\u002Fforum?id=AC5n7xHuR1)** (*2025*) `ICLR`\n  > 提出了 AgentHarm，一个包含 110 个恶意任务的新基准，用于评估 LLM 智能体的攻击与防御能力。\n\n- **[商用 LLM 智能体已经容易受到简单但危险的攻击]** (*2025*) `Arxiv`\n  > 本文分析了 LLM 智能体独特的安全和隐私漏洞，提供了攻击分类，并进行了无需机器学习知识的简单攻击实验。\n\n- **[LLM（-智能体）全栈安全综合综述：数据、训练与部署](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15585)** (*2025*) `Arxiv`\n  > 本文提出了 LLM 的“全栈”安全概念，覆盖整个生命周期，并结合大量文献和独特见解，指明了未来的研究方向。\n\n- **提示感染：多智能体系统中的 LLM 到 LLM 提示注入** (*2025*)\n  > 论文揭示了多智能体系统中 LLM 到 LLM 的提示注入现象，提出了“提示感染”方法，并建议通过 LLM 标记来缓解这一问题。\n\n- **[基于 LLM 的多智能体系统：技术与商业视角](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.14033?)** (*2024*) `Arxiv`\n  > 本文探讨了基于 LLM 的多智能体系统（LaMAS），阐述了其优势，提供了协议，并将其视为实现人工群体智能的一种解决方案。\n\n- **[BlockAgents：通过区块链实现拜占庭容错的 LLM 基础多智能体协作](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3674399.3674445)** (*2024*) `TURC`\n  > 论文提出了 BlockAgents，将区块链技术整合到 LLM 基础多智能体系统中。该方案采用 PoT 机制和多指标评估，以缓解拜占庭行为。\n\n- **[提示感染：多智能体系统中的 LLM 到 LLM 提示注入]**（2024）`Arxiv`\n  > 揭示了多智能体系统中 LLM 到 LLM 的提示注入攻击，提出了“提示感染”机制及“LLM 标记”防御方法，以提升安全性。\n\n- **[AgentDojo：评估 LLM 智能体提示注入攻击与防御的动态环境]**（2024）`NeurIPS`\n  > 介绍了 AgentDojo，一个用于在不可信数据上评估 AI 智能体的可扩展框架，旨在促进可靠且鲁棒的智能体设计。\n\n- **[AGENTPOISON：通过污染内存或知识库对 LLM 智能体进行红队攻防演练]**（2024）`NeurIPS`\n  > 提出 AGENTPOISON，一种通过污染内存\u002F知识库对 LLM 智能体实施新型后门攻击的方法。无需额外训练，具有良好的迁移性和隐蔽性。\n\n- **[AutoDefense：针对越狱攻击的多智能体 LLM 防御框架]**（2024）`Arxiv`\n  > 提出了 AutoDefense，一个过滤 LLM 有害响应的多智能体框架，能够抵御各类攻击，并使小型 LLM 能够保护大型模型。\n\n- **[Imprompter：诱使 LLM 智能体不当使用工具]**（2024）`Arxiv`\n  > 该论文致力于智能体系统安全研究，提出了一种混淆的对抗性提示攻击，并证明其对多种智能体均有效。\n\n- **[直击核心：通过直接操控 LLM 对基于 RAG 的智能体进行简单而有效的攻击]**（2024）`Arxiv`\n  > 本文揭示了一种由对抗性前缀引发的严重 LLM 漏洞，强调了在基于智能体的架构中实施多层安全防护的必要性。\n\n- **[提示注入作为应对 LLM 驱动网络攻击的防御手段]**（2024）`Arxiv`\n  > 提出了 Mantis，一个利用提示注入来反击 LLM 驱动网络攻击的防御框架，能够自主“黑客式”回击攻击者，并且是开源的。\n\n- **[邪恶天才：深入探讨基于 LLM 的智能体安全性]**（2024）`Arxiv`\n  > 该论文从三个方面探讨了基于 LLM 的智能体安全问题。提出了一种基于模板的攻击方法以及“邪恶天才”分析法，用于深入研究。\n\n- **[智能体安全基准（ASB）：为基于 LLM 的智能体攻击与防御提供形式化与基准测试]**（2024）`Arxiv`\n  > 介绍了智能体安全基准（ASB），用于形式化和基准测试基于 LLM 的智能体攻击与防御，揭示了潜在漏洞并指明未来研究方向。\n\n- **[AGENTHARM：衡量 LLM 智能体危害性的基准测试]**（2024）`Arxiv`\n  > 提出了一项新的基准测试 AgentHarm，用于研究 LLM 智能体的滥用问题，包含多样化的恶意任务和独特的评分标准。\n\n- **[CLAS 2024：LLM 和智能体安全竞赛]**（2024）`Arxiv`\n  > CLAS 2024 通过三个赛道推动对 LLM 和智能体安全的理解，促进社区协作，以构建更安全的 AI 系统。\n\n- **[任务盾牌：通过强制任务对齐防御 LLM 智能体中的间接提示注入]**（2024）`Arxiv`\n  > 该论文提出将智能体安全重新定义为确保任务对齐，并开发了任务盾牌，用于验证指令是否真正有助于实现用户目标。\n\n- **[WIPI：LLM 驱动的 Web 智能体面临的新型网络威胁]**（2024）`Arxiv`\n  > 本文介绍了一种新型威胁 WIPI，它通过网页指令间接控制 Web 智能体，从而提高攻击效率和隐蔽性。\n\n- **[史密斯特工：一张图片即可呈指数级速度越狱百万个多模态 LLM 智能体]**（2024）`Arxiv`\n  > 论文揭示了多智能体多模态 LLM 中的“传染性越狱”现象，证明其可行性，并提出了一种限制防御传播的原则。\n\n- **[CORBA：基于大型语言模型的多智能体系统中的传染性递归阻断攻击]**（2024）`Arxiv`\n  > 介绍了 CORBA，一种针对 LLM 多智能体系统的新型攻击。该攻击具有传染性和递归性，难以通过对齐技术缓解，在不同拓扑结构和模型上均有效。\n\n- **[PsySafe：基于心理学的多智能体系统安全攻击、防御与评估综合框架]**（2024）`ACL`\n  > 本文从智能体心理学角度探讨多智能体系统安全问题，提出了 PsySafe 框架，并为未来研究提供了洞见。\n\n- **[破解 ReAct 智能体：先入为主攻击让你得逞]**（2024）`Arxiv`\n  > 该论文介绍了针对 ReAct 智能体的“先入为主”攻击，并提出了一种反射机制来缓解这一安全漏洞。\n\n- **[AGENT-SAFETYBENCH：评估 LLM 智能体的安全性]**（2024）`Arxiv`\n  > 该论文引入了 AGENT-SAFETYBENCH 来评估 LLM 智能体的安全性，指出了现有方法的不足，并强调需要更先进的策略，同时计划发布该基准测试。\n\n- **[INJECAGENT：工具集成型大型语言模型智能体中间接提示注入的基准测试]**（2024）`Arxiv`\n  > 介绍了 INJECAGENT，这是一个用于评估工具集成型 LLM 智能体对 IPI 攻击脆弱性的基准测试，并对攻击意图进行了分类。\n\n- **[PsySafe：基于心理学的多智能体系统安全攻击、防御与评估综合框架]**（2024）`Arxiv`\n  > 本文提出基于智能体心理学的 PsySafe 框架，以解决多智能体系统安全问题，提供了关于风险识别、评估和缓解的见解。\n\n- **[TrustAgent：迈向安全可信的 LLM 基础智能体]**（2024）`ACL`\n  > 介绍了 TrustAgent，一个基于智能体宪章的框架。该框架采用三种策略来保障 LLM 智能体的安全，同时影响其助人能力，并揭示了 LLM 推理的重要性。\n\n- **[警惕你的智能体！探究基于 LLM 的智能体中的后门威胁]**（2024）`NeurIPS`\n  > 该研究构建了一个智能体后门攻击框架，分析了多种形式，并指出需要针对性的防御措施来应对这些威胁。\n\n- **[R-Judge：评估 LLM 智能体安全风险意识的基准测试]**（2024）`ACL`\n  > 介绍了 R-Judge，一个用于评估 LLM 智能体安全风险意识的基准测试，表明仍有改进空间，并揭示了一种有效的微调方法。\n\n- **[NetSafe：探索多智能体网络的拓扑安全性]**（2024）`*ACL`\n  > 本文从拓扑视角探讨多智能体网络的安全性，提出了NetSafe框架，揭示了新的现象，为未来的安全研究提供了指导。\n\n- **[摇摇欲坠的纸牌屋？面向语言智能体的对抗攻击映射]**（2024）`Arxiv`\n  > 首次系统性地梳理了针对语言智能体的对抗攻击，提出了一套框架和12种攻击场景，强调了理解相关风险的紧迫性。\n\n\n\n### 综述\n\n- **[LLM（-Agent）全栈安全的全面综述：数据、训练与部署]**（2025）`Arxiv`\n  > 本文首次提出了LLM“全栈”安全的概念，覆盖整个生命周期，并结合丰富的文献资料和独到见解，为未来研究指明方向。\n\n- **[信任但需验证！测试时扩展的验证设计综述]**（2025）`Arxiv`\n  > 该综述涵盖了多种TTS验证方法，提出了验证器训练的统一视角，填补了现有文献中的空白。\n\n- **[科学类大语言模型综述：从数据基础到智能体前沿]**（2025）`Arxiv`\n  > 本综述以数据为核心重新审视Sci-LLM的发展，分析了数据集、评估方法，并提出将科学发现转向闭环系统的思路。\n\n- **[LLM智能体的评估与基准测试：综述]**（2025）`Arxiv`\n  > 提出了用于LLM智能体评估的二维分类法，指出了企业级应用中的挑战，并明确了系统化评估的未来研究方向。\n\n- **[自进化AI智能体的全面综述：连接基础模型与终身智能体系统的新范式]**（2025）`Arxiv`\n  > 本文对自进化AI智能体进行了全面综述，提出了统一的框架，回顾了相关技术、领域特定策略、评估方法以及安全与伦理问题，旨在打通基础模型与终身智能体系统之间的桥梁。\n\n- **[LLM智能体的评估与基准测试：综述]**（2025）`Arxiv`\n  > 该综述引入了一个二维分类体系来组织LLM智能体的评估方法，并强调了关键的企业级挑战，为研究人员和从业者提供了一个系统化的框架，用于评估智能体在实际部署中的表现。\n\n- **[面向LLM的智能体强化学习全景：综述]**（2025）`Arxiv`\n  > 本综述正式确立了LLM从生成器向自主智能体的转变，提出了核心能力和应用场景的双重分类体系。它将强化学习定位为整合这些模块、实现适应性强且稳健的智能体行为的关键机制。\n\n- **[大型视觉语言模型的基准评估、应用与挑战：综述]**（2025）`Arxiv`\n  > 本文系统性地概述了VLM的相关内容：模型信息、架构、基准测试、应用及面临的挑战，并在GitHub仓库中提供了详细资料。\n\n- **[未来是智能体时代：多智能体推荐系统的定义、视角与开放挑战]**（2025）`Arxiv`\n  > 本文探讨了LLM智能体在推荐系统中的应用，提出了形式化框架、用例和挑战，为下一代服务奠定了基础。\n\n- **[商用LLM智能体已易受简单却危险的攻击侵害]**（2025）`Arxiv`\n  > 分析了LLM智能体特有的安全与隐私漏洞，提供了攻击分类，并对热门智能体进行了简单的攻击实验。\n\n- **[多智能体协作机制：基于LLM的综述]**（2025）`Arxiv`\n  > 本文综述了基于LLM的多智能体系统，提出了一个框架，探讨了其应用，并指出了人工智能集体智慧所面临的挑战与发展方向。\n\n- **[面临威胁的AI智能体：关键安全挑战与未来路径的综述]**（2025）`ACM Computing Survey`\n  > 该综述指出了AI智能体在安全领域的四大知识缺口。它回顾了各类威胁，展示了当前进展与局限性，并为未来研究提供了启发。\n\n- **[基于大型模型的智能体：现状、合作范式、安全与隐私及未来趋势]**（2024）`Arxiv`\n  > 本文探讨了LM智能体未来的自主协作，涵盖了当前状态、合作范式、安全风险，并提出了未来的研究方向。\n\n- **[智能体AI：多模态交互的前沿探索]**（2024）`Arxiv`\n  > 定义了用于交互式多模态系统的“智能体AI”，探讨了动作预测、缓解模型幻觉等问题，并展望了虚拟交互的未来。\n\n- **[基于大语言模型的多智能体：进展与挑战的综述]**（2024）`Arxiv`\n  > 该综述讨论了基于LLM的多智能体系统的核心要素与挑战，提供了相关数据集，并维护了一个GitHub仓库以更新最新研究成果。\n\n- **[大型多模态智能体：综述]**（2024）`Arxiv`\n  > 回顾了由LLM驱动的多模态智能体，对相关研究进行了分类，整理了评估方法，并提出了未来的发展方向。\n\n- **[理解LLM智能体的规划：综述]**（2024）`Arxiv`\n  > 该综述首次系统性地审视了基于LLM的智能体规划，对相关工作进行了分类，分析了发展方向并讨论了面临的挑战。\n\n- **[计算实验与基于大语言模型的智能体：综述与展望]**（2024）`Arxiv`\n  > 本文探讨了计算实验与基于LLM的智能体相结合的可能性，梳理了二者的历史、相互优势，并讨论了挑战与发展趋势。\n\n- **[个人LLM智能体：能力、效率与安全的洞察与综述]**（2024）`Arxiv`\n  > 本文聚焦于个人LLM智能体，讨论了其架构、挑战及解决方案，将其视为一种重要的软件范式。\n\n- **[基于大型模型的智能体：现状、合作范式、安全与隐私及未来趋势]**（2024）`Arxiv`\n  > 本文探讨了LM智能体未来的自主协作，涵盖了当前状态、关键技术、安全与隐私问题，并提出了未来的研究方向。\n\n- **[用于推理、规划和工具调用的新兴 AI 代理架构全景：综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.11584)** (*2024*) `Arxiv`\n  > 本综述评估了 AI 代理的实现方式，分享了其能力、洞见及设计考量，并突出了构建稳健系统的关键主题。\n\n- **[基于大型语言模型的智能代理探索：定义、方法与前景](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03428)** (*2024*) `Arxiv`\n  > 本文综述了单智能体与多智能体系统中基于 LLM 的智能代理，涵盖其定义、组成部分、部署机制，并展望了其发展前景。\n\n- **[立场论文：迈向整体性智能的代理 AI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00833)** (*2024*) `Arxiv`\n  > 论文提出了用于具身智能的代理基础模型，探讨了代理 AI 的领域能力和跨学科潜力，以指导未来研究方向。\n\n- **[基于大型语言模型的多智能体：进展与挑战的综述](https:\u002F\u002Fwww.ijcai.org\u002Fproceedings\u002F2024\u002F0890.pdf)** (*2024*) `IJCAI`\n  > 本综述深入探讨了基于 LLM 的多智能体系统，涵盖了其运行领域、智能体配置以及技能发展途径。此外，还列出了相关数据集并维护了一个 GitHub 仓库。\n\n- **[具备工具能力的 LLM：综述](http:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18807)** (*2024*) `Arxiv`\n  > 提出了一种标准化的工具集成范式，探讨了其中的挑战、创新解决方案以及 LLM 自主创建工具的理念，并复现了相关实验结果。\n\n- **[基于大型语言模型的代理记忆机制综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.13501)** (*2024*) `Arxiv`\n  > 本文全面综述了基于 LLM 的代理记忆机制，回顾了其设计与评估方法，展示了具体应用，并提出了未来的研究方向。\n\n- **[理解 LLM 代理的规划：综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.02716)** (*2024*) `Arxiv`\n  > 本综述系统性地梳理了基于 LLM 的代理规划问题，对现有工作进行了分类，分析了研究方向，并讨论了当前面临的挑战。\n\n- **[基于大型语言模型的多智能体：进展与挑战的综述](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01680)** (*2024*) `Arxiv`\n  > 本综述深入探讨了基于 LLM 的多智能体系统及其各个方面和挑战，并提供了相关数据集和开源代码库。\n\n- **[基于大型语言模型的游戏智能代理综述](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.02039)** (*2024*) `Arxiv`\n  > 本文全面概述了基于 LLM 的游戏智能代理，详细介绍了其架构，调研了不同游戏类型中的代理实例，并提出了未来研发的方向。\n\n- **[大型语言模型与游戏：综述与路线图](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.18659)** (*2024*) `Arxiv`\n  > 本文综述了 LLM 在游戏领域的应用，明确了 LLM 的角色，探讨了尚未开发的领域，并权衡了其潜力与局限性，为后续研究铺平了道路。\n\n- **[基于大型语言模型的智能代理探索：定义、方法与前景](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.03428)** (*2024*) `Arxiv`\n  > 本文综述了单智能体与多智能体系统中基于 LLM 的智能代理，内容涵盖定义、组件、部署方式、数据集，并展望了其发展前景。\n\n- **[风险导航：基于 LLM 的智能代理中的安全、隐私与伦理威胁综述](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.09523?)** (*2024*) `Arxiv`\n  > 本综述分析了基于 LLM 的智能代理所面临的安全、隐私和伦理威胁，提出了分类体系，并建议了未来的研究方向。\n\n- **[AI 代理的安全性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.08689)** (*2024*) `Arxiv`\n  > 本文从系统角度识别了 AI 代理的安全漏洞，介绍了相应的防御措施，并提出了提升其安全性和可靠性的方法。\n\n- **[个人 LLM 代理：关于能力、效率与安全的洞察与综述](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.05459)** (*2024*) `Arxiv`\n  > 本文聚焦于个人 LLM 代理，讨论了其架构、面临的挑战及解决方案，将其视为一种重要的软件范式。\n\n- **[LLM 代理的安全与隐私新议题：附案例研究的综述](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.19354)** (*2024*) `Arxiv`\n  > 本文全面概述了 LLM 代理在隐私与安全方面的问题，覆盖了威胁、影响、防御措施及发展趋势，并通过案例研究启发未来研究。\n\n- **[从行动与指令推断沟通智能体的目标](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.16207)** (*2024*) `ICML Workshop`\n  > 本文模拟了人类在合作中的推理能力。它使用 GPT-3 处理指令性话语，并结合多模态贝叶斯逆向规划来推断目标，从而凸显了言语交流的重要性。\n\n- **[个人 LLM 代理：关于能力、效率与安全的洞察与综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.05459)** (*2024*) `Arxiv`\n  > 本文聚焦于个人 LLM 代理，讨论了其架构、挑战与解决方案，将其视为一种重要的软件范式。\n\n- **[LLM 红队测试的最新进展：技术、防御与伦理考量](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09097)** (*2024*) `Arxiv`\n  > 本文分析了 LLM 红队攻击（如基于梯度的方法、强化学习等）及其防御措施，旨在推动更安全、更可靠的语言模型发展。\n\n- **[从长期问题到新兴困境：大型语言模型伦理的解构——综述](https:\u002F\u002Fui.adsabs.harvard.edu\u002F)** (*2024*)\n  > 本文从旧有到新兴的层面综述了 LLM 面临的伦理挑战，分析了相关研究，并强调将伦理融入 LLM 开发的重要性。\n\n- **[基于大型语言模型的自主代理综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.11432)** (*2023*) `FCS`\n  > 本文综述了基于 LLM 的自主代理，提出了一套统一的构建框架，概述了其应用，并指出了当前的挑战及未来发展方向。\n\n- **[基于大型语言模型的代理的兴起与潜力：综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07864)** (*2023*) `SCIS`\n  > 本文综述了基于 LLM 的代理，提出了一个通用框架，探讨了其应用场景，深入研究了代理社会，并讨论了关键议题与问题。\n\n- **[大型语言模型对齐：综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.15025)** (*2023*) `Arxiv`\n  > 本综述对 LLM 对齐方法进行了分类，探讨了相关问题，给出了基准测试，并展望了未来研究方向，以推动更具能力和更安全的 LLM 发展。\n\n- **[语言模型可能造成的伦理与社会危害风险](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.04359)** (*2021*) `Arxiv`\n  > 本文分析了大规模 LLM 带来的风险，列出了六个领域中的 21 种风险，提出了缓解措施，并指出了进一步研究的方向。\n\n- **[基础模型的机遇与风险](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.07258)** (*2021*) `Arxiv`\n  > 本文详细探讨了基础模型的机遇与风险，指出了涌现能力及同质化问题，并呼吁开展跨学科研究。\n\n- **[迈向可信的人工智能开发：支持可验证声明的机制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.07213)** (*2020*) `Arxiv`\n  > 本文提出了不同利益相关者提升人工智能声明可验证性的步骤，分析了十种机制，并给出了相关建议。\n\n- **[可行动的审计：公开披露商业AI产品偏见性能结果的影响研究](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3306618.3314244?casa_token=1ogqoO70pDgAAAAA:7r8-ICJ2Ym55Fg2aaW11gpz7FR15yYHzuqBdGu7ifBfkiMRdbknxo34ItX_GwjeUZPg9k4U22tRX)** (*2019*) `AIES`\n  > 本文通过Gender Shades审计案例分析了公开披露AI偏见结果的影响，表明此举能够促使企业减少算法偏差。\n\n\n\n### 工具\n\n- **[ToolCoder：面向大型语言模型的系统性代码赋能工具学习框架](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11404)** (*2025*) `Arxiv`\n  > 提出ToolCoder框架，将工具学习重新表述为代码生成任务。该框架能将用户查询转换为Python脚手架，实现代码复用与系统化调试。\n\n- **[VTool-R1：基于多模态工具使用的强化学习，让视觉语言模型学会借助图像思考](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19255)** (*2025*) `Arxiv`\n  > 介绍了VTool-R1，这是首个训练视觉语言模型进行多模态思维链的框架。它将视觉工具整合到强化反馈机制中，使模型能够在无需过程监督的情况下进行策略性工具使用。\n\n- **[Re-Invoke：用于零样本工具检索的工具调用重写方法](http:\u002F\u002Farxiv.org\u002Fabs\u002F2408.01875)** (*2024*) `Arxiv`\n  > 提出了Re-Invoke方法，这是一种针对大规模工具集的无监督工具检索技术，结合查询合成、意图提取和多视角排序来实现高效检索。\n\n- **[工具链：大型语言模型作为自动化的多工具学习者](http:\u002F\u002Farxiv.org\u002Fabs\u002F2405.16533)** (*2024*) `Arxiv`\n  > 提出了LLM的自动工具链（ATC）概念，将其视为多工具使用者；同时提出了一种黑盒探针方法用于工具学习，并构建了ToolFlow基准测试集。\n\n- **[EASYTOOL：通过简洁的工具指令增强基于LLM的智能体](http:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06201)** (*2024*) `Arxiv`\n  > 介绍了EASYTOOL框架，能够将多样化的工具文档转化为简洁明了的指令，从而提升LLM的工具使用能力。\n\n- **[ToolGen：通过生成式方法实现统一的工具检索与调用](http:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03439)** (*2024*) `Arxiv`\n  > 提出了ToolGen框架，通过独特标记将工具知识融入LLM参数中，将工具检索转化为生成任务，从而显著提升LLM的多功能性和自主性。\n\n- **[ToolNet：通过工具图谱将大型语言模型与海量工具连接起来](http:\u002F\u002Farxiv.org\u002Fabs\u002F2403.00839)** (*2024*) `Arxiv`\n  > 本文提出了ToolNet插拔式框架，将工具以图谱形式组织起来，使LLM能够更高效地处理数千种工具。\n\n- **[ToolPlanner：一种具备路径规划与反馈功能的多粒度指令增强型工具LLM](http:\u002F\u002Farxiv.org\u002Fabs\u002F2409.14826)** (*2024*) `Arxiv`\n  > 本文构建了MGToolBench以反映真实场景，并提出了ToolPlanner框架，包含路径规划与反馈机制，以更好地完成任务和遵循指令。\n\n- **[利用执行反馈提升语言模型的工具学习能力](https:\u002F\u002Faclanthology.org\u002F2024.naacl-long.195\u002F)** (*2024*) `*ACL`\n  > 提出了TRICE两阶段框架，允许模型从工具执行反馈中学习，从而决定何时以及如何有效使用工具。\n\n- **[利用大型语言模型改进REST API测试](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3639476.3639769)** (*2024*) `ICSE-NIER`\n  > 本文提出了RESTGPT，利用LLM从API规范中提取规则并生成测试值，以解决现有方法的局限性。\n\n- **[想象中的LLM：通过模拟试错进行工具学习](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.570\u002F)** (*2024*) `*ACL`\n  > 现有的LLM在工具使用方面的准确性较低。本文提出了STE方法，这是一种受生物启发的技术，结合尝试、想象和记忆，以提升工具学习效果。\n\n- **[情境技能：解锁大型语言模型的组合能力](https:\u002F\u002Faclanthology.org\u002F2024.findings-emnlp.812\u002F)** (*2024*) `*ACL`\n  > 本文提出了情境技能（SKiC）提示方法，应用于上下文学习中，以释放LLM的组合能力，并实现零样本泛化。\n\n- **[TaskMatrix.AI：通过连接基础模型与数百万个API完成任务](https:\u002F\u002Fspj.science.org\u002Fdoi\u002F10.34133\u002Ficomputing.0063)** (*2024*) `Others`\n  > 本文提出将基础模型与API连接以完成任务，利用模型的对话和代码生成等能力实现实际应用。\n\n- **[Gorilla：连接海量API的大规模语言模型](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Fhash\u002Fe4c61f578ff07830f5c37378dd3ecb0d-Abstract-Conference.html)** (*2024*) `NeurIPS`\n  > 开发了经过RAT训练优化的Gorilla模型，采用微调后的LLaMA架构，有效缓解幻觉问题，并通过检索机制更好地编写API调用代码，相关成果已在APIBench上得到验证。\n\n- **[大型语言模型作为工具制造者](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17126)** (*2024*) `ICLR`\n  > 本文提出了LATM闭环框架，使LLM能够自主制造并使用工具，通过分工实现成本效益，并扩展缓存的应用范围。\n\n- **[多智能体协作：释放智能LLM代理的力量](http:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03314)** (*2023*) `Arxiv`\n  > 该框架创新性地提升了LLM的能力，解决了其局限性，并通过多种案例展示了其在AGI领域的潜力。\n\n- **[推荐AI代理：整合大型语言模型实现交互式推荐](http:\u002F\u002Farxiv.org\u002Fabs\u002F2308.16505)** (*2023*) `Arxiv`\n  > 本文通过“InteRecAgent”将推荐模型与LLM桥梁连接，以LLM为大脑、推荐模型为工具，实现了交互式推荐功能。\n\n- **[ToolLLM：助力大型语言模型掌握超过16000个真实世界API](http:\u002F\u002Farxiv.org\u002Fabs\u002F2307.16789)** (*2023*) `Arxiv`\n  > 介绍了ToolLLM框架，包含ToolBench数据集、新型决策树算法和ToolEval评估工具。通过对LLaMA进行微调，得到了具有良好泛化能力的ToolLLaMA模型。\n\n- **[TPTU-v2：提升大型语言模型驱动的智能体在现实系统中任务规划与工具使用能力](http:\u002F\u002Farxiv.org\u002Fabs\u002F2311.11315)** (*2023*) `Arxiv`\n  > 本文提出了一种框架，旨在提升基于LLM的智能体在现实系统中的TPTU能力，包括API检索器、LLM微调器和演示选择器。\n\n- **[TPTU：基于大型语言模型的AI智能体，用于任务规划与工具使用](http:\u002F\u002Farxiv.org\u002Fabs\u002F2308.03427)** (*2023*) `Arxiv`\n  > 提出了基于LLM的AI智能体框架，设计了两种类型的智能体，并评估了TPTU能力，为LLM在AI领域的应用提供了指导。\n\n- **[GPT4Tools：通过自我指令教导大型语言模型使用工具](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002Fe393677793767624f2821cec8bdd02f1-Abstract-Conference.html?utm_campaign=Artificial%2BIntelligence%2BWeekly&utm_medium=email&utm_source=Artificial_Intelligence_Weekly_411)** (*2023*) `NeurIPS`\n  > 提出GPT4Tools方法，通过自我指令使开源LLM具备工具使用能力，提供基准测试，并展示其广泛的适用性。\n\n- **[API-Bank：面向工具增强型LLM的综合基准测试](https:\u002F\u002Faclanthology.org\u002F2023.emnlp-main.187\u002F)** (*2023*) `*ACL`\n  > 引入API-Bank用于工具增强型LLM。开发了评估体系，构建了训练数据集，并指出了未来的研究挑战。\n\n- **[ChatCoT：基于聊天式大型语言模型的工具增强型思维链推理](https:\u002F\u002Faclanthology.org\u002F2023.findings-emnlp.985\u002F)** (*2023*) `*ACL`\n  > 提出ChatCoT框架，这是一种适用于聊天式LLM的工具增强型思维链推理方法，将思维链建模为多轮对话，统一了推理与工具使用。\n\n- **[ToolQA：用于LLM问答任务的外部工具使用数据集](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F9cb2a7495900f8b602cb10159246a016-Abstract-Datasets_and_Benchmarks.html)** (*2023*) `NeurIPS`\n  > 推出了ToolQA数据集，用于评估LLM在问答任务中对外部工具的使用能力。采用可扩展的策划方式，最小化数据重叠，并提供了新的评估方向。\n\n- **[关于开源大型语言模型的工具操控能力](http:\u002F\u002Farxiv.org\u002Fabs\u002F2305.16504)** (*2023*) `Arxiv`\n  > 该文重新审视了开源LLM在工具操控方面的经典方法，创建了ToolBench，并提供了一种实用的人工监督方案。\n\n- **[RestGPT：将大型语言模型与现实世界的RESTful API连接](http:\u002F\u002Farxiv.org\u002Fabs\u002F2306.06624)** (*2023*) `Arxiv`\n  > 本文提出RestGPT，通过规划机制和API执行器将LLM与RESTful API连接起来。同时提供了RestBench用于评估。\n\n- **[Toolformer：语言模型可以自我学习使用工具](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002Fd842425e4bf79ba039352da0f658a906-Abstract-Conference.html)** (*2023*) `NeurIPS`\n  > 该文提出Toolformer，使LM能够通过少量示例，借助API自主学习使用外部工具，从而提升零样本任务的表现。\n\n- **[WebCPM：面向中文长文本问答的交互式网络搜索](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.499\u002F)** (*2023*) `*ACL`\n  > 介绍了WebCPM，这是首个包含交互式网络搜索的中文LFQA数据集。记录了搜索行为，对模型进行了微调，并公开了相关资源。\n\n- **[ToolCoder：教导代码生成模型使用API搜索工具](http:\u002F\u002Farxiv.org\u002Fabs\u002F2305.04032)** (*2023*) `Arxiv`\n  > 提出ToolCoder，将API搜索工具整合到代码生成过程中。利用ChatGPT进行标注和微调，在这一过程中创新性地融入了工具。\n\n- **[ToolAlpaca：基于3000个模拟案例的语言模型通用工具学习](http:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05301)** (*2023*) `Arxiv`\n  > 该文提出了ToolAlpaca框架，用于生成工具使用语料并让紧凑型模型学习通用技能，证明了此类模型的可行性。\n\n- **[ToolkenGPT：通过工具嵌入增强冻结语言模型的大规模工具使用能力](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F8fd1a81c882cd45f64958da6284f4a3f-Abstract-Conference.html)** (*2023*) `NeurIPS`\n  > 该文提出ToolkenGPT，利用工具嵌入使LLM掌握类似预测token的工具使用技能，解决了现有集成方法的局限性。\n\n- **[MultiTool-CoT：GPT-3可通过思维链提示同时使用多个外部工具](https:\u002F\u002Faclanthology.org\u002F2023.acl-short.130\u002F)** (*2023*) `*ACL`\n  > 提出MultiTool-CoT框架，利用思维链提示将多个外部工具整合到推理过程中，以提升在NumGLUE上的表现。\n\n- **[CREATOR：用于解耦大型语言模型抽象与具体推理的工具创建](https:\u002F\u002Faclanthology.org\u002F2023.findings-emnlp.462\u002F)** (*2023*) `*ACL`\n  > 提出CREATOR，使LLM能够创建工具，从而分离工具的创建与执行。引入了Creation Challenge，革新了问题解决范式。\n\n- **[GEAR：用通用且高效的工具解析增强语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.08775)** (*2023*) `Arxiv`\n  > 介绍了GEAR，一种通用且高效的查询-工具对接算法，可委托给SLM\u002FLLM执行，在降低成本的同时提高了精度。\n\n- **[Dify](https:\u002F\u002Fgithub.com\u002Flanggenius\u002Fdify)** (*2023*)\n  > Dify是一个开源的LLM应用开发平台。其界面集成了多种功能，支持从原型快速部署到生产环境。\n\n- **[LangChain](https:\u002F\u002Fgithub.com\u002Flangchain-ai\u002Flangchain)** (*2023*)\n  > LangChain简化了LLM应用的生命周期，提供了开发组件、生产工具以及基于大模型代理的部署平台。\n\n- **[WebGPT：结合浏览器辅助与人工反馈的问答系统](http:\u002F\u002Farxiv.org\u002Fabs\u002F2112.09332)** (*2022*) `Arxiv`\n  > 对GPT-3进行微调，使其能够结合网页浏览进行长篇问答，采用模仿学习、人工反馈和参考文献收集等方法，是一种新颖的尝试。\n\n- **[Task Bench：用于评估并行运行时性能的参数化基准测试](https:\u002F\u002Fwww.computer.org\u002Fcsdl\u002Fproceedings-article\u002Fsc\u002F2020\u002F999800a864\u002F1oeOToMWZBC)** (*2020*) `SC`\n  > Task Bench是针对分布式编程系统的参数化基准测试工具。它简化了基准测试流程，并引入了一种名为METG的新指标来评估系统性能。\n\n\n\n\n## 🤝 贡献\n\n我们欢迎各位贡献者扩充本合集。您可以：\n- 提交拉取请求以添加论文或资源\n- 开启议题建议新增论文或资源\n- 通过[我们的提交表单](https:\u002F\u002Fforms.office.com\u002Fr\u002FsW0Zzymi5b)提交您的论文，或发送邮件至luo.junyu@outlook.com\n\n我们将定期更新仓库，以纳入最新的研究成果。\n\n## 📝 引用\n\n如果您觉得本综述有所帮助，请考虑引用我们的工作：\n\n```\n\n@article{agentsurvey2025,\n  title={大型语言模型智能体：方法论、应用与挑战综述},\n  author={Luo, J. 和 Zhang, W. 和 Yuan, Y. 等},\n  journal={arXiv预印本 arXiv:2503.21460},\n  year={2025}\n}\n\n```\n\n---\n\n\u003Cp align=\"center\">\n  \u003Ci>如有任何疑问或建议，请开启议题或联系仓库维护者。\u003C\u002Fi>\n\u003C\u002Fp>","# Awesome-Agent-Papers 快速上手指南\n\n`Awesome-Agent-Papers` 是一个专注于大语言模型（LLM）智能体研究的论文合集仓库，并非可执行的软件工具或代码库。因此，它**不需要安装环境、依赖包或运行命令**。本指南将指导开发者如何高效地浏览、检索和利用该资源列表。\n\n## 🌐 环境准备\n\n由于该项目本质是一个托管在 GitHub 上的文档集合，您只需具备以下基础条件即可开始使用：\n\n*   **操作系统**：任意支持现代浏览器的系统（Windows, macOS, Linux）。\n*   **前置依赖**：\n    *   稳定的网络连接（用于访问 GitHub 和 ArXiv 等学术网站）。\n    *   Web 浏览器（推荐 Chrome, Edge 或 Firefox）。\n*   **可选工具**：\n    *   **Git**：如果您希望将论文列表克隆到本地进行离线浏览或贡献内容。\n    *   **PDF 阅读器**：用于阅读下载的论文文件。\n\n> 💡 **国内访问提示**：\n> 如果直接访问 GitHub 或 ArXiv 速度较慢，建议配置以下加速方案：\n> *   **GitHub 加速**：使用镜像站（如 `https:\u002F\u002Fgithubfast.com\u002F`）或通过修改 hosts 文件优化连接。\n> *   **ArXiv 论文下载**：推荐使用国内镜像源，例如：\n>     *   中科大镜像：`https:\u002F\u002Farxiv.paperswithcode.com\u002F` (部分支持) 或直接访问 `https:\u002F\u002Farxiv.org\u002F` 配合学术加速器。\n>     *   许多国内高校图书馆提供 ArXiv 文献传递服务。\n\n## 📥 获取与浏览步骤\n\n您无需执行复杂的安装命令，可通过以下两种方式获取资源：\n\n### 方式一：在线浏览（推荐）\n\n直接访问 GitHub 仓库页面，利用目录结构快速查找所需领域的论文。\n\n1.  打开浏览器，访问项目主页：\n    ```text\n    https:\u002F\u002Fgithub.com\u002Fluo-junyu\u002FAwesome-Agent-Papers\n    ```\n2.  向下滚动至 **📚 Resource List** 部分。\n3.  根据研究兴趣点击对应的分类标题（如 `Agent Collaboration`, `Agent Construction` 等）。\n4.  点击论文标题链接（通常指向 ArXiv 或会议官网）查看摘要或下载 PDF。\n\n### 方式二：本地克隆（适合贡献者或离线整理）\n\n如果您希望将列表保存到本地或使用 Git 跟踪更新，可执行以下命令：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fluo-junyu\u002FAwesome-Agent-Papers.git\ncd Awesome-Agent-Papers\n```\n\n*注：克隆后，您可以直接在本地用 Markdown 编辑器打开 `README.md` 文件进行浏览。*\n\n## 🚀 基本使用示例\n\n假设您是一名开发者，想要寻找关于\"**多智能体协作（Multi-Agent Collaboration）**\"的最新研究成果，请按以下步骤操作：\n\n1.  **定位分类**：\n    在 README 中找到 `### Agent Collaboration` 章节。\n\n2.  **筛选论文**：\n    浏览该列表，关注标题和年份。例如，若您对 2025 年的最新框架感兴趣，可以找到：\n    *   **[Foam-Agent: Towards Automated Intelligent CFD Workflows](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04997)** (*2025*)\n    *   **[Chain of Agents: Large language models collaborating on long-context tasks](https:\u002F\u002Fresearch.google\u002Fblog\u002Fchain-of-agents-large-language-models-collaborating-on-long-context-tasks\u002F)** (*2025*)\n\n3.  **获取全文**：\n    点击论文标题链接。以 `Foam-Agent` 为例，点击后将跳转至 ArXiv 页面。\n    *   在 ArXiv 页面右侧点击 **\"View PDF\"** 或 **\"Download\"** 获取全文。\n    *   若下载缓慢，复制论文编号（如 `2505.04997`），在国内镜像站（如 `https:\u002F\u002Far5iv.org\u002Fhtml\u002F2505.04997` 或高校镜像）中搜索该编号进行下载。\n\n4.  **追踪综述**：\n    若想全面了解该领域，优先阅读仓库顶部的官方综述论文：\n    *   点击 **[Read our survey paper here](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21460)** 获取系统性总结。\n\n通过以上步骤，您可以高效地利用 `Awesome-Agent-Papers` 追踪 LLM Agent 领域的最前沿动态，无需任何复杂的配置。","某 AI 初创团队正致力于研发一套多智能体协作系统，用于自动化处理复杂的金融数据分析任务。\n\n### 没有 Awesome-Agent-Papers 时\n- **文献检索如大海捞针**：研究人员需在 arXiv 等平台手动搜索关键词，难以区分高质量论文与过时内容，耗时数周仍无法覆盖最新进展。\n- **技术选型缺乏体系**：面对“智能体构建”与“协作机制”等碎片化概念，团队难以理清架构设计的逻辑脉络，导致原型开发反复试错。\n- **忽视潜在安全风险**：由于缺乏对“智能体安全”领域的系统性认知，初期方案未考虑对抗攻击防护，埋下严重隐患。\n- **评估标准模糊不清**：找不到权威的基准测试（Benchmarks）数据集，无法量化评估多智能体间的协作效率与竞争策略优劣。\n\n### 使用 Awesome-Agent-Papers 后\n- **前沿动态一键掌握**：直接利用其分类清晰的资源列表，快速定位到 2025 年最新的《Foam-Agent》等关键论文，将调研周期从数周缩短至两天。\n- **架构设计有据可依**：参考仓库中关于“智能体演化”与“工具集成”的结构化综述，迅速确立了分层协作架构，避免了重复造轮子。\n- **安全防线提前部署**：通过\"Security\"专栏深入理解最新威胁模型，在系统设计阶段即融入了针对性的防御协议。\n- **性能评估科学量化**：直接采用推荐的 `MultiAgentBench` 等基准测试集，精准量化了智能体在复杂任务中的协作表现，加速了迭代优化。\n\nAwesome-Agent-Papers 将分散的研究孤岛连接成完整的知识地图，让团队从盲目探索转向高效创新。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fluo-junyu_Awesome-Agent-Papers_0dcbdccd.png","luo-junyu","Junyu Luo","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fluo-junyu_842924d2.jpg","LLM researcher","Peking University",null,"https:\u002F\u002Fgithub.com\u002Fluo-junyu",2620,94,"2026-04-15T06:35:30",1,"","未说明",{"notes":86,"python":84,"dependencies":87},"该项目是一个学术论文和资源列表集合（Awesome List），并非可执行的软件代码库，因此没有特定的操作系统、GPU、内存或 Python 版本等运行环境需求。用户只需通过浏览器查看或通过 Git 克隆仓库即可获取内容。",[],[13,35,14],[90,91,92,93],"agent","awesome-list","llm","llmagents","2026-03-27T02:49:30.150509","2026-04-16T02:04:12.777136",[],[]]