[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-WindyLab--LLM-RL-Papers":3,"tool-WindyLab--LLM-RL-Papers":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159267,2,"2026-04-17T11:29:14",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":78,"stars":81,"forks":82,"last_commit_at":83,"license":78,"difficulty_score":84,"env_os":85,"env_gpu":86,"env_ram":86,"env_deps":87,"category_tags":90,"github_topics":91,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":97,"updated_at":98,"faqs":99,"releases":100},8484,"WindyLab\u002FLLM-RL-Papers","LLM-RL-Papers","Monitoring recent cross-research on LLM & RL on arXiv for control. If there are good papers, PRs are welcome. ","LLM-RL-Papers 是一个专注于追踪大语言模型（LLM）与强化学习（RL）交叉研究前沿的开源知识库。它主要解决该领域论文更新快、方向分散，导致研究人员难以高效获取“控制类”应用（如游戏智能体、机器人导航等）最新进展的痛点。\n\n该项目不仅持续监控 arXiv 上的最新论文，还创新性地按技术方法对文献进行了系统化梳理。其核心亮点在于将论文分为“直接行动”与“间接引导”等类别，清晰展示了 LLM 如何作为策略教师、反馈机制或规划核心来增强 RL 代理的决策能力。此外，项目收录了多篇高质量的综述文章，帮助读者快速构建从概念分类到具体方法的完整知识体系。\n\nLLM-RL-Papers 特别适合人工智能领域的研究人员、算法工程师以及对具身智能感兴趣的学生使用。无论是希望寻找灵感的开发者，还是试图把握学科融合趋势的学者，都能在这里找到极具价值的参考资源。作为一个社区驱动的项目，它也欢迎用户提交优质的新论文，共同维护这一活跃的知识共享生态。","# LLM RL Papers\n\n1. Monitoring recent cross-research on LLM &amp; RL;\n2. Focusing on combining their capabilities for **control** (such as game characters, robotics);\n3. Feel free to open PRs if you want to share the good papers you’ve read.\n\n\n***\n\n## Table of Content\n\n* \u003Ca href=\"#research-review\" style=\"color: black; text-decoration: none; font-size: 20px; bold: true; font-weight: 700\"> Research Review\u003C\u002Fa>\n\n   + [LLM-based Multi-Agent Reinforcement Learning: Current and Future Directions](#llm-based-multi-agent-reinforcement-learning-current-and-future-directions)\n   + [A Survey on Large Language Model-Based Game Agents](#a-survey-on-large-language-model-based-game-agents)\n   + [Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods ](#survey-on-large-language-model-enhanced-reinforcement-learning-concept-taxonomy-and-methods)\n   + [The RL and LLM Taxonomy Tree Reviewing Synergies Between Reinforcement Learning and Large Language Models](#the-rl-and-llm-taxonomy-tree-reviewing-synergies-between-reinforcement-learning-and-large-language-models)\n\n* \u003Ca href=\"#llm-rl-papers\" style=\"color: black; text-decoration: none; font-size: 20px; bold: true; font-weight: 700\">LLM RL Papers [sort by method]\u003C\u002Fa>\n\n   - **Action**\n\n     - Directly\n\n       →[iLLM-TSC: Integration reinforcement learning and large language model for traffic signal control policy improvement](#iLLM-TSC-Integration-reinforcement-learning-and-large-language-model-for-traffic-signal-control-policy-improvement)\n\n       →[SRLM: Human-in-Loop Interactive Social Robot Navigation with Large Language Model and Deep Reinforcement Learning ](#srlm-human-in-loop-interactive-social-robot-navigation-with-large-language-model-and-deep-reinforcement-learning)\n\n       →[Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts](#knowledgeable-agents-by-offline-reinforcement-learning-from-large-language-model-rollouts)\n\n       →[Policy Improvement using Language Feedback Models](#policy-improvement-using-language-feedback-models)\n\n       →[True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning](#true-knowledge-comes-from-practice-aligning-llms-with-embodied-environments-via-reinforcement-learning)\n\n       →[Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents](#large-language-model-as-a-policy-teacher-for-training-reinforcement-learning-agents)\n\n       →[LLM Augmented Hierarchical Agents](#llm-augmented-hierarchical-agents)\n\n       →[Large Language Models as Generalizable Policies for Embodied Tasks](#large-language-models-as-generalizable-policies-for-embodied-tasks)\n\n       →[Octopus: Embodied Vision-Language Programmer from Environmental Feedback](#octopus-embodied-vision-language-programmer-from-environmental-feedback)\n\n       →[RE-MOVE: An Adaptive Policy Design for Robotic Navigation Tasks in Dynamic Environments via Language-Based Feedback](#re-move-an-adaptive-policy-design-for-robotic-navigation-tasks-in-dynamic-environments-via-language-based-feedback)\n\n       →[Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning](#grounding-large-language-models-in-interactive-environments-with-online-reinforcement-learning)\n\n       →[Collaborating with language models for embodied reasoning](#collaborating-with-language-models-for-embodied-reasoning)\n\n       →[Inner Monologue: Embodied Reasoning through Planning with Language Models](#inner-monologue-embodied-reasoning-through-planning-with-language-models)\n\n       →[Do As I Can, Not As I Say: Grounding Language in Robotic Affordances](#do-as-i-can-not-as-i-say-grounding-language-in-robotic-affordances)\n\n       →[Keep CALM and Explore: Language Models for Action Generation in Text-based Games](#keep-calm-and-explore-language-models-for-action-generation-in-text-based-games)\n\n     - Indirectly\n       \n       →[Large Language Model Guided Reinforcement Learning Based Six-Degree-of-Freedom Flight Control](#Large-Language-Model-Guided-Reinforcement-Learning-Based-Six-Degree-of-Freedom-Flight-Control)\n       \n       →[Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach](#enabling-intelligent-interactions-between-an-agent-and-an-llm-a-reinforcement-learning-approach)\n       \n       →[RL-GPT: Integrating Reinforcement Learning and Code-as-policy](#rl-gpt-integrating-reinforcement-learning-and-code-as-policy)\n\n   - **Data Preference**\n\n     →[Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization](#reinforcement-learning-from-llm-feedback-to-counteract-goal-misgeneralization)\n\n   - **Data generation**\n\n     →[RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models](#rlingua-improving-reinforcement-learning-sample-efficiency-in-robotic-manipulations-with-large-language-models)\n\n   - **Environment Configuration**\n\n     →[Enhancing Autonomous Vehicle Training with Language Model Integration and Critical Scenario Generation](#enhancing-autonomous-vehicle-training-with-language-model-integration-and-critical-scenario-generation)\n\n     →[EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents](#envgen-generating-and-adapting-environments-via-llms-for-training-embodied-agents)\n\n   - **Path Point**\n\n     →[HighwayLLM: Decision-Making and Navigation in Highway Driving with RL-Informed Language Model](#HighwayLLM-Decision-Making-and-Navigation-in-Highway-Driving-with-RL-Informed-Language-Model)\n\n   - **Prediction**\n\n     →[Learning to Model the World with Language](#learning-to-model-the-world-with-language)\n\n   - **Reward Function**\n\n     →[Agentic Skill Discovery](#agentic-skill-discovery)\n\n     →[LEAGUE++: EMPOWERING CONTINUAL ROBOT LEARNING THROUGH GUIDED SKILL ACQUISITION WITH LARGE LANGUAGE MODELS](#league-empowering-continual-robot-learning-through-guided-skill-acquisition-with-large-language-models)\n\n     →[PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning](#predilect-preferences-delineated-with-zero-shot-language-based-reasoning-in-reinforcement-learning)\n\n     →[Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft](#auto-mc-reward-automated-dense-reward-design-with-large-language-models-for-minecraft)\n\n     →[Accelerating Reinforcement Learning of Robotic Manipulations via Feedback from Large Language Models](#accelerating-reinforcement-learning-of-robotic-manipulations-via-feedback-from-large-language-models)\n\n     →[Eureka: Human-Level Reward Design via Coding Large Language Models](#eureka-human-level-reward-design-via-coding-large-language-models)\n\n     →[Motif: Intrinsic Motivation from Artificial Intelligence Feedback](#motif-intrinsic-motivation-from-artificial-intelligence-feedback)\n\n     →[Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning](#text2reward-automated-dense-reward-function-generation-for-reinforcement-learning)\n\n     →[Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics](#self-refined-large-language-model-as-automated-reward-function-designer-for-deep-reinforcement-learning-in-robotics)\n\n     →[Language to Rewards for Robotic Skill Synthesis](#language-to-rewards-for-robotic-skill-synthesis)\n\n     →[Reward Design with Language Models](#reward-design-with-language-models)\n\n     →[Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals](#read-and-reap-the-rewards-learning-to-play-atari-with-the-help-of-instruction-manuals)\n\n   - **Skills Planning**\n\n     →[Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks](#skill-reinforcement-learning-and-planning-for-open-world-long-horizon-tasks)\n\n     →[Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Model](#long-horizon-locomotion-and-manipulation-on-a-quadrupedal-robot-with-large-language-model)\n\n   - **State Representation**\n\n     →[LLM-Empowered State Representation for Reinforcement Learning](#LLM-Empowered-State-Representation-for-Reinforcement-Learning)\n\n     →[Natural Language Reinforcement Learning](#natural-language-reinforcement-learning)\n\n     →[State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding](#state2explanation-concept-based-explanations-to-benefit-agent-learning-and-user-understanding)\n\n   - **Task Suggestion**\n\n     →[Hierarchical Continual Reinforcement Learning via Large Language Model](#hierarchical-continual-reinforcement-learning-via-large-language-model)\n\n     →[AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents](#autort-embodied-foundation-models-for-large-scale-orchestration-of-robotic-agents)\n\n     →[Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework](#language-and-sketching-an-llm-driven-interactive-multimodal-multitask-robot-navigation-framework)\n\n     →[LgTS: Dynamic Task Sampling using LLM-generated sub-goals for Reinforcement Learning Agents](#lgts-dynamic-task-sampling-using-llm-generated-sub-goals-for-reinforcement-learning-agents)\n\n     →[RLAdapter: Bridging Large Language Models to Reinforcement Learning in Open Worlds](#rladapter-bridging-large-language-models-to-reinforcement-learning-in-open-worlds)\n\n     →[ExpeL: LLM Agents Are Experiential Learners](#expel-llm-agents-are-experiential-learners)\n\n     →[Guiding Pretraining in Reinforcement Learning with Large Language Models](#guiding-pretraining-in-reinforcement-learning-with-large-language-models)\n\n   - **Transformers Framework**\n\n     →[Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning](#unleashing-the-power-of-pre-trained-language-models-for-offline-reinforcement-learning)\n\n     →[AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents](#amago-scalable-in-context-reinforcement-learning-for-adaptive-agents)\n\n     →[Transformers are Sample-Efficient World Models](#transformers-are-sample-efficient-world-models)\n\n* \u003Ca href=\"#foundational-approaches-in-reinforcement-learning\" style=\"color: black; text-decoration: none; font-size: 20px; bold: true; font-weight: 700\"> Foundational Approaches in Reinforcement Learning\u003C\u002Fa>\n\n\t→[Using Natural Language for Reward Shaping in Reinforcement Learning](#using-natural-language-for-reward-shaping-in-reinforcement-learning)\n\n\t→[DQN-TAMER: Human-in-the-Loop Reinforcement Learning with Intractable Feedback](#dqn-tamer-human-in-the-loop-reinforcement-learning-with-intractable-feedback)\n\n\t→[Overcoming Exploration in Reinforcement Learning with Demonstrations](#overcoming-exploration-in-reinforcement-learning-with-demonstrations)\n\n\t→[Automatic Goal Generation for Reinforcement Learning Agents](#automatic-goal-generation-for-reinforcement-learning-agents)\n\n* \u003Ca href=\"#open-source-rl-environment\" style=\"color: black; text-decoration: none; font-size: 20px; bold: true; font-weight: 700\">Open source RL environment \u003C\u002Fa>\n\n***\n\n## Research Review\n\n##### LLM-based Multi-Agent Reinforcement Learning: Current and Future Directions\n\n- Paper Link: [arXiv 2405.11106](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.11106)\n- Overview: \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_67d08d2810af.png\" style=\"zoom: 40%;\" \u002F>\n\nPotential research directions for language-conditioned Multi-Agent Reinforcement Learning (MARL). \n(a) Personalityenabled cooperation, where different robots have different personalities defined by the commands. \n(b) Language-enabled humanon-the-loop frameworks, where humans supervise robots and provide feedback. \n(c) Traditional co-design of MARL and LLM, where knowledge about different aspects of LLM is distilled into smaller models that can be executed on board.\n\n***\n\n##### A Survey on Large Language Model-Based Game Agents\n\n- Paper Link: [arXiv 2404.02039](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02039), [Homepage](https:\u002F\u002Fgithub.com\u002Fgit-disl\u002Fawesome-LLM-game-agent-papers)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_92b41b647af3.png\" style=\"zoom:33%;\" \u002F>\n\nThe conceptual architecture of LLMGAs. At each game step, the **perception** module perceives the multimodal information from the game environment, including textual, images, symbolic states, and so on. The agent retrieves essential memories from the **memory** module and take them along with perceived information as input for **thinking** (reasoning, planning, and reflection), enabling itself to formulate strategies and make informed decisions. The **role-playing** module affects the decision-making process to ensure that the agent’s behavior aligns with its designated character. Then the **action** module translates generated action descriptions into executable and admissible actions for altering game states at the next game step. Finally, the **learning** module serves to continuously improve the agent’s cognitive and game-playing abilities through accumulated gameplay experience.\n\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_3884ed678e52.png\" style=\"zoom:50%;\" \u002F>\n\n\u003Ccenter>Mind map for the learning module\u003C\u002Fcenter>\n\n***\n\n##### Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods \n\n- Paper Link: [arXiv 2403.00282](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.00282)\n- Overview: \n\n![Refer to caption](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9af618a2841f.png)\n\nFramework of LLM-enhanced RL in classical Agent-Environment interactions, where LLM plays different roles in enhancing RL.\n\n***\n\n#####  The RL and LLM Taxonomy Tree Reviewing Synergies Between Reinforcement Learning and Large Language Models\n\n- Paper Link: [arXiv 2402.01874](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01874) \n\n- Overview:\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0b86a03f5b9e.png\" style=\"zoom: 67%;\" \u002F>\n\n    ​\tThis study proposes a novel taxonomy of three main classes based on how RL and LLMs interact with each other:\n\n    - RL4LLM: RL is used to improve the performance of LLMs on tasks related to Natural Language Processing.\n    - LLM4RL: An LLM assists the training of an RL model that performs a task not inherently related to natural language.\n    - RL+LLM: An LLM and an RL agent are embedded in a common planning framework without either of them contributing to training or fine-tuning of the other.\n\n***\n\n## LLM RL Papers\n\n##### Language-Conditioned Offline RL for Multi-Robot Navigation\n\n- Paper Link: [arXiv2407.20164](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20164), [Homepage](https:\u002F\u002Fsites.google.com\u002Fview\u002Fllm-marl)\n- Overview:\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_34f51116474a.png)\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_125ac0e40a6f.png)\n\nThe proposed multi-robot model architecture. Each agent receives a different natural language task and a local observation. They summarize each natural language task g~i~  into a latent representation z~i~  , using an LLM. The function *f* is a graph neural network that encodes local observations o~1~, o~2~, . . . and task embeddings z~1~, z~2~, . . . into a task-dependent state representation s~i~|z for each agent *i*. They learn a local policy *π* conditioned on the state-task representation. Functions *π, f* are learned entirely from a fixed dataset using offline RL. Because they compute z~i~ only once per task, the LLM is not part of the perception-action loop, allowing the policy to act quickly.\n\n***\n\n##### LLM-Empowered State Representation for Reinforcement Learning\n\n- Paper Link: [arXiv2407.15019](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.13237), [Homepage](https:\u002F\u002Fgithub.com\u002Fthu-rllab\u002FLESR)\n- Overview:\n\n![framework](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_1942e7c7e0e4.png)\n\nConventional state representations in reinforcement learning often omit critical task-related details, presenting a significant challenge for value networks in establishing accurate mappings from states to task rewards. Traditional methods typically depend on extensive sample learning to enrich state representations with task-specific information, which leads to low sample efficiency and high time costs. Recently, surging knowledgeable large language models (LLMs) have provided promising substitutes for prior injection with minimal human intervention. Motivated by this, we propose LLM-Empowered State Representation (LESR), a novel approach that utilizes LLM to autonomously generate task-related state representation codes which help to enhance the continuity of network mappings and facilitate efficient training. Experimental results demonstrate LESR exhibits high sample efficiency and outperforms state-of-the-art baselines by an average of 29% in accumulated reward in Mujoco tasks and 30% in success rates in Gym-Robotics tasks.\n\n***\n\n##### iLLM-TSC: Integration reinforcement learning and large language model for traffic signal control policy improvement\n\n- Paper Link: [arXiv2407.06025](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06025), [Homepage](https:\u002F\u002Fgithub.com\u002FTraffic-Alpha\u002FiLLM-TSC)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_dac3dfa72a8b.png\" alt=\"img\" style=\"zoom:45%;\" \u002F>\n\nThe authors introduce a framework called iLLM-TSC that combines LLM and an RL agent for TSC. This framework initially employs an RL agent to make decisions based on environmental observations and policies learned from the environment, thereby providing preliminary actions. Subsequently, an LLM agent refines these actions by considering real-world situations and leveraging its understanding of complex environments. This approach enhances the TSC system’s adaptability to real-world conditions and improves the overall stability of the framework. Details regarding the RL agent and LLM agent components are provided in the following sections.\n\n***\n\n##### Large Language Model Guided Reinforcement Learning Based Six-Degree-of-Freedom Flight Control\n\n- Paper Link: [IEEE 2406 2024.3411015](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F10551749)\n- Overview:\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f3eb600bd774.png)\n\nLLM-Guided reinforcement learning framework. \nThis paper proposes an LLM-guided deep reinforcement learning framework for IFC, which utilizes LLM-guided deep reinforcement learning to achieve intelligent flight control under limited computational resources. LLM provides direct guidance during training based on local knowledge, which improves the quality of data generated in agent-environment interaction within DRL, expedites training, and offers timely feedback to agents, thereby partially mitigating sparse reward issues. Additionally, they present an effective reward function to comprehensively balance the aircraft coupling control to ensure stable, flexible control. Finally, simulations and experiments show that the proposed techniques have good performance, robustness, and adaptability across various flight tasks, laying a foundation for future research in the intelligent air combat decision-making domain.\n\n***\n\n##### Agentic Skill Discovery\n\n- Paper Link: [arXiv 2405.15019](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.15019)，[Homepage](https:\u002F\u002Fagentic-skill-discovery.github.io\u002F)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9af3c426a15f.png\" style=\"zoom: 30%;\" \u002F>\n\nAgentic Skill Discovery gradually acquires contextual skills for table manipulation.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_59496c593f8a.png\" style=\"zoom:40%;\" \u002F>\n\nContextual skill acquisition loop of ASD. Given the environment setup and the robot’s current abilities, an LLM continually *proposes* tasks for the robot to complete, and the successful completion will be collected as acquired skills, each with several neural network variants (*options*). \n\n***\n\n##### HighwayLLM: Decision-Making and Navigation in Highway Driving with RL-Informed Language Model\n\n- Paper Link: [arXiv 2405.13547](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.13547)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f8f736e11a05.png\" style=\"zoom:75%;\" \u002F>\n\nLLM-based vehicle trajectory planning structure: The RL agent observes the traffic (surrounding vehicles) and provides a high-level action for a lane change. Then, the LLM agent retrieves the highD dataset by using FAISS and provides the next three trajectory points.\n\n***\n\n##### LEAGUE++: EMPOWERING CONTINUAL ROBOT LEARNING THROUGH GUIDED SKILL ACQUISITION WITH LARGE LANGUAGE MODELS\n\n- Paper Link:  https:\u002F\u002Fopenreview.net\u002Fforum?id=xXo4JL8FvV, [Homepage](https:\u002F\u002Fsites.google.com\u002Fview\u002Fcontinuallearning)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e63d15a0f811.png\" style=\"zoom: 33%;\" \u002F>\n\nThe authors present a framework that utilizes LLMs to guide continual learning. They integrated LLMs to handle task decomposition and operator creation for TAMP, and generate dense rewards for RL skill learning, which can achieve online autonomous learning for long-horizon tasks. They also use a semantic skills library to enhance learning efficiency for new skills.\n\n***\n\n##### Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts\n\n- Paper Link: [arXiv 2404.09248](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.09248), \n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9bacb13ef8c9.png\" alt=\"Refer to caption\" style=\"zoom:50%;\" \u002F>\n\nOverall procedure of KALM, consisting of three key modules: \n(A) LLM grounding module that grounds LLM in the environment and aligns LLM with inputs of environmental data\n(B) Rollout generation module that prompts the LLM to generate data for novel skills\n(C) Skill Acquisition module that trains the policy with offline RL. Finally, KALM derives a policy that trained on both offline data and imaginary data. \n\n\n***\n\n##### Enhancing Autonomous Vehicle Training with Language Model Integration and Critical Scenario Generation\n\n- Paper Link: [arXiv 2404.08570](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08570), \n- Overview:\n\n![Refer to caption](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_66013343e98a.jpg)\n\nA architecture diagram mapping out the various components of CRITICAL. The framework first sets up an environment configuration based on typical real-world traffic from the highD dataset. These configurations are then leveraged to generate Highway Env scenarios. At the end of each episode, the authors collect data including failure reports, risk metrics, and rewards, repeating this process multiple times to gather a collection of configuration files with associated scenario risk assessments. To enhance RL training, the authors analyze a distribution of configurations based on risk metrics, identifying those conducive to critical scenarios. The authors then either directly use these configurations for new scenarios or prompt an LLM to generate critical scenarios.\n\n***\n\n##### Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Model\n\n- Paper Link: [arXiv 2404.05291](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05291)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b08f05355d26.png\" alt=\"Refer to caption\" style=\"zoom: 50%;\" \u002F>\n\nOverview of the hierarchical system for long-horizon loco-manipulation task. The system is built up from a reasoning layer for task decomposition (yellow) and a controlling layer for skill execution (purple). Given the language description of the long-horizon task (top), a cascade of LLM agents perform high-level task planning and generate function calls of parameterized robot skills. The controlling layer instantiates the mid-level motion planning and low-level controlling skills with RL. \n\n***\n\n##### Yell At Your Robot: Improving On-the-Fly from Language Corrections\n\n- Paper Link: [arXiv 2403.12910](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12910) , [Homepage](https:\u002F\u002Fyay-robot.github.io\u002F)\n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_94ac31f8f5f8.jpeg)\n\n    ​\tThe authors operate in a hierarchical setup where a high-level policy generates language instructions for a low-level policy that executes the corresponding skills. During deployment, humans can intervene through corrective language commands, temporarily overriding the high-level policy and directly influencing the low-level policy for on-the-fly adaptation. These interventions are then used to finetune the high-level policy, improving its future performance.\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_488800f3c977.png)\n\n    ​\tThe system processes RGB images and the robot's current joint positions as inputs, outputting target joint positions for motor actions. The high-level policy uses a Vision Transformer to encode visual inputs and predicts language embeddings. The low-level policy uses ACT, a Transformer-based model to generate precise motor actions for the robot, guided by language instructions. This architecture enables the robot to interpret commands like “Pick up the bag” and translate them into targeted joint movements.\n\n***\n\n##### SRLM: Human-in-Loop Interactive Social Robot Navigation with Large Language Model and Deep Reinforcement Learning \n\n- Paper Link: [arXiv 2403.15648](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.15648)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_4ce9267abf86.png\" alt=\"Refer to caption\" style=\"zoom:50%;\" \u002F>\n\nSRLM architecture: SRLM is implemented as a human-in-loop interactive social robot navigation framework, which executes human commands based on LM-based planner, feedback-based planner, and DRL-based planner incorporating. Firstly, users’ requests or real-time feedbacks are processed or replanned to high-level task guidance for three action executors via LLM. Then, the image-to-text encoder and spatio-temporal graph HRI encoder convert robot local observation information to features as LNM and RLNM input, which generate RL-based action, LM-based action, and feedback-based action. Lastly, the above three actions are adaptively fused by a low-level execution decoder as the robot behavior output of SRLM.\n\n***\n\n##### EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents\n\n- Paper Link: [arXiv 2403.12014](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12014) , [Homepage](https:\u002F\u002Fenvgen-llm.github.io\u002F)\n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_1eac74536778.png)\n\n    ​    In EnvGen framework, the authors generate multiple environments with an LLM to let the agent learn different skills effectively, with the N-cycle training cycles, each consisting of the following four steps. \n\n    ​    **Step 1:** provide an LLM with a prompt composed of four components (*i.e*., task description, environment details, output template, and feedback from the previous cycle), and ask the LLM to fill the template and output various environment configurations that can be used to train agents on different skills. \n\n    ​    **Step 2:** train a small RL agent in the LLM-generated environments. \n\n    ​    **Step 3:** train the agent in the original environment to allow for better generalization and then measure the RL agent’s training progress by letting it explore the original environment. \n\n    ​    **Step 4:** provide the LLM with the agent performance from the original environment (measured in step 3) as feedback for adapting the LLM environments in the next cycle to focus on the weaker performing skills.\n\n- Review:\n        The highlight of this paper is that it uses LLM to design initial training environment conditions, which helps the RL agent learn the strategy of long-horizon tasks more quickly. This is a concept of decomposing long-horizon tasks into smaller tasks and then retraining, accelerating the training efficiency of RL. It also uses a feedback mechanism that allows LLM to revise the conditions based on the training effect of RL. Only four interactions with LLM are needed to significantly improve the training efficiency of RL and reduce the usage cost of LLM.\n\n***\n\n##### LEAGUE++: EMPOWERING CONTINUAL ROBOT LEARNING THROUGH GUIDED SKILL ACQUISITION WITH LARGE LANGUAGE MODELS\n\n- Paper Link: https:\u002F\u002Fopenreview.net\u002Fforum?id=xXo4JL8FvV, [Homepage](https:\u002F\u002Fsites.google.com\u002Fview\u002Fcontinuallearning)\n- Overview:\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a9196abb32ab.png)\n\nThis paper present a framework that utilizes LLMs to guide continual learning. It integrated LLMs to handle task decomposition and operator creation for TAMP, and generate dense rewards for RL skill learning, which can achieve online autonomous learning for long-horizon tasks. It also use a semantic skills library to enhance learning efficiency for new skills.\n\n***\n\n##### RLingua: Improving Reinforcement Learning Sample Efficiency in Robotic Manipulations With Large Language Models\n\n- Paper Link: [arXiv 2403.06420](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06420), [homepage](https:\u002F\u002Frlingua.github.io\u002F)\n\n- Framework Overview:\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a51b4be49708.png\" alt=\"RLingua framework\" style=\"zoom: 80%;\" \u002F>\n\n    ​\t(a) Motivation: LLMs do not need environment samples and are easy to communicate for non-experts. However, the robot controllers generated directly by LLMs may have inferior performance. In contrast, RL can be used to train robot controllers to achieve high performance. However, the cost of RL is its high sample complexity. (b) Framework: RLingua extracts the internal knowledge of LLMs about robot motion to a coded imperfect controller, which is then used to collect data by interaction with the environment. The robot control policy is trained with both the collected LLM demonstration data and the interaction data collected by the online training policy.\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_fe06bbbd6a6b.png\" alt=\"RLingua 2\" style=\"zoom:50%;\" \u002F>\n\n    ​\tThe framework of prompt design with human feedback. The task descriptions and coding guidelines are prompted in sequence. The human feedback is provided after observing the preliminary LLM controller execution process on the robot.\n\n- Review: \n\n    ​    The highlight of this article is the simultaneous application of LLM and RL to generate training data for online training policy. The control code generated by LLM is also considered a policy, achieving a mathematical form of unity. The main function of this policy is to run on robot and sample data. The focus of this article is on the design of LLM, that is, two types of prompt processes, namely with human feedback and with code template, as well as how to design prompts. The design of the prompts is very detailed and worth learning from.\n\n***\n\n##### RL-GPT: Integrating Reinforcement Learning and Code-as-policy\n\n- Paper Link : [arXiv 2402.19299](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19299) ,  [homepage](https:\u002F\u002Fsites.google.com\u002Fview\u002Frl-gpt\u002F)\n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_8a547510f105.png\" alt=\"RL-GPT framework\" style=\"zoom: 50%;\" \u002F>\n\n    ​\tThe overall framework consists of a slow agent (orange) and a fast agent (green). The slow agent decomposes the task and determines “which actions” to learn. The fast agent writes code and RL configurations for low-level execution.\n\n- Review: \n\n    ​    This framework integrates “Code as Policies”, “RL training”, and “LLM planning”. It first allows the LLM to decompose tasks into actions, which are then further decomposed based on their complexity. Simple actions can be directly coded, while complex actions use a combination of code and RL. The framework also applies a Critic to continuously improve the code and planning. The highlight of this paper is the integration of LLM’s code into RL’s action space for training, and this interactive approach is worth learning from.\n\n***\n\n##### How Can LLM Guide RL? A Value-Based Approach\n\n- Paper Link: [arXiv 2402.16181](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16181) , [Homepage](https:\u002F\u002Fgithub.com\u002Fagentification\u002FLanguage-Integrated-VI)\n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b166efe03765.png)\n\n    ​\tDemonstration of the SLINVIT algorithm in the ALFWorld environment when N=2 and the tree breadth of BFS is set to k=3. The task is to “clean a cloth and put it on countertop”. The hallucination that LLM faces, i.e., the towel should be taken (instead of cloth), is addressed by the inherent exploration mechanism in our RL framework.\n\n- Review\n\n    ​    The main idea of this article is to assign the task to an LLM, explore extensively within a BFS (Breadth-First Search) framework, generate multiple policies, and propose two ways to estimate value. One approach is based on code, suitable for scenarios where achieving the goal involves fulfilling multiple preconditions. The other approach relies on Monte Carlo methods. Then select the best policy with the highest value, and combine it with RL policy to enhance data sampling and policy improvement.\n\n***\n\n##### PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning\n\n- Paper Link: [arXiv 2402.15420](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15420), [Homepage](https:\u002F\u002Fsites.google.com\u002Fview\u002Frl-predilect)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_c0bc792b78ca.png\" alt=\"Refer to caption\" style=\"zoom:50%;\" \u002F>\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a2367d701573.png\" alt=\"Refer to caption\" style=\"zoom:50%;\" \u002F>\n\nAn overview of PREDILECT in a social navigation scenario: Initially, a human is shown two trajectories, A and B. They signal their preference for one of the trajectories and provide an additional text prompt to elaborate on their insights. Subsequently, an LLM can be employed for extracting feature sentiment, revealing the causal reasoning embedded in their text prompt, which is processed and mapped to a set of intrinsic values. Finally, both the preferences and the highlighted insights are utilized to more accurately define a reward function. \n\n***\n\n##### Policy Improvement using Language Feedback Models\n\n- Paper Link : [arXiv 2402.07876](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07876) \n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9dfc638ebe2d.png)\n\n\n***\n\n##### Natural Language Reinforcement Learning\n\n- Paper Link: [arXiv 2402.07157](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07157) \n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d5117c87694f.png\" style=\"zoom:50%;\" \u002F>\n\n    ​\tThe authors present an illustrative example of grid-world MDP to show how NLRL and traditional RL differ for task objective, value function, Bellman equation, and generalized policy iteration. In this grid-world, the robot needs to reach the crown and avoid all dangers. They assume the robot policy takes optimal action at each non-terminal state, except a uniformly random policy at state b.\n\n- Review: \n\n    ​    This paper employs RL as a pipeline for LLM, which is an intriguing research approach. The optimal policy within the framework aligns with the task description. The quality of each state and state-action value depends on how well they align with the task description. The state-action description comprises both the reward and the description of the next state. And the state description is a summary of the all possible state-action description. \n\n    ​    During the policy estimation step, the state description mimics either the Monte Carlo (MC) or Temporal Difference (TD) methods commonly used in RL. MC focuses on multi-step moves, evaluating based on the final state, while TD emphasizes single-step moves, returning the description of the next state. Finally, the LLM synthesizes all results to derive the current state description. In the policy improvement step, the LLM selects the best state-action pair to make decisions regarding actions.\n\n***\n\n##### Hierarchical Continual Reinforcement Learning via Large Language Model\n\n- Paper Link: [arXiv 2401.15098](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.15098)\n\n- Framework Overview:\n\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_4688ec15008c.png\" alt=\"Hi_Core framework\" style=\"zoom:67%;\" \u002F>\n\n  ​\tThe illustration of the proposed framework. The middle section depicts the internal interactions (**light gray line**) and external interactions (**dark gray line**) in Hi-Core. Internally, the CRL agent is structured in two layers: the high-level policy formulation (**orange**) and the low-level policy learning (**green**). Furthermore, the policy library (**blue**) is constructed to store and retrieve policies. The three surrounding boxes illustrate their internal workflow when the agent encounters new tasks.\n\n- Method Overview: \n\n    ​\tThe high level LLM is used to generate a series of goals g_i . The low level is a RL with goal-directed, it needs to generate a policy in response to the goals. Policy library is used to store successful policy. When encountering new tasks, the library can retrieve relevant experience to assist high and low level policy agent.\n\n***\n\n##### True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning\n\n- Paper Link: [arXiv 2401.14151](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.14151) , [homepage](https:\u002F\u002Fgithub.com\u002FWeihaoTan\u002FTWOSOME)\n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_cd89f9e769a4.png\" style=\"zoom: 67%;\" \u002F>\n\n    ​\tOverview of how TWOSOME generates a policy using joint probabilities of actions. The color areas in the token blocks indicate the probabilities of the corresponding token in the actions.\n\n- Method Overview: \n\n    ​\tThe authors propose *True knoWledge cOmeS frOM practicE*(**TWOSOME**) online framework. It deploys LLMs as embodied agents to efficiently interact and align with environments via RL to solve decision-making tasks w.o. prepared dataset or prior knowledge of the environments. They use the loglikelihood scores of each token provided by LLMs to calculate the joint probabilities of each action and form valid behavior policies.\n\n***\n\n##### AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents\n\n- Paper Link: [arXiv 2401.12963](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.12963) , [Homepage](https:\u002F\u002Fauto-rt.github.io\u002F)\n\n- Framework Overview:\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_380204b98b1c.png\" style=\"zoom:40%;\" \u002F>\n\n    ​\tAutoRT is an exploration into scaling up robots to unstructured \"in the wild\" settings. The authors use VLMs to do open-vocab description of what the robot sees, then pass that description to an LLM which proposes natural language instructions. The proposals are then critiqued by another LLM using what they call a *robot constitution*, to refine instructions towards safer completable behavior. This lets them run robots in more diverse environments where they do not know the objects the robot will encounter ahead of time, collecting data on self-generated tasks.\n\n- Review: \n\n    ​\tThe main contribution of this paper is the design of a framework that uses a Language Learning Model (LLM) to assign tasks to robots based on the current scene and skill. During the task execution phase, various robot learning methods, such as Reinforcement Learning (RL), can be employed. The data obtained during execution is then added to the database. \n\n    ​\tThrough this iterative process, and with the addition of multiple robots, the data collection process can be automated and accelerated. This high-quality data can be used for training more robots in the future. This work lays the foundation for training robot learning based on a large amount of real physics data.\n\n***\n\n##### Reinforcement Learning from LLM Feedback to Counteract Goal Misgeneralization\n\n- Paper Link: [arXiv 2401.07181](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.07181), \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_741b82dcba14.png\" style=\"zoom:67%;\" \u002F>\n\nLLM preference modelling and reward model. The RL agent is deployed on the LLM generated dataset and its rollouts are stored. The LLM compares pairs of rollouts and provides preferences, which are used to train a new reward model. The reward model is then integrated to the remaining training timesteps of the agent.\n\n***\n\n##### Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft\n\n- Paper Link: [arXiv 2312.09238](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09238), [Homepage](https:\u002F\u002Fyangxue0827.github.io\u002Fauto_mc-reward.html)\n- Overview: \n\n![](https:\u002F\u002Fyangxue0827.github.io\u002Fauto_mc-reward_files\u002Fpipeline_v3.png)\n\nOverview of Auto MC-Reward. Auto MC-Reward consists of three key LLM-based components: Reward Designer, Reward Critic, and Trajectory Analyzer. A suitable dense reward function is iterated through the continuous interaction between the agent and the environment for reinforcement learning training of specific tasks, so that the model can better complete the task. An example of exploring diamond ore is shown in the figure: i) Trajectory Analyzer finds that the agent dies from lava in the failed trajectory, and then gives suggestion for punishment when encountering lava; ii) Reward Designer adopts the suggestion and updates the reward function; iii) The revised reward function passes the review of Reward Critic, and finally the agent avoids the lava by turning left.\n\n***\n\n##### Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents\n\n- Paper Link: [arXiv 2311.13373](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13373), [Homepage](https:\u002F\u002Fgithub.com\u002FZJLAB-AMMI\u002FLLM4Teach)\n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d8a6de34b198.png\" style=\"zoom:67%;\" \u002F>\n\n    ​\tAn illustration of the LLM4Teach framework using the MiniGrid environment as an exemplar. The LLM-based teacher agent responds to observations of the state provided by the environment by offering soft instructions. These instructions take the form of a distribution over a set of suggested actions. The student agent is trained to optimize two objectives simultaneously. \tThe first one is to maximize the expected return, the same as in traditional RL algorithms. The other one is to encourage the student agent to follow the guidance provided by the teacher. As the student agent’s expertise increases during the training process, the weight assigned to the second objective gradually decreases over time, reducing its reliance on the teacher.\n\n***\n\n##### Language and Sketching: An LLM-driven Interactive Multimodal Multitask Robot Navigation Framework\n\n- Paper Link: [arXiv 2311.08244](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08244) \n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_3bec032a0854.png\" style=\"zoom: 50%;\" \u002F>\n\n\nThe framework contains an LLM module, an Intelligent Sensing Module, and a Reinforcement Learning Module.\n\n***\n\n##### LLM Augmented Hierarchical Agents\n\n- Paper Link: [arXiv 2311.05596](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05596) \n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f38a4cbf8f42.png\" style=\"zoom: 67%;\" \u002F>\n\nThe LLM to guides the high-level policy and accelerates learning. It is prompted with the context, some examples, and the current task and observation. The LLM’s output biases high-level action selection.\n\n***\n\n##### Accelerating Reinforcement Learning of Robotic Manipulations via Feedback from Large Language Models\n\n- Paper Link: [arXiv 2311.02379](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.02379)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_93d6d558f088.png\" style=\"zoom:67%;\" \u002F>\n\nDepiction of proposed Lafite-RL framework. Before learning a task, a user provides designed prompts, including descriptions of the current task background and desired robot’s behaviors, and specifications for the LLM’s missions with several rules respectively. Then, Lafite-RL enables an LLM to “observe” and understand the scene information which includes the robot’s past action, and evaluate the action under the current task requirements. The language parser transforms the LLM response into evaluative feedback for constructing interactive rewards.\n\n***\n\n##### Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning\n\n- Paper Link: [arXiv 2310.20587](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.20587), [Homepage](https:\u002F\u002Flamo2023.github.io\u002F)\n- Overview:\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_50f290a075ce.png)\n\nThe overview of LaMo. LaMo mainly consists of two stages: (1) pre-training LMs on language tasks, (2) freezing the pre-trained attention layers, replacing linear projections with MLPs, and using LoRA to adapt to RL tasks. The authors also apply the language loss during the offline RL stage as a regularizer.\n\n***\n\n##### Large Language Models as Generalizable Policies for Embodied Tasks\n\n- Paper Link: [arXiv 2310.17722](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17722), [Homepage](https:\u002F\u002Fllm-rl.github.io\u002F)\n- Overview:\n\n![img](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_84987d32a982.jpg)\n\nBy utilizing Reinforcement Learning together with a pre-trained LLM and maximizing only sparse rewards, it can learn a policy that generalizes to novel language rearrangement tasks. The method robustly generalizes over unseen objects and scenes, novel ways of referring to objects, either by description or explanation of an activity; and even novel descriptions of tasks, including variable number of rearrangements, spatial descriptions, and conditional statements.\n\n***\n\n##### Eureka: Human-Level Reward Design via Coding Large Language Models\n\n- Paper Link: [arXiv 2310.12931](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12931) , [Homepage](https:\u002F\u002Feureka-research.github.io\u002F)\n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d3db19d18ee2.png\" style=\"zoom:67%;\" \u002F>\n\n\nEUREKA takes unmodified environment source code and language task description as context to zero-shot generate executable reward functions from a coding LLM. Then, it iterates between reward sampling, GPU-accelerated reward evaluation, and reward reflection to progressively improve its reward outputs.\n\n- Review\n\n    The LLM in this article is used to design the reward function for RL. The main focus is on how to create a well-designed reward function. There are two approaches:\n\n    1. **Evolutionary Search**: Initially, a large number of reward functions are generated, and their evaluation is done using hardcoded methods.\n    2. **Reward Reflection**:  During training, intermediate reward variables are saved and fed back to LLM, allowing improvements to be made based on the original reward function.\n\n    The first approach leans more toward static analysis, while the second approach emphasizes dynamic analysis. By combining these two methods, one can select and optimize the best reward function.\n\n***\n\n##### AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents\n\n- Paper Link: [arXiv 2310.09971](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.09971), [Homepage](https:\u002F\u002Fut-austin-rpl.github.io\u002Famago\u002F)\n- Overview:\n\n![img](https:\u002F\u002Fut-austin-rpl.github.io\u002Famago\u002Fsrc\u002Ffigure\u002Ffig1_iclr_e_notation.png)\n\nIn-context RL techniques solve memory and meta-learning problems by using sequence models to infer the identity of unknown environments from test-time experience. AMAGO addresses core technical challenges to unify the performance of end-to-end off-policy RL with long-sequence Transformers in order to push memory and adaptation to new limits.\n\n***\n\n##### LgTS: Dynamic Task Sampling using LLM-generated sub-goals for Reinforcement Learning Agents\n\n- Paper Link: [arXiv 2310.09454](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09454.pdf)\n- Overview: \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a664fe8ed8df.png\" style=\"zoom:67%;\" \u002F>\n\n(a) Gridworld domain and descriptors. The agent (red triangle) needs to collect one of the keys and open the door to reach the goal;\n(b) The prompt to the LLM that contains information about the number of paths n expected from the LLM and the symbolic information such as the entities, predicates and the high-level initial and goal states of the of the environment (no assumptions if the truth values of certain predicates are unknown). The prompt from the LLM is a set of paths in the form of ordered lists. The paths are converted in the form of a DAG. The path chosen by LgTS is highlighted in red in the DAG in Fig. b\n\n****\n\n##### Octopus: Embodied Vision-Language Programmer from Environmental Feedback\n\n- Paper Link: [arXiv 2310.08588](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08588), [Homepage](https:\u002F\u002Fchoiszt.github.io\u002FOctopus\u002F)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0575cf4a6bc3.jpg\" style=\"zoom:30%;\" \u002F>\n\n GPT-4 perceives the environment through the **environmental message** and produces anticipated plans and code in accordance with the detailed **system message**. This code is subsequently executed in the simulator, directing the agent to the subsequent state. For each state, the authors gather the environmental message, wherein **observed objects** and **relations** are substituted by egocentric images to serve as the training input. The response from GPT-4 acts as the training output. Environmental feedback, specifically the determination of whether each target state is met, is documented for RLEF training.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e324e16a5bea.jpg\" style=\"zoom:30%;\" \u002F>\n\nThe provided image depicts a comprehensive pipeline for data collection and training. In the **Data Collection Pipeline**, environmental information is captured, parsed into a scene graph, and combined to generate **environment message** and **system message**. These messages subsequently drive agent control, culminating in executable code. For the **Octopus Training Pipeline**, the agent's vision and code are input to the Octopus model for training using both **SFT** and **RLEF** techniques. The accompanying text emphasizes the importance of a well-structured system message for GPT-4's effective code generation and notes the challenges faced due to errors, underscoring the adaptability of the model in handling a myriad of tasks. In essence, the pipeline offers a holistic approach to agent training, from environment understanding to action execution.\n\n***\n\n##### Motif: Intrinsic Motivation from Artificial Intelligence Feedback\n\n- Paper Link: [arXiv 2310.00166](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00166), [Homepage](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmotif)\n- Overview: \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_781da5890c07.png\" style=\"zoom:100%;\" \u002F>\n\nA schematic representation of the three phases of Motif. In the first phase, dataset annotation, the authors extract preferences from an LLM over pairs of captions, and save the corresponding pairs of observations in a dataset alongside their annotations. In the second phase, reward training, the authors distill the preferences into an observation-based scalar reward function. In the third phase, RL training, the authors train an agent interactively with RL using the reward function extracted from the preferences, possibly together with a reward signal coming from the environment.\n\n***\n\n##### Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning\n\n- Paper Link: [arXiv 2309.11489](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.11489)\n\n- Framework Overview:\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_fe073150965d.png)\n\n    ​\tExpert Abstraction provides an abstraction of the environment as a hierarchy of Pythonic classes. *User Instruction* describes the goal to be achieved in natural language. *User Feedback* allows users to summarize the failure mode or their preferences, which are used to improve the reward code.\n\n***\n\n##### State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding\n\n- Paper Link: [arXiv 2309.12482](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.12482)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_fce4c9f00344.png\" style=\"zoom:67%;\" \u002F>\n\nS2E framework involves (a) learning a joint embedding model M from which epsilon is extracted and utilized \n(b) during agent training to inform reward shaping and benefit agent learning \n(c) at deployment to provide end-users with epsilon for agent actions\n\n***\n\n##### Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics\n\n- Paper Link: [arXiv 2309.06687](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.06687) \n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_5de4fae68c84.png)\n\n    ​    The proposed self-refine LLM framework for reward function design. It consists of three steps: initial design, evaluation, and self-refinement loop. A quadruped robot forward running task is used as an example here. \n\n***\n\n##### RLAdapter: Bridging Large Language Models to Reinforcement Learning in Open Worlds\n\n- Paper Link: [arXiv 2309.17176](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17176v1)\n\n- Framework Overview:\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f4f3ba2637ed.png)\n\n    ​    Overall framework of RLAdapter. In addition to receiving inputs from the environment and historical information, the prompt of the adapter model incorporates an understanding score. This score computes the semantic similarity between the agent’s recent actions and the sub-goals suggested by the LLM, determining whether the agent currently comprehends the LLM’s guidance accurately. Through the agent’s feedback and continuously fine-tuning the adapter model, it can keep the LLM always remains attuned to the actual circumstances of the task. This, in turn, ensures that the provided guidance is the most appropriate for the agents’ prioritized learning.\n\n- Review:\n\n    The paper develop the RLAdapter framework, apart from RL and LLM, it also includes additionally an Adapter model. \n\n***\n\n##### ExpeL: LLM Agents Are Experiential Learners\n\n- Paper Link: [arXiv 2308.10144](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.10144), [Homepage](https:\u002F\u002Fandrewzh112.github.io\u002F#expel)\n- Overview: \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f8c7b2eb5aa9.png\" style=\"zoom: 50%;\" \u002F>\n\nLeft: ExpeL operates in three stages: (1) Collection of success and failure experiences into a pool. (2) Extraction\u002Fabstraction of cross-task knowledge from these experiences. (3) Application of the gained insights and recall of past successes in evaluation tasks. \nRight: (A) Illustrates the experience gathering process via Reflexion, enabling task reattempt after self-reflection on failures. (B) Illustrates the insight extraction step. When presented with success\u002Ffailure pairs or a list of L successes, the agent dynamically modifies an existing list of insights using operations ADD, UPVOTE, DOWNVOTE, and EDIT. This process has an emphasis on extracting prevalent failure patterns or best practices.\n\n***\n\n##### Language to Rewards for Robotic Skill Synthesis\n\n- Paper Link: [arXiv 2306.08647](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08647), [Homepage](https:\u002F\u002Flanguage-to-reward.github.io\u002F)\n- Overview:\n\n![img](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_6ca32238b052.png)\n\nDetailed dataflow of the Reward Translator. A Motion Descriptor LLM takes the user input and describe the user-specified motion in natural language, and a Reward Coder translates the motion into the reward parameters\n\n***\n\n##### Learning to Model the World with Language\n\n- Paper Link: [arXiv2308.01399](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.01399), [Homepage](https:\u002F\u002Fdynalang.github.io\u002F)\n- Overview: \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b4c0181febd5.png\" style=\"zoom:67%;\" \u002F>\n\nDynalang learns to use language to make predictions about future (text + image) observations and rewards, which helps it solve tasks. Here, the authors show real model predictions in the HomeGrid environment. The agent has explored various rooms while receiving video and language observations from the environment. From the past text “the bottle is in the living room”, the agent predicts at timesteps 61-65 that it will see the bottle in the final corner of the living room. From the text ‘get the bottle” describing the task, the agent predicts that it will be rewarded for picking up the bottle. The agent can also predict future text observations: given the prefix “the plates are in the” and the plates it observed on the counter at timestep 30, the model predicts the most likely next token is “kitchen.”\n\n***\n\n##### Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach\n\n- Paper Link: [arXiv 2306.03604](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03604), [Homepage](https:\u002F\u002Fgithub.com\u002FZJLAB-AMMI\u002FLLM4RL)\n- Overview:\n\n![llm4rl](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d70c1bbac770.png)\n\nAn overview of the Planner-Actor-Mediator paradigm and an example of the interactions. At each time step, the mediator takes the observation o_t as input and decides whether to ask the LLM planner for new instructions or not. When the asking policy decides to ask, as demonstrated with a red dashed line, the translator converts o_t into text descriptions, and the planner outputs a new plan accordingly for the actor to follow. On the other hand, when the mediator decides to not ask, as demonstrated with a green dashed line, the mediator returns to the actor directly, telling it to continue with the current plan.\n\n***\n\n##### Reward Design with Language Models\n\n- Paper Link: [arXiv 2303.00001](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.00001)\n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_6852bed10d5f.png\" style=\"zoom: 50%;\" \u002F>\n\n    ​\tDepiction of the framework on the DEAL OR NO DEAL negotiation task. A user provides an example and explanation of desired negotiating behavior (e.g., versatility) before training. During training, (1) they provide the LLM with a task description, a user’s description of their objective, an outcome of an episode that is converted to a string, and a question asking if the outcome episode satisfies the user objective. (2-3) They then parse the LLM’s response back into a string and use that as the reward signal for the Alice the RL agent. (4) Alice updates their weights and rolls out a new episode. (5) They parse the episode outcome int a string and continue training. During evaluation, they sample a trajectory from Alice and evaluate whether it is aligned with the user’s objective.\n\n***\n\n##### Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks\n\n- Paper Link: [arXiv 2303.16563](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.16563) , [Homepage](https:\u002F\u002Fsites.google.com\u002Fview\u002Fplan4mc)\n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e2b270de5a69.png)\n\n    ​\tThe authors categorize the basic skills in Minecraft into three types: Findingskills, Manipulation-skills, and Crafting-skills. The authors train policies to acquire skills with reinforcement learning. With the help of LLM, the authors extract relationships between skills and construct a skill graph in advance, as shown in the dashed box. During online planning, the skill search algorithm walks on the pre-generated graph, decomposes the task into an executable skill sequence, and interactively selects policies to solve complex tasks.\n\n- Review\n\n    ​\tThe highlight of the article lies in its use of LLM to generate skill graph,  thereby clarifying the sequential relationship between skills. When a task is input, the framework searches the skill graph using DFS to determine the skill to be selected at each step. RL is responsible for executing the skill and updating the state, iterating this process to break down complex tasks into manageable segments. \n\n    ​\tAreas for improvement in the framework include:\n\n     1. Currently, humans need to provide the available skills first. In the future, the framework should have ability to lean new skills autonomously.\n     2. The application of LLM in the framework is mainly to build relationships between skills. Maybe this could potentially be achieved through hard coding, such as querying a Minecraft library to generate a skill graph.\n\n***\n\n##### RE-MOVE: An Adaptive Policy Design for Robotic Navigation Tasks in Dynamic Environments via Language-Based Feedback\n\n- Paper Link: [arXiv 2303.07622](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.07622), [Homepage](https:\u002F\u002Fgamma.umd.edu\u002Fresearchdirections\u002Fcrowdmultiagent\u002Fremove\u002F)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_5dcd044e7be4.png\" style=\"zoom:67%;\" \u002F>\n\n***\n\n##### Natural Language-conditioned Reinforcement Learning with Inside-out Task Language Development and Translation\n\n- Paper Link: [arXiv 2302.09368](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.09368)\n- Overview: \n\nNatural Language-conditioned reinforcement learning (RL) enables the agents to follow human instructions. Previous approaches generally implemented language-conditioned RL by providing human instructions in natural language (NL) and training a following policy. In this outside-in approach, the policy needs to comprehend the NL and manage the task simultaneously. However, the unbounded NL examples often bring much extra complexity for solving concrete RL tasks, which can distract policy learning from completing the task. To ease the learning burden of the policy, the authors investigate an inside-out scheme for natural language-conditioned RL by developing a task language (TL) that is task-related and unique. The TL is used in RL to achieve highly efficient and effective policy training. Besides, a translator is trained to translate NL into TL. They implement this scheme as TALAR (TAsk Language with predicAte Representation) that learns multiple predicates to model object relationships as the TL. Experiments indicate that TALAR not only better comprehends NL instructions but also leads to a better instruction-following policy that improves 13.4% success rate and adapts to unseen expressions of NL instruction. The TL can also be an effective task abstraction, naturally compatible with hierarchical RL.\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_82cba77cfcbe.png\" style=\"zoom:67%;\" \u002F>\n\nAn illustration of OIL and IOL schemes in NLC-RL. \nLeft: OIL directly exposes the NL instructions to the policy. \nRight: IOL develops a task language, which is task-related and a unique representation of NL instructions. \nThe solid lines represent instruction following process, while the dashed lines represent TL development and translation.\n\n***\n\n##### Guiding Pretraining in Reinforcement Learning with Large Language Models\n\n- Paper Link: [arXiv 2302.06692](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.06692) , [Homepage](https:\u002F\u002Fgithub.com\u002Fyuqingd\u002Fellm)\n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d096f77b5e21.png)\n\n    ​\tELLM uses a pretrained large language model (LLM) to suggest plausibly useful goals in a task-agnostic way. Building on LLM capabilities such as context-sensitivity and common-sense, ELLM trains RL agents to pursue goals that are likely meaningful without requiring direct human intervention.\n\n    ​\t![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e6ffb73b6494.png)\n\n    ​    ELLM uses GPT-3 to suggest adequate exploratory goals and SentenceBERT embeddings to compute the similarity between suggested goals and demonstrated behaviors as a form of intrinsically-motivated reward.\n\n- Review: \n\n    ​    This paper is one of the earliest to use LLM for RL planning goals. The ELLM framework provides the current environmental information and available actions to the LLM, allowing it to design multiple reasonable goals based on common sense. RL then executes one of these goals. The reward function is determined based on the similarity of the embeddings of the goals and states. Since the embeddings are also generated by a  SentenceBERT model, it can also be said that the reward is generated by the LLM.\n\n***\n\n##### Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning\n\n- Paper Link: [arXiv 2302.02662](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.02662) , [Homepage](https:\u002F\u002Fgithub.com\u002Fflowersteam\u002FGrounding_LLMs_with_online_RL)\n\n- Framework Overview: \n\n    ![Main schema](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_c3fb3f61cdd5.png)\n\n    ​    The GLAM method: the authors use an LLM as agent policy in an interactive textual RL environment (BabyAI-Text) where the LLM is trained to achieve language goals using online RL (PPO), enabling functional grounding. (a) BabyAI-Text provides a goal description for the current episode as well as a description of the agent observation and a scalar reward for the current step. (b) At each step, they gather the goal description and the observation in a prompt sent to our LLM. (c) For each possible action, they use the encoder to generate a representation of the prompt and compute the conditional probability of tokens composing the action given the prompt. Once the probability of each action is estimated, they compute a softmax function over these probabilities and sample an action according to this distribution. That is, the LLM is our agent policy. (d) They use the reward returned by the environment to finetune the LLM using PPO. For this, they estimate the value of the current observation by adding a value head on top of our LLM. Finally, they backpropagate the gradient through the LLM (and its value head).\n\n- Review: \n\n  ​    This article uses BabyAI-Text to convert the goal and observation in Gridworld into text descriptions, which can then be transformed into prompts input to the LLM. The LLM outputs the probability of actions, and then the action probabilities output by the LLM, the value estimation obtained through MLC, and the reward are input into PPO for training. Eventually, the Agent outputs an appropriate action. In the experiment, the authors used the GFlan-T5 model, and after 250k steps of training, they achieved a success rate of 80%, which is a significant improvement compared to other methods.\n\n***\n\n##### Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals\n\n- Paper Link: [arXiv 2302.04449](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04449)\n- Overview:\n\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_897fb103c22b.png\" style=\"zoom:80%;\" \u002F>\n\nAn overview of  Read and Reward framework. The system receives the current frame in the environment, and the instruction manual as input. After object detection and grounding, the QA Extraction Module extracts and summarizes relevant information from the manual, and the Reasoning Module assigns auxiliary rewards to detected in-game events by reasoning with outputs from the QA Extraction Module. The “Yes\u002FNo” answers are then mapped to +5\u002F − 5 auxiliary rewards.\n\n***\n\n##### Collaborating with language models for embodied reasoning\n\n- Paper Link: [arXiv 2302.00763](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.00763) \n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_c6b3979d8424.png)\n\n    ​    A. Schematic of the Planner-Actor-Reporter paradigm and an example of the interaction among them. B. Observation and action space of the PycoLab environment.\n\n- Review:\n\n    The framework presented in this paper is simple yet clear, and it is one of the early works on using LLM for RL policy. In this framework, the Planner is an LLM, while the Reporter and Actor are RL components. The task requires the role to first inspect the properties of an item, and then select an item with the “good” property. The framework starts with the Planner, informing it of the task description and historical execution records. The Planner then chooses an action for the Actor. After the Actor executes the action, a result is obtained. The Reporter observes the environment and provides feedback to the Planner, and this process repeats.\n\n***\n\n##### Transformers are Sample-Efficient World Models\n\n- Paper Link: [arXiv 2209.00588](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.00588), [Homepage](https:\u002F\u002Fgithub.com\u002Feloialonso\u002Firis)\n\n***\n\n##### Inner Monologue: Embodied Reasoning through Planning with Language Models\n\n- Paper Link: [arXiv 2207.05608](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.05608), [Homepage](https:\u002F\u002Finnermonologue.github.io\u002F)\n- Overview: \n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_fcb301aae340.png)\n\n Inner Monologue enables grounded closed-loop feedback for robot planning with large language models by leveraging a collection of perception models (e.g., scene descriptors and success detectors) in tandem with pretrained language-conditioned robot skills. Experiments show the system can reason and replan to accomplish complex long-horizon tasks for (a) mobile manipulation and (b,c) tabletop manipulation in both simulated and real settings.\n\n***\n\n##### Do As I Can, Not As I Say: Grounding Language in Robotic Affordances\n\n- Paper Link: [arXiv 2204.01691](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.01691) , [Homepage](https:\u002F\u002Fsay-can.github.io\u002F)\n\n- Framework Overview: \n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_aba52bfdb258.png\" style=\"zoom:67%;\" \u002F>\n\n    ​\tGiven a high-level instruction, SayCan combines probabilities from a LLM (the probability that a skill is useful for the instruction) with the probabilities from a value function (the probability of successfully executing said skill) to select the skill to perform. This emits a skill that is both possible and useful. The process is repeated by appending the skill to the response and querying the models again, until the output step is to terminate. \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_2f83ecd36c09.png)\n\n    ​\tA value function module (a) is queried to form a value function space of action primitives based on the current observation. Visualizing “pick” value functions, in (b) “Pick up the red bull can” and “Pick up the apple” have high values because both objects are in the scene, while in (c) the robot is navigating an empty space, and thus none of the pick up actions receive high values.\n\n***\n\n##### Keep CALM and Explore: Language Models for Action Generation in Text-based Games\n\n- Paper Link: [arXiv 2010.02903](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.02903), [Homepage](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002Fcalm-textgame)\n- Overview:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0ece76ad810d.png\" style=\"zoom: 67%;\" \u002F>\n\nCALM combined with an RL agent – DRRN – for gameplay. CALM is trained on transcripts of human gameplay for action generation. At each state, CALM generates action candidates conditioned on the game context, and the DRRN calculates the Q-values over them to select an action. Once trained, a single instance of CALM can be used to generate actions for any text-based game.\n\n***\n\n## Foundational Approaches in Reinforcement Learning\n\n>Understanding the foundational approaches in Reinforcement Learning, such as Curriculum Learning, RLHF and HITL, is crucial for our research. These methods represent the building blocks upon which modern RL techniques are built. By studying these early methods, we can gain a deeper understanding of the principles and mechanisms that underlie RL. This knowledge can then inform and inspire current work on the intersection of Language Model Learning (LLM) and RL, helping us to develop more effective and innovative solutions.\n\n***\n\n##### Using Natural Language for Reward Shaping in Reinforcement Learning\n\n- Paper Link: [arXiv 1903.02020](https:\u002F\u002Farxiv.org\u002Fabs\u002F1903.02020) \n\n- Framework Overview: \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_184f3663ec46.png)\n\n    The framework consists of the standard RL module containing the agent-environment loop, augmented with a LanguagE Action Reward Network (LEARN) module.\n\n- Review:\n\n    ​    This article provides a method of using natural language to provide rewards. At that time, there was no LLM, so this article used a large number of existing game videos and corresponding language descriptions as the dataset. An FNN was trained, which can output the relationship between the current trajectory and language command, and use this output as an intermediate reward. By combining it with the original sparse environment reward, the RL Agent can learn the optimal strategy faster based on both the goal and the language command.\n\n***\n\n##### DQN-TAMER: Human-in-the-Loop Reinforcement Learning with Intractable Feedback\n\n- Paper Link: [arXiv 1810.11748](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.11748)\n- Overview: \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b74f31309406.png\" style=\"zoom:80%;\" \u002F>\n\nOverview of human-in-the-loop RL and the model (DQNTAMER). The agent asynchronously interacts with a human observer in the given environment. DQN-TAMER decides actions based on two models. One (Q) estimates rewards from the environment and the other (H) for feedback from the human. \n\n***\n\n##### Overcoming Exploration in Reinforcement Learning with Demonstrations\n\n- Paper Link: [arXiv 1709.10089](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.10089), [Homepage](https:\u002F\u002Fashvin.me\u002Fdemoddpg-website\u002F)\n- Overview:\n\nExploration in environments with sparse rewards has been a persistent problem in reinforcement learning (RL). Many tasks are natural to specify with a sparse reward, and manually shaping a reward function can result in suboptimal performance. However, finding a non-zero reward is exponentially more difficult with increasing task horizon or action dimensionality. This puts many real-world tasks out of practical reach of RL methods. In this work, we use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm. Our method, which builds on top of Deep Deterministic Policy Gradients and Hindsight Experience Replay, provides an order of magnitude of speedup over RL on simulated robotics tasks. It is simple to implement and makes only the additional assumption that we can collect a small set of demonstrations. Furthermore, our method is able to solve tasks not solvable by either RL or behavior cloning alone, and often ends up outperforming the demonstrator policy.\n\n***\n\n##### Automatic Goal Generation for Reinforcement Learning Agents\n\n- Paper Link: [arXiv 1705.06366](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06366), [Homepage](https:\u002F\u002Fsites.google.com\u002Fview\u002Fgoalgeneration4rl)\n- Overview: \n\nReinforcement learning (RL) is a powerful technique to train an agent to perform a task; however, an agent that is trained using RL is only capable of achieving the single task that is specified via its reward function.   Such an approach does not scale well to settings in which an agent needs to perform a diverse set of tasks, such as navigating to varying positions in a room or moving objects to varying locations.  Instead, the authors propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing in its environment.  the authors use a generator network to propose tasks for the agent to try to accomplish, each task being specified as reaching a certain parametrized subset of the state-space.  The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent, thus automatically producing a curriculum.  the authors show that, by using this framework, an agent can efficiently and automatically learn to perform a wide set of tasks without requiring any prior knowledge of its environment, even when only sparse rewards are available.\n\n***\n\n## Open source RL environment \n\n- Awesome RL environments: https:\u002F\u002Fgithub.com\u002Fclvrai\u002Fawesome-rl-envs\n\n    This repository has a comprehensive list of categorized reinforcement learning environments.\n\n- Mine Dojo: https:\u002F\u002Fgithub.com\u002FMineDojo\u002FMineDojo\n\n    ​\tMineDojo features a **massive simulation suite** built on Minecraft with 1000s of diverse tasks, and provides **open access to an internet-scale knowledge base** of 730K YouTube videos, 7K Wiki pages, 340K Reddit posts.\n\n\n\u003Cdiv style=\"text-align:center;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_19946a6f172b.gif\" alt=\"img\" style=\"zoom:67%;\" \u002F>\n\u003C\u002Fdiv> \n\n- MineRL: https:\u002F\u002Fgithub.com\u002Fminerllabs\u002Fminerl , https:\u002F\u002Fminerl.readthedocs.io\u002Fen\u002Flatest\u002F\n\n    ​\tMineRL is a rich Python 3 library which provides a [OpenAI Gym](https:\u002F\u002Fgym.openai.com\u002F) interface for interacting with the video game Minecraft, accompanied with datasets of human gameplay.\n\n\u003Cdiv style=\"text-align:center;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_1652c5ea01c3.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d4b9439ba136.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0e757c43095f.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e3b2e87cb961.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_444585e04339.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9a1a40f4aa0b.gif\" alt=\"img\"  \u002F>\n\u003C\u002Fdiv>\n\n- ALFworld: https:\u002F\u002Fgithub.com\u002Falfworld\u002Falfworld?tab=readme-ov-file , https:\u002F\u002Falfworld.github.io\u002F\n\n    ​\t**ALFWorld** contains interactive TextWorld environments (Côté et. al) that parallel embodied worlds in the ALFRED dataset (Shridhar et. al). The aligned environments allow agents to reason and learn high-level policies in an abstract space before solving embodied tasks through low-level actuation.\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_42c8d2b4b9b3.png\" style=\"zoom:50%;\" \u002F>\n\n- Skillhack: https:\u002F\u002Fgithub.com\u002Fucl-dark\u002Fskillhack\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e7a41cc0cffa.png\" style=\"zoom: 33%;\" \u002F>\n\n- Minigrid: https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMiniGrid?tab=readme-ov-file\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_5fd7fa7a296e.gif\" style=\"zoom: 50%;\" \u002F>\n\n- Crafter: https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fcrafter?tab=readme-ov-file\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_5baed70b4c4f.gif)\n\n- OpenAI procgen: https:\u002F\u002Fgithub.com\u002Fopenai\u002Fprocgen\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9a36f7ca75ee.gif)\n\n- Petting ZOO MPE: https:\u002F\u002Fpettingzoo.farama.org\u002Fenvironments\u002Fmpe\u002F\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b0284afd7c94.gif\" alt=\"img\" style=\"zoom: 25%;\" \u002F> \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d88b3bb83bea.gif\" alt=\"img\" style=\"zoom:25%;\" \u002F> \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_7e82b6c86fbb.gif\" alt=\"img\" style=\"zoom:25%;\" \u002F>\n\n- OpenAI Multi Agent Particle Env: https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmultiagent-particle-envs\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_8a39b5b08932.gif\" style=\"zoom: 50%;\" \u002F>\n\n- Multi Agent RL Environment: https:\u002F\u002Fgithub.com\u002FBigpig4396\u002FMulti-Agent-Reinforcement-Learning-Environment\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0f086f9b6021.gif\" style=\"zoom:80%;\" \u002F>\n\n- MAgent2: https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMAgent2?tab=readme-ov-file\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e3725d190000.gif\" style=\"zoom: 67%;\" \u002F>\n\n\n***\n\n\u003Ctable style=\"border-collapse: collapse; border: none;\" border=\"0\" cellpadding=\"5\" cellspacing=\"0\">\n\u003Ctr>\n\u003Ctd style=\"border: none; padding: 3;\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_05b8bf57f1b2.png\" width=\"200\" \u002F>\u003C\u002Ftd>\n\u003Ctd style=\"border: none; padding: 3;\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a7083ec1a8e9.png\" width=\"160\" \u002F>\u003C\u002Ftd>\n\u003Ctd style=\"border: none; padding: 3;\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d88ffc418c44.png\" width=\"200\" \u002F>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n","# LLM RL 论文\n\n1. 关注 LLM 与强化学习的最新交叉研究；\n2. 重点在于结合两者的功能实现 **控制**（如游戏角色、机器人）；\n3. 如果你读到了优秀的论文，欢迎随时提交 PR 分享。\n\n\n***\n\n## 目录\n\n* \u003Ca href=\"#research-review\" style=\"color: black; text-decoration: none; font-size: 20px; bold: true; font-weight: 700\"> 研究综述\u003C\u002Fa>\n\n   + [基于 LLM 的多智能体强化学习：现状与未来方向](#llm-based-multi-agent-reinforcement-learning-current-and-future-directions)\n   + [大型语言模型驱动的游戏智能体综述](#a-survey-on-large-language-model-based-game-agents)\n   + [大型语言模型增强的强化学习综述：概念、分类与方法](#survey-on-large-language-model-enhanced-reinforcement-learning-concept-taxonomy-and-methods)\n   + [RL 和 LLM 分类树：回顾强化学习与大型语言模型之间的协同作用](#the-rl-and-llm-taxonomy-tree-reviewing-synergies-between-reinforcement-learning-and-large-language-models)\n\n* \u003Ca href=\"#llm-rl-papers\" style=\"color: black; text-decoration: none; font-size: 20px; bold: true; font-weight: 700\">LLM RL 论文 [按方法排序]\u003C\u002Fa>\n\n   - **行动**\n\n     - 直接\n\n       →[iLLM-TSC：将强化学习与大型语言模型结合用于交通信号控制策略优化](#iLLM-TSC-Integration-reinforcement-learning-and-large-language-model-for-traffic-signal-control-policy-improvement)\n\n       →[SRLM：基于大型语言模型和深度强化学习的人机交互社交机器人导航](#srlm-human-in-loop-interactive-social-robot-navigation-with-large-language-model-and-deep-reinforcement-learning)\n\n       →[通过大型语言模型回放进行离线强化学习以构建知识型智能体](#knowledgeable-agents-by-offline-reinforcement-learning-from-large-language-model-rollouts)\n\n       →[利用语言反馈模型改进策略](#policy-improvement-using-language-feedback-models)\n\n       →[实践出真知：通过强化学习使 LLM 与具身环境对齐](#true-knowledge-comes-from-practice-aligning-llms-with-embodied-environments-via-reinforcement-learning)\n\n       →[大型语言模型作为策略教师训练强化学习智能体](#large-language-model-as-a-policy-teacher-for-training-reinforcement-learning-agents)\n\n       →[LLM 增强的层次化智能体](#llm-augmented-hierarchical-agents)\n\n       →[大型语言模型作为具身任务的通用策略](#large-language-models-as-generalizable-policies-for-embodied-tasks)\n\n       →[Octopus：基于环境反馈的具身视觉-语言编程器](#octopus-embodied-vision-language-programmer-from-environmental-feedback)\n\n       →[RE-MOVE：通过语言反馈为动态环境中的机器人导航任务设计自适应策略](#re-move-an-adaptive-policy-design-for-robotic-navigation-tasks-in-dynamic-environments-via-language-based-feedback)\n\n       →[利用在线强化学习将大型语言模型嵌入交互式环境](#grounding-large-language-models-in-interactive-environments-with-online-reinforcement-learning)\n\n       →[与语言模型协作进行具身推理](#collaborating-with-language-models-for-embodied-reasoning)\n\n       →[内心独白：借助语言模型规划实现具身推理](#inner-monologue-embodied-reasoning-through-planning-with-language-models)\n\n       →[做我能做到的，而不是我说的：将语言与机器人操作特性相结合](#do-as-i-can-not-as-i-say-grounding-language-in-robotic-affordances)\n\n       →[保持冷静并探索：文本类游戏中用于动作生成的语言模型](#keep-calm-and-explore-language-models-for-action-generation-in-text-based-games)\n\n     - 间接\n       \n       →[大型语言模型引导的六自由度飞行控制强化学习](#Large-Language-Model-Guided-Reinforcement-Learning-Based-Six-Degree-of-Freedom-Flight-Control)\n       \n       →[通过强化学习实现智能体与 LLM 之间的智能交互](#enabling-intelligent-interactions-between-an-agent-and-an-llm-a-reinforcement-learning-approach)\n       \n       →[RL-GPT：整合强化学习与代码即策略](#rl-gpt-integrating-reinforcement-learning-and-code-as-policy)\n\n   - **数据偏好**\n\n     →[基于 LLM 反馈的强化学习以对抗目标泛化错误](#reinforcement-learning-from-llm-feedback-to-counteract-goal-misgeneralization)\n\n   - **数据生成**\n\n     →[RLingua：利用大型语言模型提升机器人操作任务中的强化学习样本效率](#rlingua-improving-reinforcement-learning-sample-efficiency-in-robotic-manipulations-with-large-language-models)\n\n   - **环境配置**\n\n     →[通过集成语言模型和关键场景生成来增强自动驾驶车辆的训练](#enhancing-autonomous-vehicle-training-with-language-model-integration-and-critical-scenario-generation)\n\n     →[EnvGen：利用 LLM 生成并适配环境以训练具身智能体](#envgen-generating-and-adapting-environments-via-llms-for-training-embodied-agents)\n\n   - **路径点**\n\n     →[HighwayLLM：基于 RL 指导的语言模型在高速公路驾驶中的决策与导航](#HighwayLLM-Decision-Making-and-Navigation-in-Highway-Driving-with-RL-Informed-Language-Model)\n\n   - **预测**\n\n     →[用语言学习建模世界](#learning-to-model-the-world-with-language)\n\n   - **奖励函数**\n\n     →[代理技能发现](#agentic-skill-discovery)\n\n     →[LEAGUE++：通过大型语言模型引导的技能习得赋能持续机器人学习](#league-empowering-continual-robot-learning-through-guided-skill-acquisition-with-large-language-models)\n\n     →[PREDILECT：在强化学习中利用零样本语言推理界定偏好](#predilect-preferences-delineated-with-zero-shot-language-based-reasoning-in-reinforcement-learning)\n\n     →[Auto MC-Reward：为 Minecraft 自动设计密集型奖励的大型语言模型](#auto-mc-reward-automated-dense-reward-design-with-large-language-models-for-minecraft)\n\n     →[通过大型语言模型的反馈加速机器人操作任务的强化学习](#accelerating-reinforcement-learning-of-robotic-manipulations-via-feedback-from-large-language-models)\n\n     →[Eureka：通过编码大型语言模型设计人类水平的奖励](#eureka-human-level-reward-design-via-coding-large-language-models)\n\n     →[Motif：来自人工智能反馈的内在动机](#motif-intrinsic-motivation-from-artificial-intelligence-feedback)\n\n→[Text2Reward：面向强化学习的自动化密集奖励函数生成](#text2reward-automated-dense-reward-function-generation-for-reinforcement-learning)\n\n     →[自精炼大型语言模型作为机器人深度强化学习的自动化奖励函数设计者](#self-refined-large-language-model-as-automated-reward-function-designer-for-deep-reinforcement-learning-in-robotics)\n\n     →[面向机器人技能合成的语言到奖励](#language-to-rewards-for-robotic-skill-synthesis)\n\n     →[基于语言模型的奖励设计](#reward-design-with-language-models)\n\n     →[阅读并收获奖励：借助说明书学习玩Atari游戏](#read-and-reap-the-rewards-learning-to-play-atari-with-the-help-of-instruction-manuals)\n\n   - **技能规划**\n\n     →[面向开放世界长 horizon 任务的技能强化学习与规划](#skill-reinforcement-learning-and-planning-for-open-world-long-horizon-tasks)\n\n     →[基于大型语言模型的四足机器人长 horizon 行走与操作](#long-horizon-locomotion-and-manipulation-on-a-quadrupedal-robot-with-large-language-model)\n\n   - **状态表示**\n\n     →[强化学习中的 LLM 赋能状态表示](#LLM-Empowered-State-Representation-for-Reinforcement-Learning)\n\n     →[自然语言强化学习](#natural-language-reinforcement-learning)\n\n     →[State2Explanation：基于概念的解释，助力智能体学习与用户理解](#state2explanation-concept-based-explanations-to-benefit-agent-learning-and-user-understanding)\n\n   - **任务建议**\n\n     →[基于大型语言模型的层次化持续强化学习](#hierarchical-continual-reinforcement-learning-via-large-language-model)\n\n     →[AutoRT：用于大规模机器人智能体编排的具身基础模型](#autort-embodied-foundation-models-for-large-scale-orchestration-of-robotic-agents)\n\n     →[语言与草图：一种由 LLM 驱动的交互式多模态多任务机器人导航框架](#language-and-sketching-an-llm-driven-interactive-multimodal-multitask-robot-navigation-framework)\n\n     →[LgTS：利用 LLM 生成的子目标为强化学习智能体进行动态任务采样](#lgts-dynamic-task-sampling-using-llm-generated-sub-goals-for-reinforcement-learning-agents)\n\n     →[RLAdapter：将大型语言模型与开放世界中的强化学习相连接](#rladapter-bridging-large-language-models-to-reinforcement-learning-in-open-worlds)\n\n     →[ExpeL：LLM 智能体是体验式学习者](#expel-llm-agents-are-experiential-learners)\n\n     →[用大型语言模型指导强化学习的预训练](#guiding-pretraining-in-reinforcement-learning-with-large-language-models)\n\n   - **Transformer 框架**\n\n     →[释放预训练语言模型的力量，用于离线强化学习](#unleashing-the-power-of-pre-trained-language-models-for-offline-reinforcement-learning)\n\n     →[AMAGO：适用于自适应智能体的可扩展上下文强化学习](#amago-scalable-in-context-reinforcement-learning-for-adaptive-agents)\n\n     →[Transformers 是样本高效的 World Models](#transformers-are-sample-efficient-world-models)\n\n* \u003Ca href=\"#foundational-approaches-in-reinforcement-learning\" style=\"color: black; text-decoration: none; font-size: 20px; bold: true; font-weight: 700\">强化学习的基础方法\u003C\u002Fa>\n\n\t→[在强化学习中使用自然语言进行奖励塑造](#using-natural-language-for-reward-shaping-in-reinforcement-learning)\n\n\t→[DQN-TAMER：具有难以处理反馈的人机协作强化学习](#dqn-tamer-human-in-the-loop-reinforcement-learning-with-intractable-feedback)\n\n\t→[通过示范克服强化学习中的探索问题](#overcoming-exploration-in-reinforcement-learning-with-demonstrations)\n\n\t→[强化学习智能体的自动目标生成](#automatic-goal-generation-for-reinforcement-learning-agents)\n\n* \u003Ca href=\"#open-source-rl-environment\" style=\"color: black; text-decoration: none; font-size: 20px; bold: true; font-weight: 700\">开源强化学习环境\u003C\u002Fa>\n\n***\n\n## 研究综述\n\n##### 基于大语言模型的多智能体强化学习：现状与未来方向\n\n- 论文链接：[arXiv 2405.11106](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.11106)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_67d08d2810af.png\" style=\"zoom: 40%;\" \u002F>\n\n语言条件下的多智能体强化学习（MARL）的潜在研究方向。  \n(a) 具有人格特征的合作，其中不同机器人由指令定义的不同人格进行协作。  \n(b) 支持人类参与的框架，即人类监督机器人并提供反馈。  \n(c) 传统的MARL与LLM协同设计，将关于LLM各方面的知识提炼到可在设备端运行的小型模型中。\n\n***\n\n##### 基于大型语言模型的游戏智能体综述\n\n- 论文链接：[arXiv 2404.02039](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02039)，[主页](https:\u002F\u002Fgithub.com\u002Fgit-disl\u002Fawesome-LLM-game-agent-papers)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_92b41b647af3.png\" style=\"zoom:33%;\" \u002F>\n\nLLMGAs的概念架构。在每个游戏步骤中，**感知**模块从游戏环境中获取多模态信息，包括文本、图像、符号状态等。智能体从**记忆**模块中检索关键记忆，并将其与感知到的信息一起作为输入用于**思考**（推理、规划和反思），从而制定策略并做出明智决策。**角色扮演**模块影响决策过程，以确保智能体的行为与其设定的角色一致。随后，**行动**模块将生成的行动描述转化为可执行且合法的动作，以改变下一游戏步骤中的游戏状态。最后，**学习**模块通过积累的游戏经验不断提升智能体的认知能力和游戏水平。\n\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_3884ed678e52.png\" style=\"zoom:50%;\" \u002F>\n\n\u003Ccenter>学习模块思维导图\u003C\u002Fcenter>\n\n***\n\n##### 大型语言模型增强的强化学习综述：概念、分类与方法\n\n- 论文链接：[arXiv 2403.00282](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.00282)\n- 概述：\n\n![参见标题](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9af618a2841f.png)\n\n经典智能体-环境交互中LLM增强的RL框架，其中LLM在增强RL方面发挥着不同的作用。\n\n***\n\n##### 强化学习与大型语言模型的协同关系分类树\n\n- 论文链接：[arXiv 2402.01874](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.01874) \n\n- 概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0b86a03f5b9e.png\" style=\"zoom: 67%;\" \u002F>\n\n    ​本研究提出了一种基于RL和LLM相互作用方式的新型分类体系，分为三大类：\n\n    - RL4LLM：利用RL提升LLM在自然语言处理相关任务中的性能。\n    - LLM4RL：使用LLM辅助训练一个不直接涉及自然语言的任务的RL模型。\n    - RL+LLM：将LLM和RL智能体嵌入到一个共同的规划框架中，而两者互不参与对方的训练或微调。\n\n***\n\n## LLM RL论文\n\n##### 语言条件下的多机器人导航离线强化学习\n\n- 论文链接：[arXiv2407.20164](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20164)，[主页](https:\u002F\u002Fsites.google.com\u002Fview\u002Fllm-marl)\n- 概述：\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_34f51116474a.png)\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_125ac0e40a6f.png)\n\n所提出的多机器人模型架构。每个智能体接收不同的自然语言任务和局部观测。他们使用LLM将每个自然语言任务g~i~总结为潜在表示z~i~。函数*f*是一个图神经网络，它将局部观测o~1~, o~2~, ...以及任务嵌入z~1~, z~2~, ...编码为每个智能体*i*的任务依赖状态表示s~i~|z。他们学习一种基于状态-任务表示的局部策略*π*。函数*π*和*f*完全通过离线强化学习从固定数据集中学习得到。由于他们只对每个任务计算一次z~i~，因此LLM并不参与感知-行动循环，从而使策略能够快速执行。\n\n***\n\n##### LLM赋能的强化学习状态表示\n\n- 论文链接：[arXiv2407.15019](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.13237)，[主页](https:\u002F\u002Fgithub.com\u002Fthu-rllab\u002FLESR)\n- 概述：\n\n![框架](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_1942e7c7e0e4.png)\n\n强化学习中传统的状态表示往往忽略关键的任务相关信息，这给价值网络建立准确的状态到任务奖励映射带来了巨大挑战。传统方法通常依赖大量样本学习来丰富状态表示中的任务特定信息，但这会导致样本效率低下和时间成本高昂。最近，功能强大的大型语言模型（LLMs）为无需过多人工干预的任务相关注入提供了有前景的替代方案。受此启发，我们提出了LLM赋能的状态表示（LESR），这是一种新颖的方法，利用LLM自主生成与任务相关的状态表示代码，从而帮助增强网络映射的连续性，并促进高效训练。实验结果表明，LESR具有较高的样本效率，在Mujoco任务中累积奖励平均比现有基准高出29%，在Gym-Robotics任务中成功率则高出30%。\n\n***\n\n##### iLLM-TSC：融合强化学习与大型语言模型以改进交通信号控制策略\n\n- 论文链接：[arXiv2407.06025](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06025)，[主页](https:\u002F\u002Fgithub.com\u002FTraffic-Alpha\u002FiLLM-TSC)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_dac3dfa72a8b.png\" alt=\"img\" style=\"zoom:45%;\" \u002F>\n\n作者介绍了一种名为iLLM-TSC的框架，将LLM与RL智能体结合用于交通信号控制。该框架首先由RL智能体根据环境观测和从环境中学习到的策略做出初步决策。随后，LLM智能体会结合实际情况，并利用其对复杂环境的理解进一步优化这些决策。这种方法增强了TSC系统对现实条件的适应能力，同时提高了整个框架的稳定性。有关RL智能体和LLM智能体组件的详细信息将在后续章节中介绍。\n\n***\n\n##### 大型语言模型引导的六自由度飞行控制强化学习\n\n- 论文链接：[IEEE 2406 2024.3411015](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F10551749)\n- 概述：\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f3eb600bd774.png)\n\nLLM引导的强化学习框架。\n本文提出了一种用于IFC的LLM引导深度强化学习框架，利用LLM引导的深度强化学习在计算资源有限的情况下实现智能飞行控制。LLM基于局部知识在训练过程中提供直接指导，从而提升DRL中智能体与环境交互生成数据的质量，加速训练进程，并为智能体提供及时反馈，部分缓解稀疏奖励问题。此外，他们还提出了一种有效的奖励函数，以全面平衡飞机的耦合控制，确保稳定、灵活的控制效果。最后，仿真和实验表明，所提出的技术在各类飞行任务中均表现出良好的性能、鲁棒性和适应性，为未来智能空战决策领域的研究奠定了基础。\n\n***\n\n##### 代理式技能发现\n\n- 论文链接：[arXiv 2405.15019](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.15019)，[主页](https:\u002F\u002Fagentic-skill-discovery.github.io\u002F)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9af3c426a15f.png\" style=\"zoom: 30%;\" \u002F>\n\n代理式技能发现逐步获取用于桌面操作的上下文技能。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_59496c593f8a.png\" style=\"zoom:40%;\" \u002F>\n\nASD的上下文技能获取循环。给定环境设置和机器人当前的能力，LLM会持续*提议*机器人需要完成的任务，成功完成后这些任务将被收集为已习得的技能，每项技能都有多个神经网络变体（*选项*）。\n\n***\n\n##### HighwayLLM：基于RL启发的语言模型在高速公路驾驶中的决策与导航\n\n- 论文链接：[arXiv 2405.13547](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.13547)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f8f736e11a05.png\" style=\"zoom:75%;\" \u002F>\n\n基于LLM的车辆轨迹规划结构：RL智能体观察交通状况（周围车辆），并给出一个换道的高层动作。随后，LLM智能体通过FAISS检索highD数据集，并提供接下来的三个轨迹点。\n\n***\n\n##### LEAGUE++：通过大型语言模型引导的技能获取赋能持续机器人学习\n\n- 论文链接：https:\u002F\u002Fopenreview.net\u002Fforum?id=xXo4JL8FvV，[主页](https:\u002F\u002Fsites.google.com\u002Fview\u002Fcontinuallearning)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e63d15a0f811.png\" style=\"zoom: 33%;\" \u002F>\n\n作者提出了一套利用LLM引导持续学习的框架。他们将LLM集成用于TAMP的任务分解和操作符创建，并为RL技能学习生成密集奖励，从而实现针对长时程任务的在线自主学习。同时，他们还使用语义技能库来提高新技能的学习效率。\n\n***\n\n##### 基于大型语言模型回放的离线强化学习所构建的知识型智能体\n\n- 论文链接：[arXiv 2404.09248](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.09248)，\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9bacb13ef8c9.png\" alt=\"参见说明\" style=\"zoom:50%;\" \u002F>\n\nKALM的整体流程，包含三个关键模块：\n(A) LLM接地模块，负责将LLM置于环境中，并使其与环境数据输入对齐；\n(B) 回放生成模块，提示LLM生成关于新技能的数据；\n(C) 技能获取模块，利用离线强化学习训练策略。最终，KALM得到一个既基于离线数据又基于想象数据训练的策略。\n\n\n***\n\n##### 通过语言模型集成与关键场景生成提升自动驾驶汽车训练\n\n- 论文链接：[arXiv 2404.08570](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08570)，\n- 概述：\n\n![参见说明](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_66013343e98a.jpg)\n\nCRITICAL框架的组件架构图。该框架首先根据highD数据集中的典型真实交通情况设置环境配置。这些配置随后被用来生成高速公路环境场景。每完成一轮模拟后，作者会收集包括失败报告、风险指标和奖励在内的数据，重复这一过程多次，以积累一系列带有相应场景风险评估的配置文件。为了增强RL训练，作者会基于风险指标分析配置分布，识别出那些有利于生成关键场景的配置。然后，他们要么直接使用这些配置生成新场景，要么让LLM生成关键场景。\n\n***\n\n##### 基于大型语言模型的四足机器人长时程运动与操作\n\n- 论文链接：[arXiv 2404.05291](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05291)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b08f05355d26.png\" alt=\"参见说明\" style=\"zoom: 50%;\" \u002F>\n\n长时程运动-操作任务的分层系统概述。该系统由用于任务分解的推理层（黄色）和用于技能执行的控制层（紫色）组成。根据长时程任务的语言描述（顶部），一系列LLM智能体进行高层任务规划，并生成参数化的机器人技能函数调用。控制层则通过RL实例化中层运动规划和底层控制技能。\n\n***\n\n##### 对你的机器人吼一吼：通过语言纠正即时改进\n\n- 论文链接：[arXiv 2403.12910](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12910)，[主页](https:\u002F\u002Fyay-robot.github.io\u002F)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_94ac31f8f5f8.jpeg)\n\n    ​\t作者采用分层架构，高层策略生成语言指令，供低层策略执行相应的技能。部署过程中，人类可以通过纠正性语言命令进行干预，临时覆盖高层策略，直接影响低层策略以实现实时调整。这些干预随后会被用来微调高层策略，从而提升其未来的性能。\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_488800f3c977.png)\n\n    ​\t系统以RGB图像和机器人当前的关节位置作为输入，输出用于电机动作的目标关节位置。高层策略使用视觉Transformer对视觉输入进行编码，并预测语言嵌入。低层策略则使用ACT——一种基于Transformer的模型——在语言指令的指导下生成精确的机器人动作。这种架构使机器人能够理解“拿起包”之类的指令，并将其转化为目标关节的动作。\n\n***\n\n##### SRLM：结合大型语言模型与深度强化学习的人机交互社交机器人导航\n\n- 论文链接：[arXiv 2403.15648](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.15648)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_4ce9267abf86.png\" alt=\"参见说明\" style=\"zoom:50%;\" \u002F>\n\nSRLM架构：SRLM被实现为一种人机协作的交互式社交机器人导航框架，它结合了基于语言模型的规划器、基于反馈的规划器以及基于深度强化学习的规划器来执行人类指令。首先，用户请求或实时反馈会通过大型语言模型处理并重新规划为高层次的任务指导，传递给三个动作执行器。接着，图像转文本编码器和时空图人机交互编码器将机器人的局部观测信息转换为特征，作为LNM和RLNM的输入，从而生成基于强化学习的动作、基于语言模型的动作以及基于反馈的动作。最后，这三个动作由一个低层执行解码器自适应地融合，作为SRLM的机器人行为输出。\n\n***\n\n##### EnvGen：利用大型语言模型生成与适配环境，用于具身智能体的训练\n\n- 论文链接：[arXiv 2403.12014](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12014) ，[主页](https:\u002F\u002Fenvgen-llm.github.io\u002F)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_1eac74536778.png)\n\n    ​    在EnvGen框架中，作者使用大型语言模型生成多种环境，使智能体能够有效学习不同技能。整个训练过程分为N个循环，每个循环包含以下四个步骤。\n\n    ​    **第1步：** 向大型语言模型提供一个由四个部分组成的提示（即任务描述、环境细节、输出模板以及上一循环的反馈），并要求模型填充模板，输出可用于训练智能体不同技能的各种环境配置。\n\n    ​    **第2步：** 在大型语言模型生成的环境中训练一个小规模的强化学习智能体。\n\n    ​    **第3步：** 将智能体迁移到原始环境中进行训练，以提升其泛化能力，并通过让智能体探索原始环境来评估其训练进展。\n\n    ​    **第4步：** 将智能体在原始环境中的表现（由第3步测量得到）作为反馈提供给大型语言模型，以便在下一循环中调整环境设置，重点关注智能体表现较弱的技能。\n\n- 评论：\n        本文的亮点在于利用大型语言模型设计初始训练环境条件，这有助于强化学习智能体更快地掌握长时程任务的策略。该方法通过将长时程任务分解为更小的任务并反复训练，从而加速了强化学习的效率。此外，文中还引入了一种反馈机制，允许大型语言模型根据强化学习的训练效果不断调整环境条件。仅需与大型语言模型进行四次交互，便能显著提升强化学习的训练效率，并降低对大型语言模型的使用成本。\n\n***\n\n##### LEAGUE++：通过大型语言模型引导的技能获取，赋能持续性机器人学习\n\n- 论文链接：https:\u002F\u002Fopenreview.net\u002Fforum?id=xXo4JL8FvV，[主页](https:\u002F\u002Fsites.google.com\u002Fview\u002Fcontinuallearning)\n- 概述：\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a9196abb32ab.png)\n\n本文提出了一种利用大型语言模型引导持续学习的框架。该框架整合了大型语言模型，用于处理TAMP中的任务分解和操作符生成，并为强化学习技能学习生成密集奖励信号，从而实现针对长时程任务的在线自主学习。同时，该框架还使用语义技能库来提升新技能的学习效率。\n\n***\n\n##### RLingua：利用大型语言模型提升机器人操控中的强化学习样本效率\n\n- 论文链接：[arXiv 2403.06420](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.06420)，[主页](https:\u002F\u002Frlingua.github.io\u002F)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a51b4be49708.png\" alt=\"RLingua框架\" style=\"zoom: 80%;\" \u002F>\n\n    ​\t(a) 动机：大型语言模型无需环境样本，且非专业人士也能轻松与其沟通。然而，直接由大型语言模型生成的机器人控制器性能可能较差。相比之下，强化学习可以训练出高性能的机器人控制器，但其缺点是样本复杂度高。(b) 框架：RLingua将大型语言模型关于机器人运动的内部知识提取出来，形成一段不完善的代码化控制器，然后通过与环境交互收集数据。机器人控制策略则同时利用收集到的大型语言模型示范数据以及在线训练策略收集的交互数据进行训练。\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_fe06bbbd6a6b.png\" alt=\"RLingua 2\" style=\"zoom:50%;\" \u002F>\n\n    ​\t带有人类反馈的提示设计框架。任务描述和编码指南按顺序被提示，人类反馈则是在观察初步的大型语言模型控制器在机器人上的执行过程后提供的。\n\n- 评论： \n\n    ​    本文的亮点在于同时应用大型语言模型和强化学习来为在线训练策略生成训练数据。由大型语言模型生成的控制代码也被视为一种策略，实现了数学意义上的统一。该策略的主要作用是运行在机器人上并收集数据。文章的重点在于大型语言模型的提示设计，即两种类型的提示流程——带人类反馈的和带代码模板的——以及如何设计这些提示。提示的设计非常细致，值得借鉴。\n\n***\n\n##### RL-GPT：将强化学习与“代码即策略”相结合\n\n- 论文链接：[arXiv 2402.19299](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19299) ，[主页](https:\u002F\u002Fsites.google.com\u002Fview\u002Frl-gpt\u002F)\n\n- 框架概述： \n\n    \u003Cimg src=\".\u002Fimages\u002FRL-GPT框架.png\" alt=\"RL-GPT框架\" style=\"zoom: 50%;\" \u002F>\n\n    ​\t整体框架由一个慢速智能体（橙色）和一个快速智能体（绿色）组成。慢速智能体负责分解任务并确定“需要学习哪些动作”。快速智能体则负责编写代码及强化学习配置，以供底层执行使用。\n\n- 评论： \n\n    ​    该框架整合了“代码即策略”、“强化学习训练”和“大型语言模型规划”。首先，大型语言模型将任务分解为若干动作，再根据复杂程度进一步细分。简单的动作可以直接编码实现，而复杂的动作则采用代码与强化学习相结合的方式。此外，框架还引入了一个Critic模块，用于持续优化代码和规划方案。本文的亮点在于将大型语言模型生成的代码融入强化学习的动作空间进行训练，这种互动式的做法值得借鉴。\n\n***\n\n##### 大型语言模型如何引导强化学习？一种基于价值的方法\n\n- 论文链接：[arXiv 2402.16181](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.16181) ，[主页](https:\u002F\u002Fgithub.com\u002Fagentification\u002FLanguage-Integrated-VI)\n\n- 框架概述： \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b166efe03765.png)\n\n    ​\t在ALFWorld环境中，当N=2且BFS树的宽度设置为k=3时，SLINVIT算法的演示。任务是“清洗一块布并将它放在台面上”。大型语言模型可能出现的幻觉问题，即本应取毛巾而非布，可通过我们强化学习框架中固有的探索机制加以解决。\n\n- 评论\n\n​    本文的核心思想是将任务分配给大语言模型，在广度优先搜索（BFS）框架下进行充分探索，生成多种策略，并提出两种估值方法。一种方法基于代码，适用于目标实现需要满足多个前提条件的场景；另一种方法则依赖蒙特卡洛方法。随后，选择价值最高的最优策略，并将其与强化学习策略相结合，以增强数据采样和策略改进。\n\n***\n\n##### PREDILECT：在强化学习中利用零样本语言推理定义偏好\n\n- 论文链接：[arXiv 2402.15420](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15420)，[主页](https:\u002F\u002Fsites.google.com\u002Fview\u002Frl-predilect)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_c0bc792b78ca.png\" alt=\"参见说明\" style=\"zoom:50%;\" \u002F>\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a2367d701573.png\" alt=\"参见说明\" style=\"zoom:50%;\" \u002F>\n\nPREDILECT在社交导航场景中的概览：首先，向人类展示两条轨迹A和B。他们表明对其中一条轨迹的偏好，并提供一段额外的文本提示以阐述其见解。随后，可以使用大语言模型提取特征情感，揭示其文本提示中蕴含的因果推理，并将其处理后映射为一组内在价值。最后，利用这些偏好和强调的见解更准确地定义奖励函数。\n\n***\n\n##### 基于语言反馈模型的策略改进\n\n- 论文链接：[arXiv 2402.07876](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07876)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9dfc638ebe2d.png)\n\n\n***\n\n##### 自然语言强化学习\n\n- 论文链接：[arXiv 2402.07157](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07157)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d5117c87694f.png\" style=\"zoom:50%;\" \u002F>\n\n    ​ 作者通过一个网格世界MDP示例，展示了NLRL与传统RL在任务目标、价值函数、贝尔曼方程以及广义策略迭代方面的差异。在这个网格世界中，机器人需要到达皇冠位置并避开所有危险。他们假设机器人在每个非终止状态都采取最优行动，但在状态b处则采用均匀随机策略。\n\n- 评论：\n\n    ​ 本文将强化学习作为大语言模型的一个流程，这是一种引人入胜的研究方法。框架内的最优策略与任务描述高度一致。每个状态及状态-动作值的质量取决于它们与任务描述的契合程度。状态-动作描述既包含奖励，也描述了下一个状态。而状态描述则是对所有可能状态-动作描述的总结。\n\n    ​ 在策略估计阶段，状态描述模仿了强化学习中常用的蒙特卡洛（MC）或时序差分（TD）方法。MC侧重于多步移动，根据最终状态进行评估；而TD则强调单步移动，返回下一个状态的描述。最后，大语言模型综合所有结果，得出当前状态的描述。在策略改进阶段，大语言模型会选择最佳的状态-动作对来决定行动。\n\n***\n\n##### 基于大语言模型的层次化持续强化学习\n\n- 论文链接：[arXiv 2401.15098](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.15098)\n\n- 框架概述：\n\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_4688ec15008c.png\" alt=\"Hi_Core框架\" style=\"zoom:67%;\" \u002F>\n\n  ​ 所提出的框架示意图。中间部分展示了Hi-Core内部交互（浅灰色线）和外部交互（深灰色线）。在内部，CRL智能体分为两层：高层策略制定（橙色）和低层策略学习（绿色）。此外，还构建了一个策略库（蓝色），用于存储和检索策略。周围的三个方框展示了当智能体遇到新任务时的内部工作流程。\n\n- 方法概述：\n\n    ​ 高层使用大语言模型生成一系列目标g_i。低层则是一个以目标为导向的强化学习系统，需根据这些目标生成策略。策略库用于存储成功的策略。当遇到新任务时，策略库可以调用相关经验，帮助高低层策略智能体完成任务。\n\n***\n\n##### 真知源于实践：通过强化学习使大语言模型与具身环境对齐\n\n- 论文链接：[arXiv 2401.14151](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.14151)，[主页](https:\u002F\u002Fgithub.com\u002FWeihaoTan\u002FTWOSOME)\n\n- 框架概述：\n\n    \u003Cimg src=\".\u002Fimages\u002FTWOSOME框架.png\" style=\"zoom: 67%;\" \u002F>\n\n    ​ TWOSOME如何利用动作的联合概率生成策略的概述。标记块中的颜色区域表示相应标记在动作中的概率。\n\n- 方法概述：\n\n    ​ 作者提出了在线框架*True knoWledge cOmeS frOM practicE*（TWOSOME）。该框架部署具身化的大语言模型智能体，通过强化学习高效地与环境互动并对齐，从而解决无需预先准备数据集或了解环境先验知识的决策任务。他们利用大语言模型提供的每个标记的日志似然分数计算各动作的联合概率，进而形成有效的行为策略。\n\n***\n\n##### AutoRT：用于大规模机器人编排的具身基础模型\n\n- 论文链接：[arXiv 2401.12963](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.12963)，[主页](https:\u002F\u002Fauto-rt.github.io\u002F)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_380204b98b1c.png\" style=\"zoom:40%;\" \u002F>\n\n    ​ AutoRT旨在探索将机器人扩展到非结构化的“野外”环境中。作者使用视觉语言模型对机器人所见内容进行开放词汇描述，再将这些描述传递给大语言模型，由后者生成自然语言指令。随后，另一家大语言模型会依据所谓的“机器人宪章”对这些建议进行评审，以优化指令，使其更加安全且易于执行。这样一来，他们便可以在更多样化的环境中运行机器人，即使事先并不知道机器人会遇到哪些物体，也能收集关于自动生成任务的数据。\n\n- 评论：\n\n    ​ 本文的主要贡献在于设计了一种框架，该框架利用语言学习模型（LLM）根据当前场景和技能为机器人分配任务。在任务执行阶段，可以采用多种机器人学习方法，例如强化学习（RL）。执行过程中获得的数据会被添加到数据库中。\n\n    ​ 通过这一迭代过程，并随着越来越多的机器人加入，数据收集过程可以实现自动化和加速。这些高质量的数据可用于未来训练更多的机器人。这项工作为基于大量真实物理数据的机器人学习奠定了基础。\n\n***\n\n##### 基于大语言模型反馈的强化学习以对抗目标误泛化\n\n- 论文链接：[arXiv 2401.07181](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.07181)，\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_741b82dcba14.png\" style=\"zoom:67%;\" \u002F>\n\nLLM偏好建模与奖励模型。RL智能体部署在LLM生成的数据集上，并存储其轨迹数据。LLM会比较成对的轨迹并给出偏好，这些偏好用于训练一个新的奖励模型。随后，该奖励模型会被整合到智能体剩余的训练步中。\n\n***\n\n##### Auto MC-Reward：利用大语言模型为我的世界游戏设计自动化密集奖励机制\n\n- 论文链接：[arXiv 2312.09238](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09238)，[主页](https:\u002F\u002Fyangxue0827.github.io\u002Fauto_mc-reward.html)\n- 概述：\n\n![](https:\u002F\u002Fyangxue0827.github.io\u002Fauto_mc-reward_files\u002Fpipeline_v3.png)\n\nAuto MC-Reward框架概览。Auto MC-Reward由三个基于LLM的关键组件构成：奖励设计师、奖励评论家和轨迹分析器。通过智能体与环境之间的持续交互，不断迭代出合适的密集奖励函数，用于特定任务的强化学习训练，从而使模型能够更好地完成任务。图中展示了一个探索钻石矿石的例子：i) 轨迹分析器发现失败轨迹中的智能体会因熔岩而死亡，于是建议在遇到熔岩时给予惩罚；ii) 奖励设计师采纳该建议并更新奖励函数；iii) 修改后的奖励函数经过奖励评论家的审核后，最终智能体通过左转避开了熔岩。\n\n***\n\n##### 大语言模型作为策略教师用于训练强化学习智能体\n\n- 论文链接：[arXiv 2311.13373](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13373)，[主页](https:\u002F\u002Fgithub.com\u002FZJLAB-AMMI\u002FLLM4Teach)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d8a6de34b198.png\" style=\"zoom:67%;\" \u002F>\n\n    以MiniGrid环境为例的LLM4Teach框架示意图。基于LLM的教师智能体根据环境提供的状态观测结果，给出软性指导。这些指导表现为一组建议动作的概率分布。学生智能体被训练同时优化两个目标。第一个目标是最大化预期回报，这与传统RL算法相同。另一个目标则是鼓励学生智能体遵循教师提供的指导。随着训练过程中学生智能体专业技能的提升，第二个目标的权重会随时间逐渐降低，从而减少其对教师的依赖。\n\n***\n\n##### 语言与草图：一个由LLM驱动的交互式多模态多任务机器人导航框架\n\n- 论文链接：[arXiv 2311.08244](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08244)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_3bec032a0854.png\" style=\"zoom: 50%;\" \u002F>\n\n\n该框架包含一个LLM模块、一个智能感知模块和一个强化学习模块。\n\n***\n\n##### LLM增强的层次化智能体\n\n- 论文链接：[arXiv 2311.05596](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.05596)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f38a4cbf8f42.png\" style=\"zoom: 67%;\" \u002F>\n\nLLM用于指导高层策略并加速学习。它会根据上下文、一些示例以及当前的任务和观测结果进行提示。LLM的输出会影响高层动作的选择。\n\n***\n\n##### 通过大语言模型的反馈加速机器人操作的强化学习\n\n- 论文链接：[arXiv 2311.02379](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.02379)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_93d6d558f088.png\" style=\"zoom:67%;\" \u002F>\n\n所提出的Lafite-RL框架示意图。在开始学习一项任务之前，用户会提供设计好的提示，包括当前任务背景和期望的机器人行为描述，以及针对LLM任务的具体规则说明。随后，Lafite-RL使LLM能够“观察”并理解场景信息，其中包括机器人的过往动作，并根据当前任务要求评估这些动作。语言解析器将LLM的响应转化为评估反馈，用于构建交互式奖励。\n\n***\n\n##### 解放预训练语言模型的力量用于离线强化学习\n\n- 论文链接：[arXiv 2310.20587](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.20587)，[主页](https:\u002F\u002Flamo2023.github.io\u002F)\n- 概述：\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_50f290a075ce.png)\n\nLaMo框架概览。LaMo主要分为两个阶段：(1) 在语言任务上预训练LM；(2) 冻结预训练的注意力层，用MLP替换线性投影，并使用LoRA适配RL任务。作者还在离线RL阶段将语言损失作为正则化项应用。\n\n***\n\n##### 大语言模型作为具泛化能力的具身任务策略\n\n- 论文链接：[arXiv 2310.17722](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17722)，[主页](https:\u002F\u002Fllm-rl.github.io\u002F)\n- 概述：\n\n![img](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_84987d32a982.jpg)\n\n通过结合强化学习与预训练的LLM，并仅最大化稀疏奖励，该方法可以学习到一种能够泛化到新型语言重排任务的策略。该方法对未见过的物体和场景、以描述或活动解释方式指代物体的新方法，甚至包括变量数量的重排、空间描述和条件语句等新型任务描述，均表现出强大的泛化能力。\n\n***\n\n##### Eureka：通过编码型大语言模型实现人类水平的奖励设计\n\n- 论文链接：[arXiv 2310.12931](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.12931)，[主页](https:\u002F\u002Feureka-research.github.io\u002F)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d3db19d18ee2.png\" style=\"zoom:67%;\" \u002F>\n\n\nEUREKA以未经修改的环境源代码和语言任务描述为上下文，零次调用直接从编码型LLM生成可执行的奖励函数。随后，它在奖励采样、GPU加速的奖励评估和奖励反思之间反复迭代，逐步改进其奖励输出。\n\n- 评论\n\n    本文中的LLM被用于设计RL的奖励函数。主要关注点是如何创建一个精心设计的奖励函数。有两种方法：\n\n    1. **进化搜索**：最初生成大量奖励函数，并使用硬编码的方法对其进行评估。\n    2. **奖励反思**：在训练过程中，保存中间奖励变量并反馈给LLM，从而可以在原有奖励函数的基础上进行改进。\n\n    第一种方法更偏向于静态分析，而第二种方法则强调动态分析。通过结合这两种方法，可以选择并优化最佳的奖励函数。\n\n***\n\n##### AMAGO：面向自适应智能体的可扩展上下文强化学习\n\n- 论文链接：[arXiv 2310.09971](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.09971)，[主页](https:\u002F\u002Fut-austin-rpl.github.io\u002Famago\u002F)\n- 概述：\n\n![img](https:\u002F\u002Fut-austin-rpl.github.io\u002Famago\u002Fsrc\u002Ffigure\u002Ffig1_iclr_e_notation.png)\n\n上下文强化学习技术通过使用序列模型，从测试时的经验中推断未知环境的身份，从而解决记忆和元学习问题。AMAGO则针对核心技术挑战，旨在统一端到端离策略强化学习与长序列Transformer的性能，以将记忆能力和对新环境的适应能力推向新的极限。\n\n***\n\n##### LgTS：利用LLM生成的子目标进行动态任务采样的强化学习智能体\n\n- 论文链接：[arXiv 2310.09454](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09454.pdf)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a664fe8ed8df.png\" style=\"zoom:67%;\" \u002F>\n\n(a) 网格世界领域及描述符。智能体（红色三角形）需要收集其中一把钥匙并打开门才能到达目标；\n(b) 发送给LLM的提示，其中包含关于LLM应生成路径数量的信息，以及诸如实体、谓词、环境的高层初始状态和目标状态等符号信息（对于某些谓词的真值是否已知不做假设）。LLM返回的是一组以有序列表形式表示的路径。这些路径随后被转换为有向无环图（DAG）。图b中，LgTS选择的路径在DAG中以红色突出显示。\n\n****\n\n##### Octopus：基于环境反馈的具身视觉—语言编程器\n\n- 论文链接：[arXiv 2310.08588](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08588)，[主页](https:\u002F\u002Fchoiszt.github.io\u002FOctopus\u002F)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0575cf4a6bc3.jpg\" style=\"zoom:30%;\" \u002F>\n\nGPT-4通过**环境消息**感知环境，并根据详细的**系统消息**生成预期的计划和代码。这些代码随后在模拟器中执行，引导智能体进入下一个状态。对于每个状态，作者都会收集环境消息，其中**观察到的物体**和**关系**会被自我的视角图像所替代，作为训练输入。而GPT-4的响应则作为训练输出。环境反馈，特别是对每个目标状态是否达成的判断，会被记录下来用于RLEF训练。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e324e16a5bea.jpg\" style=\"zoom:30%;\" \u002F>\n\n提供的图片展示了一个全面的数据收集和训练流程。在**数据收集管道**中，环境信息被捕捉并解析为场景图，然后组合生成**环境消息**和**系统消息**。这些消息随后驱动智能体的控制，最终生成可执行的代码。对于**Octopus训练管道**，智能体的视觉和代码会被输入到Octopus模型中，采用**SFT**和**RLEF**两种技术进行训练。附带的文字强调了结构良好的系统消息对于GPT-4有效生成代码的重要性，并指出了因错误而面临的挑战，同时突出了该模型在处理各种任务时的适应性。总之，该管道提供了一种从环境理解到行动执行的全方位智能体训练方法。\n\n***\n\n##### Motif：来自人工智能反馈的内在动机\n\n- 论文链接：[arXiv 2310.00166](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00166)，[主页](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmotif)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_781da5890c07.png\" style=\"zoom:100%;\" \u002F>\n\nMotif三个阶段的示意图。在第一阶段，即数据集标注阶段，作者从LLM中提取出对成对说明文字的偏好，并将相应的观测对及其标注保存到数据集中。第二阶段是奖励训练阶段，作者将这些偏好提炼成基于观测的标量奖励函数。第三阶段则是强化学习训练阶段，作者使用从偏好中提取的奖励函数，可能结合来自环境的奖励信号，与智能体进行交互式强化学习训练。\n\n***\n\n##### Text2Reward：强化学习中的自动化密集奖励函数生成\n\n- 论文链接：[arXiv 2309.11489](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.11489)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_fe073150965d.png)\n\n    ​专家抽象层将环境抽象为一个Python类的层次结构。*用户指令*用自然语言描述要实现的目标。*用户反馈*允许用户总结失败模式或偏好，这些信息可用于改进奖励代码。\n\n***\n\n##### State2Explanation：基于概念的解释，助力智能体学习和用户理解\n\n- 论文链接：[arXiv 2309.12482](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.12482)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_fce4c9f00344.png\" style=\"zoom:67%;\" \u002F>\n\nS2E框架包括(a)学习一个联合嵌入模型M，从中提取epsilon并加以利用；(b)在智能体训练过程中用于指导奖励塑造，促进智能体学习；(c)在部署阶段为终端用户提供关于智能体行为的epsilon解释。\n\n***\n\n##### 自我精炼的大规模语言模型：用于机器人深度强化学习的自动化奖励函数设计者\n\n- 论文链接：[arXiv 2309.06687](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.06687)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_5de4fae68c84.png)\n\n    ​ 提出的用于奖励函数设计的自我精炼LLM框架。它包括三个步骤：初始设计、评估和自我精炼循环。这里以四足机器人前向奔跑任务为例。\n\n***\n\n##### RLAdapter：连接大型语言模型与开放世界中的强化学习\n\n- 论文链接：[arXiv 2309.17176](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.17176v1)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f4f3ba2637ed.png)\n\n    ​ RLAdapter的整体框架。除了接收来自环境和历史信息的输入外，适配器模型的提示还包含一个理解分数。该分数计算智能体最近的动作与LLM建议的子目标之间的语义相似度，从而判断智能体当前是否准确理解了LLM的指导。通过智能体的反馈以及对适配器模型的持续微调，可以使LLM始终与任务的实际状况保持一致。这反过来又确保了所提供的指导最符合智能体优先的学习需求。\n\n- 评论：\n\n    这篇论文开发了RLAdapter框架，除了强化学习和大型语言模型之外，还额外引入了一个适配器模型。\n\n***\n\n##### ExpeL：LLM智能体是经验型学习者\n\n- 论文链接：[arXiv 2308.10144](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.10144)，[主页](https:\u002F\u002Fandrewzh112.github.io\u002F#expel)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_f8c7b2eb5aa9.png\" style=\"zoom: 50%;\" \u002F>\n\n左图：ExpeL 分为三个阶段：(1) 将成功和失败的经验收集到池中。 (2) 从这些经验中提取\u002F抽象出跨任务的知识。 (3) 在评估任务中应用所获得的洞见，并回忆过去的成功经验。\n右图：(A) 展示了通过反思进行经验收集的过程，使智能体在对失败进行自我反思后能够重新尝试任务。(B) 展示了洞见提取步骤。当面对成功\u002F失败配对或 L 个成功案例列表时，智能体可以使用 ADD、UPVOTE、DOWNVOTE 和 EDIT 等操作动态修改现有的洞见列表。这一过程的重点在于提取常见的失败模式或最佳实践。\n\n***\n\n##### 面向机器人技能合成的语言到奖励系统\n\n- 论文链接：[arXiv 2306.08647](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08647)，[主页](https:\u002F\u002Flanguage-to-reward.github.io\u002F)\n- 概述：\n\n![img](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_6ca32238b052.png)\n\n奖励转换器的详细数据流。运动描述语言模型接收用户输入，用自然语言描述用户指定的运动；而奖励编码器则将该运动转化为奖励参数。\n\n***\n\n##### 利用语言学习世界模型\n\n- 论文链接：[arXiv2308.01399](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.01399)，[主页](https:\u002F\u002Fdynalang.github.io\u002F)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b4c0181febd5.png\" style=\"zoom:67%;\" \u002F>\n\nDynalang 学习使用语言对未来（文本+图像）观测值和奖励进行预测，从而帮助其完成任务。在此，作者展示了在 HomeGrid 环境中的真实模型预测结果。智能体探索了多个房间，并从环境中获取视频和语言观测信息。根据过去的文本“瓶子在客厅里”，智能体在第 61 至 65 个时间步预测它将在客厅的最后一个角落看到瓶子。根据描述任务的文本“拿瓶子”，智能体预测自己会因拿起瓶子而获得奖励。此外，智能体还能预测未来的文本观测：给定前缀“盘子在”以及在第 30 个时间步观察到的台面上的盘子，模型预测最有可能出现的下一个词是“厨房”。\n\n***\n\n##### 基于强化学习实现智能体与大语言模型之间的智能交互\n\n- 论文链接：[arXiv 2306.03604](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.03604)，[主页](https:\u002F\u002Fgithub.com\u002FZJLAB-AMMI\u002FLLM4RL)\n- 概述：\n\n![llm4rl](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d70c1bbac770.png)\n\n规划者-执行者-中介范式概览及交互示例。在每个时间步，中介接收观测 o_t 作为输入，并决定是否向 LLM 规划者请求新指令。当请求策略决定发起请求时，如红色虚线所示，翻译器会将 o_t 转换为文本描述，规划者据此输出新的计划供执行者遵循。另一方面，当中介决定不请求时，如绿色虚线所示，中介直接返回给执行者，指示其继续执行当前计划。\n\n***\n\n##### 基于语言模型的奖励设计\n\n- 论文链接：[arXiv 2303.00001](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.00001)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_6852bed10d5f.png\" style=\"zoom: 50%;\" \u002F>\n\n    ​\t框架在 DEAL OR NO DEAL 谈判任务上的示意图。用户在训练前提供期望谈判行为的示例及其解释（例如灵活性）。在训练过程中，(1) 用户向 LLM 提供任务描述、用户对其目标的描述、将一轮结果转换为字符串，以及询问该轮结果是否满足用户目标的问题。(2-3) 随后，他们将 LLM 的回复解析回字符串，并将其用作 RL 智能体 Alice 的奖励信号。(4) Alice 更新其权重并展开新一轮实验。(5) 他们再次将实验结果解析为字符串，继续训练。在评估阶段，他们从 Alice 的轨迹中采样，并评估其是否符合用户的目标。\n\n***\n\n##### 面向开放世界长时限任务的技能强化学习与规划\n\n- 论文链接：[arXiv 2303.16563](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.16563)，[主页](https:\u002F\u002Fsites.google.com\u002Fview\u002Fplan4mc)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e2b270de5a69.png)\n\n    ​\t作者将 Minecraft 中的基本技能分为三类：发现技能、操作技能和制作技能。他们利用强化学习训练策略来掌握这些技能。借助 LLM，作者提前提取技能之间的关系并构建技能图，如虚线框所示。在线规划时，技能搜索算法会在预生成的图上遍历，将任务分解为可执行的技能序列，并交互式地选择相应的策略来解决复杂任务。\n\n- 评论\n\n    ​\t本文的亮点在于使用 LLM 生成技能图，从而明确各技能之间的先后顺序。当输入一项任务时，框架会使用深度优先搜索在技能图上查找，以确定每一步应选择的技能。强化学习负责执行具体技能并更新状态，通过迭代这一过程，将复杂任务分解为易于管理的部分。\n\n    ​\t该框架有待改进的地方包括：\n\n     1. 目前需要人工预先提供可用技能。未来，框架应具备自主学习新技能的能力。\n     2. 框架中 LLM 的作用主要是建立技能之间的关系。或许这一功能也可以通过硬编码实现，例如查询 Minecraft 库来生成技能图。\n\n***\n\n##### RE-MOVE：基于语言反馈的自适应策略设计，用于动态环境中的机器人导航任务\n\n- 论文链接：[arXiv 2303.07622](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.07622)，[主页](https:\u002F\u002Fgamma.umd.edu\u002Fresearchdirections\u002Fcrowdmultiagent\u002Fremove\u002F)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_5dcd044e7be4.png\" style=\"zoom:67%;\" \u002F>\n\n***\n\n##### 基于内部向外的任务语言开发与翻译的自然语言条件强化学习\n\n- 论文链接：[arXiv 2302.09368](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.09368)\n- 概述：\n\n自然语言条件强化学习（NLC-RL）使智能体能够遵循人类指令。以往的方法通常通过提供自然语言指令并训练一个遵循策略来实现自然语言条件强化学习。在这种“外向式”方法中，策略需要同时理解自然语言并完成任务。然而，不受限制的自然语言示例往往会给具体的强化学习任务带来额外的复杂性，从而分散策略学习完成任务的注意力。为了减轻策略的学习负担，作者提出了一种“内向式”自然语言条件强化学习方案，开发了一种与任务相关且独特的任务语言（TL）。该任务语言用于强化学习中，以实现高效且有效的策略训练。此外，还训练了一个翻译器将自然语言翻译成任务语言。他们将这一方案实现为TALAR（带有谓词表示的任务语言），该模型学习多个谓词来建模对象关系作为任务语言。实验表明，TALAR不仅更好地理解了自然语言指令，还生成了更好的指令遵循策略，成功率达到13.4%，并且能够适应未见过的自然语言表达方式。此外，任务语言还可以作为一种有效的任务抽象，天然地与层次化强化学习兼容。\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_82cba77cfcbe.png\" style=\"zoom:67%;\" \u002F>\n\nNLC-RL中OIL和IOL方案的示意图。\n左图：OIL直接将自然语言指令暴露给策略。\n右图：IOL开发了一种任务语言，它是与任务相关的、对自然语言指令的独特表示。\n实线表示指令执行过程，虚线表示任务语言的开发和翻译。\n\n***\n\n##### 大型语言模型在强化学习中的引导式预训练\n\n- 论文链接：[arXiv 2302.06692](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.06692) ，[主页](https:\u002F\u002Fgithub.com\u002Fyuqingd\u002Fellm)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d096f77b5e21.png)\n\n    ELLM利用预训练的大规模语言模型（LLM），以一种与任务无关的方式建议可能有用的目标。基于LLM的上下文敏感性和常识能力，ELLM训练强化学习智能体去追求那些很可能有意义的目标，而无需直接的人工干预。\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e6ffb73b6494.png)\n\n    ELLM使用GPT-3来建议合适的探索目标，并利用SentenceBERT嵌入计算建议目标与示范行为之间的相似度，以此作为内在动机奖励。\n\n- 评论：\n\n    这篇论文是最早将LLM用于强化学习目标规划的研究之一。ELLM框架会将当前环境信息和可用动作传递给LLM，使其能够根据常识设计出多个合理的目标，然后由强化学习执行其中一个目标。奖励函数则根据目标和状态嵌入的相似度来确定。由于嵌入也是由SentenceBERT模型生成的，因此可以说奖励实际上是由LLM生成的。\n\n***\n\n##### 在交互式环境中通过在线强化学习实现大型语言模型的具身化\n\n- 论文链接：[arXiv 2302.02662](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.02662) ，[主页](https:\u002F\u002Fgithub.com\u002Fflowersteam\u002FGrounding_LLMs_with_online_RL)\n\n- 框架概述：\n\n    ![主架构图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_c3fb3f61cdd5.png)\n\n    GLAM方法：作者在文本交互式强化学习环境（BabyAI-Text）中使用LLM作为智能体策略，在该环境中LLM通过在线强化学习（PPO）被训练以实现语言目标，从而实现功能性的具身化。（a）BabyAI-Text为当前回合提供目标描述、智能体观测描述以及当前步骤的标量奖励。（b）每一步，他们将目标描述和观测整合成一个提示词发送给我们的LLM。（c）对于每个可能的动作，他们使用编码器生成提示词的表示，并计算在给定提示词条件下构成该动作的标记的条件概率。一旦估算出每个动作的概率，他们就会对这些概率进行softmax归一化，并根据该分布采样一个动作。也就是说，LLM就是我们的智能体策略。（d）他们利用环境返回的奖励，通过PPO微调LLM。为此，他们在LLM的基础上添加了一个价值头来估计当前观测的价值。最后，他们通过LLM（及其价值头）反向传播梯度。\n\n- 评论：\n\n    本文使用BabyAI-Text将Gridworld中的目标和观测转化为文本描述，这些描述可以进一步转换为输入LLM的提示词。LLM输出动作的概率，随后LLM输出的动作概率、MLC获得的值估计以及奖励会被输入到PPO中进行训练。最终，智能体会输出一个合适的动作。在实验中，作者使用了GFlan-T5模型，经过25万步的训练后，成功率达到80%，相比其他方法有了显著提升。\n\n***\n\n##### 阅读并收获奖励：借助说明书学习玩Atari游戏\n\n- 论文链接：[arXiv 2302.04449](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.04449)\n- 概述：\n\n\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_897fb103c22b.png\" style=\"zoom:80%;\" \u002F>\n\n“阅读与奖励”框架概览。系统接收环境中的当前帧和说明书作为输入。经过目标检测和具身化处理后，QA提取模块会从说明书中提取并总结相关信息，推理模块则根据QA提取模块的输出，为游戏中检测到的事件分配辅助奖励。“是\u002F否”答案随后被映射为+5\u002F−5的辅助奖励。\n\n***\n\n##### 与语言模型协作进行具身推理\n\n- 论文链接：[arXiv 2302.00763](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.00763)\n\n- 框架概述：\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_c6b3979d8424.png)\n\n    A. 规划者-执行者-报告者范式的示意图及三者之间交互的示例。B. PycoLab环境的观测空间和动作空间。\n\n- 评论：\n\n    本文提出的框架简单明了，是早期将LLM用于强化学习策略的研究之一。在这个框架中，规划者是LLM，而报告者和执行者则是强化学习组件。任务要求先检查物品的属性，再选择具有“良好”属性的物品。框架从规划者开始，向其提供任务描述和历史执行记录。规划者随后为执行者选择一个行动。执行者执行动作后会得到结果。报告者观察环境并向规划者反馈，这一过程不断重复。\n\n***\n\n##### 变压器是样本高效的世界模型\n\n- 论文链接：[arXiv 2209.00588](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.00588)，[主页](https:\u002F\u002Fgithub.com\u002Feloialonso\u002Firis)\n\n***\n\n##### 内心独白：通过语言模型规划实现具身推理\n\n- 论文链接：[arXiv 2207.05608](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.05608)，[主页](https:\u002F\u002Finnermonologue.github.io\u002F)\n- 概述：\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_fcb301aae340.png)\n\n内心独白通过结合一系列感知模型（如场景描述符和成功检测器）与预训练的语言条件机器人技能，为基于大型语言模型的机器人规划提供 grounded 的闭环反馈。实验表明，该系统能够在模拟和真实环境中，针对 (a) 移动操作和 (b,c) 桌面操作等复杂长 horizon 任务进行推理和重规划，从而完成目标。\n\n***\n\n##### 做我能做的，而不是我说的：将语言 grounding 到机器人 affordance 中\n\n- 论文链接：[arXiv 2204.01691](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.01691)，[主页](https:\u002F\u002Fsay-can.github.io\u002F)\n\n- 框架概述：\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_aba52bfdb258.png\" style=\"zoom:67%;\" \u002F>\n\n    给定一个高层指令，SayCan 将来自 LLM 的概率（即某项技能对指令有用的概率）与来自价值函数的概率（即成功执行该技能的概率）相结合，以选择要执行的技能。这样生成的技能既可行又有效。随后，将该技能附加到响应中，并再次查询模型，这一过程不断重复，直到输出步骤为终止。\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_2f83ecd36c09.png)\n\n    ​ 会查询一个价值函数模块 (a)，根据当前观测结果构建动作基元的价值函数空间。以“pick”价值函数为例，在 (b) 场景中，“拿起红色的公牛罐头”和“拿起苹果”的值较高，因为这两个物体都在场景中；而在 (c) 场景中，机器人正在空旷的空间中导航，因此没有任何拿起动作能获得高值。\n\n***\n\n##### 保持冷静并探索：用于文本类游戏中动作生成的语言模型\n\n- 论文链接：[arXiv 2010.02903](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.02903)，[主页](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002Fcalm-textgame)\n- 概述：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0ece76ad810d.png\" style=\"zoom: 67%;\" \u002F>\n\nCALM 与一个强化学习智能体——DDRN——结合用于游戏玩法。CALM 基于人类游戏的转录本进行训练，以生成动作。在每个状态下，CALM 会根据游戏上下文生成候选动作，而 DRRN 则计算这些动作的 Q 值，从而选择最终的动作。一旦训练完成，单个 CALM 实例即可用于为任何文本类游戏生成动作。\n\n***\n\n## 强化学习的基础方法\n\n> 理解强化学习中的基础方法，如课程学习、RLHF 和人机协作强化学习（HITL），对我们的研究至关重要。这些方法构成了现代强化学习技术的基石。通过研究这些早期方法，我们可以更深入地理解强化学习背后的原理和机制。这些知识能够为当前语言模型学习（LLM）与强化学习交叉领域的研究提供指导和启发，帮助我们开发出更有效、更具创新性的解决方案。\n\n***\n\n##### 在强化学习中使用自然语言进行奖励塑造\n\n- 论文链接：[arXiv 1903.02020](https:\u002F\u002Farxiv.org\u002Fabs\u002F1903.02020) \n\n- 框架概述： \n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_184f3663ec46.png)\n\n    该框架由标准的强化学习模块组成，包含智能体与环境的交互循环，并在此基础上增加了一个语言动作奖励网络（LEARN）模块。\n\n- 评论：\n\n    ​    本文提出了一种利用自然语言提供奖励的方法。当时还没有大型语言模型，因此作者使用了大量的现有游戏视频及其对应的语言描述作为数据集。他们训练了一个前馈神经网络（FNN），该网络可以输出当前轨迹与语言指令之间的关系，并将这一输出作为中间奖励。通过将其与原始稀疏的环境奖励相结合，强化学习智能体能够基于目标和语言指令更快地学习到最优策略。\n\n***\n\n##### DQN-TAMER：具有难以处理反馈的人机协作强化学习\n\n- 论文链接：[arXiv 1810.11748](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.11748)\n- 概述： \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b74f31309406.png\" style=\"zoom:80%;\" \u002F>\n\n人机协作强化学习及模型（DQNTAMER）的概览。智能体在给定环境中与人类观察者异步交互。DQN-TAMER 根据两个模型来决定行动：一个模型（Q）用于估计来自环境的奖励，另一个模型（H）则用于获取来自人类的反馈。\n\n***\n\n##### 利用示范克服强化学习中的探索问题\n\n- 论文链接：[arXiv 1709.10089](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.10089)，[主页](https:\u002F\u002Fashvin.me\u002Fdemoddpg-website\u002F)\n- 概述：\n\n在奖励稀疏的环境中进行探索一直是强化学习（RL）中一个持续存在的难题。许多任务天然适合采用稀疏奖励的形式，而手动设计奖励函数往往会导致次优性能。然而，随着任务时间跨度或动作空间维度的增加，找到非零奖励的难度会呈指数级增长。这使得许多现实世界中的任务超出了强化学习方法的实际应用范围。在本工作中，我们利用示范来克服探索问题，成功地学习执行具有长时程、多步骤的连续控制型机器人任务，例如用机械臂堆叠积木。我们的方法基于深度确定性策略梯度和事后经验回放技术，在模拟机器人任务上比纯强化学习快了一个数量级。该方法实现简单，仅需额外收集一小批示范数据。此外，我们的方法还能解决仅靠强化学习或行为克隆都无法完成的任务，并且通常表现优于示范策略。\n\n***\n\n##### 强化学习智能体的自动目标生成\n\n- 论文链接：[arXiv 1705.06366](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06366)，[主页](https:\u002F\u002Fsites.google.com\u002Fview\u002Fgoalgeneration4rl)\n- 概述： \n\n强化学习（RL）是一种强大的技术，可用于训练智能体完成特定任务；然而，通过强化学习训练得到的智能体只能实现其奖励函数所指定的单一任务。这种方法在需要智能体执行多样化任务的场景中扩展性较差，例如导航到房间内的不同位置或将物体移动到不同的地点。为此，作者提出了一种方法，使智能体能够自动发现其在环境中所能完成的任务范围。作者使用一个生成网络为智能体提出待完成的任务，每个任务被定义为到达状态空间中的某个参数化子集。生成网络通过对抗训练进行优化，以生成始终符合智能体能力水平的任务，从而自动形成一套课程体系。作者证明，借助这一框架，智能体可以在没有任何环境先验知识的情况下，高效且自动地学习执行一系列广泛的任务，即使只有稀疏奖励可用。\n\n***\n\n## 开源强化学习环境\n\n- 优秀的强化学习环境：https:\u002F\u002Fgithub.com\u002Fclvrai\u002Fawesome-rl-envs\n\n    该仓库包含一个全面的、分类整理的强化学习环境列表。\n\n- Mine Dojo：https:\u002F\u002Fgithub.com\u002FMineDojo\u002FMineDojo\n\n    ​\tMineDojo基于Minecraft构建了一个**庞大的模拟环境套件**，包含数千种多样化的任务，并提供了对73万条YouTube视频、7000页维基百科和34万篇Reddit帖子组成的**互联网规模知识库**的**开放访问权限**。\n\n\n\u003Cdiv style=\"text-align:center;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_19946a6f172b.gif\" alt=\"img\" style=\"zoom:67%;\" \u002F>\n\u003C\u002Fdiv> \n\n- MineRL：https:\u002F\u002Fgithub.com\u002Fminerllabs\u002Fminerl ，https:\u002F\u002Fminerl.readthedocs.io\u002Fen\u002Flatest\u002F\n\n    ​\tMineRL是一个功能丰富的Python 3库，它提供了一个与[OpenAI Gym](https:\u002F\u002Fgym.openai.com\u002F)兼容的接口，用于与电子游戏Minecraft进行交互，并附带人类玩家的游戏数据集。\n\n\u003Cdiv style=\"text-align:center;\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_1652c5ea01c3.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d4b9439ba136.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0e757c43095f.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e3b2e87cb961.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_444585e04339.gif\" alt=\"img\"  \u002F>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9a1a40f4aa0b.gif\" alt=\"img\"  \u002F>\n\u003C\u002Fdiv>\n\n- ALFworld：https:\u002F\u002Fgithub.com\u002Falfworld\u002Falfworld?tab=readme-ov-file ，https:\u002F\u002Falfworld.github.io\u002F\n\n    ​\t**ALFWorld** 包含交互式的TextWorld环境（Côté等），这些环境与ALFRED数据集中体现的身体化世界相对应（Shridhar等）。这些对齐的环境允许智能体在抽象空间中进行推理并学习高层策略，然后再通过低层执行来解决具体的任务。\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_42c8d2b4b9b3.png\" style=\"zoom:50%;\" \u002F>\n\n- Skillhack：https:\u002F\u002Fgithub.com\u002Fucl-dark\u002Fskillhack\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e7a41cc0cffa.png\" style=\"zoom: 33%;\" \u002F>\n\n- Minigrid：https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMiniGrid?tab=readme-ov-file\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_5fd7fa7a296e.gif\" style=\"zoom: 50%;\" \u002F>\n\n- Crafter：https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fcrafter?tab=readme-ov-file\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_5baed70b4c4f.gif)\n\n- OpenAI procgen：https:\u002F\u002Fgithub.com\u002Fopenai\u002Fprocgen\n\n    ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_9a36f7ca75ee.gif)\n\n- Petting ZOO MPE：https:\u002F\u002Fpettingzoo.farama.org\u002Fenvironments\u002Fmpe\u002F\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_b0284afd7c94.gif\" alt=\"img\" style=\"zoom: 25%;\" \u002F> \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d88b3bb83bea.gif\" alt=\"img\" style=\"zoom:25%;\" \u002F> \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_7e82b6c86fbb.gif\" alt=\"img\" style=\"zoom:25%;\" \u002F>\n\n- OpenAI 多智能体粒子环境：https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmultiagent-particle-envs\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_8a39b5b08932.gif\" style=\"zoom: 50%;\" \u002F>\n\n- 多智能体强化学习环境：https:\u002F\u002Fgithub.com\u002FBigpig4396\u002FMulti-Agent-Reinforcement-Learning-Environment\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_0f086f9b6021.gif\" style=\"zoom:80%;\" \u002F>\n\n- MAgent2：https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMAgent2?tab=readme-ov-file\n\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_e3725d190000.gif\" style=\"zoom: 67%;\" \u002F>\n\n\n***\n\n\u003Ctable style=\"border-collapse: collapse; border: none;\" border=\"0\" cellpadding=\"5\" cellspacing=\"0\">\n\u003Ctr>\n\u003Ctd style=\"border: none; padding: 3;\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_05b8bf57f1b2.png\" width=\"200\" \u002F>\u003C\u002Ftd>\n\u003Ctd style=\"border: none; padding: 3;\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_a7083ec1a8e9.png\" width=\"160\" \u002F>\u003C\u002Ftd>\n\u003Ctd style=\"border: none; padding: 3;\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_readme_d88ffc418c44.png\" width=\"200\" \u002F>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>","# LLM-RL-Papers 快速上手指南\n\n**LLM-RL-Papers** 并非一个可执行的软件库或框架，而是一个**开源论文清单与知识库**。它专注于整理大语言模型（LLM）与强化学习（RL）结合的前沿研究，特别是针对**控制领域**（如游戏角色、机器人导航等）。\n\n本指南将帮助开发者快速利用该资源进行技术调研和代码复现。\n\n## 1. 环境准备\n\n由于本项目本质是文档索引，无需特定的系统运行环境，但为了高效阅读和复现论文代码，建议准备以下基础环境：\n\n*   **操作系统**: Windows, macOS 或 Linux 均可。\n*   **核心依赖**:\n    *   **Git**: 用于克隆仓库及跟踪更新。\n    *   **Markdown 阅读器**: 推荐使用 VS Code (配合 Markdown Preview Enhanced 插件) 或 Typora，以便更好地渲染目录跳转和图片。\n    *   **Python 环境 (可选)**: 若需复现清单中的具体论文代码，建议安装 `conda` 或 `venv`，并准备好 PyTorch\u002FTensorFlow 等深度学习框架。\n*   **网络环境**: 部分论文链接指向 arXiv，国内访问可能较慢，建议配置学术加速工具或使用国内镜像站。\n\n## 2. 获取与安装步骤\n\n该项目无需“安装”，只需克隆到本地即可随时查阅。\n\n### 步骤 1: 克隆仓库\n打开终端，执行以下命令获取最新论文列表：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fllm-rl-papers\u002Fllm-rl-papers.git\n```\n\n*(注：如果上述官方源访问缓慢，可在 GitHub 上搜索是否有国内开发者维护的 Mirror 仓库，或直接下载 ZIP 包)*\n\n### 步骤 2: 进入目录\n```bash\ncd llm-rl-papers\n```\n\n### 步骤 3: 更新内容\n该仓库通过社区 PR 持续更新，建议定期拉取最新内容：\n```bash\ngit pull origin main\n```\n\n## 3. 基本使用\n\n### 方式一：本地浏览（推荐）\n直接在本地打开 `README.md` 文件。利用 Markdown 的目录结构，你可以快速定位到感兴趣的研究方向。\n\n**使用场景示例：查找“奖励函数设计”相关论文**\n1. 在 `README.md` 中找到 **LLM RL Papers [sort by method]** 章节。\n2. 定位到 **Reward Function** 子类别。\n3. 点击链接（如 `Eureka: Human-Level Reward Design...`）直接跳转至论文详情或 arXiv 链接。\n4. 根据提供的论文标题，在 arXiv 或 Google Scholar 搜索源码进行复现。\n\n### 方式二：在线查阅\n直接访问项目的 GitHub 仓库页面，利用右侧的目录导航栏快速跳转。\n\n### 核心分类速查\n该清单按方法论对论文进行了分类，开发者可根据需求直接检索：\n\n| 分类方向 | 典型应用场景 | 代表论文关键词 |\n| :--- | :--- | :--- |\n| **Action (动作)** | 机器人控制、交通信号优化 | iLLM-TSC, Octopus, Inner Monologue |\n| **Reward Function (奖励)** | 自动化奖励设计、技能发现 | Eureka, Text2Reward, Auto MC-Reward |\n| **Data Generation (数据)** | 提升样本效率、场景生成 | RLingua, EnvGen |\n| **State Representation (状态)** | 自然语言状态表示、可解释性 | LLM-Empowered State, State2Explanation |\n| **Task Suggestion (任务)** | 开放世界任务规划、持续学习 | AutoRT, ExpeL, RLAdapter |\n\n### 贡献指南\n如果你阅读了优质的相关论文并希望分享：\n1.  Fork 本仓库。\n2.  按照现有的 Markdown 格式，将论文信息添加到对应的分类下。\n3.  提交 Pull Request (PR)。\n\n---\n*提示：本工具核心价值在于“索引”与“分类”。对于清单中列出的具体论文（如 Eureka, RT-2 等），请前往其原始仓库获取具体的代码安装和运行指令。*","某机器人实验室团队正致力于开发能理解自然语言指令并在动态环境中自主导航的服务机器人，急需融合大语言模型（LLM）的语义理解与强化学习（RL）的决策控制能力。\n\n### 没有 LLM-RL-Papers 时\n- **信息检索低效**：研究人员需在 arXiv 上手动筛选海量论文，难以快速定位同时涉及\"LLM\"与\"RL\"且专注于“控制”领域的最新成果。\n- **技术路线迷茫**：面对分散的研究，团队难以厘清如“直接动作生成”或“间接策略指导”等具体技术路径的优劣，导致实验方向反复试错。\n- **前沿案例缺失**：缺乏类似 `SRLM`（人机交互导航）或 `RE-MOVE`（动态环境自适应）等具体落地案例参考，算法设计往往脱离实际场景需求。\n- **协作壁垒高**：团队成员各自阅读文献，缺乏统一的知识库共享机制，导致重复劳动且难以形成合力突破技术瓶颈。\n\n### 使用 LLM-RL-Papers 后\n- **精准追踪前沿**：团队直接利用该工具监控 arXiv 交叉研究，瞬间获取按方法分类的最新论文列表，将文献调研时间从数天缩短至几小时。\n- **清晰技术图谱**：通过工具整理的综述（如 Taxonomy Tree）和分类（Direct\u002FIndirect Action），团队迅速锁定了适合动态导航的“基于语言反馈的策略改进”路线。\n- **复用成熟方案**：参考列表中 `Octopus` 和 `Inner Monologue` 等具体论文的实验设置，团队快速复现了基线模型，显著提升了机器人对复杂指令的响应准确率。\n- **高效知识协同**：团队成员将验证有效的论文通过 PR 贡献回仓库，构建了实验室专属的动态知识库，加速了从理论到实机部署的迭代周期。\n\nLLM-RL-Papers 通过结构化聚合跨领域前沿成果，将原本分散的探索转化为可控的技术演进，极大加速了具身智能系统的研发落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FWindyLab_LLM-RL-Papers_486f9160.png","WindyLab","Intelligent Unmanned Systems Laboratory","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FWindyLab_90fef6f0.png","Our lab focuses on novel theories and applications of robotic systems.","Westlake University","Hangzhou, China",null,"https:\u002F\u002Fshiyuzhao.westlake.edu.cn\u002F","https:\u002F\u002Fgithub.com\u002FWindyLab",550,36,"2026-04-17T06:06:31",1,"","未说明",{"notes":88,"python":86,"dependencies":89},"该仓库是一个论文综述列表（Awesome List），用于整理和分类大语言模型（LLM）与强化学习（RL）结合的研究论文。它不包含可执行的源代码、安装脚本或具体的运行环境配置要求。所列出的条目均为指向外部论文链接的索引，因此没有特定的操作系统、GPU、内存、Python 版本或依赖库需求。",[],[35,14],[92,93,94,95,96],"papers","control","docs","llm","reinfrocement-learning","2026-03-27T02:49:30.150509","2026-04-17T21:44:53.767386",[],[]]