[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-opendilab--awesome-model-based-RL":3,"tool-opendilab--awesome-model-based-RL":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[19,13,20,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[20,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,"2026-04-05T10:10:46",[20,18,14],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":10,"last_commit_at":63,"category_tags":64,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":77,"owner_website":80,"owner_url":82,"languages":80,"stars":83,"forks":84,"last_commit_at":85,"license":86,"difficulty_score":46,"env_os":87,"env_gpu":88,"env_ram":88,"env_deps":89,"category_tags":92,"github_topics":93,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":100,"updated_at":101,"faqs":102,"releases":103},3499,"opendilab\u002Fawesome-model-based-RL","awesome-model-based-RL","A curated list of awesome model based RL resources (continually updated)","awesome-model-based-RL 是一个专为“基于模型的强化学习”（Model-Based RL）领域打造的精选资源库。它系统性地收集并整理了该方向的核心研究论文、经典算法分类、教程指南以及开源代码实现，旨在帮助从业者快速把握前沿动态。\n\n在强化学习中，传统方法往往需要大量试错，而基于模型的方法通过构建环境模型来规划行动，能显著提升样本效率。然而，该领域文献浩如烟海且更新极快，研究者常面临资料分散、难以追踪最新成果的痛点。awesome-model-based-RL 正是为解决这一难题而生，它不仅持续收录来自 NeurIPS、ICML、ICLR 等顶级会议的最新论文（已更新至 2025 年），还提供了一份清晰的算法分类图谱，将复杂的技术路线梳理为“学习模型”与“利用模型”两大维度，帮助用户建立系统的知识框架。\n\n这份资源特别适合人工智能研究人员、算法工程师以及对深度强化学习感兴趣的高校师生使用。无论是想要入门的新手，还是希望紧跟学术前沿的资深专家，都能在这里找到高质量的参考材料。其独特的价值在于持续的维护更新与结构化的知识整理，让探索高效强化学习之路变得更加清晰顺畅。","# Awesome Model-Based Reinforcement Learning\n\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome) [![docs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-latest-blue)](https:\u002F\u002Fgithub.com\u002Fopendilab\u002Fawesome-model-based-RL) ![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopendilab\u002Fawesome-model-based-RL?color=yellow) ![GitHub forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fopendilab\u002Fawesome-model-based-RL?color=9cf) [![GitHub license](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fopendilab\u002Fawesome-model-based-RL)](https:\u002F\u002Fgithub.com\u002Fopendilab\u002Fawesome-model-based-RL\u002Fblob\u002Fmain\u002FLICENSE)\n\nThis is a collection of research papers for **model-based reinforcement learning (mbrl)**.\nAnd the repository will be continuously updated to track the frontier of model-based rl.\n\nWelcome to follow and star!\n\n\u003Cpre name=\"code\" class=\"html\">\n\u003Cfont color=\"red\">[2025.12.01] \u003Cb>New: We update the NeurIPS 2025 paper list of model-based rl!\u003C\u002Fb>\u003C\u002Ffont>\n\n[2025.08.28] We update the ICML 2025 paper list of model-based rl.\n\n[2025.02.06] We update the ICLR 2025 paper list of model-based rl.\n\n[2024.10.27] We update the NeurIPS 2024 paper list of model-based rl.\n\n[2024.05.20] We update the ICML 2024 paper list of model-based rl.\n\n[2023.11.29] We update the ICLR 2024 paper list of model-based rl.\n\n[2023.09.29] We update the NeurIPS 2023 paper list of model-based rl.\n\n[2023.06.15] We update the ICML 2023 paper list of model-based rl.\n\n[2023.02.05] We update the ICLR 2023 paper list of model-based rl.\n\n[2022.11.03] We update the NeurIPS 2022 paper list of model-based rl.\n\n[2022.07.06] We update the ICML 2022 paper list of model-based rl.\n\n[2022.02.13] We update the ICLR 2022 paper list of model-based rl.\n\n[2021.12.28] We release the awesome model-based rl.\n\u003C\u002Fpre>\n\n\n## Table of Contents\n\n- [Awesome Model-Based Reinforcement Learning](#awesome-model-based-reinforcement-learning)\n  - [Table of Contents](#table-of-contents)\n  - [A Taxonomy of Model-Based RL Algorithms](#a-taxonomy-of-model-based-rl-algorithms)\n  - [Papers](#papers)\n    - [Classic Model-Based RL Papers](#classic-model-based-rl-papers)\n    - [NeurIPS 2025](#neurips-2025)\n    - [ICML 2025](#icml-2025)\n    - [ICLR 2025](#iclr-2025)\n    - [NeurIPS 2024](#neurips-2024)\n    - [ICML 2024](#icml-2024)\n    - [ICLR 2024](#iclr-2024)\n    - [NeurIPS 2023](#neurips-2023)\n    - [ICML 2023](#icml-2023)\n    - [ICLR 2023](#iclr-2023)\n    - [NeurIPS 2022](#neurips-2022)\n    - [ICML 2022](#icml-2022)\n    - [ICLR 2022](#iclr-2022)\n    - [NeurIPS 2021](#neurips-2021)\n    - [ICLR 2021](#iclr-2021)\n    - [ICML 2021](#icml-2021)\n    - [Other](#other)\n  - [Tutorial](#tutorial)\n  - [Codebase](#codebase)\n  - [Contributing](#contributing)\n  - [License](#license)\n\n\n## A Taxonomy of Model-Based RL Algorithms\n\nWe’ll start this section with a disclaimer: it’s really quite hard to draw an accurate, all-encompassing taxonomy of algorithms in the Model-Based RL space, because the modularity of algorithms is not well-represented by a tree structure. So we will publish a series of related blogs to explain more Model-Based RL algorithms.\n\n\u003Cp align=\"center\">\n    \u003Cimg style=\"border-radius: 0.3125em;\n    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);\"\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendilab_awesome-model-based-RL_readme_553abf8de4ad.png\">\n    \u003Cbr>\n    \u003Cem style=\"display: inline-block;\">A non-exhaustive, but useful taxonomy of algorithms in modern Model-Based RL.\u003C\u002Fem>\n\u003C\u002Fp>\n\nWe simply divide `Model-Based RL`  into two categories: `Learn the Model` and `Given the Model`.\n\n- `Learn the Model` mainly focuses on how to build the environment model.\n\n- `Given the Model` cares about how to utilize the learned model.\n\nAnd we give some examples as shown in the figure above. There are links to algorithms in taxonomy.\n\n>[1] [World Models](https:\u002F\u002Fworldmodels.github.io\u002F): Ha and Schmidhuber, 2018  \n[2] [I2A](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06203) (Imagination-Augmented Agents): Weber et al, 2017  \n[3] [MBMF](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmbmf) (Model-Based RL with Model-Free Fine-Tuning): Nagabandi et al, 2017  \n[4] [MBVE](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.00101) (Model-Based Value Expansion): Feinberg et al, 2018  \n[5] [ExIt](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.08439) (Expert Iteration): Anthony et al, 2017  \n[6] [AlphaZero](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.01815): Silver et al, 2017  \n[7] [POPLIN](https:\u002F\u002Fopenreview.net\u002Fforum?id=H1exf64KwH) (Model-Based Policy Planning): Wang et al, 2019  \n[8] [M2AC](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.04893) (Masked Model-based Actor-Critic): Pan et al, 2020\n\n\n## Papers\n\n```\nformat:\n- [title](paper link) [links]\n  - author1, author2, and author3\n  - Key: key problems and insights\n  - OpenReview: optional\n  - ExpEnv: experiment environments\n```\n\n### Classic Model-Based RL Papers\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Dyna, an integrated architecture for learning, planning, and reacting](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F122344.122377)\n  - Richard S. Sutton. *ACM 1991*\n  - Key: dyna architecture\n  - ExpEnv: None\n\n- [PILCO: A Model-Based and Data-Efficient Approach to Policy Search](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F221345233_PILCO_A_Model-Based_and_Data-Efficient_Approach_to_Policy_Search)\n  - Marc Peter Deisenroth, Carl Edward Rasmussen. *ICML 2011*\n  - Key: probabilistic dynamics model\n  - ExpEnv: cart-pole system, robotic unicycle\n\n- [Learning Complex Neural Network Policies with Trajectory Optimization](https:\u002F\u002Fproceedings.mlr.press\u002Fv32\u002Flevine14.html)\n  - Sergey Levine, Vladlen Koltun. *ICML 2014*\n  - Key: guided policy search\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Learning Continuous Control Policies by Stochastic Value Gradients](https:\u002F\u002Farxiv.org\u002Fabs\u002F1510.09142)\n  - Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, Tom Erez. *NIPS 2015*\n  - Key: backpropagation through paths, gradient on real trajectory\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Value Prediction Network](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.03497)\n  - Junhyuk Oh, Satinder Singh, Honglak Lee. *NIPS 2017*\n  - Key: value-prediction model  \u003C!-- VE? -->\n  - ExpEnv: collect domain, [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.01675)\n  - Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, Honglak Lee. *NIPS 2018*\n  - Key: ensemble model and Qnet, value expansion\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [roboschool](https:\u002F\u002Fgithub.com\u002Fopenai\u002Froboschool)\n\n- [Recurrent World Models Facilitate Policy Evolution](https:\u002F\u002Farxiv.org\u002Fabs\u002F1809.01999)\n  - David Ha, Jürgen Schmidhuber. *NIPS 2018*\n  - Key: vae(representation), rnn(predictive model)\n  - ExpEnv: [car racing](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [vizdoom](https:\u002F\u002Fgithub.com\u002Fmwydmuch\u002FViZDoom)\n\n- [Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.12114)\n  - Kurtland Chua, Roberto Calandra, Rowan McAllister, Sergey Levine. *NIPS 2018*\n  - Key: probabilistic ensembles with trajectory sampling\n  - ExpEnv: [cartpole](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [When to Trust Your Model: Model-Based Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.08253)\n  - Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine. *NeurIPS 2019*\n  - Key: ensemble model, sac, *k*-branched rollout\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.03858)\n  - Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, Tengyu Ma. *ICLR 2019*\n  - Key: Discrepancy Bounds Design, ME-TRPO with multi-step, Entropy regularization\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Model-Ensemble Trust-Region Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fforum?id=SJJinbWRZ)\n  - Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel. *ICLR 2018*\n  - Key: ensemble model, TRPO\n  \u003C!-- - OpenReview: 7, 7, 6 -->\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Dream to Control: Learning Behaviors by Latent Imagination](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.01603)\n  - Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi. *ICLR 2019*\n  - Key: DreamerV1, latent space imagination\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [deepmind lab](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Flab)\n\n- [Exploring Model-based Planning with Policy Networks](https:\u002F\u002Fopenreview.net\u002Fforum?id=H1exf64KwH)\n  - Tingwu Wang, Jimmy Ba. *ICLR 2020*\n  - Key: model-based policy planning in action space and parameter space\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.08265)\n  - Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, David Silver. *Nature 2020*\n  - Key: MCTS, value equivalence\n  - ExpEnv: chess, shogi, go, [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n\u003C\u002Fdetails>\n\n### NeurIPS 2025\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Stable Planning through Aligned Representations in Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=Uv7V1gTOjK)\n  - Misagh Soltani, Forest Agostinelli. *NeurIPS 2025*\n  - Key: visual planning, aligned representations, discrete latent state, heuristic search\n  - ExpEnv: Rubik's Cube, Sokoban\n\n- [RLVR-World: Training World Models with Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=jpiSagi8aV)\n  - Mingsheng Long, et al. *NeurIPS 2025*\n  - Key: world model training, decision-aware, verifiable rewards\n  - ExpEnv: text games, robot manipulation\n\n- [Dyn-O: Building Structured World Models with Object-Centric Representations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.03298)\n  - Microsoft Research et al. *NeurIPS 2025*\n  - Key: structured world models, object-centric, physics modeling\n  - ExpEnv: physical interaction, object manipulation\n\n- [Off-policy Reinforcement Learning with Model-based Exploration Augmentation](https:\u002F\u002Fopenreview.net\u002Fforum?id=JGkZgEEjiM)\n  - Anonymous et al. *NeurIPS 2025*\n  - Key: exploration, diffusion model, synthetic experience, data augmentation\n  - ExpEnv: mujoco, sparse reward tasks\n\n- [Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective](https:\u002F\u002Fopenreview.net\u002Fforum?id=rRxFIOoEeF)\n  - Xiu Li, et al. *NeurIPS 2025*\n  - Key: multi-agent MBRL, diffusion-inspired, sequence modeling, joint distribution\n  - ExpEnv: SMAC, MPE\n\n- [SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer](https:\u002F\u002Fopenreview.net\u002Fforum?id=Pe1ypX9gBO)\n  - Yarden As, Chengrui Qu, Benjamin Unger, Dongho Kang, Max van der Hart, Laixi Shi, Stelian Coros, Adam Wierman, Andreas Krause. *NeurIPS 2025*\n  - Key: safe MBRL, sim-to-real, ensemble uncertainty, robust control\n  - ExpEnv: real-world robotics, safety gym\n- [Improving Model-Based Reinforcement Learning by Converging to Flatter Minima](https:\u002F\u002Fopenreview.net\u002Fpdf?id=vcB1OwtWUZ)\n  - Shrinivas Ramasubramanian, Benjamin Freed, Alexandre Capone, Jeff Schneider. *NeurIPS 2025*\n  - Key: model error, simulation lemma, model generalization,\n  - ExpEnv: DMC, Atari100k, HumanoidBench\n\n\u003C\u002Fdetails>\n\n### ICML 2025\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Improving Transformer World Models for Data-Efficient RL](https:\u002F\u002Fopenreview.net\u002Fforum?id=IajCvMJw41)\n  - Antoine Dedieu, Joseph Ortiz, Xinghua Lou, Carter Wendelken, Wolfgang Lehrach, J Swaroop Guntupalli, Miguel Lazaro-Gredilla, Kevin Murphy\n  - Key: dyna with warmup, patch nearestneighbor tokenization, block teacher forcing\n  - OpenReview: 4, 4, 4, 3\n  - ExpEnv: craftax-classic\n\n- [Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=Zt05jXhqXx)\n  - Brett Barkley, David Fridovich-Keil\n  - Key: Dyna-style algorithms significantly degrades performance across most DMC environments.\n  - OpenReview: 4, 4, 3, 2\n  - ExpEnv: gym, DeepMind Control Suite\n\n- [Knowledge Retention in Continual Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=DiqeZY27XK) \n  - Haotian Fu, Yixiang Sun, Michael L. Littman, George Konidaris\n  - Key: synthetic experience rehearsal, regaining memories through exploration\n  - OpenReview: 4, 3, 3, 3\n  - ExpEnv: mini-grid, deepmind control suite\n\n- [Time-Aware World Model for Adaptive Prediction and Control](https:\u002F\u002Fopenreview.net\u002Fforum?id=gZ5N3TLjwv) \n  - Anh N Nhu, Sanghyun Son, Ming Lin\n  - Key: condition on the time-step size ∆t and and train over a diverse range of ∆t values\n  - OpenReview: 4, 3, 3\n  - ExpEnv: meta-world control tasks, PDE-control tasks\n\n- [Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06482)\n  - Minting Pan, Yitao Zheng, Jiajian Li, Yunbo Wang, Xiaokang Yang\n  - Key: behavior abstraction network, hierarchical world model\n  - OpenReview: 3, 3, 3, 2\n  - ExpEnv: meta-world, carla, minedojo\n\n- [Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=drBVowFvqf)\n  - Dongsu Lee, Minhae Kwon\n  - Key: learn a latent abstraction that captures a temporal distance from both trajectory and transition levels of state space.\n  - OpenReview: 4, 3, 3, 2\n  - ExpEnv: D4RL, AntMaze, FrankaKitchen, CALVIN, pixel-based FrankaKitchen.\n\n- [PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=mtk8tTKWs0) \n  - Dongchi Huang, Jiaqi WANG, Yang Li, Chunhe Xia, Tianle Zhang, Kaige Zhang\n  - Key: leverage privileged information through privileged representation alignment and an asymmetric actor-critic structure\n  - OpenReview: 3, 3, 3\n  - ExpEnv: safety gymnasium benchmark, guard benchmark\n\n- [Reward-free World Models for Online Imitation Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=owEhpoKBKC)\n  - Shangzhe Li, Zhiao Huang, Hao Su\n  - Key: reward-free world model, inverse soft-Q learning objective\n  - OpenReview: 4, 3, 3, 3\n  - ExpEnv: DMControl, MyoSuite, ManiSkill2\n\n- [FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making](https:\u002F\u002Fopenreview.net\u002Fforum?id=UTT5OTyIWm)\n  - Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan\n  - Key: ground FM representations into the WM state space, model-based goal-condition RL\n  - OpenReview: 4, 3, 3, 3\n  - ExpEnv: DMControl, Kitchen, minecraft\n\n- [Continual Reinforcement Learning by Planning with Online World Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=mQeZEsdODh)\n  - Zichen Liu, Guoji Fu, Chao Du, Wee Sun Lee, Min Lin\n  - Key: plan with online world model, regret analysis\n  - OpenReview: 4, 4, 4, 3\n  - ExpEnv: [ContinualBench](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FContinualBench\u002Ftree\u002Fmain\u002Fcontinual_bench\u002Fenvs)\n\n- [Scaling Laws for Pre-training Agents and World Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=HHwGfLOKxq) \n  - Tim Pearce*, Tabish Rashid*, David Bignell, Raluca Georgescu, Sam Devlin, Katja Hofmann\n  - Key: scaling laws, embodied AI, behavior cloning, world modeling, tokenizer, architecture\n  - ExpEnv: Bleeding Edge, RT-1 (robotics), Atari, NetHack\n\n- [DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=D5RNACOZEI)\n  - Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto\n  - Key: world models, offline learning, zero-shot planning, pretrained visual features, task-agnostic reasoning\n  - ExpEnv: Maze, Wall, Reach, Push-T, Rope Manipulation, Granular Manipulation\n  \n- [General agents need world models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=dlIoumNiXt) \n  - Jonathan Richens, Tom Everitt, David Abel\n  - Key: world models, goal-directed behavior, model-free learning, policy analysis, regret bounds\n  - ExpEnv: synthetic controlled Markov process (cMP) environments with varying sample trajectories and goal depths\n\n- [RobustZero: Enhancing MuZero Reinforcement Learning Robustness to State Perturbations](https:\u002F\u002Fopenreview.net\u002Fpdf?id=DaOdkXgLvE)\n  - Yushuai Li, Hengyu Liu, Torben Bach Pedersen, Yuqiang He, Kim Guldstrand Larsen, Lu Chen, Christian S. Jensen, Jiachen Xu, Tianyi Li\n  - Key: MuZero, robustness, reinforcement learning, state perturbations, self-supervised learning, adaptive adjustment\n  - ExpEnv: CartPole, Pendulum, IEEE 34-bus, IEEE 123-bus, IEEE 8500-node, Highway, Intersection, Racetrack, Hopper, Walker2d, HalfCheetah, Ant\n\n- [Accurate and Efficient World Modeling with Masked Latent Transformers](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zNUOZcAUxz)\n  - Maxime Burchi, Radu Timofte\n  - Key: model-based reinforcement learning, world models, MaskGIT, spatial latent space, Dreamer, Transformer, efficiency\n  - ExpEnv: Crafter, Atari 100k\n\n- [Trajectory World Models for Heterogeneous Environments](https:\u002F\u002Fopenreview.net\u002Fforum?id=Py2KmXaRmi)  \n  - Shaofeng Yin, Jialong Wu, Siqiao Huang, Xingjian Su, Xu He, Jianye Hao, Mingsheng Long  \n  - Key: world models, heterogeneous environments, pre-training, in-context learning, model transfer, trajectory data  \n  - ExpEnv: UniTraj (80 diverse environments), D4RL (HalfCheetah, Hopper, Walker2D), Cart-2-Pole, Cart-3-Pole\n\n- [A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qA3xHJzF6B)\n  - Raanan Y. Rohekar, Yaniv Gurwicz, Sungduk Yu, Estelle Aflalo, Vasudev Lal\n  - Key: GPT, causal inference, attention mechanism, structural causal model, zero-shot causal discovery\n  - ExpEnv: Othello, Chess\n\n\u003C\u002Fdetails>\n\n### ICLR 2025\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Learning Transformer-based World Models with Contrastive Predictive Coding](https:\u002F\u002Fopenreview.net\u002Fforum?id=YK9G4Htdew)  \n  - Maxime Burchi, Radu Timofte  \n  - Key: model-based reinforcement learning, transformer network, contrastive predictive coding  \n  - ExpEnv: Atari 100k benchmark\n\n- [Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation](https:\u002F\u002Fopenreview.net\u002Fforum?id=meRCKuUpmc)  \n  - Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang  \n  - Key: Robotic Manipulation, Pre-training, Visual Foresight, Inverse Dynamics, Large-scale robot dataset  \n  - ExpEnv: LIBERO-LONG benchmark, CALVIN ABC-D, real-world tasks\n\n- [OptionZero: Planning with Learned Options](https:\u002F\u002Fopenreview.net\u002Fforum?id=3IFRygQKGL)  \n  - Po-Wei Huang, Pei-Chiun Peng, Hung Guei, Ti-Rong Wu  \n  - Key: Option, Semi-MDP, MuZero, MCTS, Planning, Reinforcement Learning  \n  - ExpEnv: Atari\n\n- [MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL](https:\u002F\u002Fopenreview.net\u002Fforum?id=6RtRsg8ZV1)  \n  - Claas A Voelcker, Marcel Hussing, Eric Eaton, Amir-massoud Farahmand, Igor Gilitschenski  \n  - Key: reinforcement learning, model based reinforcement learning, data augmentation, high update ratios  \n  - ExpEnv: DeepMind Control Suite\n\n- [Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks](https:\u002F\u002Fopenreview.net\u002Fforum?id=zCxGCdzreM)  \n  - Michael Matthews, Michael Beukman, Chris Lu, Jakob Nicolaus Foerster  \n  - Key: Reinforcement Learning, Open-Endedness, Unsupervised Environment Design, Automatic Curriculum Learning, Benchmark  \n  - ExpEnv: 2D Physics-Based Tasks, Robotic Locomotion, Grasping, Video Games, Classic RL Environments\n\n- [Learning to Search from Demonstration Sequences](https:\u002F\u002Fopenreview.net\u002Fforum?id=v593OaNePQ)  \n  - Dixant Mittal, Liwei Kang, Wee Sun Lee  \n  - Key: Planning, Reasoning, Learning to Search, Reinforcement Learning, Large Language Model  \n  - ExpEnv: Game of 24, 2D Grid Navigation, Procgen Games\n\n- [Open-World Reinforcement Learning over Long Short-Term Imagination](https:\u002F\u002Fopenreview.net\u002Fforum?id=vzItLaEoDa)  \n  - Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wenjun Zeng, Xiaokang Yang  \n  - Key: Reinforcement Learning, World Models, Visual Control  \n  - ExpEnv: MineDojo\n\n- [MaestroMotif: Skill Design from Artificial Intelligence Feedback](https:\u002F\u002Fopenreview.net\u002Fforum?id=or8mMhmyRV)\n  - Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, Pierluca D'Oro\n  - Key: Hierarchical RL, Reinforcement Learning, LLMs\n  - ExpEnv: NetHack Learning Environment (NLE)\n\n- [Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects](https:\u002F\u002Fopenreview.net\u002Fforum?id=7BLXhmWvwF)  \n  - Authors: Tai Hoang, Huy Le, Philipp Becker, Vien Anh Ngo, Gerhard Neumann  \n  - Key: Robotic Manipulation, Equivariance, Graph Neural Networks, Reinforcement Learning, Deformable Objects  \n  - ExpEnv: Rigid insertion, rope manipulation, cloth manipulation with multiple end-effectors\n\n- [M^3PC: Test-time Model Predictive Control using Pretrained Masked Trajectory Model](https:\u002F\u002Fopenreview.net\u002Fforum?id=inOwd7hZC1)  \n  - Kehan Wen, Yutong Hu, Yao Mu, Lei Ke  \n  - Key: Offline-to-Online Reinforcement Learning, Model-based Reinforcement Learning, Masked Autoencoding, Robot Learning  \n  - ExpEnv: D4RL, RoboMimic\n\n- [Offline Model-Based Optimization by Learning to Rank](https:\u002F\u002Fopenreview.net\u002Fforum?id=sb1HgVDLjN)  \n  - Rong-Xi Tan, Ke Xue, Shen-Huan Lyu, Haopu Shang, yaowang, Yaoyuan Wang, Fu Sheng, Chao Qian  \n  - Key: Offline model-based optimization, black-box optimization, learning to rank, learning to optimize  \n  - ExpEnv: Diverse tasks across optimization scenarios\n  \n- [Monte Carlo Planning with Large Language Model for Text-Based Games](https:\u002F\u002Fopenreview.net\u002Fforum?id=r1KcapkzCt)  \n  - Zijing Shi, Meng Fang, Ling Chen  \n  - Key: Large language model, Monte Carlo tree search, Text-based games  \n  - ExpEnv: Jericho benchmark\n\n- [Interpreting Emergent Planning in Model-Free Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=DzGe40glxs)  \n  - Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, David Krueger  \n  - Key: reinforcement learning, interpretability, planning, probes, model-free, mechanistic interpretability, sokoban  \n  - ExpEnv: Sokoban\n\n- [Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient](https:\u002F\u002Fopenreview.net\u002Fforum?id=7XIkRgYjK3)  \n  - Wenlong Wang, Ivana Dusparic, Yucheng Shi, Ke Zhang, Vinny Cahill  \n  - Key: Mamba-2, Model based reinforcement learning, Mamba, State space models\n  - ExpEnv: Atari 100K\n \n- [Zero-shot Model-based Reinforcement Learning using Large Language Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=uZFXpPrwSh)  \n  - Abdelhakim Benechehab, Youssef Attia El Hili, Ambroise Odonnat, Oussama Zekri, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Ievgen Redko, Balázs Kégl \n  - Key: Model-based Reinforcement Learning, Large language models, Zero-shot Learning, In-context Learning\n  - ExpEnv: D4RL, Pendulum, HalfCheetah, Hopper\n \n- [On Rollouts in Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=Uh5GRmLlvt)  \n  - Bernd Frauenknecht, Devdutt Subhasish, Friedrich Solowjow, Sebastian Trimpe  \n  - Key: Model-Based Reinforcement Learning, Model Rollouts, Uncertainty Quantification\n  - ExpEnv: Gym MuJoCo\n \n- [Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=JZCxlrwjZ8)  \n  - Haoxin Lin, Yu-Yan Xu, Yihao Sun, Zhilong Zhang, Yi-Chen Li, Chengxing Jia, Junyin Ye, Jiaji Zhang, Yang Yu\n  - Key: model-based reinforcement learning, any-step dynamics model\n  - ExpEnv: D4RL, NeoRL, Gym MuJoCo-v3\n\n- [Discrete Codebook World Models for Continuous Control](https:\u002F\u002Fopenreview.net\u002Fforum?id=lfRYzd8ady)\n  - Aidan Scannell, Mohammadreza Nakhaeinezhadfard, Kalle Kujanpää, Yi Zhao, Kevin Sebastian Luck, Arno Solin, Joni Pajarinen\n  - Key: reinforcement learning, world model, representation learning, self-supervised learning, model-based reinforcement learning, continuous control\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld), [myosuite](https:\u002F\u002Fgithub.com\u002FMyoHub\u002Fmyosuite)\n\n\u003C\u002Fdetails>\n\n### NeurIPS 2024\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [iVideoGPT: Interactive VideoGPTs are Scalable World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15223)\n  - Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long\n  - Key: world models, video generative models, autoregressive transformer, reinforcement learning, video prediction, visual planning\n  - ExpEnv: Meta-world\n  \n- [Parallelizing Model-based Reinforcement Learning Over the Sequence Length](https:\u002F\u002Fopenreview.net\u002Fpdf\u002Fe061517a824b90efc807dc90ac6bbd20747bd654.pdf)\n  - ZiRui Wang, Yue Deng, Junfeng Long, Yin Zhang\n  - Key: reinforcement learning, model-based reinforcement learning, parallelization, sequence length, world model, eligibility trace, sample efficiency\n  - ExpEnv: Atari 100K, DMControl\n\n- [Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qf2uZAdy1N)  \n  - Philip Amortila, Dylan J. Foster, Nan Jiang, Akshay Krishnamurthy, Zakaria Mhammedi  \n  - Key: reinforcement learning, latent dynamics, statistical modularity, algorithmic modularity, observable-to-latent reductions, self-predictive models  \n  - ExpEnv: None  \n\n- [SPO: Sequential Monte Carlo Policy Optimisation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=XKvYcPPH5G)  \n  - Matthew V Macfarlane, Edan Toledo, Donal Byrne, Paul Duckworth, Alexandre Laterre  \n  - Key: reinforcement learning, rl, model-based reinforcement learning, sequential monte carlo, expectation maximisation, planning  \n  - ExpEnv: Brax, Boxoban, Rubik's Cube\n\n- [Seek Commonality but Preserve Differences: Dissected Dynamics Modeling for Multi-modal Visual RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=4php6bGL2W)  \n  - Yangru Huang, Peixi Peng, Yifan Zhao, Guangyao Chen, Yonghong Tian  \n  - Key: multi-modal reinforcement learning, visual RL, dynamics modeling, modality consistency, modality inconsistency, DDM  \n  - ExpEnv: CARLA, DMControl\n\n- [The Surprising Ineffectiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=LvAy07mCxU)\n  - Moritz Schneider, Robert Krug, Narunas Vaskevicius, Luigi Palmieri, Joschka Boedecker\n  - Key: reinforcement learning, rl, model-based reinforcement learning, representation learning, pvr, visual representations\n  - ExpEnv: DMC, ManiSkill2, Miniworld\n\n- [Multi-Agent Domain Calibration with a Handful of Offline Data](https:\u002F\u002Fopenreview.net\u002Fpdf?id=LvAy07mCxU)\n  - Tao Jiang, Lei Yuan, Lihe Li, Cong Guan, Zongzhang Zhang, Yang Yu\n  - Key:  Multi-agent reinforcement learning, domain transfer\n  - ExpEnv: D4RL\n\n- [WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12275)\n  - Hao Tang, Darren Key, Kevin Ellis\n  - Key: learn world models as code, LLM\n  - ExpEnv: [sokoban](https:\u002F\u002Fgithub.com\u002FmpSchrader\u002Fgym-sokoban), [minigrid](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMinigrid), [alfworld](https:\u002F\u002Fgithub.com\u002Falfworld\u002Falfworld)\n\n- [The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12527)\n  - Anya Sims, Cong Lu, Jakob Foerster, Yee Whye Teh\n  - Key: edge-of-reach problem, reach-aware value learning\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FD4RL), [v-r4rl](https:\u002F\u002Fgithub.com\u002Fconglu1997\u002Fv-d4rl)\n\n- [Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04088)\n  - Abdullah Akgül, Manuel Haussmann, Melih Kandemir\n  - Key: The paper argues that uncertainty-based reward penalization introduces excessive conservatism, potentially resulting in suboptimal policies through underestimation.\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FD4RL)\n\n- [BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.10967)\n  - Haohong Lin, Wenhao Ding, Jian Chen, Laixi Shi, Jiacheng Zhu, Bo Li, DING ZHAO\n  - Key: objective mismatch problem, capture causal representation for both states and actions\n  - ExpEnv: [list](https:\u002F\u002Fgithub.com\u002FARISE-Initiative\u002Frobosuite), [unlock](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMinigrid), [crash](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n\n- [Model-Based Transfer Learning for Contextual Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04498)\n  - Jung-Hoon Cho, Vindula Jayawardana, Sirui Li, Cathy Wu\n  - Key: bayesian optimization, contextual rl\n  - ExpEnv: [gaussian process, traffic signal, eco-driving, advisory autonomy, control tasks]()\n\n- [Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17248)\n  - Guhao Feng, Han Zhong\n  - Key: rl representation complexity\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n\u003C!--- [Parallelizing Model-based Reinforcement Learning Over the Sequence Length]()\n  - Zirui Wang, Yue DENG, Junfeng Long, Yin Zhang\n  - Key:\n  - ExpEnv:\n\n- [Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning]()\n  - Marvin Alles, Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl\n  - Key:\n  - ExpEnv:\n\n- [Policy-shaped prediction: avoiding distractions in model-based RL]()\n  - Miles Hutson, Isaac Kauvar, Nick Haber\n  - Key:\n  - ExpEnv: -->\n\n\u003C\u002Fdetails>\n\n### ICML 2024\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [HarmonyDream: Task Harmonization Inside World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00344)\n  - Haoyu Ma, Jialong Wu, Ningya Feng, Chenjun Xiao, Dong Li, Jianye Hao, Jianmin Wang, Mingsheng Long\n  - Key: observation modeling and reward modeling analysis in world models\n  - ExpEnv: [meta-world](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld), [rlbench](https:\u002F\u002Fgithub.com\u002Fstepjam\u002FRLBench), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [3D-VLA: A 3D Vision-Language-Action Generative World Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09631)\n  - Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan\n  - Key: unify 3D perception, reasoning, and action with a generative world model; create a large-scale 3D embodied instruction tuning dataset\n  - ExpEnv: [rlbench](https:\u002F\u002Fgithub.com\u002Fstepjam\u002FRLBench), [calvin](https:\u002F\u002Fgithub.com\u002Fmees\u002Fcalvin)\n\n- [CompeteAI: Understanding the Competition Behaviors in Large Language Model-based Agents](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17512)\n  - Qinlin Zhao, Jindong Wang, Yixuan Zhang, Yiqiao Jin, Kaijie Zhu, Hao Chen, Xing Xie\n  - Key: propose a competitive framework for LLM-based agents; build a simulated competitive environment\n  - ExpEnv: a virtual town with only restaurants and customers\n\n- [Model-based Reinforcement Learning for Parameterized Action Spaces](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03037)\n  - Renhao Zhang, Haotian Fu, Yilin Miao, George Konidaris\n  - Key: discrete-continuous hybrid action space, dynamics model with parameterized actions, MPC with parameterized actions\n  - ExpEnv: [platform, goal, hard goal, catch point, hard move](https:\u002F\u002Fgithub.com\u002FValarzz\u002FModel-based-Reinforcement-Learning-for-Parameterized-Action-Spaces\u002Ftree\u002Fmain\u002Fcommon)\n\n- [Learning Latent Dynamic Robust Representations for World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.06263)\n  - Ruixiang Sun, Hongyu Zang, Xin Li, Riashat Islam\n  - Key: modified Dreamer architecture, hybrid-recurrent state space model\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [distracted deepmind control suite](https:\u002F\u002Fgithub.com\u002Fbit1029public\u002FHRSSM\u002Ftree\u002Fmain\u002Fenv), [mani-skill2](https:\u002F\u002Fgithub.com\u002Fhaosulab\u002FManiSkill2)\n\n- [AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09976)\n  - Yucen Wang, Shenghua Wan, Le Gan, Shuai Feng, De-Chuan Zhan\n  - Key: implicit action generator, action-conditioned separated world models\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Hieros: Hierarchical Imagination on Structured State Space Sequence World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05167)\n  - Paul Mattes, Rainer Schlosser, Ralf Herbrich\n  - Key: state-space models, multilayered hierarchical imagination, S5 based world model\n  - ExpEnv: [atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Improving Token-Based World Models with Parallel Observation Prediction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05643)\n  - Lior Cohen, Kaixin Wang, Bingyi Kang, Shie Mannor\n  - Key: pixel-based mbrl, token-based world models, retentive environment model\n  - ExpEnv: [atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Do Transformer World Models Give Better Policy Gradients?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05290)\n  - Michel Ma, Tianwei Ni, Clement Gehring, Pierluca D'Oro, Pierre-Luc Bacon\n  - Key: actions world model\n  - ExpEnv: [double-pendulum](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [Myriad](https:\u002F\u002Fgithub.com\u002Fnikihowe\u002Fmyriad)\n\n- [Dr. Strategy: Model-Based Generalist Agents with Strategic Dreaming](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.18866)\n  - Hany Hamed, Subin Kim, Dongyeong Kim, Jaesik Yoon, Sungjin Ahn\n  - Key: during strategeic dreaming, train three policies -- highway policy, explorer policy and achiever policy, and then achieve downstream tasks\n  - ExpEnv: 2D Navigation, 3D-Maze Navigation, RoboKitchen\n\n- [Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.08991)\n  - Chenlu Ye, Jiafan He, Quanquan Gu, Tong Zhang\n  - Key: theoretical analysis of adversarial corruption for model-based rl, encompassing both online and offline settings\n  - ExpEnv: None\n\n- [Model-based Reinforcement Learning for Confounded POMDPs](https:\u002F\u002Fproceedings.mlr.press\u002Fv235\u002Fhong24d.html)\n  - Mao Hong, Zhengling Qi, Yanxun Xu\n  - Key: model-based RL, POMDP\n  - ExpEnv: None\n\n\u003C!-- - [Trust the Model Where It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption]()\n  - Bernd Frauenknecht, Artur Eisele, Devdutt Subhasish, Friedrich Solowjow, Sebastian Trimpe\n  - Key: \n  - ExpEnv: \n\n- [Efficient World Models with Time-Aware and Context-Augmented Tokenization]()\n  - Vincent Micheli, Eloi Alonso, François Fleuret\n  - Key: \n  - ExpEnv: \n\n- [Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Deep Brain Stimulation]()\n  - Michelle Pan, Mariah Schrum, Vivek Myers, Erdem Biyik, Anca Dragan\n  - Key: \n  - ExpEnv:  -->\n\n\u003C\u002Fdetails>\n\n### ICLR 2024\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=m3xVPaZp6Z)\n  - Chengxing Jia, Chenxiao Gao, Hao Yin, Fuxiang Zhang, Xiong-Hui Chen, Tian Xu, Lei Yuan, Zongzhang Zhang, Zhi-Hua Zhou, Yang Yu\n  - Key: Reinforcement Learning, Model-based Reinforcement Learning, Offline Reinforcement Learning\n  - OpenReview: 8, 8, 8, 6\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Efficient Dynamics Modeling in Interactive Environments with Koopman Theory](https:\u002F\u002Fopenreview.net\u002Fforum?id=fkrYDQaHOJ)\n  - Arnab Kumar Mondal, Siba Smarak Panigrahi, Sai Rajeswar, Kaleem Siddiqi, Siamak Ravanbakhsh\n  - Key: Koopman Theory, Reinforcement Learning, Dynamical System, Planning, Longe range dynamics prediction models, Efficient forward dynamics\n  - OpenReview: 8, 6, 5, 3\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Combining Spatial and Temporal Abstraction in Planning for Better Generalization](https:\u002F\u002Fopenreview.net\u002Fforum?id=eo9dHwtTFt)\n  - Mingde Zhao, Safa Alver, Harm van Seijen, Romain Laroche, Doina Precup, Yoshua Bengio\n  - Key: Reinforcement Learning, Planning, Neural Networks, Temporal Difference Learning, Generalization, Deep Reinforcement Learning\n  - OpenReview: 6, 6, 6, 5\n  - ExpEnv: [MiniGrid-BabyAI framework](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-minigrid)\n\n- [Mastering Memory Tasks with World Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=1vDArHJ68h)\n  - Mohammad Reza Samsami, Artem Zholus, Janarthanan Rajendran, Sarath Chandar\n  - Key: recall to imagine module, based on DreamerV3\n  - OpenReview: 10, 8, 6\n  - ExpEnv: [bsuite](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Fbsuite), [popgym](https:\u002F\u002Fgithub.com\u002Fproroklab\u002Fpopgym), [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [memory maze](https:\u002F\u002Fgithub.com\u002Fjurgisp\u002Fmemory-maze)\n\n- [Privileged Sensing Scaffolds Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=EpVe8jAjdx)\n  - Edward S. Hu, James Springer, Oleh Rybkin, Dinesh Jayaraman\n  - Key: privileged information, based on DreamerV3\n  - OpenReview: 10, 8, 8, 8\n  - ExpEnv: [gymnasium robotics](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FGymnasium-Robotics)\n  \n- [TD-MPC2: Scalable, Robust World Models for Continuous Control](https:\u002F\u002Fopenreview.net\u002Fforum?id=Oxh5CstDJU)\n  - Nicklas Hansen, Hao Su, Xiaolong Wang\n  - Key: implicit world model, model predictive control, generalist td-mpc2\n  - OpenReview: 8, 8, 8, 8\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld), [maniskill2](https:\u002F\u002Fgithub.com\u002Fhaosulab\u002FManiSkill2), [myosuite](https:\u002F\u002Fgithub.com\u002FMyoHub\u002Fmyosuite)\n\n- [Robust Model Based Reinforcement Learning Using L1 Adaptive Control](https:\u002F\u002Fopenreview.net\u002Fforum?id=GaLCLvJaoF)\n  - Minjun Sung, Sambhu Harimanas Karumanchi, Aditya Gahlawat, Naira Hovakimyan\n  - Key: L1 Adaptive Control\n  - OpenReview: 8, 6, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics](https:\u002F\u002Fopenreview.net\u002Fforum?id=TjCDNssXKU)\n  - Christian Gumbsch, Noor Sajid, Georg Martius, Martin V. Butz\n  - Key: Context-specific Recurrent State Space Model, hierarchical world model\n  - OpenReview: 8, 6, 6\n  - ExpEnv: [MiniHack](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fminihack), [VisualPinPad](https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fdirector\u002Fblob\u002Fmain\u002Fembodied\u002Fenvs\u002Fpinpad.py), [MultiWorld](https:\u002F\u002Fgithub.com\u002Fvitchyr\u002Fmultiworld)\n\n- [Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.01017)\n  - Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, Raquel Urtasun\n  - Key: discrete diffusion; world model; autonomous driving\n  - OpenReview: 10, 8, 6, 6, 6\n  - ExpEnv: [NuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F), [KITTI Odometry](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002Feval_odometry.php), [Argoverse2 Lidar](https:\u002F\u002Fwww.argoverse.org\u002Fav2.html)\n\n- [COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL](https:\u002F\u002Fopenreview.net\u002Fforum?id=jnFcKjtUPN)\n  - Xiyao Wang, Ruijie Zheng, Yanchao Sun, Ruonan Jia, Wichayaporn Wongkamjan, Huazhe Xu, Furong Huang\n  - Key: conservative model rollouts, optimistic environment exploration\n  - OpenReview: 6, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Efficient Multi-agent Reinforcement Learning by Planning](https:\u002F\u002Fopenreview.net\u002Fforum?id=CpnKq3UJwp)\n  - Qihan Liu, Jianing Ye, Xiaoteng Ma, Jun Yang, Bin Liang, Chongjie Zhang\n  - Key: mcts, optimistic search lambda, advantage-weighted policy optimization\n  - OpenReview: 8, 6, 6, 6\n  - ExpEnv: [smac](https:\u002F\u002Fgithub.com\u002Foxwhirl\u002Fsmac)\n\n- [Differentiable Trajectory Optimization as a Policy Class for Reinforcement and Imitation Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=HL5P4H8eO2)\n  - Weikang Wan, Yufei Wang, Zackory Erickson, David Held\n  - Key: differentiable trajectory optimization\n  - OpenReview: 10, 8, 8, 5\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [robomimic](https:\u002F\u002Fgithub.com\u002FARISE-Initiative\u002Frobomimic), [maniskill](https:\u002F\u002Fgithub.com\u002Fhaosulab\u002FManiSkill2)\n\n- [DMBP: Diffusion model based predictor for robust offline reinforcement learning against state observation perturbations](https:\u002F\u002Fopenreview.net\u002Fforum?id=ZULjcYLWKe)\n  - Zhihe YANG, Yunjian Xu\n  - Key: conditional diffusion, offline RL\n  - OpenReview: 8, 8, 6, 6\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=1RE0H6mU7M)\n  - Zohar Rimon, Tom Jurgenson, Orr Krupnik, Gilad Adler, Aviv Tamar\n  - Key: context-based meta-RL, based on dreamer\n  - OpenReview: 6, 6, 6, 6\n  - ExpEnv: [Point Robot Navigation, Escape Room](https:\u002F\u002Fgithub.com\u002FRondorf\u002FBOReL\u002Fblob\u002Fmain\u002Fenvironments\u002Ftoy_navigation\u002Fpoint_robot.py), [Reacher Sparse](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=GSBHKiw19c)\n  - Fan-Ming Luo, Tian Xu, Xingchen Cao, Yang Yu\n  - Key: reward learning, offline RL\n  - OpenReview: 8, 6, 6, 6\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl), [NeoRL](https:\u002F\u002Fgithub.com\u002Fpolixir\u002FNeoRL)\n\n- [DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing](https:\u002F\u002Fopenreview.net\u002Fforum?id=GruDNzQ4ux)\n  - Vint Lee, Pieter Abbeel, Youngwoon Lee\n  - Key: learn to predict a temporally-smoothed reward rather than the exact reward at each timestep\n  - OpenReview: 6, 6, 6, 5\n  - ExpEnv: [robodesk](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frobodesk), [hand](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [earthmoving](https:\u002F\u002Fwww.algoryx.se\u002Fagx-dynamics\u002F)\n\n- [Informed POMDP: Leveraging Additional Information in Model-Based RL](https:\u002F\u002Fopenreview.net\u002Fforum?id=5NJzNAXAmx)\n  - Gaspard Lambrechts, Adrien Bolland, Damien Ernst\n  - Key: informed world model, based on DreamerV3\n  - OpenReview: 6, 6, 6, 5\n  - ExpEnv: [varying mountain hike](https:\u002F\u002Fgithub.com\u002Fmaximilianigl\u002FDVRL\u002Ftree\u002Fmaster), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [pop gym](https:\u002F\u002Fgithub.com\u002Fproroklab\u002Fpopgym), [flickering atari and flickering control](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n\u003C\u002Fdetails>\n\n### NeurIPS 2023\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Large Language Models as Commonsense Knowledge for Large-Scale Task Planning](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html)\n  - Zirui Zhao, Wee Sun Lee, David Hsu\n  - Key: LLM-MCTS\n  - ExpEnv: [VirtualHome]()\n\n- [Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F6b8dfb8c0c12e6fafc6c256cb08a5ca7-Paper-Conference.pdf)\n  - Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian (Shawn) Ma, Yitao Liang\n  - Key: interactive planning approach based on LLM\n  - ExpEnv: [minecraft](https:\u002F\u002Fgithub.com\u002Fminerllabs\u002Fminerl)\n\n- [Facing Off World Model Backbones: RNNs, Transformers, and S4](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fe6c65eb9b56719c1aa45ff73874de317-Paper-Conference.pdf)\n  - Fei Deng, Junyeong Park, Sungjin Ahn\n  - Key: world model backbones\n  - ExpEnv: [MiniGrid](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-minigrid), [memory maze](https:\u002F\u002Fgithub.com\u002Fjurgisp\u002Fmemory-maze)\n\n- [Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F7ce1cbededb4b0d6202847ac1b484ee8-Paper-Conference.pdf)\n  - Jialong Wu, Haoyu Ma, Chaoyi Deng, Mingsheng Long\n  - Key: Contextualized World Models\n  - ExpEnv: [CARLA](https:\u002F\u002Fgithub.com\u002Fwayveai\u002Fmile\u002Ftree\u002Fmain\u002Fcarla_gym), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Conformal Prediction for Uncertainty-Aware Planning with Diffusion Dynamics Model](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Ffe318a2b6c699808019a456b706cd845-Paper-Conference.pdf)\n  - Jiankai Sun, Yiqi Jiang, Jianing Qiu, Parth Nobel, Mykel J Kochenderfer, Mac Schwager\n  - Key: Diffusion Dynamics Model\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl), [Maze2D](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FD4RL\u002Ftree\u002Fmaster\u002Fd4rl)\n\n- [LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios](https:\u002F\u002Fopenreview.net\u002Fforum?id=oIUXpBnyjv)\n  - Yazhe Niu, Yuan Pu, Zhenjie Yang, Xueyan Li, Tong Zhou, Jiyuan Ren, Shuai Hu, Hongsheng Li, Yu Liu\n  - Key: MCTS-style benchmark\n  - ExpEnv: [board games](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FLightZero\u002Ftree\u002Fmain\u002Fzoo\u002Fboard_games), [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [gobigger](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FGoBigger)\n\n- [Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=fAdMly4ki5)\n  - Haoran He, Chenjia Bai, Kang Xu, Zhuoran Yang, Weinan Zhang, Dong Wang, Bin Zhao, Xuelong Li\n  - Key: GPT-based diffusion model for planning and data synthesizing\n  - ExpEnv: [Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld), [Maze2D](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FD4RL\u002Ftree\u002Fmaster\u002Fd4rl)\n\n- [MoVie: Visual Model-Based Policy Adaptation for View Generalization](https:\u002F\u002Fopenreview.net\u002Fforum?id=YV1MYtj2AR)\n  - Sizhe Yang, Yanjie Ze, Huazhe Xu\n  - Key: view generalization, spatial adaptive encoder\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [adroit](https:\u002F\u002Fgithub.com\u002Faravindr93\u002Fmjrl), [xArm](https:\u002F\u002Fgithub.com\u002Fyangsizhe\u002FMoVie\u002Ftree\u002Fmain\u002Fsrc\u002Fenvs\u002Fxarm_env)\n\n- [Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms](https:\u002F\u002Fopenreview.net\u002Fforum?id=bUgqyyNo8j)\n  - Shenao Zhang, Boyi Liu, Zhaoran Wang, Tuo Zhao\n  - Key: model-based reparameterization policy gradient method, smoothness regularization\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning](https:\u002F\u002Fopenreview.net\u002Fforum?id=zDbsSscmuj)\n  - Lin Guan, Karthik Valmeekam, Sarath Sreedharan, Subbarao Kambhampati\n  - Key: construct an explicit world (domain) model in planning domain definition language\n  - ExpEnv: [household-robot domain](), [tyreworld and logistics]()\n\n- [RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability](https:\u002F\u002Fopenreview.net\u002Fforum?id=OIJ3VXDy6s)\n  - Chuning Zhu, Max Simchowitz, Siri Gadipudi, Abhishek Gupta\n  - Key: representation resilience for visual RL\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [maniskill](https:\u002F\u002Fgithub.com\u002Fhaosulab\u002FManiSkill2)\n\n- [Model-Based Control with Sparse Neural Dynamics](https:\u002F\u002Fopenreview.net\u002Fforum?id=ymBG2xs9Zf)\n  - Ziang Liu, Jeff He, Genggeng Zhou, Tobia Marcucci, Fei-Fei Li, Jiajun Wu, Yunzhu Li\n  - Key: network sparsification, mixed-integer formulation of ReLU neural dynamics\n  - ExpEnv: [gym, cartpole, reacher](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Optimal Exploration for Model-Based RL in Nonlinear Systems](https:\u002F\u002Fopenreview.net\u002Fforum?id=pJQu0zpKCS)\n  - Andrew Wagenmaker, Guanya Shi, Kevin Jamieson\n  - Key: optimal sample complexity for nonlinear dynamical systems\n  - ExpEnv: [affine dynamics system](https:\u002F\u002Fgithub.com\u002Fajwagen\u002Fnonlinear_sysid_for_control\u002Fblob\u002Fmain\u002Fenvironments.py)\n\n- [State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding](https:\u002F\u002Fopenreview.net\u002Fforum?id=xGz0wAIJrS)\n  - Devleena Das, Sonia Chernova, Been Kim\n  - Key: a joint embedding model between state-action pairs and concept-based explanations\n  - ExpEnv: [connect4](), [lunar lander](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Efficient Exploration in Continuous-time Model-based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=VkhvDfY2dB)\n  - Lenart Treven, Jonas Hübotter, Bhavya, Florian Dorfler, Andreas Krause\n  - Key: nonlinear ordinary differential equations, regret bound, measurement selection strategies\n  - ExpEnv: [system’s tasks]()\n\n- [Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=WjlCQxpuxU)\n  - Xingyuan Zhang, Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl\n  - Key: pretrained world models, imitation learning from observation only\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=WxnrX42rnS)\n  - Weipu Zhang, Gang Wang, Jian Sun, Yetian Yuan, Gao Huang\n  - Key: categorical-VAE, transformer structure, DreamerV3\n  - ExpEnv: [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n\u003C\u002Fdetails>\n\n### ICML 2023\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12016)\n  - Sai Rajeswar Mudumba, Pietro Mazzaglia, Tim Verbelen, Alexandre Piche, Bart Dhoedt, Aaron Courville, Alexandre Lacoste\n  - Key: unsupervised pretrain, task-aware finetune, dyna-mpc\n  - ExpEnv: [URLB benchmark](https:\u002F\u002Fgithub.com\u002Frll-research\u002Furl_benchmark), [RWRL suite](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frealworldrl_suite)\n\n- [Reparameterized Policy Learning for Multimodal Trajectory Optimization](https:\u002F\u002Fopenreview.net\u002Fforum?id=5Akrk9Ln6N)\n  - Zhiao Huang, Litian Liang, Zhan Ling, Xuanlin Li, Chuang Gan, Hao Su\n  - Key: multimodal policy learning, reparameterized policy gradient\n  - ExpEnv: [Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.12141)\n  - Xiyao Wang, Wichayaporn Wongkamjan, Ruonan Jia, Furong Huang\n  - Key: policy-adapted model learning, weight design\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Predictable MDP Abstraction for Unsupervised Model-Based RL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03921)\n  - Seohong Park, Sergey Levine\n  - Key: predictable MDP abstraction, tackle \u003Ci>model exploitation\u003C\u002Fi>\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Investigating the Role of Model-Based Learning in Exploration and Transfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.04009)\n  - Jacob C Walker, Eszter Vértes, Yazhe Li, Gabriel Dulac-Arnold, Ankesh Anand, Jessica Hamrick, Theophane Weber\n  - Key Insights: (1) Is there an advantage to an agent being model-based during unsupervised exploration and\u002For fine-tuning? (2) What are the contributions of each component of a model-based agent for downstream task learning? (3) How well does the model-based agent deal with environmental shift between the unsupervised and downstream phases?\n  - ExpEnv: [Crafter](https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fcrafter), [RoboDesk](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frobodesk), [Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)\n\n- [The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.00694)\n  - Anirudh Vemula, Yuda Song, Aarti Singh, J. Bagnell, Sanjiban Choudhury\n  - Key: objective mismatch, mbrl framework\n  - ExpEnv: [Helicopter, WideTree, Linear Dynamical System, Maze](https:\u002F\u002Fgithub.com\u002Fvvanirudh\u002FLAMPS-MBRL\u002Ftree\u002Fmaster), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [The Benefits of Model-Based Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.02222)\n  - Kenny Young, Aditya Ramesh, Louis Kirsch, Jürgen Schmidhuber\n  - Key: experience replay, when and how learned model generalization\n  - ExpEnv: [ProcMaze, ButtonGrid, PanFlute](https:\u002F\u002Fgithub.com\u002Fkenjyoung\u002FModel_Generalization_Code_supplement\u002Fblob\u002Fmain\u002Fenvironments.py)\n\n- [STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12038)\n  - Souradip Chakraborty, Amrit Bedi, Alec Koppel, Mengdi Wang, Furong Huang, Dinesh Manocha\n  - Key: information directed sampling, kernelized Stein discrepancy\n  - ExpEnv: [DeepSea](https:\u002F\u002Fgithub.com\u002FstratisMarkou\u002Fsample-efficient-bayesian-rl\u002Fblob\u002Fmaster\u002Fcode\u002FEnvironments.py)\n\n- [Model-based Reinforcement Learning with Scalable Composite Policy Gradient Estimators](https:\u002F\u002Fopenreview.net\u002Fforum?id=rDMAJECBM2)\n  - Paavo Parmas, Takuma Seno, Yuma Aoki\n  - Key: extension of Dreamer, total propagation computation graph\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Reinforcement Learning with History Dependent Dynamic Contexts](https:\u002F\u002Fopenreview.net\u002Fforum?id=rdOuTlTUMX)\n  - Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutilier\n  - Key: non-Markov context dynamics, logistic DCMDPs, theoretical analysis, extension of MuZero\n  - ExpEnv: [MovieLens dataset](https:\u002F\u002Fwww.tensorflow.org\u002Fdatasets\u002Fcatalog\u002Fmovielens)\n\n- [Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=rwLwGPdzDD)\n  - Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, Yang Yu\n  - Key: pessimistic value estimation, theoretical analysis\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl), [NeoRL](https:\u002F\u002Fgithub.com\u002Fpolixir\u002FNeoRL)\n\n- [Simplified Temporal Consistency Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=IkhTCX9x5i)\n  - Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, Joni Pajarinen\n  - Key: representation learning, temporal consistency\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Curious Replay for Model-based Adaptation](https:\u002F\u002Fopenreview.net\u002Fforum?id=7p7YakZP2H)\n  - Isaac Kauvar, Chris Doyle, Linqi Zhou, Nick Haber\n  - Key: extension of DreamerV3, curious replay, count-based replay, adversarial replay\n  - ExpEnv: [Crafter](https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fcrafter), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [On Many-Actions Policy Gradient](https:\u002F\u002Fopenreview.net\u002Fforum?id=HKfSTYLJh7)\n  - Michal Nauman, Marek Cygan\n  - Key: bias and variance, theoretical analysis\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Posterior Sampling for Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=ZwjSECgl6p)\n  - Remo Sasso, Michelangelo Conserva, Paulo Rauber\n  - Key: posterior sampling, continual value network\n  - ExpEnv: [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Model-based Offline Reinforcement Learning with Count-based Conservatism](https:\u002F\u002Fopenreview.net\u002Fforum?id=T5VlejGx7f)\n  - Byeongchan Kim, Min-hwan Oh\n  - Key: count estimation, theoretical analysis\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n\u003C\u002Fdetails>\n\n### ICLR 2023\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Transformers are Sample-Efficient World Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=vhFu1Acb0xb)\n  - Vincent Micheli, Eloi Alonso, François Fleuret\n  - Key: discrete autoencoder, transformer based world model\n  - OpenReview: 8, 8, 8, 8\n  - ExpEnv: [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fforum?id=dNqxZgyjcYA)\n  - Jihwan Jeong, Xiaoyu Wang, Michael Gimelfarb, Hyunwoo Kim, Baher Abdulhai, Scott Sanner\n  - Key: model-based offline, bayesian posterior value estimate\n  - OpenReview: 8, 8, 6, 6\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [User-Interactive Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=a4COps0uokg)\n  - Phillip Swazinna, Steffen Udluft, Thomas Runkler\n  - Key: let the user adapt the policy behavior after training is finished\n  - OpenReview: 10, 8, 6, 3\n  - ExpEnv: [2d-world](), [industrial benchmark](https:\u002F\u002Fgithub.com\u002Fsiemens\u002Findustrialbenchmark\u002Ftree\u002Foffline_datasets\u002Fdatasets)\n\n- [CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=5aT4ganOd98)\n  - Sheng Yue, Guanbo Wang, Wei Shao, Zhaofeng Zhang, Sen Lin, Ju Ren, Junshan Zhang\n  - Key: offline IRL, reward extrapolation error\n  - OpenReview: 8, 8, 6, 6\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Efficient Offline Policy Optimization with a Learned Model](https:\u002F\u002Fopenreview.net\u002Fforum?id=Yt-yM-JbYFO)\n  - Zichen Liu, Siyi Li, Wee Sun Lee, Shuicheng YAN, Zhongwen Xu\n  - Key: offline rl, analysis of MuZero Unplugged, one-step look-ahead policy improvement\n  - OpenReview: 8, 6, 5\n  - ExpEnv: [atari dataset](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdeepmind-research\u002Ftree\u002Fmaster\u002Frl_unplugged)\n\n- [Efficient Planning in a Compact Latent Action Space](https:\u002F\u002Fopenreview.net\u002Fforum?id=cA77NrVEuqn)\n  - zhengyao jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rocktäschel, Edward Grefenstette, Yuandong Tian\n  - Key: planning with VQ-VAE\n  - OpenReview: 6, 6, 6, 6\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Is Model Ensemble Necessary? Model-based RL via a Single Model with Lipschitz Regularized Value Function](https:\u002F\u002Fopenreview.net\u002Fforum?id=hNyJBk3CwR)\n  - Ruijie Zheng, Xiyao Wang, Huazhe Xu, Furong Huang\n  - Key: lipschitz regularization\n  - OpenReview: 8, 8, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations](https:\u002F\u002Fopenreview.net\u002Fforum?id=JdTnc9gjVfJ)\n  - Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, Aravind Rajeswaran\n  - Key: three phases -- policy pretraining, targeted exploration, interactive learning\n  - OpenReview: 8, 6, 6, 6\n  - ExpEnv: [adroit](https:\u002F\u002Fgithub.com\u002Faravindr93\u002Fmjrl), [meta-world](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective](https:\u002F\u002Fopenreview.net\u002Fforum?id=MQcmfgRxf7a)\n  - Raj Ghugare, Homanga Bharadhwaj, Benjamin Eysenbach, Sergey Levine, Ruslan Salakhutdinov\n  - Key: Aligned Latent Models\n  - OpenReview: 8, 6, 6, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n\u003C!-- - [The Benefits of Model-Based Generalization in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=w1w4dGJ4qV)\n  - Kenny Young, Aditya Ramesh, Louis Kirsch, Jürgen Schmidhuber\n  - Key: model generalization can be considered more useful than value function generalization\n  - OpenReview: 8, 6, 5, 5\n  - ExpEnv: [ProcMaze, ButtonGrid, PanFlute]() -->\n\n- [Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=H4Ncs5jhTCu)\n  - Daniel Palenicek, Michael Lutter, Joao Carvalho, Jan Peters\n  - Key: longer horizons yield diminishing returns in terms of sample efficiency\n  - OpenReview: 8, 6, 6, 6\n  - ExpEnv: [brax](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fbrax)\n\n- [Planning Goals for Exploration](https:\u002F\u002Fopenreview.net\u002Fforum?id=6qeBuZSo7Pr)\n  - Edward S. Hu, Richard Chang, Oleh Rybkin, Dinesh Jayaraman\n  - Key: sampling-based planning, set goals for each training episode to directly optimize an intrinsic exploration reward\n  - OpenReview: 8, 8, 8, 8, 6\n  - ExpEnv: [point maze](), [walker](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [ant maze, 3-block stack](https:\u002F\u002Fgithub.com\u002Fspitis\u002Fmrl\u002Ftree\u002Fmaster\u002Fenvs)\n\n- [Making Better Decision by Directly Planning in Continuous Control](https:\u002F\u002Fopenreview.net\u002Fforum?id=r8Mu7idxyF)\n  - Jinhua Zhu, Yue Wang, Lijun Wu, Tao Qin, Wengang Zhou, Tie-Yan Liu, Houqiang Li\n  - Key: deep differentiable dynamic programming planner\n  - OpenReview: 8, 8, 8, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Latent Variable Representation for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=mQpmZVzXK1h)\n  - Tongzheng Ren, Chenjun Xiao, Tianjun Zhang, Na Li, Zhaoran Wang, sujay sanghavi, Dale Schuurmans, Bo Dai\n  - Key: variational learning, representation learning\n  - OpenReview: 8, 6, 6, 3\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [SpeedyZero: Mastering Atari with Limited Data and Time](https:\u002F\u002Fopenreview.net\u002Fforum?id=Mg5CLXZgvLJ)\n  - Yixuan Mei, Jiaxuan Gao, Weirui Ye, Shaohuai Liu, Yang Gao, Yi Wu\n  - Key: distributed model-based rl, speed up EfficientZero\n  - OpenReview: 6, 6, 5\n  - ExpEnv: [atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Transformer-based World Models Are Happy With 100k Interactions](https:\u002F\u002Fopenreview.net\u002Fforum?id=TdBaDGCpjly)\n  - Jan Robine, Marc Höftmann, Tobias Uelwer, Stefan Harmeling\n  - Key: autoregressive world model, Transformer-XL, balanced cross-entropy loss, balanced dataset sampling\n  - OpenReview: 8, 6, 6, 6\n  - ExpEnv: [atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=KB1sc5pNKFv)\n  - Yifan Xu, Nicklas Hansen, Zirui Wang, Yung-Chieh Chan, Hao Su, Zhuowen Tu\n  - Key: offline multi-task pretraining, online finetuning\n  - OpenReview: 6, 6, 6, 6\n  - ExpEnv: [atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Become a Proficient Player with Limited Data through Watching Pure Videos](https:\u002F\u002Fopenreview.net\u002Fforum?id=Sy-o2N0hF4f)\n  - Weirui Ye, Yunsheng Zhang, Pieter Abbeel, Yang Gao\n  - Key: unsupervised pre-training, finetune with down-stream tasks\n  - OpenReview: 8, 6, 6, 5\n  - ExpEnv: [atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model](https:\u002F\u002Fopenreview.net\u002Fforum?id=xQAjSr64PTc)\n  - Yifu Yuan, Jianye HAO, Fei Ni, Yao Mu, YAN ZHENG, Yujing Hu, Jinyi Liu, Yingfeng Chen, Changjie Fan\n  - Key: jointly pretrain the multi-headed dynamics model and unsupervised exploration policy, finetune to downstream tasks\n  - OpenReview: 6, 6, 6, 6\n  - ExpEnv: [URLB benchmark](https:\u002F\u002Fgithub.com\u002Frll-research\u002Furl_benchmark)\n\n- [Choreographer: Learning and Adapting Skills in Imagination](https:\u002F\u002Fopenreview.net\u002Fforum?id=PhkWyijGi5b)\n  - Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Alexandre Lacoste, Sai Rajeswar\n  - Key: world model, skill discovery, skill learning, Skill adaptation\n  - OpenReview: 8, 8, 6, 6\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)\n\n\u003C\u002Fdetails>\n\n### NeurIPS 2022\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Bidirectional Learning for Offline Infinite-width Model-based Optimization](https:\u002F\u002Fopenreview.net\u002Fforum?id=_j8yVIyp27Q)\n  - Can Chen, Yingxue Zhang, Jie Fu, Xue Liu, Mark Coates\n  - Key: model-based, offline\n  - OpenReview: 7, 6, 5\n  - ExpEnv: [design-bench](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fdesign-bench)\n\n- [A Unified Framework for Alternating Offline Model Training and Policy Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=5yjM1sQ1uKZ)\n  - Shentao Yang, Shujian Zhang, Yihao Feng, Mingyuan Zhou\n  - Key: model-based, offline, marginal importance weight\n  - OpenReview: 7, 6, 6, 5\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief](https:\u002F\u002Fopenreview.net\u002Fforum?id=oDWyVsHBzNT)\n  - Kaiyang Guo, Shao Yunfeng, Yanhui Geng\n  - Key: model-based, offline\n  - OpenReview: 8, 8, 7, 7\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination](https:\u002F\u002Fopenreview.net\u002Fforum?id=3e3IQMLDSLP)\n  - Jiafei Lyu, Xiu Li, Zongqing Lu\n  - Key: double check mechanism, bidirectional modeling, offline RL\n  - OpenReview: 7, 6, 6\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Model-Based Opponent Modeling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.01843)\n  - XiaoPeng Yu, Jiechuan Jiang, Wanpeng Zhang, Haobin Jiang, Zongqing Lu\n  - Key: multi-agent, model-based\n  - OpenReview: 7, 6, 4, 3\n  - ExpEnv: [mpe](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmultiagent-particle-envs), [google research football](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ffootball)\n\n- [Mingling Foresight with Imagination: Model-Based Cooperative Multi-Agent Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.09418)\n  - Zhiwei Xu, Dapeng Li, Bin Zhang, Yuan Zhan, Yunpeng Bai, Guoliang Fan\n  - Key: multi-agent, model-based\n  - OpenReview: 6, 5\n  - ExpEnv: [StarCraft II](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fpysc2), [Google Research Football](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ffootball), [Multi-Agent Discrete MuJoCo](https:\u002F\u002Fgithub.com\u002Fschroederdewitt\u002Fmultiagent_mujoco)\n\n- [MoCoDA: Model-based Counterfactual Data Augmentation](https:\u002F\u002Fopenreview.net\u002Fforum?id=w6tBOjPCrIO)\n  - Silviu Pitis, Elliot Creager, Ajay Mandlekar, Animesh Garg\n  - Key: data augmentation framework, offline RL\n  - OpenReview: 7, 7, 7, 6\n  - ExpEnv: [2D Navigation](https:\u002F\u002Fgithub.com\u002Fspitis\u002Fmocoda\u002Fblob\u002Fmain\u002Faugment_offline_toy.py#L45), [Hook-Sweep](https:\u002F\u002Fgithub.com\u002Fspitis\u002Fmrl\u002Fblob\u002Fmaster\u002Fenvs\u002Fcustomfetch\u002Fcustom_fetch.py#L1699)\n\n- [When to Update Your Model: Constrained Model-based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=9a1oV7UunyP)\n  - Tianying Ji, Yu Luo, Fuchun Sun, Mingxuan Jing, Fengxiang He, Wenbing Huang\n  - Key: event-triggered mechanism, constrained model-shift lower-bound optimization\n  - OpenReview: 6, 6, 5, 5\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm](https:\u002F\u002Fopenreview.net\u002Fforum?id=hYa_lseXK8)\n  - Ashish Jayant, Shalabh Bhatnagar\n  - Key: constrained RL, model-based\n  - OpenReview: 7, 6, 5, 5\n  - ExpEnv: [safety gym](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fsafety-gym)\n\n- [Learning to Attack Federated Learning: A Model-based Reinforcement Learning Attack Framework](https:\u002F\u002Fopenreview.net\u002Fforum?id=4OHRr7gmhd4)\n  - Henger Li, Xiaolin Sun, Zizhan Zheng\n  - Key: attack & defense,  federated learning, model-based\n  - OpenReview: 6, 6, 6, 5\n  - ExpEnv: MNIST, FashionMNIST, EMNIST, CIFAR-10 and synthetic dataset\n\n- [Model-Based Imitation Learning for Urban Driving](https:\u002F\u002Fopenreview.net\u002Fforum?id=Zk1SbbdZwS)\n  - Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zachary Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, Jamie Shotton\n  - Key: model-based, imitation learning, autonomous driving\n  - OpenReview: 7, 6, 6\n  - ExpEnv: [CARLA](https:\u002F\u002Fgithub.com\u002Fwayveai\u002Fmile\u002Ftree\u002Fmain\u002Fcarla_gym)\n\n- [Data-Driven Model-Based Optimization via Invariant Representation Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=gKe_A-DxzkH)\n  - Han Qi, Yi Su, Aviral Kumar, Sergey Levine\n  - Key: domain adaptation, invariant objective models, representation learning (no about model-based RL)\n  - OpenReview: 7, 6, 6, 5, 5\n  - ExpEnv: [design-bench](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fdesign-bench)\n\n- [Model-based Lifelong Reinforcement Learning with Bayesian Exploration](https:\u002F\u002Fopenreview.net\u002Fforum?id=6I3zJn9Slsb)\n  - Haotian Fu, Shangqun Yu, Michael Littman, George Konidaris\n  - Key: lifelong RL, variational bayesian\n  - OpenReview: 7, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [meta-world](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld)\n\n- [Plan To Predict: Learning an Uncertainty-Foreseeing Model For Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=L9YayWPcHA_)\n  - Zifan Wu, Chao Yu, Chen Chen, Jianye Hao, Hankz Hankui Zhuo\n  - Key: treat the model rollout process as a sequential decision making problem\n  - OpenReview: 7, 7, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Joint Model-Policy Optimization of a Lower Bound for Model-Based RL](https:\u002F\u002Fopenreview.net\u002Fforum?id=LYfFj-Vk6lt)\n  - Benjamin Eysenbach, Alexander Khazatsky, Sergey Levine, Russ Salakhutdinov\n  - Key: unified objective for model-based RL\n  - OpenReview: 8, 8, 7, 6\n  - ExpEnv: [gridworld](https:\u002F\u002Fgithub.com\u002Fdennybritz\u002Freinforcement-learning\u002Fblob\u002Fmaster\u002Flib\u002Fenvs\u002Fgridworld.py), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [ROBEL manipulation](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frobel)\n\n- [RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=nrksGSRT7kX)\n  - Marc Rigter, Bruno Lacerda, Nick Hawes\n  - Key: offline rl, model-based rl, two-player game, adversarial model training\n  - OpenReview: 6, 6, 6, 4\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=xL7B5axplIe)\n  - Shenao Zhang\n  - Key: posterior sampling RL, referential update, constrained conservative update\n  - OpenReview: 7, 7, 5, 5\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [N-Chain MDPs](https:\u002F\u002Fgithub.com\u002FstratisMarkou\u002Fsample-efficient-bayesian-rl\u002Fblob\u002Fmaster\u002Fcode\u002FEnvironments.py)\n\n- [Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=GdHVClGh9N)\n  - Chenyang Wu, Tianci Li, Zongzhang Zhang, Yang Yu\n  - Key: optimism in the face of uncertainty(OFU), BOO Regret\n  - OpenReview: 6, 6, 5\n  - ExpEnv: [RiverSwim, Chain, Random MDPs]()\n\n- [Model-based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity](https:\u002F\u002Fopenreview.net\u002Fforum?id=bEMrmaw8gOB)\n  - Alekh Agarwal, Tong Zhang\n  - Key: posterior sampling RL, Bellman error decoupling framework\n  - OpenReview: 7, 7, 7, 6\n  - ExpEnv: None\n\n- [Exponential Family Model-Based Reinforcement Learning via Score Matching](https:\u002F\u002Fopenreview.net\u002Fforum?id=G1uywu6vNZe)\n  - Gene Li, Junbo Li, Nathan Srebro, Zhaoran Wang, Zhuoran Yang\n  - Key: optimistic model-based, score matching\n  - OpenReview: 7, 7, 6\n  - ExpEnv: None\n\n- [Deep Hierarchical Planning from Pixels](https:\u002F\u002Fopenreview.net\u002Fforum?id=wZk69kjy9_d)\n  - Danijar Hafner, Kuang-Huei Lee, Ian Fischer, Pieter Abbeel\n  - Key: hierarchical RL, long-horizon and sparse reward tasks\n  - OpenReview: 6, 6, 5\n  - ExpEnv: [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [deepmind lab](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Flab), [crafter](https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fcrafter)\n\n- [Continuous MDP Homomorphisms and Homomorphic Policy Gradient](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.07364)\n  - Sahand Rezaei-Shoshtari, Rosie Zhao, Prakash Panangaden, David Meger, Doina Precup\n  - Key: Homomorphic Policy Gradient, Continuous MDP Homomorphisms, Lax Bisimulation Loss\n  - OpenReview: 7, 7, 7\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n\u003C\u002Fdetails>\n\n### ICML 2022\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.14565)\n  - Fei Deng, Ingook Jang, Sungjin Ahn\n  - Key: dreamer, prototypes\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Denoised MDPs: Learning World Models Better Than the World Itself](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15477.pdf)\n  - Tongzhou Wang, Simon Du, Antonio Torralba, Phillip Isola, Amy Zhang, Yuandong Tian\n  - Key: representation learning, denoised model\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [RoboDesk](https:\u002F\u002Fgithub.com\u002FSsnL\u002Frobodesk)\n\n- [Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models and Amortized Policy Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.08291.pdf)\n  - Qi Wang, Herke van Hoof\n  - Key: graph structured surrogate model, meta training\n  - ExpEnv: [atari, mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Towards Adaptive Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11464.pdf)\n  - Yi Wan, Ali Rahimi-Kalahroudi, Janarthanan Rajendran, Ida Momennejad, Sarath Chandar, Harm van Seijen\n  - Key: local change adaptation\n  - ExpEnv: [GridWorldLoCA, ReacherLoCA, MountaincarLoCA](https:\u002F\u002Fgithub.com\u002Fchandar-lab\u002FLoCA2)\n\n- [Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07322.pdf)\n  - Pier Giuseppe Sessa, Maryam Kamgarpour, Andreas Krause\n  - Key: model-based multi-agent, confidence bound\n  - ExpEnv: [SMART](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FSMARTS)\n\n- [Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07166.pdf)\n  - Shentao Yang, Yihao Feng, Shujian Zhang, Mingyuan Zhou\n  - Key: offline rl, model-based rl, stationary distribution regularization\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Design-Bench: Benchmarks for Data-Driven Offline Model-Based Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.08450.pdf)\n  - Brandon Trabucco, Xinyang Geng, Aviral Kumar, Sergey Levine\n  - Key: benchmark, offline MBO\n  - ExpEnv: [Design-Bench Benchmark Tasks](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fdesign-bench)\n\n- [Temporal Difference Learning for Model Predictive Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04955.pdf)\n  - Nicklas Hansen, Hao Su, Xiaolong Wang\n  - Key: td-learning, MPC\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [Meta-World](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld)\n\n\u003C\u002Fdetails>\n\n### ICLR 2022\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Revisiting Design Choices in Offline Model Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=zz9hXVhf40)\n  - Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, Stephen J. Roberts\n  - Key: model-based offline, uncertainty quantification\n  - OpenReview: 8, 8, 6, 6, 6\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Value Gradient weighted Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=4-D6CZkRXxI)\n  - Claas A Voelcker, Victor Liao, Animesh Garg, Amir-massoud Farahmand\n  - Key: Value-Gradient weighted Model loss\n  - OpenReview: 8, 8, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Planning in Stochastic Environments with a Learned Model](https:\u002F\u002Fopenreview.net\u002Fforum?id=X6D9bAHhBQ1)\n  - Ioannis Antonoglou, Julian Schrittwieser, Sherjil Ozair, Thomas K Hubert, David Silver\n  - Key: MCTS, stochastic MuZero\n  - OpenReview: 10, 8, 8, 5\n  - ExpEnv: 2048 game, Backgammon, Go\n\n- [Policy improvement by planning with Gumbel](https:\u002F\u002Fopenreview.net\u002Fforum?id=bERaNdoegnO)\n  - Ivo Danihelka, Arthur Guez, Julian Schrittwieser, David Silver\n  - Key: Gumbel AlphaZero, Gumbel MuZero\n  - OpenReview: 8, 8, 8, 6\n  - ExpEnv: go, chess, [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Model-Based Offline Meta-Reinforcement Learning with Regularization](https:\u002F\u002Fopenreview.net\u002Fforum?id=EBn0uInJZWh)\n  - Sen Lin, Jialin Wan, Tengyu Xu, Yingbin Liang, Junshan Zhang\n  - Key: model-based offline Meta-RL\n  - OpenReview: 8, 6, 6, 6\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [On-Policy Model Errors in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=81e1aeOt-sd)\n  - Lukas Froehlich, Maksym Lefarov, Melanie Zeilinger, Felix Berkenkamp\n  - Key: model errors, on-policy corrections\n  - OpenReview: 8, 6, 6, 5\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [pybullet](https:\u002F\u002Fgithub.com\u002Fbenelot\u002Fpybullet-gym)\n\n- [A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=YRq0ZUnzKoZ)\n  - Jiaxian Guo, Mingming Gong, Dacheng Tao\n  - Key: relational intervention, dynamics generalization\n  - OpenReview: 8, 8, 6, 6\n  - ExpEnv: [Pendulum](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Information Prioritization through Empowerment in Visual Model-based RL](https:\u002F\u002Fopenreview.net\u002Fforum?id=DfUjyyRW90)\n  - Homanga Bharadhwaj, Mohammad Babaeizadeh, Dumitru Erhan, Sergey Levine\n  - Key: mutual information, visual model-based RL\n  - OpenReview: 8, 8, 8, 6\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [Kinetics dataset](https:\u002F\u002Fgithub.com\u002Fcvdfoundation\u002Fkinetics-dataset)\n\n- [Transfer RL across Observation Feature Spaces via Model-Based Regularization](https:\u002F\u002Fopenreview.net\u002Fforum?id=7KdAoOsI81C)\n  - Yanchao Sun, Ruijie Zheng, Xiyao Wang, Andrew E Cohen, Furong Huang\n  - Key: latent dynamics model, transfer RL\n  - OpenReview: 8, 6, 5, 5\n  - ExpEnv: [CartPole, Acrobot and Cheetah-Run](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [3DBall](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents)\n\n- [Learning State Representations via Retracing in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=CLpxpXqqBV)\n  - Changmin Yu, Dong Li, Jianye HAO, Jun Wang, Neil Burgess\n  - Key: representation learning, learning via retracing\n  - OpenReview: 8, 6, 5, 3\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Model-augmented Prioritized Experience Replay](https:\u002F\u002Fopenreview.net\u002Fforum?id=WuEiafqdy9H)\n  - Youngmin Oh, Jinwoo Shin, Eunho Yang, Sung Ju Hwang\n  - Key: prioritized experience replay, mbrl\n  - OpenReview: 8, 8, 6, 5\n  - ExpEnv: [pybullet](https:\u002F\u002Fgithub.com\u002Fbenelot\u002Fpybullet-gym)\n\n- [Evaluating Model-Based Planning and Planner Amortization for Continuous Control](https:\u002F\u002Fopenreview.net\u002Fforum?id=SS8F6tFX3-)\n  - Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, Martin Riedmiller\n  - Key: model predictive control\n  - OpenReview: 8, 6, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Gradient Information Matters in Policy Optimization by Back-propagating through Model](https:\u002F\u002Fopenreview.net\u002Fforum?id=rzvOQrnclO0)\n  - Chongchong Li, Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, Tie-Yan Liu\n  - Key: two-model-based method, analyze model error and policy gradient\n  - OpenReview: 8, 8, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Pareto Policy Pool for Model-based Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=OqcZu8JIIzS)\n  - Yijun Yang, Jing Jiang, Tianyi Zhou, Jie Ma, Yuhui Shi\n  - Key: model-based offline, model return-uncertainty trade-off\n  - OpenReview: 8, 8, 6, 5\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage](https:\u002F\u002Fopenreview.net\u002Fforum?id=tyrJsbKAe6)\n  - Masatoshi Uehara, Wen Sun\n  - Key: model-based offline theory, PAC bounds\n  - OpenReview: 8, 6, 6, 5\n  - ExpEnv: None\n\n- [Know Thyself: Transferable Visual Control Policies Through Robot-Awareness](https:\u002F\u002Fopenreview.net\u002Fforum?id=o0ehFykKVtr)\n  - Edward S. Hu, Kun Huang, Oleh Rybkin, Dinesh Jayaraman\n  - Key: world models that transfer to new robots\n  - OpenReview: 8, 6, 6, 5\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), WidowX and Franka Panda robot\n\n\u003C\u002Fdetails>\n\n### NeurIPS 2021\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [On Effective Scheduling of Model-based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.08550)\n  - Hang Lai, Jian Shen, Weinan Zhang, Yimin Huang, Xing Zhang, Ruiming Tang, Yong Yu, Zhenguo Li\n  - Key: extension of mbpo, hyper-controller learning\n  - OpenReview: 8, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [pybullet](https:\u002F\u002Fgithub.com\u002Fbenelot\u002Fpybullet-gym)\n\n- [COMBO: Conservative Offline Model-Based Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fpdf?id=dUEpGV2mhf)\n  - Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, Chelsea Finn\n  - Key: offline reinforcement learning, model-based reinforcement learning, deep reinforcement learning\n  - OpenReview: 6, 7, 6, 8\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Safe Reinforcement Learning by Imagining the Near Future](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.07789)\n  - Garrett Thomas, Yuping Luo, Tengyu Ma\n  - Key: safe rl, reward penalty, theory about model-based rollouts\n  - OpenReview: 8, 6, 6\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Model-Based Reinforcement Learning via Imagination with Derived Memory](https:\u002F\u002Fopenreview.net\u002Fforum?id=jeATherHHGj)\n  - Yao Mu, Yuzheng Zhuang, Bin Wang, Guangxiang Zhu, Wulong Liu, Jianyu Chen, Ping Luo, Shengbo Eben Li, Chongjie Zhang, Jianye HAO\n  - Key: extension of dreamer, prediction-reliability weight\n  - OpenReview: 6, 6, 6, 6\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [MobILE: Model-Based Imitation Learning From Observation Alone](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.10769)\n  - Rahul Kidambi, Jonathan Chang, Wen Sun\n  - Key: imitation learning from observations alone, mbrl\n  - OpenReview: 6, 6, 6, 4\n  - ExpEnv: [cartpole](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Model-Based Episodic Memory Induces Dynamic Hybrid Controls](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.02104)\n  - Hung Le, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, Svetha Venkatesh\n  - Key: model-based, episodic control\n  - OpenReview: 7, 7, 6, 6\n  - ExpEnv: [2D maze navigation](https:\u002F\u002Fgithub.com\u002FMattChanTK\u002Fgym-maze), [cartpole, mountainCar and lunarlander](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [atari](https:\u002F\u002Fgym.openai.com\u002Fenvs\u002Fatari), [3D navigation: gym-miniworld](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-miniworld)\n\n- [A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.02097)\n  - Mingde Zhao, Zhen Liu, Sitao Luan, Shuyuan Zhang, Doina Precup, Yoshua Bengio\n  - Key: mbrl, set representation\n  - OpenReview: 7, 7, 7, 6\n  - ExpEnv: [MiniGrid-BabyAI framework](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-minigrid)\n\n- [Mastering Atari Games with Limited Data](https:\u002F\u002Fopenreview.net\u002Fforum?id=OKrNPg3xR3T)\n  - Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao\n  - Key: muzero, self-supervised consistency loss\n  - OpenReview: 7, 7, 7, 5\n  - ExpEnv: [atrai 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Online and Offline Reinforcement Learning by Planning with a Learned Model](https:\u002F\u002Fopenreview.net\u002Fforum?id=HKtsGW-lNbw)\n  - Julian Schrittwieser, Thomas K Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, David Silver\n  - Key: muzero, reanalyse, offline\n  - OpenReview: 8, 8, 7, 6\n  - ExpEnv: [atrai dataset, deepmind control suite dataset](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdeepmind-research\u002Ftree\u002Fmaster\u002Frl_unplugged)\n\n- [Self-Consistent Models and Values](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.12840)\n  - Gregory Farquhar, Kate Baumli, Zita Marinho, Angelos Filos, Matteo Hessel, Hado van Hasselt, David Silver\n  - Key: new model learning way\n  - OpenReview: 7, 7, 7, 6\n  - ExpEnv: tabular MDP, Sokoban, [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Proper Value Equivalence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.10316)\n  - Christopher Grimm, Andre Barreto, Gregory Farquhar, David Silver, Satinder Singh\n  - Key: value equivalence, value-based planning, muzero\n  - OpenReview: 8, 7, 7, 6\n  - ExpEnv: [four rooms](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-minigrid), [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [MOPO: Model-based Offline Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.13239)\n  - Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma\n  - Key: model-based, offline\n  - OpenReview: None\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl), halfcheetah-jump and ant-angle\n\n- [RoMA: Robust Model Adaptation for Offline Model-based Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.14188)\n  - Sihyun Yu, Sungsoo Ahn, Le Song, Jinwoo Shin\n  - Key: model-based, offline\n  - OpenReview: 7, 6, 6\n  - ExpEnv: [design-bench](https:\u002F\u002Fgithub.com\u002Fbrandontrabucco\u002Fdesign-bench)\n\n- [Offline Reinforcement Learning with Reverse Model-based Imagination](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.00188)\n  - Jianhao Wang, Wenzhe Li, Haozhe Jiang, Guangxiang Zhu, Siyuan Li, Chongjie Zhang\n  - Key: model-based, offline\n  - OpenReview: 7, 6, 6, 5\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Offline Model-based Adaptable Policy Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=lrdXc17jm6)\n  - Xiong-Hui Chen, Yang Yu, Qingyang Li, Fan-Ming Luo, Zhiwei Tony Qin, Shang Wenjie, Jieping Ye\n  - Key: model-based, offline\n  - OpenReview: 6, 6, 6, 4\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Weighted model estimation for offline model-based reinforcement learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zdC5eXljMPy)\n  - Toru Hishinuma, Kei Senda\n  - Key: model-based, offline, off-policy evaluation\n  - OpenReview: 7, 6, 6, 6\n  - ExpEnv: pendulum, [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.06394)\n  - Weitong Zhang, Dongruo Zhou, Quanquan Gu\n  - Key: learning theory, model-based reward-free RL, linear function approximation\n  - OpenReview: 6, 6, 5, 5\n  - ExpEnv: None\n\n- [Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.04168)\n  - Kefan Dong, Jiaqi Yang, Tengyu Ma\n  - Key: learning theory, model-based bandit RL, nonlinear function approximation\n  - OpenReview: 7, 7, 7, 6\n  - ExpEnv: None\n\n- [Discovering and Achieving Goals via World Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=6vWuYzkp8d)\n  - Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, Deepak Pathak\n  - Key: unsupervised goal reaching, goal-conditioned RL\n  - OpenReview: 6, 6, 6, 6, 6\n  - ExpEnv: [walker, quadruped, bins, kitchen](https:\u002F\u002Fgithub.com\u002Forybkin\u002Flexa-benchmark)\n\n\u003C\u002Fdetails>\n\n### ICLR 2021\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.03647)\n  - Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu\n  - Key: model-based, behavior cloning (warmup), trpo\n  - OpenReview: 8, 7, 7, 5\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Control-Aware Representations for Model-based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.13408)\n  - Brandon Cui, Yinlam Chow, Mohammad Ghavamzadeh\n  - Key: representation learning, model-based soft actor-critic\n  - OpenReview: 6, 6, 6\n  - ExpEnv: planar system, inverted pendulum – swingup, cartpole, 3-link manipulator — swingUp & balance\n\n- [Mastering Atari with Discrete World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.02193)\n  - Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, Jimmy Ba\n  - Key: DreamerV2, many tricks(multiple categorical variables, KL balancing, etc)\n  - OpenReview: 9, 8, 5, 4\n  - ExpEnv: [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Model-Based Visual Planning with Self-Supervised Functional Distances](https:\u002F\u002Fopenreview.net\u002Fforum?id=UcoXdfrORC)\n  - Stephen Tian, Suraj Nair, Frederik Ebert, Sudeep Dasari, Benjamin Eysenbach, Chelsea Finn, Sergey Levine\n  - Key: goal-reaching task, dynamics learning, distance learning (goal-conditioned Q-function)\n  - OpenReview: 7, 7, 7, 7\n  - ExpEnv: [sawyer](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld\u002Ftree\u002Fmaster\u002Fmetaworld\u002Fenvs), door sliding\n\n- [Model-Based Offline Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2008.05556)\n  - Arthur Argenson, Gabriel Dulac-Arnold\n  - Key: model-based, offline\n  - OpenReview: 8, 7, 5, 5\n  - ExpEnv: [RL Unplugged(RLU)](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdeepmind-research\u002Ftree\u002Fmaster\u002Frl_unplugged), [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Offline Model-Based Optimization via Normalized Maximum Likelihood Estimation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.07970)\n  - Justin Fu, Sergey Levine\n  - Key: model-based, offline\n  - OpenReview: 8, 6, 6\n  - ExpEnv: [design-bench](https:\u002F\u002Fgithub.com\u002Fbrandontrabucco\u002Fdesign-bench)\n\n- [On the role of planning in model-based deep reinforcement learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.04021)\n  - Jessica B. Hamrick, Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Buesing, Petar Veličković, Théophane Weber\n  - Key: discussion about planning in MuZero\n  - OpenReview: 7, 7, 6, 5\n  - ExpEnv: [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), go, [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Representation Balancing Offline Model-based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fforum?id=QpNz8r_Ri2Y)\n  - Byung-Jun Lee, Jongmin Lee, Kee-Eung Kim\n  - Key: Representation Balancing MDP, model-based, offline\n  - OpenReview: 7, 7, 7, 6\n  - ExpEnv: [d4rl dataset](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Model-based micro-data reinforcement learning: what are the crucial model properties and which model to choose?](https:\u002F\u002Fopenreview.net\u002Fforum?id=p5uylG94S68)\n  - Balázs Kégl, Gabriel Hurtado, Albert Thomas\n  - Key: mixture density nets, heteroscedasticity\n  - OpenReview: 7, 7, 7, 6, 5\n  - ExpEnv: [acrobot system](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n\u003C\u002Fdetails>\n\n### ICML 2021\n\n\u003Cdetails open>\n\u003Csummary>Toggle\u003C\u002Fsummary>\n\n- [Conservative Objective Models for Effective Offline Model-Based Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.06882)\n  - Brandon Trabucco, Aviral Kumar, Xinyang Geng, Sergey Levine\n  - Key: conservative objective model, offline mbrl\n  - ExpEnv: [design-bench](https:\u002F\u002Fgithub.com\u002Fbrandontrabucco\u002Fdesign-bench)\n\n- [Continuous-Time Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.04764)\n  - Çağatay Yıldız, Markus Heinonen, Harri Lähdesmäki\n  - Key: continuous-time\n  - ExpEnv: [pendulum, cartPole and acrobot](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Model-Based Reinforcement Learning via Latent-Space Collocation](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.13229)\n  - Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, Sergey Levine\n  - Key: latent space collocation\n  - ExpEnv: [sparse metaworld tasks](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld\u002Ftree\u002Fmaster\u002Fmetaworld\u002Fenvs)\n\n- [Model-Free and Model-Based Policy Evaluation when Causality is Uncertain](http:\u002F\u002Fproceedings.mlr.press\u002Fv139\u002Fbruns-smith21a.html)\n  - David A Bruns-Smith\n  - Key: worst-case bounds\n  - ExpEnv: [ope-tools](https:\u002F\u002Fgithub.com\u002Fclvoloshin\u002FCOBS)\n\n- [Muesli: Combining Improvements in Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.06159)\n  - Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado van Hasselt\n  - Key: value equivalence\n  - ExpEnv: [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Vector Quantized Models for Planning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.04615.pdf)\n  - Sherjil Ozair, Yazhe Li, Ali Razavi, Ioannis Antonoglou, Aäron van den Oord, Oriol Vinyals\n  - Key: VQVAE, MCTS\n  - ExpEnv: [chess datasets](https:\u002F\u002Fwww.ﬁcsgames.org\u002Fdownload.html), [DeepMind Lab](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Flab)\n\n- [PC-MLP: Model-based Reinforcement Learning with Policy Cover Guided Exploration](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.07410)\n  - Yuda Song, Wen Sun\n  - Key: sample complexity, kernelized nonlinear regulators, linear MDPs\n  - ExpEnv: [mountain car, antmaze](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Temporal Predictive Coding For Model-Based Planning In Latent Space](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.07156)\n  - Tung Nguyen, Rui Shu, Tuan Pham, Hung Bui, Stefano Ermon\n  - Key: temporal predictive coding with a RSSM, latent space\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Model-based Reinforcement Learning for Continuous Control with Posterior Sampling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.09613)\n  - Ying Fan, Yifei Ming\n  - Key: regret bound of psrl, mpc\n  - ExpEnv: [continuous cartpole, pendulum swingup](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [A Sharp Analysis of Model-based Reinforcement Learning with Self-Play](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.01604)\n  - Qinghua Liu, Tiancheng Yu, Yu Bai, Chi Jin\n  - Key: learning theory, multi-agent, model-based self play, two-player zero-sum Markov games\n  - ExpEnv: None\n\n\u003C\u002Fdetails>\n\n### Other\n\n- [UniZero: Generalized and Efficient Planning with Scalable Latent World Models](https:\u002F\u002Fopenreview.net\u002Fforum?id=Gl6dF9soQo) \n  - Yuan Pu, Yazhe Niu, Zhenjie Yang, Jiyuan Ren, Hongsheng Li, Yu Liu *TMLR2025*\n  - Key: world model, MCTS, model-based reinforcement learning, transformer, latent planning, multitask learning  \n  - ExpEnv: Atari, DMControl, VisualMatch\n\n- [Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2024\u002Fhtml\u002FWang_Driving_into_the_Future_Multiview_Visual_Forecasting_and_Planning_with_CVPR_2024_paper.html)\n  - Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, Zhaoxiang Zhang *CVPR 2024*\n  - Key: AutoDrive world modeling\n  - ExpEnv: [nuScenes]()\n\n- [DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tT3LUdmzbd)\n  - Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, Liping Jing, Yiming Nie, Bin Dai *CVPR 2024*\n  - Key: AutoDrive world modeling\n  - ExpEnv: [nuScenes](), [OpenScene]()\n\n- [Masked Trajectory Models for Prediction, Representation, and Control](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tT3LUdmzbd)\n  - Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, Aravind Rajeswaran *ICLR 2023 Workshop RRL*\n  - Key: offline RL, learning for control, sequence modeling\n  - ExpEnv: [d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [World Models via Policy-Guided Trajectory Diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08533)\n  - Marc Rigter, Jun Yamada, Ingmar Posner *Arxiv 2023*\n  - Key: Diffusion model, world model\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [gridworld](https:\u002F\u002Fgithub.com\u002Fdennybritz\u002Freinforcement-learning\u002Fblob\u002Fmaster\u002Flib\u002Fenvs\u002Fgridworld.py)\n\n- [Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04386)\n  - Carlos E. Luis, Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, Jan Peters *Arxiv 2023*\n  - Key: cumulative rewards uncertainty estimation in MBRL\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [Sample-Efficient Learning to Solve a Real-World Labyrinth Game Using Data-Augmented Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09906)\n  - Thomas Bi, Raffaello D'Andrea. *Arxiv 2023*\n  - Key: Data-Augmented,  DreamerV3\n  - ExpEnv: [Real-World Labyrinth Game]()\n\n- [Mastering Diverse Domains through World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.04104)\n  - Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap. *Arxiv 2023*\n  - Key: DreamerV3, scaling property to world model\n  - ExpEnv: [deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [DMLab](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Flab), [minecraft](https:\u002F\u002Fgithub.com\u002Fminerllabs\u002Fminerl)\n\n- [Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.12933)\n  - Chuming Li, Ruonan Jia, Jiawei Yao, Jie Liu, Yinmin Zhang, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang. *IJCAI Workshop 2023*\n  - Key: extended policy improvement, model regularization, planning theorem\n  - ExpEnv: [mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n\n## Tutorial\n\n- [Video] [Csaba Szepesvári - The challenges of model-based reinforcement learning and how to overcome them](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-Y-fHsPIQ_Q)\n- [Blog] [Model-Based Reinforcement Learning: Theory and Practice](https:\u002F\u002Fbair.berkeley.edu\u002Fblog\u002F2019\u002F12\u002F12\u002Fmbpo\u002F)\n\n\n## Codebase\n\n- [mbrl-lib](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmbrl-lib) - Meta: Library for Model Based RL\n- [DI-engine](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FDI-engine) - OpenDILab: Decision AI Engine\n\n\n## Contributing\n\nOur purpose is to make this repo even better. If you are interested in contributing, please refer to [HERE](CONTRIBUTING.md) for instructions in contribution.\n\n\n## License\n\nAwesome Model-Based RL is released under the Apache 2.0 license.\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">Back to top\u003C\u002Fa>)\u003C\u002Fp>\n","# 优秀的基于模型的强化学习\n\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome) [![文档](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-latest-blue)](https:\u002F\u002Fgithub.com\u002Fopendilab\u002Fawesome-model-based-RL) ![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopendilab\u002Fawesome-model-based-RL?color=yellow) ![GitHub 分支](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fopendilab\u002Fawesome-model-based-RL?color=9cf) [![GitHub 许可证](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fopendilab\u002Fawesome-model-based-RL)](https:\u002F\u002Fgithub.com\u002Fopendilab\u002Fawesome-model-based-RL\u002Fblob\u002Fmain\u002FLICENSE)\n\n这是一个关于**基于模型的强化学习（mbrl）**的研究论文合集。\n该仓库将持续更新，以跟踪基于模型强化学习领域的最新进展。\n\n欢迎关注并点赞！\n\n\u003Cpre name=\"code\" class=\"html\">\n\u003Cfont color=\"red\">[2025.12.01] \u003Cb>新增：我们更新了基于模型强化学习的 NeurIPS 2025 论文列表！\u003C\u002Fb>\u003C\u002Ffont>\n\n[2025.08.28] 我们更新了基于模型强化学习的 ICML 2025 论文列表。\n\n[2025.02.06] 我们更新了基于模型强化学习的 ICLR 2025 论文列表。\n\n[2024.10.27] 我们更新了基于模型强化学习的 NeurIPS 2024 论文列表。\n\n[2024.05.20] 我们更新了基于模型强化学习的 ICML 2024 论文列表。\n\n[2023.11.29] 我们更新了基于模型强化学习的 ICLR 2024 论文列表。\n\n[2023.09.29] 我们更新了基于模型强化学习的 NeurIPS 2023 论文列表。\n\n[2023.06.15] 我们更新了基于模型强化学习的 ICML 2023 论文列表。\n\n[2023.02.05] 我们更新了基于模型强化学习的 ICLR 2023 论文列表。\n\n[2022.11.03] 我们更新了基于模型强化学习的 NeurIPS 2022 论文列表。\n\n[2022.07.06] 我们更新了基于模型强化学习的 ICML 2022 论文列表。\n\n[2022.02.13] 我们更新了基于模型强化学习的 ICLR 2022 论文列表。\n\n[2021.12.28] 我们发布了优秀的基于模型强化学习资源。\n\u003C\u002Fpre>\n\n\n## 目录\n\n- [优秀的基于模型的强化学习](#awesome-model-based-reinforcement-learning)\n  - [目录](#table-of-contents)\n  - [基于模型强化学习算法分类](#a-taxonomy-of-model-based-rl-algorithms)\n  - [论文](#papers)\n    - [经典基于模型强化学习论文](#classic-model-based-rl-papers)\n    - [NeurIPS 2025](#neurips-2025)\n    - [ICML 2025](#icml-2025)\n    - [ICLR 2025](#iclr-2025)\n    - [NeurIPS 2024](#neurips-2024)\n    - [ICML 2024](#icml-2024)\n    - [ICLR 2024](#iclr-2024)\n    - [NeurIPS 2023](#neurips-2023)\n    - [ICML 2023](#icml-2023)\n    - [ICLR 2023](#iclr-2023)\n    - [NeurIPS 2022](#neurips-2022)\n    - [ICML 2022](#icml-2022)\n    - [ICLR 2022](#iclr-2022)\n    - [NeurIPS 2021](#neurips-2021)\n    - [ICLR 2021](#iclr-2021)\n    - [ICML 2021](#icml-2021)\n    - [其他](#other)\n  - [教程](#tutorial)\n  - [代码库](#codebase)\n  - [贡献](#contributing)\n  - [许可证](#license)\n\n\n## 基于模型强化学习算法分类\n\n在开始本节之前，我们先声明一点：要绘制一个准确且全面的基于模型强化学习算法分类体系确实非常困难，因为算法的模块化特性很难用树状结构来完整表达。因此，我们将发布一系列相关博客，以更深入地介绍各种基于模型强化学习算法。\n\n\u003Cp align=\"center\">\n    \u003Cimg style=\"border-radius: 0.3125em;\n    box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);\"\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendilab_awesome-model-based-RL_readme_553abf8de4ad.png\">\n    \u003Cbr>\n    \u003Cem style=\"display: inline-block;\">现代基于模型强化学习中一种非详尽但实用的算法分类。\u003C\u002Fem>\n\u003C\u002Fp>\n\n我们简单地将“基于模型强化学习”分为两大类：“学习模型”和“给定模型”。\n\n- “学习模型”主要关注如何构建环境模型。\n- “给定模型”则侧重于如何利用已学习到的模型。\n\n如上图所示，我们给出了一些示例，并附上了相关算法的链接。\n\n>[1] [World Models](https:\u002F\u002Fworldmodels.github.io\u002F)：Ha 和 Schmidhuber，2018年  \n[2] [I2A](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06203)（想象增强智能体）：Weber 等，2017年  \n[3] [MBMF](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmbmf)（结合无模型微调的基于模型强化学习）：Nagabandi 等，2017年  \n[4] [MBVE](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.00101)（基于模型的价值扩展）：Feinberg 等，2018年  \n[5] [ExIt](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.08439)（专家迭代）：Anthony 等，2017年  \n[6] [AlphaZero](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.01815)：Silver 等，2017年  \n[7] [POPLIN](https:\u002F\u002Fopenreview.net\u002Fforum?id=H1exf64KwH)（基于模型的策略规划）：Wang 等，2019年  \n[8] [M2AC](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.04893)（掩码式基于模型的演员-评论家）：Pan 等，2020年\n\n\n## 论文\n\n```\n格式：\n- [标题](论文链接) [链接]\n  - 作者1、作者2和作者3\n  - 关键点：关键问题和见解\n  - OpenReview：可选\n  - 实验环境：实验使用的环境\n```\n\n### 经典基于模型的强化学习论文\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F折叠\u003C\u002Fsummary>\n\n- [Dyna：一种集成的学习、规划与反应架构](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F122344.122377)\n  - 理查德·萨顿。*ACM 1991*\n  - 关键点：dyna架构\n  - 实验环境：无\n\n- [PILCO：一种基于模型且数据高效的策略搜索方法](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F221345233_PILCO_A_Model-Based_and_Data-Efficient_Approach_to_Policy_Search)\n  - 马克·彼得·戴森罗斯特，卡尔·爱德华·拉斯穆森。*ICML 2011*\n  - 关键点：概率动力学模型\n  - 实验环境：倒立摆系统，机器人独轮车\n\n- [利用轨迹优化学习复杂的神经网络策略](https:\u002F\u002Fproceedings.mlr.press\u002Fv32\u002Flevine14.html)\n  - 谢尔盖·列维涅，弗拉季伦·科尔屯。*ICML 2014*\n  - 关键点：引导式策略搜索\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [通过随机价值梯度学习连续控制策略](https:\u002F\u002Farxiv.org\u002Fabs\u002F1510.09142)\n  - 尼古拉斯·海斯，格雷格·韦恩，大卫·西尔弗，蒂莫西·利利克拉普，尤瓦尔·塔萨，汤姆·埃雷兹。*NIPS 2015*\n  - 关键点：路径反向传播，真实轨迹上的梯度\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [价值预测网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.03497)\n  - 欧俊赫，萨廷德·辛格，李洪洛克。*NIPS 2017*\n  - 关键点：价值预测模型  \u003C!-- VE? -->\n  - 实验环境：收集领域，[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [基于随机集成价值扩展的样本高效强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.01675)\n  - 雅各布·巴克曼，达尼雅尔·哈夫纳，乔治·塔克，尤金·布雷夫多，李洪洛克。*NIPS 2018*\n  - 关键点：集成模型与Q网络，价值扩展\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)，[roboschool](https:\u002F\u002Fgithub.com\u002Fopenai\u002Froboschool)\n\n- [循环世界模型促进策略进化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1809.01999)\n  - 大卫·哈，尤尔根·施密德胡伯。*NIPS 2018*\n  - 关键点：vae(表征)，rnn(预测模型)\n  - 实验环境：[赛车](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)，[vizdoom](https:\u002F\u002Fgithub.com\u002Fmwydmuch\u002FViZDoom)\n\n- [利用概率动力学模型在少数几次试验中实现深度强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.12114)\n  - 库特兰·丘阿，罗伯托·卡兰德拉，罗温·麦卡利斯特，谢尔盖·列维涅。*NIPS 2018*\n  - 关键点：带有轨迹采样的概率集成\n  - 实验环境：[cartpole](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)，[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [何时信任你的模型：基于模型的策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.08253)\n  - 迈克尔·詹纳，贾斯汀·傅，马文·张，谢尔盖·列维涅。*NeurIPS 2019*\n  - 关键点：集成模型，sac，*k*-分支展开\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [具有理论保证的基于模型深度强化学习算法框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.03858)\n  - 周平，许华哲，李元智，田元东，特雷弗·达雷尔，马腾宇。*ICLR 2019*\n  - 关键点：差异界设计，多步ME-TRPO，熵正则化\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [模型集成信任区域策略优化](https:\u002F\u002Fopenreview.net\u002Fforum?id=SJJinbWRZ)\n  - 塔纳德·库鲁塔奇，伊格纳西·克拉韦拉，严端，阿维夫·塔马尔，皮特·阿贝尔。*ICLR 2018*\n  - 关键点：集成模型，TRPO\n  \u003C!-- - OpenReview: 7, 7, 6 -->\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [从梦想到控制：通过潜在空间想象学习行为](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.01603)\n  - 达尼雅尔·哈夫纳，蒂莫西·利利克拉普，吉米·巴，穆罕默德·诺鲁齐。*ICLR 2019*\n  - 关键点：DreamerV1，潜在空间想象\n  - 实验环境：[deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)，[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)，[deepmind lab](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Flab)\n\n- [利用策略网络探索基于模型的规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=H1exf64KwH)\n  - 王婷武，吉米·巴。*ICLR 2020*\n  - 关键点：在动作空间和参数空间中进行基于模型的策略规划\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [通过规划已学习的模型掌握Atari、围棋、国际象棋和将棋](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.08265)\n  - 朱利安·施里特维瑟，伊万尼斯·安东格卢，托马斯·休伯特，卡伦·西蒙扬，洛朗·西弗，西蒙·施密特，阿瑟·格茨，爱德华·洛克哈特，德米斯·哈萨比斯，托雷·格雷佩尔，蒂莫西·利利克拉普，大卫·西尔弗。*Nature 2020*\n  - 关键点：MCTS，价值等价\n  - 实验环境：国际象棋，将棋，围棋，[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n\u003C\u002Fdetails>\n\n### NeurIPS 2025\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F折叠\u003C\u002Fsummary>\n\n- [通过基于模型的强化学习中的对齐表征实现稳定规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=Uv7V1gTOjK)\n  - 米萨格·索尔塔尼，福雷斯特·阿戈斯蒂内利。*NeurIPS 2025*\n  - 关键点：视觉规划，对齐表征，离散潜在状态，启发式搜索\n  - 实验环境：魔方，推箱子\n\n- [RLVR-World：使用强化学习训练世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=jpiSagi8aV)\n  - 龙明生等人。*NeurIPS 2025*\n  - 关键点：世界模型训练，决策感知，可验证奖励\n  - 实验环境：文字游戏，机器人操作\n\n- [Dyn-O：利用以物体为中心的表征构建结构化世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.03298)\n  - 微软研究院等。*NeurIPS 2025*\n  - 关键点：结构化世界模型，以物体为中心，物理建模\n  - 实验环境：物理交互，物体操作\n\n- [基于模型的探索增强的离策略强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=JGkZgEEjiM)\n  - 匿名作者等。*NeurIPS 2025*\n  - 关键点：探索，扩散模型，合成经验，数据增强\n  - 实验环境：mujoco，稀疏奖励任务\n\n- [从扩散启发的角度重新审视多智能体世界建模](https:\u002F\u002Fopenreview.net\u002Fforum?id=rRxFIOoEeF)\n  - 李秀等人。*NeurIPS 2025*\n  - 关键点：多智能体MBRL，扩散启发，序列建模，联合分布\n  - 实验环境：SMAC，MPE\n\n- [SPiDR：一种用于模拟到现实迁移中零样本安全性的简单方法](https:\u002F\u002Fopenreview.net\u002Fforum?id=Pe1ypX9gBO)\n  - 亚登·阿斯，程锐·邱，本杰明·昂格尔，董浩·康，马克斯·范德哈特，莱希·史，斯特利安·科罗斯，亚当·维尔曼，安德烈亚斯·克劳斯。*NeurIPS 2025*\n  - 关键点：安全的MBRL，模拟到现实，集成不确定性，鲁棒控制\n  - 实验环境：现实世界机器人，safety gym\n- [通过收敛到更平坦的极小值来改进基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=vcB1OwtWUZ)\n  - 施里尼瓦斯·拉马苏布拉马尼安，本杰明·弗里德，亚历山大·卡波内，杰夫·施奈德。*NeurIPS 2025*\n  - 关键点：模型误差，仿真引理，模型泛化\n  - 实验环境：DMC，Atari100k，HumanoidBench\n\n\u003C\u002Fdetails>\n\n### ICML 2025\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F折叠\u003C\u002Fsummary>\n\n- [提升用于数据高效强化学习的 Transformer 世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=IajCvMJw41)\n  - 作者：Antoine Dedieu、Joseph Ortiz、Xinghua Lou、Carter Wendelken、Wolfgang Lehrach、J Swaroop Guntupalli、Miguel Lazaro-Gredilla、Kevin Murphy\n  - 关键点：带预热的 Dyna 策略、最近邻补丁标记化、块教师强制\n  - OpenReview 评分：4, 4, 4, 3\n  - 实验环境：craftax-classic\n\n- [窃取那顿免费午餐：揭示 Dyna 式强化学习的局限性](https:\u002F\u002Fopenreview.net\u002Fforum?id=Zt05jXhqXx)\n  - 作者：Brett Barkley、David Fridovich-Keil\n  - 关键点：Dyna 式算法在大多数 DMC 环境中会显著降低性能。\n  - OpenReview 评分：4, 4, 3, 2\n  - 实验环境：gym、DeepMind Control Suite\n\n- [持续基于模型的强化学习中的知识保留](https:\u002F\u002Fopenreview.net\u002Fforum?id=DiqeZY27XK) \n  - 作者：Haotian Fu、Yixiang Sun、Michael L. Littman、George Konidaris\n  - 关键点：合成经验回放、通过探索恢复记忆\n  - OpenReview 评分：4, 3, 3, 3\n  - 实验环境：mini-grid、deepmind control suite\n\n- [面向自适应预测与控制的时间感知世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=gZ5N3TLjwv) \n  - 作者：Anh N Nhu、Sanghyun Son、Ming Lin\n  - 关键点：根据时间步长 ∆t 进行条件建模，并在多种不同的 ∆t 值上进行训练\n  - OpenReview 评分：4, 3, 3\n  - 实验环境：meta-world 控制任务、PDE 控制任务\n\n- [视频增强的离线强化学习：一种基于模型的方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06482)\n  - 作者：Minting Pan、Yitao Zheng、Jiajian Li、Yunbo Wang、Xiaokang Yang\n  - 关键点：行为抽象网络、分层世界模型\n  - OpenReview 评分：3, 3, 3, 2\n  - 实验环境：meta-world、carla、minedojo\n\n- [面向离线基于模型强化学习的时距感知转移增强](https:\u002F\u002Fopenreview.net\u002Fforum?id=drBVowFvqf)\n  - 作者：Dongsu Lee、Minhae Kwon\n  - 关键点：学习一种潜在抽象，从轨迹和状态空间的转移层面捕捉时间距离。\n  - OpenReview 评分：4, 3, 3, 2\n  - 实验环境：D4RL、AntMaze、FrankaKitchen、CALVIN、基于像素的 FrankaKitchen。\n\n- [PIGDreamer：面向安全部分可观测强化学习的特权信息引导世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=mtk8tTKWs0) \n  - 作者：Dongchi Huang、Jiaqi WANG、Yang Li、Chunhe Xia、Tianle Zhang、Kaige Zhang\n  - 关键点：通过特权表示对齐和非对称的演员-评论家结构来利用特权信息\n  - OpenReview 评分：3, 3, 3\n  - 实验环境：safety gymnasium benchmark、guard benchmark\n\n- [用于在线模仿学习的无奖励世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=owEhpoKBKC)\n  - 作者：Shangzhe Li、Zhiao Huang、Hao Su\n  - 关键点：无奖励世界模型、逆向软 Q 学习目标\n  - OpenReview 评分：4, 3, 3, 3\n  - 实验环境：DMControl、MyoSuite、ManiSkill2\n\n- [FOUNDER：将基础模型嵌入世界模型，用于开放式具身决策](https:\u002F\u002Fopenreview.net\u002Fforum?id=UTT5OTyIWm)\n  - 作者：Yucen Wang、Rui Yu、Shenghua Wan、Le Gan、De-Chuan Zhan\n  - 关键点：将 FM 表征嵌入 WM 状态空间，基于模型的目标条件强化学习\n  - OpenReview 评分：4, 3, 3, 3\n  - 实验环境：DMControl、Kitchen、minecraft\n\n- [通过在线世界模型规划实现持续强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=mQeZEsdODh)\n  - 作者：Zichen Liu、Guoji Fu、Chao Du、Wee Sun Lee、Min Lin\n  - 关键点：使用在线世界模型进行规划、后悔分析\n  - OpenReview 评分：4, 4, 4, 3\n  - 实验环境：[ContinualBench](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002FContinualBench\u002Ftree\u002Fmain\u002Fcontinual_bench\u002Fenvs)\n\n- [预训练智能体和世界模型的规模定律](https:\u002F\u002Fopenreview.net\u002Fpdf?id=HHwGfLOKxq) \n  - 作者：Tim Pearce*、Tabish Rashid*、David Bignell、Raluca Georgescu、Sam Devlin、Katja Hofmann\n  - 关键点：规模定律、具身 AI、行为克隆、世界建模、分词器、架构\n  - 实验环境：Bleeding Edge、RT-1（机器人）、Atari、NetHack\n\n- [DINO-WM：基于预训练视觉特征的世界模型实现零样本规划](https:\u002F\u002Fopenreview.net\u002Fpdf?id=D5RNACOZEI)\n  - 作者：Gaoyue Zhou、Hengkai Pan、Yann LeCun、Lerrel Pinto\n  - 关键点：世界模型、离线学习、零样本规划、预训练视觉特征、任务无关推理\n  - 实验环境：Maze、Wall、Reach、Push-T、绳索操作、颗粒物操作\n\n- [通用智能体需要世界模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=dlIoumNiXt) \n  - 作者：Jonathan Richens、Tom Everitt、David Abel\n  - 关键点：世界模型、目标导向行为、无模型学习、策略分析、后悔界\n  - 实验环境：具有不同采样轨迹和目标深度的合成受控马尔可夫过程（cMP）环境\n\n- [RobustZero：提升 MuZero 强化学习对状态扰动的鲁棒性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=DaOdkXgLvE)\n  - 作者：Yushuai Li、Hengyu Liu、Torben Bach Pedersen、Yuqiang He、Kim Guldstrand Larsen、Lu Chen、Christian S. Jensen、Jiachen Xu、Tianyi Li\n  - 关键点：MuZero、鲁棒性、强化学习、状态扰动、自监督学习、适应性调整\n  - 实验环境：CartPole、Pendulum、IEEE 34-bus、IEEE 123-bus、IEEE 8500-node、Highway、Intersection、Racetrack、Hopper、Walker2d、HalfCheetah、Ant\n\n- [使用掩码潜在变换器实现准确高效的世界建模](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zNUOZcAUxz)\n  - 作者：Maxime Burchi、Radu Timofte\n  - 关键点：基于模型的强化学习、世界模型、MaskGIT、空间潜在空间、Dreamer、Transformer、效率\n  - 实验环境：Crafter、Atari 100k\n\n- [用于异构环境的轨迹世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=Py2KmXaRmi)  \n  - 作者：Shaofeng Yin、Jialong Wu、Siqiao Huang、Xingjian Su、Xu He、Jianye Hao、Mingsheng Long  \n  - 关键点：世界模型、异构环境、预训练、上下文学习、模型迁移、轨迹数据  \n  - 实验环境：UniTraj（80 种不同环境）、D4RL（HalfCheetah、Hopper、Walker2D）、Cart-2-Pole、Cart-3-Pole\n\n- [作为下一令牌预测基础的因果世界模型：在受控环境中探索 GPT](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qA3xHJzF6B)\n  - 作者：Raanan Y. Rohekar、Yaniv Gurwicz、Sungduk Yu、Estelle Aflalo、Vasudev Lal\n  - 关键点：GPT、因果推断、注意力机制、结构化因果模型、零样本因果发现\n  - 实验环境：Othello、Chess\n\n\u003C\u002Fdetails>\n\n### ICLR 2025\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F折叠\u003C\u002Fsummary>\n\n- [基于对比预测编码的 Transformer 世界模型学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=YK9G4Htdew)  \n  - Maxime Burchi, Radu Timofte  \n  - 关键词：基于模型的强化学习、Transformer 网络、对比预测编码  \n  - 实验环境：Atari 100k 基准\n\n- [预测性逆动力学模型是可扩展的机器人操作学习器](https:\u002F\u002Fopenreview.net\u002Fforum?id=meRCKuUpmc)  \n  - Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang  \n  - 关键词：机器人操作、预训练、视觉预见、逆动力学、大规模机器人数据集  \n  - 实验环境：LIBERO-LONG 基准、CALVIN ABC-D、真实世界任务\n\n- [OptionZero：基于学习到的选项进行规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=3IFRygQKGL)  \n  - Po-Wei Huang, Pei-Chiun Peng, Hung Guei, Ti-Rong Wu  \n  - 关键词：选项、半马尔可夫决策过程、MuZero、蒙特卡洛树搜索、规划、强化学习  \n  - 实验环境：Atari\n\n- [MAD-TD：模型增强数据稳定高更新率强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=6RtRsg8ZV1)  \n  - Claas A Voelcker, Marcel Hussing, Eric Eaton, Amir-massoud Farahmand, Igor Gilitschenski  \n  - 关键词：强化学习、基于模型的强化学习、数据增强、高更新率  \n  - 实验环境：DeepMind Control Suite\n\n- [Kinetix：通过开放式物理控制任务探究通用智能体的训练](https:\u002F\u002Fopenreview.net\u002Fforum?id=zCxGCdzreM)  \n  - Michael Matthews, Michael Beukman, Chris Lu, Jakob Nicolaus Foerster  \n  - 关键词：强化学习、开放性、无监督环境设计、自动课程学习、基准  \n  - 实验环境：2D 物理任务、机器人运动、抓取、视频游戏、经典强化学习环境\n\n- [从示范序列中学习搜索](https:\u002F\u002Fopenreview.net\u002Fforum?id=v593OaNePQ)  \n  - Dixant Mittal, Liwei Kang, Wee Sun Lee  \n  - 关键词：规划、推理、学习搜索、强化学习、大型语言模型  \n  - 实验环境：24 点游戏、2D 格点导航、Procgen 游戏\n\n- [基于长短时想象的开放世界强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=vzItLaEoDa)  \n  - Jiajian Li, Qi Wang, Yunbo Wang, Xin Jin, Yang Li, Wenjun Zeng, Xiaokang Yang  \n  - 关键词：强化学习、世界模型、视觉控制  \n  - 实验环境：MineDojo\n\n- [MaestroMotif：基于人工智能反馈的设计技能](https:\u002F\u002Fopenreview.net\u002Fforum?id=or8mMhmyRV)  \n  - Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, Pierluca D'Oro  \n  - 关键词：层次化强化学习、强化学习、大型语言模型  \n  - 实验环境：NetHack 学习环境 (NLE)\n\n- [面向形状多变与可变形物体操作的几何感知强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=7BLXhmWvwF)  \n  - Tai Hoang, Huy Le, Philipp Becker, Vien Anh Ngo, Gerhard Neumann  \n  - 关键词：机器人操作、等变性、图神经网络、强化学习、可变形物体  \n  - 实验环境：刚性插入、绳索操作、使用多个末端执行器进行布料操作\n\n- [M^3PC：利用预训练掩码轨迹模型进行测试时模型预测控制](https:\u002F\u002Fopenreview.net\u002Fforum?id=inOwd7hZC1)  \n  - Kehan Wen, Yutong Hu, Yao Mu, Lei Ke  \n  - 关键词：离线到在线强化学习、基于模型的强化学习、掩码自编码、机器人学习  \n  - 实验环境：D4RL、RoboMimic\n\n- [通过学习排序实现基于模型的离线优化](https:\u002F\u002Fopenreview.net\u002Fforum?id=sb1HgVDLjN)  \n  - Rong-Xi Tan, Ke Xue, Shen-Huan Lyu, Haopu Shang, yaowang, Yaoyuan Wang, Fu Sheng, Chao Qian  \n  - 关键词：基于模型的离线优化、黑箱优化、学习排序、学习优化  \n  - 实验环境：涵盖多种优化场景的任务\n\n- [基于大型语言模型的蒙特卡洛规划在文本类游戏中的应用](https:\u002F\u002Fopenreview.net\u002Fforum?id=r1KcapkzCt)  \n  - Zijing Shi, Meng Fang, Ling Chen  \n  - 关键词：大型语言模型、蒙特卡洛树搜索、文本类游戏  \n  - 实验环境：Jericho 基准\n\n- [对无模型强化学习中涌现式规划的解释](https:\u002F\u002Fopenreview.net\u002Fforum?id=DzGe40glxs)  \n  - Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, David Krueger  \n  - 关键词：强化学习、可解释性、规划、探针、无模型、机制性可解释性、推箱子  \n  - 实验环境：推箱子\n\n- [Drama：Mamba 加速的基于模型的强化学习具有样本和参数效率](https:\u002F\u002Fopenreview.net\u002Fforum?id=7XIkRgYjK3)  \n  - Wenlong Wang, Ivana Dusparic, Yucheng Shi, Ke Zhang, Vinny Cahill  \n  - 关键词：Mamba-2、基于模型的强化学习、Mamba、状态空间模型  \n  - 实验环境：Atari 100K\n\n- [利用大型语言模型的零样本基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=uZFXpPrwSh)  \n  - Abdelhakim Benechehab, Youssef Attia El Hili, Ambroise Odonnat, Oussama Zekri, Albert Thomas, Giuseppe Paolo, Maurizio Filippone, Ievgen Redko, Balázs Kégl  \n  - 关键词：基于模型的强化学习、大型语言模型、零样本学习、上下文学习  \n  - 实验环境：D4RL、摆、HalfCheetah、Hopper\n\n- [关于基于模型的强化学习中的模拟运行](https:\u002F\u002Fopenreview.net\u002Fforum?id=Uh5GRmLlvt)  \n  - Bernd Frauenknecht, Devdutt Subhasish, Friedrich Solowjow, Sebastian Trimpe  \n  - 关键词：基于模型的强化学习、模型模拟、不确定性量化  \n  - 实验环境：Gym MuJoCo\n\n- [任意步长动力学模型提升在线和离线强化学习的未来预测能力](https:\u002F\u002Fopenreview.net\u002Fforum?id=JZCxlrwjZ8)  \n  - Haoxin Lin, Yu-Yan Xu, Yihao Sun, Zhilong Zhang, Yi-Chen Li, Chengxing Jia, Junyin Ye, Jiaji Zhang, Yang Yu  \n  - 关键词：基于模型的强化学习、任意步长动力学模型  \n  - 实验环境：D4RL、NeoRL、Gym MuJoCo-v3\n\n- [用于连续控制的离散码本世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=lfRYzd8ady)  \n  - Aidan Scannell, Mohammadreza Nakhaeinezhadfard, Kalle Kujanpää, Yi Zhao, Kevin Sebastian Luck, Arno Solin, Joni Pajarinen  \n  - 关键词：强化学习、世界模型、表征学习、自监督学习、基于模型的强化学习、连续控制  \n  - 实验环境：[DeepMind 控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)、[Myosuite](https:\u002F\u002Fgithub.com\u002FMyoHub\u002Fmyosuite)\n\n\u003C\u002Fdetails>\n\n### NeurIPS 2024\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F折叠\u003C\u002Fsummary>\n\n- [iVideoGPT：交互式 VideoGPT 是可扩展的世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15223)\n  - 吴嘉龙、尹绍峰、冯宁雅、何旭、李栋、郝建业、龙明生\n  - 关键词：世界模型、视频生成模型、自回归 Transformer、强化学习、视频预测、视觉规划\n  - 实验环境：Meta-world\n  \n- [基于模型的强化学习在序列长度上的并行化](https:\u002F\u002Fopenreview.net\u002Fpdf\u002Fe061517a824b90efc807dc90ac6bbd20747bd654.pdf)\n  - 王子睿、邓悦、龙俊峰、张寅\n  - 关键词：强化学习、基于模型的强化学习、并行化、序列长度、世界模型、资格迹、样本效率\n  - 实验环境：Atari 100K、DMControl\n\n- [潜在动力学下的强化学习：迈向统计与算法模块化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qf2uZAdy1N)  \n  - 菲利普·阿莫蒂拉、迪伦·J·福斯特、南江、阿克沙伊·克里希纳穆提、扎卡里亚·姆哈梅迪  \n  - 关键词：强化学习、潜在动力学、统计模块化、算法模块化、可观测到潜在的约简、自预测模型  \n  - 实验环境：无  \n\n- [SPO：序贯蒙特卡洛策略优化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=XKvYcPPH5G)  \n  - 马修·V·麦克法兰、埃丹·托莱多、唐纳尔·伯恩、保罗·达克沃斯、亚历山大·拉特尔  \n  - 关键词：强化学习、RL、基于模型的强化学习、序贯蒙特卡洛、期望最大化、规划  \n  - 实验环境：Brax、Boxoban、鲁比克魔方\n\n- [寻求共性但保留差异：用于多模态视觉 RL 的分解动力学建模](https:\u002F\u002Fopenreview.net\u002Fpdf?id=4php6bGL2W)  \n  - 黄扬儒、彭培熙、赵一凡、陈光耀、田永宏  \n  - 关键词：多模态强化学习、视觉 RL、动力学建模、模态一致性、模态不一致性、DDM  \n  - 实验环境：CARLA、DMControl\n\n- [预训练视觉表征在基于模型的强化学习中的惊人无效性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=LvAy07mCxU)\n  - 莫里茨·施奈德、罗伯特·克鲁格、纳鲁纳斯·瓦斯凯维丘斯、路易吉·帕尔米耶里、约什卡·博德克尔\n  - 关键词：强化学习、RL、基于模型的强化学习、表征学习、PVR、视觉表征\n  - 实验环境：DMC、ManiSkill2、Miniworld\n\n- [仅用少量离线数据进行多智能体领域校准](https:\u002F\u002Fopenreview.net\u002Fpdf?id=LvAy07mCxU)\n  - 姜涛、袁磊、李和、关聪、张宗章、于洋\n  - 关键词：多智能体强化学习、领域迁移\n  - 实验环境：D4RL\n\n- [WorldCoder，一个基于模型的 LLM 智能体：通过编写代码并与环境交互来构建世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12275)\n  - 唐浩、达伦·基、凯文·埃利斯\n  - 关键词：以代码形式学习世界模型、LLM\n  - 实验环境：[sokoban](https:\u002F\u002Fgithub.com\u002FmpSchrader\u002Fgym-sokoban)、[minigrid](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMinigrid)、[alfworld](https:\u002F\u002Fgithub.com\u002Falfworld\u002Falfworld)\n\n- [离线基于模型的强化学习中的可达边界问题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.12527)\n  - 安雅·西姆斯、卢聪、雅各布·福斯特、叶伟哲\n  - 关键词：可达边界问题、可达感知的价值学习\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FD4RL)、[v-r4rl](https:\u002F\u002Fgithub.com\u002Fconglu1997\u002Fv-d4rl)\n\n- [用于改进基于模型的离线强化学习的确定性不确定性传播](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04088)\n  - 阿卜杜拉·阿克居尔、曼努埃尔·豪斯曼、梅利赫·坎德米尔\n  - 关键词：论文认为，基于不确定性的奖励惩罚会引入过度保守主义，可能导致因低估而产生次优策略。\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FD4RL)\n\n- [BECAUSE：双线性因果表征，用于可泛化的离线基于模型的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.10967)\n  - 林浩鸿、丁文浩、陈健、石来喜、朱家成、李波、赵丁\n  - 关键词：目标不匹配问题、捕捉状态和动作的因果表征\n  - 实验环境：[robosuite](https:\u002F\u002Fgithub.com\u002FARISE-Initiative\u002Frobosuite)、[解锁](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMinigrid)、[碰撞](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FHighwayEnv)\n\n- [基于模型的上下文强化学习迁移学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04498)\n  - 曹正勋、温杜拉·贾亚瓦达纳、李思睿、凯茜·吴\n  - 关键词：贝叶斯优化、上下文 RL\n  - 实验环境：[高斯过程、交通信号、生态驾驶、辅助自主、控制任务]()\n\n- [通过表征复杂度的视角重新思考基于模型、基于策略和基于价值的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.17248)\n  - 冯古浩、钟翰\n  - 关键词：RL 表征复杂度\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n\u003C!--- [基于模型的强化学习在序列长度上的并行化]()\n  - 王子睿、邓悦、龙俊峰、张寅\n  - 关键词：\n  - 实验环境：\n\n- [基于模型的离线强化学习中的受限潜在动作策略]()\n  - 马文·阿莱斯、菲利普·贝克-埃姆克、帕特里克·范德斯马赫特、马克西米利安·卡尔\n  - 关键词：\n  - 实验环境：\n\n- [策略形状预测：避免基于模型的 RL 中的干扰]()\n  - 迈尔斯·哈特森、艾萨克·考瓦尔、尼克·哈伯\n  - 关键词：\n  - 实验环境： -->\n\n\u003C\u002Fdetails>\n\n### ICML 2024\n\n\u003Cdetails open>\n\u003Csummary>切换\u003C\u002Fsummary>\n\n- [HarmonyDream: 世界模型中的任务协调](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.00344)\n  - 马浩宇、吴嘉龙、冯宁雅、肖晨俊、李东、郝建业、王建民、龙明生\n  - 关键点：世界模型中的观测建模和奖励建模分析\n  - 实验环境：[meta-world](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)、[rlbench](https:\u002F\u002Fgithub.com\u002Fstepjam\u002FRLBench)、[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [3D-VLA：一种基于3D视觉-语言-动作的生成式世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09631)\n  - 甄浩宇、邱晓文、陈培豪、杨锦程、严欣、杜一伦、洪怡宁、甘闯\n  - 关键点：利用生成式世界模型统一3D感知、推理和行动；构建大规模3D具身指令调优数据集\n  - 实验环境：[rlbench](https:\u002F\u002Fgithub.com\u002Fstepjam\u002FRLBench)、[calvin](https:\u002F\u002Fgithub.com\u002Fmees\u002Fcalvin)\n\n- [CompeteAI：理解基于大语言模型的智能体的竞争行为](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.17512)\n  - 赵钦林、王金东、张艺轩、金一乔、朱凯杰、陈浩、谢星\n  - 关键点：提出面向LLM智能体的竞争框架；构建模拟竞争环境\n  - 实验环境：一个仅包含餐厅和顾客的虚拟小镇\n\n- [面向参数化动作空间的基于模型的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03037)\n  - 张仁昊、傅浩天、缪怡琳、乔治·科尼达里斯\n  - 关键点：离散-连续混合动作空间、带参数化动作的动力学模型、参数化动作的MPC\n  - 实验环境：[platform, goal, hard goal, catch point, hard move](https:\u002F\u002Fgithub.com\u002FValarzz\u002FModel-based-Reinforcement-Learning-for-Parameterized-Action-Spaces\u002Ftree\u002Fmain\u002Fcommon)\n\n- [为世界模型学习鲁棒的潜在动态表征](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.06263)\n  - 孙瑞翔、臧宏宇、李欣、里亚沙特·伊斯兰\n  - 关键点：改进的Dreamer架构、混合循环状态空间模型\n  - 实验环境：[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[分心版deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fbit1029public\u002FHRSSM\u002Ftree\u002Fmain\u002Fenv)、[mani-skill2](https:\u002F\u002Fgithub.com\u002Fhaosulab\u002FManiSkill2)\n\n- [AD3：隐式动作是世界模型区分多样化视觉干扰的关键](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.09976)\n  - 王宇辰、万圣华、甘乐、冯帅、詹德川\n  - 关键点：隐式动作生成器、动作条件分离的世界模型\n  - 实验环境：[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [Hieros：基于结构化状态序列的世界模型中的层次化想象](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05167)\n  - 保罗·马特斯、赖纳·施洛瑟、拉尔夫·赫布里希\n  - 关键点：状态空间模型、多层层次化想象、基于S5的世界模型\n  - 实验环境：[atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [通过并行观测预测改进基于token的世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05643)\n  - 利奥尔·科恩、王凯欣、康炳义、希·曼诺尔\n  - 关键点：基于像素的MBRL、基于token的世界模型、保留型环境模型\n  - 实验环境：[atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [Transformer世界模型能否提供更好的策略梯度？](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.05290)\n  - 米歇尔·马、倪天伟、克莱门特·格灵、皮埃尔卢卡·多罗、皮埃尔-吕克·培肯\n  - 关键点：动作世界模型\n  - 实验环境：[双摆](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[Myriad](https:\u002F\u002Fgithub.com\u002Fnikihowe\u002Fmyriad)\n\n- [Dr. Strategy：具有战略梦的基于模型的通用智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.18866)\n  - 哈尼·哈梅德、金秀彬、金东英、尹在植、安成镇\n  - 关键点：在战略梦中训练三种策略——高速公路策略、探索者策略和成就者策略，进而完成下游任务\n  - 实验环境：2D导航、3D迷宫导航、RoboKitchen\n\n- [迈向对抗性破坏下的稳健基于模型的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.08991)\n  - 叶晨露、何佳凡、顾全全、张彤\n  - 关键点：对基于模型强化学习中对抗性破坏的理论分析，涵盖在线和离线设置\n  - 实验环境：无\n\n- [面向混杂POMDP的基于模型的强化学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv235\u002Fhong24d.html)\n  - 洪茂、齐正玲、徐燕勋\n  - 关键点：基于模型的强化学习、POMDP\n  - 实验环境：无\n\n\u003C!-- - [信任模型所信任的地方——基于模型的不确定性感知回放适应演员-评论家]()\n  - 伯恩德·弗劳恩克内希特、阿图尔·艾泽勒、德夫杜特·苏巴斯什、弗里德里希·索洛维约夫、塞巴斯蒂安·特林佩\n  - 关键点：\n  - 实验环境：\n\n- [具有时间感知和上下文增强的高效token化世界模型]()\n  - 文森特·米凯利、埃洛伊·阿隆索、弗朗索瓦·弗勒雷\n  - 关键点：\n  - 实验环境：\n\n- [协处理器演员-评论家：用于自适应深部脑刺激的基于模型强化学习方法]()\n  - 米歇尔·潘、玛丽亚·施鲁姆、维韦克·迈尔斯、埃尔德姆·比耶克、安卡·德拉甘\n  - 关键点：\n  - 实验环境：] -->\n\n\u003C\u002Fdetails>\n\n### ICLR 2024\n\n\u003Cdetails open>\n\u003Csummary>切换\u003C\u002Fsummary>\n\n- [策略排练：训练可泛化的强化学习策略](https:\u002F\u002Fopenreview.net\u002Fforum?id=m3xVPaZp6Z)\n  - 贾成兴、高晨晓、殷浩、张福祥、陈雄辉、许田、袁磊、张宗章、周志华、于洋\n  - 关键点：强化学习、基于模型的强化学习、离线强化学习\n  - OpenReview评分：8、8、8、6\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [利用Koopman理论在交互环境中高效建模动力学](https:\u002F\u002Fopenreview.net\u002Fforum?id=fkrYDQaHOJ)\n  - 阿尔纳布·库马尔·蒙达尔、西巴·斯马拉克·帕尼格拉希、赛·拉杰斯瓦尔、卡利姆·西迪奇、西亚马克·拉万巴赫什\n  - 关键点：Koopman理论、强化学习、动力系统、规划、长距离动力学预测模型、高效的前向动力学\n  - OpenReview评分：8、6、5、3\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [结合空间与时间抽象进行规划以提升泛化能力](https:\u002F\u002Fopenreview.net\u002Fforum?id=eo9dHwtTFt)\n  - 赵明德、萨法·阿尔维尔、哈姆·范·塞延、罗曼·拉罗什、多伊娜·普雷库普、约书亚·本吉奥\n  - 关键点：强化学习、规划、神经网络、时序差分学习、泛化、深度强化学习\n  - OpenReview评分：6、6、6、5\n  - 实验环境：[MiniGrid-BabyAI框架](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-minigrid)\n\n- [用世界模型掌握记忆任务](https:\u002F\u002Fopenreview.net\u002Fforum?id=1vDArHJ68h)\n  - Mohammad Reza Samsami、Artem Zholus、Janarthanan Rajendran、Sarath Chandar\n  - 关键点：基于DreamerV3的回忆与想象模块\n  - OpenReview评分：10、8、6\n  - 实验环境：[bsuite](https:\u002F\u002Fgithub.com\u002Fgoogle-deepmind\u002Fbsuite)、[popgym](https:\u002F\u002Fgithub.com\u002Fproroklab\u002Fpopgym)、[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[记忆迷宫](https:\u002F\u002Fgithub.com\u002Fjurgisp\u002Fmemory-maze)\n\n- [特权感知支撑强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=EpVe8jAjdx)\n  - Edward S. Hu、James Springer、Oleh Rybkin、Dinesh Jayaraman\n  - 关键点：基于DreamerV3的特权信息\n  - OpenReview评分：10、8、8、8\n  - 实验环境：[gymnasium机器人学](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FGymnasium-Robotics)\n\n- [TD-MPC2：用于连续控制的可扩展、鲁棒的世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=Oxh5CstDJU)\n  - Nicklas Hansen、Hao Su、Xiaolong Wang\n  - 关键点：隐式世界模型、模型预测控制、通用型td-mpc2\n  - OpenReview评分：8、8、8、8\n  - 实验环境：[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)、[maniskill2](https:\u002F\u002Fgithub.com\u002Fhaosulab\u002FManiSkill2)、[myosuite](https:\u002F\u002Fgithub.com\u002FMyoHub\u002Fmyosuite)\n\n- [利用L1自适应控制的鲁棒基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=GaLCLvJaoF)\n  - Minjun Sung、Sambhu Harimanas Karumanchi、Aditya Gahlawat、Naira Hovakimyan\n  - 关键点：L1自适应控制\n  - OpenReview评分：8、6、6、6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [从离散潜在动力学中学习具有自适应时间抽象的层次化世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=TjCDNssXKU)\n  - Christian Gumbsch、Noor Sajid、Georg Martius、Martin V. Butz\n  - 关键点：上下文特定的循环状态空间模型、层次化世界模型\n  - OpenReview评分：8、6、6\n  - 实验环境：[MiniHack](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fminihack)、[VisualPinPad](https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fdirector\u002Fblob\u002Fmain\u002Fembodied\u002Fenvs\u002Fpinpad.py)、[MultiWorld](https:\u002F\u002Fgithub.com\u002Fvitchyr\u002Fmultiworld)\n\n- [通过离散扩散学习用于自动驾驶的无监督世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.01017)\n  - Lunjun Zhang、Yuwen Xiong、Ze Yang、Sergio Casas、Rui Hu、Raquel Urtasun\n  - 关键点：离散扩散；世界模型；自动驾驶\n  - OpenReview评分：10、8、6、6、6\n  - 实验环境：[NuScenes](https:\u002F\u002Fwww.nuscenes.org\u002F)、[KITTI里程计](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002Feval_odometry.php)、[Argoverse2激光雷达](https:\u002F\u002Fwww.argoverse.org\u002Fav2.html)\n\n- [COPlanner：为基于模型的RL制定保守滚动但乐观探索的计划](https:\u002F\u002Fopenreview.net\u002Fforum?id=jnFcKjtUPN)\n  - Xiyao Wang、Ruijie Zheng、Yanchao Sun、Ruonan Jia、Wichayaporn Wongkamjan、Huazhe Xu、Furong Huang\n  - 关键点：保守的世界模型滚动、乐观的环境探索\n  - OpenReview评分：6、6、6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [通过规划实现高效的多智能体强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=CpnKq3UJwp)\n  - Qihan Liu、Jianing Ye、Xiaoteng Ma、Jun Yang、Bin Liang、Chongjie Zhang\n  - 关键点：mcts、乐观搜索lambda、优势加权策略优化\n  - OpenReview评分：8、6、6、6\n  - 实验环境：[smac](https:\u002F\u002Fgithub.com\u002Foxwhirl\u002Fsmac)\n\n- [可微轨迹优化作为强化学习和模仿学习的策略类](https:\u002F\u002Fopenreview.net\u002Fforum?id=HL5P4H8eO2)\n  - Weikang Wan、Yufei Wang、Zackory Erickson、David Held\n  - 关键点：可微轨迹优化\n  - OpenReview评分：10、8、8、5\n  - 实验环境：[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[robomimic](https:\u002F\u002Fgithub.com\u002FARISE-Initiative\u002Frobomimic)、[maniskill](https:\u002F\u002Fgithub.com\u002Fhaosulab\u002FManiSkill2)\n\n- [DMBP：基于扩散模型的预测器，用于对抗状态观测扰动的稳健离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=ZULjcYLWKe)\n  - Zhihe YANG、Yunjian Xu\n  - 关键点：条件扩散、离线RL\n  - OpenReview评分：8、8、6、6\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [MAMBA：一种用于元强化学习的有效世界模型方法](https:\u002F\u002Fopenreview.net\u002Fforum?id=1RE0H6mU7M)\n  - Zohar Rimon、Tom Jurgenson、Orr Krupnik、Gilad Adler、Aviv Tamar\n  - 关键点：基于dreamer的上下文元RL\n  - OpenReview评分：6、6、6、6\n  - 实验环境：[点机器人导航、逃生室](https:\u002F\u002Fgithub.com\u002FRondorf\u002FBOReL\u002Fblob\u002Fmain\u002Fenvironments\u002Ftoy_navigation\u002Fpoint_robot.py)、[Reacher Sparse](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [奖励一致的动力学模型对离线强化学习具有强大的泛化能力](https:\u002F\u002Fopenreview.net\u002Fforum?id=GSBHKiw19c)\n  - Fan-Ming Luo、Tian Xu、Xingchen Cao、Yang Yu\n  - 关键点：奖励学习、离线RL\n  - OpenReview评分：8、6、6、6\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)、[NeoRL](https:\u002F\u002Fgithub.com\u002Fpolixir\u002FNeoRL)\n\n- [DreamSmooth：通过奖励平滑改进基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=GruDNzQ4ux)\n  - Vint Lee、Pieter Abbeel、Youngwoon Lee\n  - 关键点：学习预测时间平滑的奖励，而非每个时间步的精确奖励\n  - OpenReview评分：6、6、6、5\n  - 实验环境：[robodesk](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frobodesk)、[hand](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[土方工程](https:\u002F\u002Fwww.algoryx.se\u002Fagx-dynamics\u002F)\n\n- [知情POMDP：在基于模型的RL中利用额外信息](https:\u002F\u002Fopenreview.net\u002Fforum?id=5NJzNAXAmx)\n  - Gaspard Lambrechts、Adrien Bolland、Damien Ernst\n  - 关键点：基于DreamerV3的知情世界模型\n  - OpenReview评分：6、6、6、5\n  - 实验环境：[变化的登山路线](https:\u002F\u002Fgithub.com\u002Fmaximilianigl\u002FDVRL\u002Ftree\u002Fmaster)、[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[pop gym](https:\u002F\u002Fgithub.com\u002Fproroklab\u002Fpopgym)、[闪烁的atari和闪烁的控制](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n### NeurIPS 2023\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F折叠\u003C\u002Fsummary>\n\n- [大型语言模型作为大规模任务规划中的常识知识](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Fhash\u002F65a39213d7d0e1eb5d192aa77e77eeb7-Abstract-Conference.html)\n  - 赵子睿、李伟顺、许大卫\n  - 关键词：LLM-MCTS\n  - 实验环境：[VirtualHome]()\n\n- [描述、解释、规划与选择：基于 LLM 的交互式规划赋能开放世界多任务智能体](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F6b8dfb8c0c12e6fafc6c256cb08a5ca7-Paper-Conference.pdf)\n  - 王子豪、蔡绍飞、陈冠州、刘安吉、马晓健（Shawn）、梁义涛\n  - 关键词：基于 LLM 的交互式规划方法\n  - 实验环境：[Minecraft](https:\u002F\u002Fgithub.com\u002Fminerllabs\u002Fminerl)\n\n- [对决世界模型骨干网络：RNN、Transformer 和 S4](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Fe6c65eb9b56719c1aa45ff73874de317-Paper-Conference.pdf)\n  - 邓飞、朴俊英、安成镇\n  - 关键词：世界模型骨干网络\n  - 实验环境：[MiniGrid](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-minigrid)、[记忆迷宫](https:\u002F\u002Fgithub.com\u002Fjurgisp\u002Fmemory-maze)\n\n- [利用野外视频预训练情境化世界模型用于强化学习](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002F7ce1cbededb4b0d6202847ac1b484ee8-Paper-Conference.pdf)\n  - 吴嘉龙、马浩宇、邓超毅、龙鸣生\n  - 关键词：情境化世界模型\n  - 实验环境：[CARLA](https:\u002F\u002Fgithub.com\u002Fwayveai\u002Fmile\u002Ftree\u002Fmain\u002Fcarla_gym)、[DeepMind 控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [基于扩散动力学模型的不确定性感知规划的置信预测](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2023\u002Ffile\u002Ffe318a2b6c699808019a456b706cd845-Paper-Conference.pdf)\n  - 孙建凯、蒋一奇、邱佳宁、帕斯·诺布尔、迈克尔·J·科亨德费尔、麦克·施瓦格\n  - 关键词：扩散动力学模型\n  - 实验环境：[D4RL](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)、[Maze2D](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FD4RL\u002Ftree\u002Fmaster\u002Fd4rl)\n\n- [LightZero：面向通用序列决策场景的蒙特卡洛树搜索统一基准](https:\u002F\u002Fopenreview.net\u002Fforum?id=oIUXpBnyjv)\n  - 牛亚哲、蒲源、杨振杰、李雪燕、周彤、任继元、胡帅、李洪胜、刘宇\n  - 关键词：MCTS 风格的基准测试\n  - 实验环境：[棋类游戏](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FLightZero\u002Ftree\u002Fmain\u002Fzoo\u002Fboard_games)、[Atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[MuJoCo](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[Gobigger](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FGoBigger)\n\n- [扩散模型是多任务强化学习中有效的规划器和数据合成器](https:\u002F\u002Fopenreview.net\u002Fforum?id=fAdMly4ki5)\n  - 何浩然、白晨嘉、徐康、杨卓然、张维安、王东、赵斌、李学龙\n  - 关键词：基于 GPT 的扩散模型用于规划和数据合成\n  - 实验环境：[Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)、[Maze2D](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FD4RL\u002Ftree\u002Fmaster\u002Fd4rl)\n\n- [MoVie：基于视觉模型的策略适应，实现视角泛化](https:\u002F\u002Fopenreview.net\u002Fforum?id=YV1MYtj2AR)\n  - 杨思哲、泽延杰、徐华哲\n  - 关键词：视角泛化、空间自适应编码器\n  - 实验环境：[DeepMind 控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[Adroit](https:\u002F\u002Fgithub.com\u002Faravindr93\u002Fmjrl)、[XArm](https:\u002F\u002Fgithub.com\u002Fyangsizhe\u002FMoVie\u002Ftree\u002Fmain\u002Fsrc\u002Fenvs\u002Fxarm_env)\n\n- [基于模型的再参数化策略梯度方法：理论与实用算法](https:\u002F\u002Fopenreview.net\u002Fforum?id=bUgqyyNo8j)\n  - 张申奥、刘博艺、王兆然、赵拓\n  - 关键词：基于模型的再参数化策略梯度方法、平滑正则化\n  - 实验环境：[MuJoCo](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [利用预训练大型语言模型构建并使用世界模型进行基于模型的任务规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=zDbsSscmuj)\n  - 关林、卡尔蒂克·瓦尔米坎、萨拉特·斯里达兰、苏巴拉奥·坎巴姆帕蒂\n  - 关键词：在规划领域定义语言中构建显式的世界（领域）模型\n  - 实验环境：[家用机器人领域]()、[轮胎世界与物流]()\n\n- [RePo：通过正则化后验可预测性实现稳健的基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=OIJ3VXDy6s)\n  - 朱春宁、麦克斯·辛乔维茨、西丽·加迪普迪、阿比谢克·古普塔\n  - 关键词：视觉强化学习中的表征鲁棒性\n  - 实验环境：[DeepMind 控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[ManiSkill](https:\u002F\u002Fgithub.com\u002Fhaosulab\u002FManiSkill2)\n\n- [基于稀疏神经动力学的模型控制](https:\u002F\u002Fopenreview.net\u002Fforum?id=ymBG2xs9Zf)\n  - 刘子昂、何杰夫、周耿耿、托比亚·马库奇、李飞飞、吴家俊、李云竹\n  - 关键词：网络稀疏化、ReLU 神经动力学的混合整数规划\n  - 实验环境：[Gym, CartPole, Reacher](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [非线性系统中基于模型强化学习的最优探索](https:\u002F\u002Fopenreview.net\u002Fforum?id=pJQu0zpKCS)\n  - 安德鲁·瓦根迈克尔、石冠雅、凯文·杰米森\n  - 关键词：非线性动力系统的最优样本复杂度\n  - 实验环境：[仿射动力系统](https:\u002F\u002Fgithub.com\u002Fajwagen\u002Fnonlinear_sysid_for_control\u002Fblob\u002Fmain\u002Fenvironments.py)\n\n- [状态2解释：基于概念的解释助力智能体学习与用户理解](https:\u002F\u002Fopenreview.net\u002Fforum?id=xGz0wAIJrS)\n  - 德夫利娜·达斯、索尼娅·切尔诺娃、彬·金\n  - 关键词：状态-动作对与基于概念的解释之间的联合嵌入模型\n  - 实验环境：[四子棋]()、[月球着陆器](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [连续时间基于模型强化学习中的高效探索](https:\u002F\u002Fopenreview.net\u002Fforum?id=VkhvDfY2dB)\n  - 莱纳特·特雷文、约纳斯·休博特、巴维亚、弗洛里安·多尔夫勒、安德烈亚斯·克劳斯\n  - 关键词：非线性常微分方程、后悔界、测量选择策略\n  - 实验环境：[系统任务]()\n\n- [通过最大化证据进行动作推理：仅凭观察利用世界模型实现零样本模仿](https:\u002F\u002Fopenreview.net\u002Fforum?id=WjlCQxpuxU)\n  - 张兴远、菲利普·贝克-埃姆克、帕特里克·范德斯马赫特、马克西米利安·卡尔\n  - 关键词：预训练世界模型、仅基于观察的模仿学习\n  - 实验环境：[DeepMind 控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [STORM：高效的基于随机 Transformer 的世界模型用于强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=WxnrX42rnS)\n  - 张伟璞、王刚、孙健、袁业田、黄高\n  - 关键词：分类 VAE、Transformer 结构、DreamerV3\n  - 实验环境：[Atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n\u003C\u002Fdetails>\n\n### ICML 2023\n\n\u003Cdetails open>\n\u003Csummary>切换\u003C\u002Fsummary>\n\n- [从像素中掌握无监督强化学习基准](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.12016)\n  - Sai Rajeswar Mudumba, Pietro Mazzaglia, Tim Verbelen, Alexandre Piche, Bart Dhoedt, Aaron Courville, Alexandre Lacoste\n  - 关键词：无监督预训练、任务感知微调、dyna-mpc\n  - 实验环境：[URLB基准](https:\u002F\u002Fgithub.com\u002Frll-research\u002Furl_benchmark)、[RWRL套件](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frealworldrl_suite)\n\n- [用于多模态轨迹优化的重参数化策略学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=5Akrk9Ln6N)\n  - Zhiao Huang, Litian Liang, Zhan Ling, Xuanlin Li, Chuang Gan, Hao Su\n  - 关键词：多模态策略学习、重参数化策略梯度\n  - 实验环境：[Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)、[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [活在当下：适应不断变化策略的动力学模型学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.12141)\n  - Xiyao Wang, Wichayaporn Wongkamjan, Ruonan Jia, Furong Huang\n  - 关键词：策略自适应模型学习、权重设计\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [面向无监督基于模型强化学习的可预测MDP抽象](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03921)\n  - Seohong Park, Sergey Levine\n  - 关键词：可预测MDP抽象、解决\u003Ci>模型利用\u003C\u002Fi>问题\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [探究基于模型学习在探索与迁移中的作用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.04009)\n  - Jacob C Walker, Eszter Vértes, Yazhe Li, Gabriel Dulac-Arnold, Ankesh Anand, Jessica Hamrick, Theophane Weber\n  - 主要见解：(1) 在无监督探索和\u002F或微调过程中，基于模型的智能体是否具有优势？(2) 基于模型智能体的各个组件对下游任务学习有何贡献？(3) 基于模型的智能体如何应对无监督阶段与下游阶段之间的环境变化？\n  - 实验环境：[Crafter](https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fcrafter)、[RoboDesk](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frobodesk)、[Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)\n\n- [基于模型强化学习中懒惰的优点：统一目标与算法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.00694)\n  - Anirudh Vemula, Yuda Song, Aarti Singh, J. Bagnell, Sanjiban Choudhury\n  - 关键词：目标不匹配、mbrl框架\n  - 实验环境：[Helicopter、WideTree、线性动力系统、Maze](https:\u002F\u002Fgithub.com\u002Fvvanirudh\u002FLAMPS-MBRL\u002Ftree\u002Fmaster)、[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [强化学习中基于模型泛化的益处](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.02222)\n  - Kenny Young, Aditya Ramesh, Louis Kirsch, Jürgen Schmidhuber\n  - 关键词：经验回放、已学习模型泛化的时机与方式\n  - 实验环境：[ProcMaze、ButtonGrid、PanFlute](https:\u002F\u002Fgithub.com\u002Fkenjyoung\u002FModel_Generalization_Code_supplement\u002Fblob\u002Fmain\u002Fenvironments.py)\n\n- [STEERING：面向基于模型强化学习的施泰因信息导向探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.12038)\n  - Souradip Chakraborty, Amrit Bedi, Alec Koppel, Mengdi Wang, Furong Huang, Dinesh Manocha\n  - 关键词：信息导向采样、核化施泰因散度\n  - 实验环境：[DeepSea](https:\u002F\u002Fgithub.com\u002FstratisMarkou\u002Fsample-efficient-bayesian-rl\u002Fblob\u002Fmaster\u002Fcode\u002FEnvironments.py)\n\n- [具有可扩展复合策略梯度估计器基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=rDMAJECBM2)\n  - Paavo Parmas, Takuma Seno, Yuma Aoki\n  - 关键词：Dreamer的扩展、全传播计算图\n  - 实验环境：[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [具有历史依赖动态上下文的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=rdOuTlTUMX)\n  - Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutilier\n  - 关键词：非马尔可夫上下文动态、逻辑DCMDPs、理论分析、MuZero的扩展\n  - 实验环境：[MovieLens数据集](https:\u002F\u002Fwww.tensorflow.org\u002Fdatasets\u002Fcatalog\u002Fmovielens)\n\n- [面向基于模型离线强化学习的模型贝尔曼不一致性](https:\u002F\u002Fopenreview.net\u002Fforum?id=rwLwGPdzDD)\n  - Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, Yang Yu\n  - 关键词：悲观价值估计、理论分析\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)、[NeoRL](https:\u002F\u002Fgithub.com\u002Fpolixir\u002FNeoRL)\n\n- [简化的时序一致性强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=IkhTCX9x5i)\n  - Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, Joni Pajarinen\n  - 关键词：表征学习、时序一致性\n  - 实验环境：[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [面向基于模型适应的 curiosity 回放](https:\u002F\u002Fopenreview.net\u002Fforum?id=7p7YakZP2H)\n  - Isaac Kauvar, Chris Doyle, Linqi Zhou, Nick Haber\n  - 关键词：DreamerV3的扩展、curiosity 回放、基于计数的回放、对抗性回放\n  - 实验环境：[Crafter](https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fcrafter)、[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [关于多动作策略梯度](https:\u002F\u002Fopenreview.net\u002Fforum?id=HKfSTYLJh7)\n  - Michal Nauman, Marek Cygan\n  - 关键词：偏差与方差、理论分析\n  - 实验环境：[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [面向深度强化学习的后验采样](https:\u002F\u002Fopenreview.net\u002Fforum?id=ZwjSECgl6p)\n  - Remo Sasso, Michelangelo Conserva, Paulo Rauber\n  - 关键词：后验采样、持续价值网络\n  - 实验环境：[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [基于计数保守性的基于模型离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=T5VlejGx7f)\n  - Byeongchan Kim, Min-hwan Oh\n  - 关键词：计数估计、理论分析\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n\u003C\u002Fdetails>\n\n### ICLR 2023\n\n\u003Cdetails open>\n\u003Csummary>切换\u003C\u002Fsummary>\n\n- [Transformer是样本高效的的世界模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=vhFu1Acb0xb)\n  - Vincent Micheli, Eloi Alonso, François Fleuret\n  - 关键词：离散自编码器、基于Transformer的世界模型\n  - OpenReview评分：8, 8, 8, 8\n  - 实验环境：[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [面向离线策略优化的保守贝叶斯基于模型价值扩展](https:\u002F\u002Fopenreview.net\u002Fforum?id=dNqxZgyjcYA)\n  - Jihwan Jeong, Xiaoyu Wang, Michael Gimelfarb, Hyunwoo Kim, Baher Abdulhai, Scott Sanner\n  - 关键词：基于模型的离线学习、贝叶斯后验价值估计\n  - OpenReview评分：8, 8, 6, 6\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [用户交互式离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=a4COps0uokg)\n  - Phillip Swazinna, Steffen Udluft, Thomas Runkler\n  - 关键词：允许用户在训练完成后调整策略行为\n  - OpenReview评分：10, 8, 6, 3\n  - 实验环境：[2D世界]()、[工业基准](https:\u002F\u002Fgithub.com\u002Fsiemens\u002Findustrialbenchmark\u002Ftree\u002Foffline_datasets\u002Fdatasets)\n\n- [CLARE：用于离线逆强化学习的保守模型基准奖励学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=5aT4ganOd98)\n  - Sheng Yue, Guanbo Wang, Wei Shao, Zhaofeng Zhang, Sen Lin, Ju Ren, Junshan Zhang\n  - 关键词：离线逆强化学习，奖励外推误差\n  - OpenReview评分：8, 8, 6, 6\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [基于学习模型的高效离线策略优化](https:\u002F\u002Fopenreview.net\u002Fforum?id=Yt-yM-JbYFO)\n  - Zichen Liu, Siyi Li, Wee Sun Lee, Shuicheng YAN, Zhongwen Xu\n  - 关键词：离线强化学习，MuZero Unplugged分析，单步前瞻策略改进\n  - OpenReview评分：8, 6, 5\n  - 实验环境：[atari数据集](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdeepmind-research\u002Ftree\u002Fmaster\u002Frl_unplugged)\n\n- [在紧凑的潜在动作空间中进行高效规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=cA77NrVEuqn)\n  - zhengyao jiang, Tianjun Zhang, Michael Janner, Yueying Li, Tim Rocktäschel, Edward Grefenstette, Yuandong Tian\n  - 关键词：使用VQ-VAE进行规划\n  - OpenReview评分：6, 6, 6, 6\n  - 实验环境：[d4rl数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [模型集成真的必要吗？通过带有Lipschitz正则化的值函数的单个模型实现基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=hNyJBk3CwR)\n  - Ruijie Zheng, Xiyao Wang, Huazhe Xu, Furong Huang\n  - 关键词：Lipschitz正则化\n  - OpenReview评分：8, 8, 6, 6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [MoDem：利用示范加速视觉基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=JdTnc9gjVfJ)\n  - Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang, Vikash Kumar, Aravind Rajeswaran\n  - 关键词：三个阶段——策略预训练、目标导向探索、交互式学习\n  - OpenReview评分：8, 6, 6, 6\n  - 实验环境：[adroit](https:\u002F\u002Fgithub.com\u002Faravindr93\u002Fmjrl)、[meta-world](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld)、[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [简化基于模型的强化学习：用一个目标同时学习表征、潜在空间模型和策略](https:\u002F\u002Fopenreview.net\u002Fforum?id=MQcmfgRxf7a)\n  - Raj Ghugare, Homanga Bharadhwaj, Benjamin Eysenbach, Sergey Levine, Ruslan Salakhutdinov\n  - 关键词：对齐的潜在模型\n  - OpenReview评分：8, 6, 6, 6, 6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n\u003C!-- - [基于模型的泛化在强化学习中的优势](https:\u002F\u002Fopenreview.net\u002Fforum?id=w1w4dGJ4qV)\n  - Kenny Young, Aditya Ramesh, Louis Kirsch, Jürgen Schmidhuber\n  - 关键词：模型泛化被认为比值函数泛化更有用\n  - OpenReview评分：8, 6, 5, 5\n  - 实验环境：[ProcMaze, ButtonGrid, PanFlute]() -->\n\n- [基于模型的强化学习中价值扩展方法的边际收益递减](https:\u002F\u002Fopenreview.net\u002Fforum?id=H4Ncs5jhTCu)\n  - Daniel Palenicek, Michael Lutter, Joao Carvalho, Jan Peters\n  - 关键词：更长的规划 horizon 在样本效率方面带来递减的回报\n  - OpenReview评分：8, 6, 6, 6\n  - 实验环境：[brax](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fbrax)\n\n- [为探索而规划目标](https:\u002F\u002Fopenreview.net\u002Fforum?id=6qeBuZSo7Pr)\n  - Edward S. Hu, Richard Chang, Oleh Rybkin, Dinesh Jayaraman\n  - 关键词：基于采样的规划，为每个训练episode设定目标以直接优化内在探索奖励\n  - OpenReview评分：8, 8, 8, 8, 6\n  - 实验环境：[point maze](), [walker](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control), [ant maze, 3-block stack](https:\u002F\u002Fgithub.com\u002Fspitis\u002Fmrl\u002Ftree\u002Fmaster\u002Fenvs)\n\n- [通过在连续控制中直接规划做出更好决策](https:\u002F\u002Fopenreview.net\u002Fforum?id=r8Mu7idxyF)\n  - Jinhua Zhu, Yue Wang, Lijun Wu, Tao Qin, Wengang Zhou, Tie-Yan Liu, Houqiang Li\n  - 关键词：深度可微分动态规划规划器\n  - OpenReview评分：8, 8, 8, 6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [用于强化学习的潜在变量表示](https:\u002F\u002Fopenreview.net\u002Fforum?id=mQpmZVzXK1h)\n  - Tongzheng Ren, Chenjun Xiao, Tianjun Zhang, Na Li, Zhaoran Wang, sujay sanghavi, Dale Schuurmans, Bo Dai\n  - 关键词：变分学习，表征学习\n  - OpenReview评分：8, 6, 6, 3\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [SpeedyZero：用有限的数据和时间掌握Atari游戏](https:\u002F\u002Fopenreview.net\u002Fforum?id=Mg5CLXZgvLJ)\n  - Yixuan Mei, Jiaxuan Gao, Weirui Ye, Shaohuai Liu, Yang Gao, Yi Wu\n  - 关键词：分布式基于模型的强化学习，加速EfficientZero\n  - OpenReview评分：6, 6, 5\n  - 实验环境：[atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [基于Transformer的世界模型只需10万次交互即可](https:\u002F\u002Fopenreview.net\u002Fforum?id=TdBaDGCpjly)\n  - Jan Robine, Marc Höftmann, Tobias Uelwer, Stefan Harmeling\n  - 关键词：自回归世界模型，Transformer-XL，平衡交叉熵损失，平衡数据集采样\n  - OpenReview评分：8, 6, 6, 6\n  - 实验环境：[atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [关于基于模型的强化学习跨任务迁移的可行性](https:\u002F\u002Fopenreview.net\u002Fforum?id=KB1sc5pNKFv)\n  - Yifan Xu, Nicklas Hansen, Zirui Wang, Yung-Chieh Chan, Hao Su, Zhuowen Tu\n  - 关键词：离线多任务预训练，在线微调\n  - OpenReview评分：6, 6, 6, 6\n  - 实验环境：[atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [通过观看纯视频，在数据有限的情况下成为熟练玩家](https:\u002F\u002Fopenreview.net\u002Fforum?id=Sy-o2N0hF4f)\n  - Weirui Ye, Yunsheng Zhang, Pieter Abbeel, Yang Gao\n  - 关键词：无监督预训练，再用下游任务进行微调\n  - OpenReview评分：8, 6, 6, 5\n  - 实验环境：[atari 100k](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [EUCLID：迈向高效的多选动力学模型无监督强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=xQAjSr64PTc)\n  - Yifu Yuan, Jianye HAO, Fei Ni, Yao Mu, YAN ZHENG, Yujing Hu, Jinyi Liu, Yingfeng Chen, Changjie Fan\n  - 关键词：联合预训练多头动力学模型和无监督探索策略，再针对下游任务进行微调\n  - OpenReview评分：6, 6, 6, 6\n  - 实验环境：[URLB基准测试](https:\u002F\u002Fgithub.com\u002Frll-research\u002Furl_benchmark)\n\n- [编舞者：在想象中学习与适应技能](https:\u002F\u002Fopenreview.net\u002Fforum?id=PhkWyijGi5b)\n  - Pietro Mazzaglia, Tim Verbelen, Bart Dhoedt, Alexandre Lacoste, Sai Rajeswar\n  - 关键词：世界模型、技能发现、技能学习、技能适应\n  - OpenReview评分：8, 8, 6, 6\n  - 实验环境：[deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[Meta-World](https:\u002F\u002Fgithub.com\u002FFarama-Foundation\u002FMetaworld)\n\n\u003C\u002Fdetails>\n\n\n\n### NeurIPS 2022\n\n\u003Cdetails open>\n\u003Csummary>切换\u003C\u002Fsummary>\n\n- [用于离线无限宽度模型优化的双向学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=_j8yVIyp27Q)\n  - Can Chen, Yingxue Zhang, Jie Fu, Xue Liu, Mark Coates\n  - 关键词：基于模型，离线\n  - OpenReview评分：7, 6, 5\n  - 实验环境：[design-bench](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fdesign-bench)\n\n- [用于交替进行离线模型训练和策略学习的统一框架](https:\u002F\u002Fopenreview.net\u002Fforum?id=5yjM1sQ1uKZ)\n  - 杨申涛、张书健、冯一豪、周明远\n  - 关键词：基于模型、离线、边际重要性权重\n  - OpenReview评分：7, 6, 6, 5\n  - 实验环境：[d4rl数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [基于悲观调节动力学信念的基于模型的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=oDWyVsHBzNT)\n  - 郭凯阳、邵云峰、耿彦辉\n  - 关键词：基于模型、离线\n  - OpenReview评分：8, 8, 7, 7\n  - 实验环境：[d4rl数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [在信任状态之前先双重检查：基于双向建模的自信感知离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=3e3IQMLDSLP)\n  - 吕家飞、李修、陆宗庆\n  - 关键词：双重检查机制、双向建模、离线强化学习\n  - OpenReview评分：7, 6, 6\n  - 实验环境：[d4rl数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [基于模型的对手建模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.01843)\n  - 于晓鹏、蒋杰川、张万鹏、姜浩斌、陆宗庆\n  - 关键词：多智能体、基于模型\n  - OpenReview评分：7, 6, 4, 3\n  - 实验环境：[mpe](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmultiagent-particle-envs)、[google research football](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ffootball)\n\n- [将预见与想象相结合：基于模型的合作式多智能体强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.09418)\n  - 徐志伟、李大鹏、张彬、詹源、白云鹏、范国梁\n  - 关键词：多智能体、基于模型\n  - OpenReview评分：6, 5\n  - 实验环境：[星际争霸II](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fpysc2)、[Google Research Football](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ffootball)、[多智能体离散MuJoCo](https:\u002F\u002Fgithub.com\u002Fschroederdewitt\u002Fmultiagent_mujoco)\n\n- [MoCoDA：基于模型的反事实数据增强](https:\u002F\u002Fopenreview.net\u002Fforum?id=w6tBOjPCrIO)\n  - 西尔维乌·皮蒂斯、埃利奥特·克雷格、阿贾伊·曼德尔卡、阿尼梅什·加格\n  - 关键词：数据增强框架、离线强化学习\n  - OpenReview评分：7, 7, 7, 6\n  - 实验环境：[2D导航](https:\u002F\u002Fgithub.com\u002Fspitis\u002Fmocoda\u002Fblob\u002Fmain\u002Faugment_offline_toy.py#L45)、[Hook-Sweep](https:\u002F\u002Fgithub.com\u002Fspitis\u002Fmrl\u002Fblob\u002Fmaster\u002Fenvs\u002Fcustomfetch\u002Fcustom_fetch.py#L1699)\n\n- [何时更新你的模型：约束下的基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=9a1oV7UunyP)\n  - 季天英、罗宇、孙富春、景明轩、何丰熙、黄文兵\n  - 关键词：事件触发机制、约束模型转移下界优化\n  - OpenReview评分：6, 6, 5, 5\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [通过约束型近端策略优化算法实现基于模型的安全深度强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=hYa_lseXK8)\n  - 阿希什·贾扬特、沙拉布·巴特纳加尔\n  - 关键词：约束强化学习、基于模型\n  - OpenReview评分：7, 6, 5, 5\n  - 实验环境：[safety gym](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fsafety-gym)\n\n- [学习攻击联邦学习：基于模型的强化学习攻击框架](https:\u002F\u002Fopenreview.net\u002Fforum?id=4OHRr7gmhd4)\n  - 李恒格、孙晓林、郑子涵\n  - 关键词：攻击与防御、联邦学习、基于模型\n  - OpenReview评分：6, 6, 6, 5\n  - 实验环境：MNIST、FashionMNIST、EMNIST、CIFAR-10以及合成数据集\n\n- [基于模型的城市驾驶模仿学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=Zk1SbbdZwS)\n  - 安东尼·胡、吉安卢卡·科拉多、尼古拉斯·格里菲斯、扎卡里·穆雷斯、科丽娜·古劳、哈德森·叶、亚历克斯·肯德尔、罗伯托·西波拉、杰米·肖顿\n  - 关键词：基于模型、模仿学习、自动驾驶\n  - OpenReview评分：7, 6, 6\n  - 实验环境：[CARLA](https:\u002F\u002Fgithub.com\u002Fwayveai\u002Fmile\u002Ftree\u002Fmain\u002Fcarla_gym)\n\n- [通过不变表示学习实现数据驱动的基于模型优化](https:\u002F\u002Fopenreview.net\u002Fforum?id=gKe_A-DxzkH)\n  - 韩琪、苏毅、阿维拉尔·库马尔、谢尔盖·列文\n  - 关键词：领域适应、不变目标模型、表示学习（与基于模型的强化学习无关）\n  - OpenReview评分：7, 6, 6, 5, 5\n  - 实验环境：[design-bench](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fdesign-bench)\n\n- [基于贝叶斯探索的终身强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=6I3zJn9Slsb)\n  - 傅浩天、于尚群、迈克尔·利特曼、乔治·科尼达里斯\n  - 关键词：终身强化学习、变分贝叶斯\n  - OpenReview评分：7, 6, 6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[meta-world](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld)\n\n- [计划以预测：为基于模型的强化学习学习一种预见不确定性的模型](https:\u002F\u002Fopenreview.net\u002Fforum?id=L9YayWPcHA_)\n  - 吴子凡、余超、陈晨、郝建业、卓汉兹·汉奎\n  - 关键词：将模型滚动过程视为一个序列决策问题\n  - OpenReview评分：7, 7, 6, 6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [基于模型强化学习下界的联合模型-策略优化](https:\u002F\u002Fopenreview.net\u002Fforum?id=LYfFj-Vk6lt)\n  - 本杰明·艾森巴赫、亚历山大·哈扎茨基、谢尔盖·列文、鲁斯·萨拉胡丁诺夫\n  - 关键词：基于模型强化学习的统一目标函数\n  - OpenReview评分：8, 8, 7, 6\n  - 实验环境：[gridworld](https:\u002F\u002Fgithub.com\u002Fdennybritz\u002Freinforcement-learning\u002Fblob\u002Fmaster\u002Flib\u002Fenvs\u002Fgridworld.py)、[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[ROBEL操作](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frobel)\n\n- [RAMBO-RL：鲁棒对抗性基于模型的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=nrksGSRT7kX)\n  - 马克·里格特、布鲁诺·拉塞尔达、尼克·霍斯\n  - 关键词：离线强化学习、基于模型的强化学习、双人游戏、对抗性模型训练\n  - OpenReview评分：6, 6, 6, 4\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [高效基于模型强化学习的保守双策略优化](https:\u002F\u002Fopenreview.net\u002Fforum?id=xL7B5axplIe)\n  - 张绍昂\n  - 关键词：后验采样强化学习、参考更新、约束性保守更新\n  - OpenReview评分：7, 7, 5, 5\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[N-Chain MDPs](https:\u002F\u002Fgithub.com\u002FstratisMarkou\u002Fsample-efficient-bayesian-rl\u002Fblob\u002Fmaster\u002Fcode\u002FEnvironments.py)\n\n- [贝叶斯乐观优化：基于模型强化学习的乐观探索](https:\u002F\u002Fopenreview.net\u002Fforum?id=GdHVClGh9N)\n  - 武晨阳、李天赐、张宗章、于洋\n  - 关键词：面对不确定性时的乐观态度(OFU)、BOO后悔\n  - OpenReview评分：6, 6, 5\n  - 实验环境：[RiverSwim、Chain、随机MDPs]()\n\n- [基于模型强化学习的乐观后验采样：结构条件与样本复杂度](https:\u002F\u002Fopenreview.net\u002Fforum?id=bEMrmaw8gOB)\n  - 阿莱克·阿加瓦尔、张彤\n  - 关键词：后验采样强化学习、贝尔曼误差解耦框架\n  - OpenReview评分：7, 7, 7, 6\n  - 实验环境：无\n\n- [基于指数族模型的强化学习：分数匹配方法](https:\u002F\u002Fopenreview.net\u002Fforum?id=G1uywu6vNZe)\n  - Gene Li、Junbo Li、Nathan Srebro、Zhaoran Wang、Zhuoran Yang\n  - 关键词：乐观模型、分数匹配\n  - OpenReview评分：7, 7, 6\n  - 实验环境：无\n\n- [从像素出发的深度层次化规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=wZk69kjy9_d)\n  - Danijar Hafner、Kuang-Huei Lee、Ian Fischer、Pieter Abbeel\n  - 关键词：层次化强化学习、长 horizon 任务、稀疏奖励任务\n  - OpenReview评分：6, 6, 5\n  - 实验环境：[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[deepmind lab](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Flab)、[crafter](https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fcrafter)\n\n- [连续 MDP 同态与同态策略梯度](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.07364)\n  - Sahand Rezaei-Shoshtari、Rosie Zhao、Prakash Panangaden、David Meger、Doina Precup\n  - 关键词：同态策略梯度、连续 MDP 同态、松弛双模拟损失\n  - OpenReview评分：7, 7, 7\n  - 实验环境：[deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n\u003C\u002Fdetails>\n\n\n\n### ICML 2022\n\n\u003Cdetails open>\n\u003Csummary>展开\u003C\u002Fsummary>\n\n- [DreamerPro：基于原型表示的无重建模型强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.14565)\n  - Fei Deng、Ingook Jang、Sungjin Ahn\n  - 关键词：dreamer、原型\n  - 实验环境：[deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [去噪 MDP：学习比真实世界更好的世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15477.pdf)\n  - Tongzhou Wang、Simon Du、Antonio Torralba、Phillip Isola、Amy Zhang、Yuandong Tian\n  - 关键词：表征学习、去噪模型\n  - 实验环境：[deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[RoboDesk](https:\u002F\u002Fgithub.com\u002FSsnL\u002Frobodesk)\n\n- [基于图结构代理模型和摊销策略搜索的模型基元强化学习元学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.08291.pdf)\n  - Qi Wang、Herke van Hoof\n  - 关键词：图结构代理模型、元训练\n  - 实验环境：[atari、mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [迈向自适应模型基元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.11464.pdf)\n  - Yi Wan、Ali Rahimi-Kalahroudi、Janarthanan Rajendran、Ida Momennejad、Sarath Chandar、Harm van Seijen\n  - 关键词：局部变化适应\n  - 实验环境：[GridWorldLoCA、ReacherLoCA、MountaincarLoCA](https:\u002F\u002Fgithub.com\u002Fchandar-lab\u002FLoCA2)\n\n- [通过乐观均衡计算实现高效的模型基元多智能体强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.07322.pdf)\n  - Pier Giuseppe Sessa、Maryam Kamgarpour、Andreas Krause\n  - 关键词：模型基元多智能体、置信区间\n  - 实验环境：[SMART](https:\u002F\u002Fgithub.com\u002Fhuawei-noah\u002FSMARTS)\n\n- [通过正则化模型基元策略平稳分布来稳定离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07166.pdf)\n  - Shentao Yang、Yihao Feng、Shujian Zhang、Mingyuan Zhou\n  - 关键词：离线强化学习、模型基元强化学习、平稳分布正则化\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [Design-Bench：数据驱动的离线模型基元优化基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.08450.pdf)\n  - Brandon Trabucco、Xinyang Geng、Aviral Kumar、Sergey Levine\n  - 关键词：基准测试、离线 MBO\n  - 实验环境：[Design-Bench 基准任务](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fdesign-bench)\n\n- [用于模型预测控制的时间差分学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04955.pdf)\n  - Nicklas Hansen、Hao Su、Xiaolong Wang\n  - 关键词：TD 学习、MPC\n  - 实验环境：[deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[Meta-World](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld)\n\n\u003C\u002Fdetails>\n\n### ICLR 2022\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F收起\u003C\u002Fsummary>\n\n- [重新审视离线模型基础强化学习中的设计选择](https:\u002F\u002Fopenreview.net\u002Fforum?id=zz9hXVhf40)\n  - Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, Stephen J. Roberts\n  - 关键词：基于模型的离线、不确定性量化\n  - OpenReview评分：8, 8, 6, 6, 6\n  - 实验环境：[d4rl 数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [基于价值梯度加权的模型基础强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=4-D6CZkRXxI)\n  - Claas A Voelcker, Victor Liao, Animesh Garg, Amir-massoud Farahmand\n  - 关键词：价值梯度加权的模型损失\n  - OpenReview评分：8, 8, 6, 6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [使用学习模型在随机环境中进行规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=X6D9bAHhBQ1)\n  - Ioannis Antonoglou, Julian Schrittwieser, Sherjil Ozair, Thomas K Hubert, David Silver\n  - 关键词：MCTS、随机MuZero\n  - OpenReview评分：10, 8, 8, 5\n  - 实验环境：2048 游戏、西洋双陆棋、围棋\n\n- [通过 Gumbel 规划改进策略](https:\u002F\u002Fopenreview.net\u002Fforum?id=bERaNdoegnO)\n  - Ivo Danihelka, Arthur Guez, Julian Schrittwieser, David Silver\n  - 关键词：Gumbel AlphaZero、Gumbel MuZero\n  - OpenReview评分：8, 8, 8, 6\n  - 实验环境：围棋、国际象棋、[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [带有正则化的基于模型的离线元强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=EBn0uInJZWh)\n  - Sen Lin, Jialin Wan, Tengyu Xu, Yingbin Liang, Junshan Zhang\n  - 关键词：基于模型的离线元强化学习\n  - OpenReview评分：8, 6, 6, 6\n  - 实验环境：[d4rl 数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [强化学习中的在线策略模型误差](https:\u002F\u002Fopenreview.net\u002Fforum?id=81e1aeOt-sd)\n  - Lukas Froehlich, Maksym Lefarov, Melanie Zeilinger, Felix Berkenkamp\n  - 关键词：模型误差、在线策略修正\n  - OpenReview评分：8, 6, 6, 5\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[pybullet](https:\u002F\u002Fgithub.com\u002Fbenelot\u002Fpybullet-gym)\n\n- [一种用于无监督动态泛化的基于关系干预的模型基础强化学习方法](https:\u002F\u002Fopenreview.net\u002Fforum?id=YRq0ZUnzKoZ)\n  - Jiaxian Guo, Mingming Gong, Dacheng Tao\n  - 关键词：关系干预、动态泛化\n  - OpenReview评分：8, 8, 6, 6\n  - 实验环境：[单摆](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [通过赋能进行信息优先级排序的视觉模型基础强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=DfUjyyRW90)\n  - Homanga Bharadhwaj, Mohammad Babaeizadeh, Dumitru Erhan, Sergey Levine\n  - 关键词：互信息、视觉模型基础强化学习\n  - OpenReview评分：8, 8, 8, 6\n  - 实验环境：[deepmind 控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[Kinetics 数据集](https:\u002F\u002Fgithub.com\u002Fcvdfoundation\u002Fkinetics-dataset)\n\n- [通过基于模型的正则化实现观测特征空间之间的迁移强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=7KdAoOsI81C)\n  - Yanchao Sun, Ruijie Zheng, Xiyao Wang, Andrew E Cohen, Furong Huang\n  - 关键词：潜在动力学模型、迁移强化学习\n  - OpenReview评分：8, 6, 5, 5\n  - 实验环境：[CartPole、Acrobot 和 Cheetah-Run](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、[3DBall](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents)\n\n- [通过回溯学习强化学习中的状态表示](https:\u002F\u002Fopenreview.net\u002Fforum?id=CLpxpXqqBV)\n  - Changmin Yu, Dong Li, Jianye HAO, Jun Wang, Neil Burgess\n  - 关键词：表示学习、通过回溯学习\n  - OpenReview评分：8, 6, 5, 3\n  - 实验环境：[deepmind 控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [模型增强的优先经验回放](https:\u002F\u002Fopenreview.net\u002Fforum?id=WuEiafqdy9H)\n  - Youngmin Oh, Jinwoo Shin, Eunho Yang, Sung Ju Hwang\n  - 关键词：优先经验回放、mbrl\n  - OpenReview评分：8, 8, 6, 5\n  - 实验环境：[pybullet](https:\u002F\u002Fgithub.com\u002Fbenelot\u002Fpybullet-gym)\n\n- [评估用于连续控制的基于模型的规划及规划器摊销](https:\u002F\u002Fopenreview.net\u002Fforum?id=SS8F6tFX3-)\n  - Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo, Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, Martin Riedmiller\n  - 关键词：模型预测控制\n  - OpenReview评分：8, 6, 6, 6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [通过反向传播穿过模型进行策略优化时，梯度信息至关重要](https:\u002F\u002Fopenreview.net\u002Fforum?id=rzvOQrnclO0)\n  - Chongchong Li, Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, Tie-Yan Liu\n  - 关键词：双模型方法、分析模型误差和策略梯度\n  - OpenReview评分：8, 8, 6, 6\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [基于模型的离线强化学习中的帕累托策略池](https:\u002F\u002Fopenreview.net\u002Fforum?id=OqcZu8JIIzS)\n  - Yijun Yang, Jing Jiang, Tianyi Zhou, Jie Ma, Yuhui Shi\n  - 关键词：基于模型的离线、模型回报与不确定性权衡\n  - OpenReview评分：8, 8, 6, 5\n  - 实验环境：[d4rl 数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [部分覆盖下的悲观主义基于模型的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=tyrJsbKAe6)\n  - Masatoshi Uehara, Wen Sun\n  - 关键词：基于模型的离线理论、PAC界\n  - OpenReview评分：8, 6, 6, 5\n  - 实验环境：无\n\n- [认识自我：通过机器人感知实现可迁移的视觉控制策略](https:\u002F\u002Fopenreview.net\u002Fforum?id=o0ehFykKVtr)\n  - Edward S. Hu, Kun Huang, Oleh Rybkin, Dinesh Jayaraman\n  - 关键词：可在新机器人上迁移的世界模型\n  - OpenReview评分：8, 6, 6, 5\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)、WidowX 和 Franka Panda 机器人\n\n\u003C\u002Fdetails>\n\n### NeurIPS 2021\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F折叠\u003C\u002Fsummary>\n\n- [关于基于模型的强化学习的有效调度](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.08550)\n  - 作者：Hang Lai, Jian Shen, Weinan Zhang, Yimin Huang, Xing Zhang, Ruiming Tang, Yong Yu, Zhenguo Li\n  - 关键点：MBPO的扩展，超控制器学习\n  - OpenReview评分：8, 6, 6\n  - 实验环境：[MuJoCo](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py), [PyBullet](https:\u002F\u002Fgithub.com\u002Fbenelot\u002Fpybullet-gym)\n\n- [COMBO：保守的离线基于模型策略优化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=dUEpGV2mhf)\n  - 作者：Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, Chelsea Finn\n  - 关键点：离线强化学习，基于模型的强化学习，深度强化学习\n  - OpenReview评分：6, 7, 6, 8\n  - 实验环境：[D4RL数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [通过想象近期未来实现安全强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.07789)\n  - 作者：Garrett Thomas, Yuping Luo, Tengyu Ma\n  - 关键点：安全RL，奖励惩罚，基于模型rollout的理论\n  - OpenReview评分：8, 6, 6\n  - 实验环境：[MuJoCo](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [基于派生记忆的想象进行基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=jeATherHHGj)\n  - 作者：Yao Mu, Yuzheng Zhuang, Bin Wang, Guangxiang Zhu, Wulong Liu, Jianyu Chen, Ping Luo, Shengbo Eben Li, Chongjie Zhang, Jianye HAO\n  - 关键点：Dreamer的扩展，预测可靠性权重\n  - OpenReview评分：6, 6, 6, 6\n  - 实验环境：[DeepMind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [MobILE：仅从观察中进行基于模型的模仿学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.10769)\n  - 作者：Rahul Kidambi, Jonathan Chang, Wen Sun\n  - 关键点：仅从观察中进行模仿学习，MBRL\n  - OpenReview评分：6, 6, 6, 4\n  - 实验环境：[CartPole](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [MuJoCo](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [基于模型的情景记忆诱导动态混合控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.02104)\n  - 作者：Hung Le, Thommen Karimpanal George, Majid Abdolshah, Truyen Tran, Svetha Venkatesh\n  - 关键点：基于模型，情景控制\n  - OpenReview评分：7, 7, 6, 6\n  - 实验环境：[2D迷宫导航](https:\u002F\u002Fgithub.com\u002FMattChanTK\u002Fgym-maze), [CartPole、MountainCar和LunarLander](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [Atari](https:\u002F\u002Fgym.openai.com\u002Fenvs\u002Fatari), [3D导航：Gym-MiniWorld](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-miniworld)\n\n- [受意识启发的基于模型强化学习规划智能体](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.02097)\n  - 作者：Mingde Zhao, Zhen Liu, Sitao Luan, Shuyuan Zhang, Doina Precup, Yoshua Bengio\n  - 关键点：MBRL，集合表示\n  - OpenReview评分：7, 7, 7, 6\n  - 实验环境：[MiniGrid-BabyAI框架](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-minigrid)\n\n- [利用有限数据掌握Atari游戏](https:\u002F\u002Fopenreview.net\u002Fforum?id=OKrNPg3xR3T)\n  - 作者：Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao\n  - 关键点：MuZero，自监督一致性损失\n  - OpenReview评分：7, 7, 7, 5\n  - 实验环境：[Atari 10万数据集](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym), [DeepMind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [通过使用学习到的模型进行规划的在线和离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=HKtsGW-lNbw)\n  - 作者：Julian Schrittwieser, Thomas K Hubert, Amol Mandhane, Mohammadamin Barekatain, Ioannis Antonoglou, David Silver\n  - 关键点：MuZero，重新分析，离线\n  - OpenReview评分：8, 8, 7, 6\n  - 实验环境：[Atari数据集、DeepMind控制套件数据集](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdeepmind-research\u002Ftree\u002Fmaster\u002Frl_unplugged)\n\n- [自洽模型与价值](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.12840)\n  - 作者：Gregory Farquhar, Kate Baumli, Zita Marinho, Angelos Filos, Matteo Hessel, Hado van Hasselt, David Silver\n  - 关键点：新的模型学习方式\n  - OpenReview评分：7, 7, 7, 6\n  - 实验环境：表格MDP、Sokoban、[Atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [适当的价值等价性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.10316)\n  - 作者：Christopher Grimm, Andre Barreto, Gregory Farquhar, David Silver, Satinder Singh\n  - 关键点：价值等价性，基于价值的规划，MuZero\n  - OpenReview评分：8, 7, 7, 6\n  - 实验环境：[四房间](https:\u002F\u002Fgithub.com\u002Fmaximecb\u002Fgym-minigrid), [Atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [MOPO：基于模型的离线策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.13239)\n  - 作者：Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma\n  - 关键点：基于模型，离线\n  - OpenReview暂无评分\n  - 实验环境：[D4RL数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)，halfcheetah-jump和ant-angle\n\n- [RoMA：用于离线基于模型优化的鲁棒模型适应](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.14188)\n  - 作者：Sihyun Yu, Sungsoo Ahn, Le Song, Jinwoo Shin\n  - 关键点：基于模型，离线\n  - OpenReview评分：7, 6, 6\n  - 实验环境：[Design-Bench](https:\u002F\u002Fgithub.com\u002Fbrandontrabucco\u002Fdesign-bench)\n\n- [基于反向模型想象的离线强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.00188)\n  - 作者：Jianhao Wang, Wenzhe Li, Haozhe Jiang, Guangxiang Zhu, Siyuan Li, Chongjie Zhang\n  - 关键点：基于模型，离线\n  - OpenReview评分：7, 6, 6, 5\n  - 实验环境：[D4RL数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [离线基于模型的可适应策略学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=lrdXc17jm6)\n  - 作者：Xiong-Hui Chen, Yang Yu, Qingyang Li, Fan-Ming Luo, Zhiwei Tony Qin, Shang Wenjie, Jieping Ye\n  - 关键点：基于模型，离线\n  - OpenReview评分：6, 6, 6, 4\n  - 实验环境：[D4RL数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [用于离线基于模型强化学习的加权模型估计](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zdC5eXljMPy)\n  - 作者：Toru Hishinuma, Kei Senda\n  - 关键点：基于模型，离线，off-policy评估\n  - OpenReview评分：7, 6, 6, 6\n  - 实验环境：摆动、[D4RL数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [具有线性函数近似的免奖励基于模型强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.06394)\n  - 作者：Weitong Zhang, Dongruo Zhou, Quanquan Gu\n  - 关键点：学习理论，基于模型的免奖励RL，线性函数近似\n  - OpenReview评分：6, 6, 5, 5\n  - 实验环境：无\n\n- [可证明的基于模型的非线性多臂老虎机与强化学习：摒弃乐观主义，拥抱虚拟曲率](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.04168)\n  - 作者：Kefan Dong, Jiaqi Yang, Tengyu Ma\n  - 关键点：学习理论，基于模型的多臂老虎机RL，非线性函数近似\n  - OpenReview评分：7, 7, 7, 6\n  - 实验环境：无\n\n- [通过世界模型发现并实现目标](https:\u002F\u002Fopenreview.net\u002Fforum?id=6vWuYzkp8d)\n  - 作者：Russell Mendonca, Oleh Rybkin, Kostas Daniilidis, Danijar Hafner, Deepak Pathak\n  - 关键点：无监督目标达成，目标条件强化学习\n  - OpenReview评分：6, 6, 6, 6, 6\n  - 实验环境：[Walker、四足机器人、箱子、厨房](https:\u002F\u002Fgithub.com\u002Forybkin\u002Flexa-benchmark)\n\n\u003C\u002Fdetails>\n\n### ICLR 2021\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F收起\u003C\u002Fsummary>\n\n- [基于模型的离线优化实现高效部署的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.03647)\n  - 作者：松岛达也、古田博纪、松尾丰、Ofir Nachum、Shixiang Gu\n  - 关键词：基于模型、行为克隆（预热）、TRPO\n  - OpenReview评分：8, 7, 7, 5\n  - 实验环境：[d4rl数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [面向控制的基于模型强化学习表示](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.13408)\n  - 作者：Brandon Cui、Yinlam Chow、Mohammad Ghavamzadeh\n  - 关键词：表示学习、基于模型的软演员-评论家算法\n  - OpenReview评分：6, 6, 6\n  - 实验环境：平面系统、倒立摆——摆起、小车倒立摆、三连杆机械臂——摆起与平衡\n\n- [使用离散世界模型掌握Atari游戏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.02193)\n  - 作者：Danijar Hafner、Timothy Lillicrap、Mohammad Norouzi、Jimmy Ba\n  - 关键词：DreamerV2、多种技巧（多个分类变量、KL平衡等）\n  - OpenReview评分：9, 8, 5, 4\n  - 实验环境：[Atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [基于自监督功能距离的视觉规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=UcoXdfrORC)\n  - 作者：Stephen Tian、Suraj Nair、Frederik Ebert、Sudeep Dasari、Benjamin Eysenbach、Chelsea Finn、Sergey Levine\n  - 关键词：目标达成任务、动力学学习、距离学习（目标条件Q函数）\n  - OpenReview评分：7, 7, 7, 7\n  - 实验环境：[sawyer](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld\u002Ftree\u002Fmaster\u002Fmetaworld\u002Fenvs)、门滑动\n\n- [基于模型的离线规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2008.05556)\n  - 作者：Arthur Argenson、Gabriel Dulac-Arnold\n  - 关键词：基于模型、离线\n  - OpenReview评分：8, 7, 5, 5\n  - 实验环境：[RL Unplugged(RLU)](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdeepmind-research\u002Ftree\u002Fmaster\u002Frl_unplugged)、[d4rl数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [基于归一化最大似然估计的离线模型优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.07970)\n  - 作者：Justin Fu、Sergey Levine\n  - 关键词：基于模型、离线\n  - OpenReview评分：8, 6, 6\n  - 实验环境：[design-bench](https:\u002F\u002Fgithub.com\u002Fbrandontrabucco\u002Fdesign-bench)\n\n- [关于规划在基于模型深度强化学习中的作用](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.04021)\n  - 作者：Jessica B. Hamrick、Abram L. Friesen、Feryal Behbahani、Arthur Guez、Fabio Viola、Sims Witherspoon、Thomas Anthony、Lars Buesing、Petar Veličković、Théophane Weber\n  - 关键词：讨论MuZero中的规划问题\n  - OpenReview评分：7, 7, 6, 5\n  - 实验环境：[Atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、围棋、[Deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [基于表示平衡的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=QpNz8r_Ri2Y)\n  - 作者：Byung-Jun Lee、Jongmin Lee、Kee-Eung Kim\n  - 关键词：表示平衡MDP、基于模型、离线\n  - OpenReview评分：7, 7, 7, 6\n  - 实验环境：[d4rl数据集](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [基于模型的微观数据强化学习：关键的模型属性是什么？应选择哪种模型？](https:\u002F\u002Fopenreview.net\u002Fforum?id=p5uylG94S68)\n  - 作者：Balázs Kégl、Gabriel Hurtado、Albert Thomas\n  - 关键词：混合密度网络、异方差性\n  - OpenReview评分：7, 7, 7, 6, 5\n  - 实验环境：[acrobot系统](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n\u003C\u002Fdetails>\n\n### ICML 2021\n\n\u003Cdetails open>\n\u003Csummary>展开\u002F收起\u003C\u002Fsummary>\n\n- [用于有效离线模型优化的保守目标模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.06882)\n  - 作者：Brandon Trabucco、Aviral Kumar、Xinyang Geng、Sergey Levine\n  - 关键词：保守目标模型、离线MRL\n  - 实验环境：[design-bench](https:\u002F\u002Fgithub.com\u002Fbrandontrabucco\u002Fdesign-bench)\n\n- [连续时间基于模型的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.04764)\n  - 作者：Çağatay Yıldız、Markus Heinonen、Harri Lähdesmäki\n  - 关键词：连续时间\n  - 实验环境：[单摆、小车倒立摆和acrobot](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [基于潜在空间配点法的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.13229)\n  - 作者：Oleh Rybkin、Chuning Zhu、Anusha Nagabandi、Kostas Daniilidis、Igor Mordatch、Sergey Levine\n  - 关键词：潜在空间配点法\n  - 实验环境：[稀疏的metaworld任务](https:\u002F\u002Fgithub.com\u002Frlworkgroup\u002Fmetaworld\u002Ftree\u002Fmaster\u002Fmetaworld\u002Fenvs)\n\n- [因果关系不确定时的无模型与基于模型策略评估](http:\u002F\u002Fproceedings.mlr.press\u002Fv139\u002Fbruns-smith21a.html)\n  - 作者：David A Bruns-Smith\n  - 关键词：最坏情况下的界\n  - 实验环境：[ope-tools](https:\u002F\u002Fgithub.com\u002Fclvoloshin\u002FCOBS)\n\n- [Muesli：结合策略优化的改进方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.06159)\n  - 作者：Matteo Hessel、Ivo Danihelka、Fabio Viola、Arthur Guez、Simon Schmitt、Laurent Sifre、Theophane Weber、David Silver、Hado van Hasselt\n  - 关键词：价值等价性\n  - 实验环境：[Atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)\n\n- [用于规划的向量量化模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.04615.pdf)\n  - 作者：Sherjil Ozair、Yazhe Li、Ali Razavi、Ioannis Antonoglou、Aäron van den Oord、Oriol Vinyals\n  - 关键词：VQVAE、蒙特卡洛树搜索\n  - 实验环境：[国际象棋数据集](https:\u002F\u002Fwww.ﬁcsgames.org\u002Fdownload.html)、[DeepMind Lab](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Flab)\n\n- [PC-MLP：基于政策覆盖引导探索的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.07410)\n  - 作者：宋宇达、孙文\n  - 关键词：样本复杂度、核化非线性调节器、线性MDP\n  - 实验环境：[山地车、蚂蚁迷宫](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [潜在空间中基于模型的时序预测编码规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.07156)\n  - 作者：Tung Nguyen、Rui Shu、Tuan Pham、Hung Bui、Stefano Ermon\n  - 关键词：带有RSSM的时序预测编码、潜在空间\n  - 实验环境：[Deepmind控制套件](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)\n\n- [基于后验采样的连续控制强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.09613)\n  - 作者：Ying Fan、Yifei Ming\n  - 关键词：PSRL的后悔界、MPC\n  - 实验环境：[连续的小车倒立摆、单摆摆起](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [基于自我对弈的强化学习的严格分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.01604)\n  - 作者：Qinghua Liu、Tiancheng Yu、Yu Bai、Chi Jin\n  - 关键词：学习理论、多智能体、基于模型的自我对弈、双人零和马尔可夫博弈\n  - 实验环境：无\n\n\u003C\u002Fdetails>\n\n### 其他\n\n- [UniZero：基于可扩展潜在世界模型的通用高效规划](https:\u002F\u002Fopenreview.net\u002Fforum?id=Gl6dF9soQo) \n  - Pu Yuan, Niu Yazhe, Yang Zhenjie, Ren Jiyuan, Li Hongsheng, Liu Yu *TMLR2025*\n  - 关键词：世界模型、MCTS、基于模型的强化学习、Transformer、潜在规划、多任务学习  \n  - 实验环境：Atari、DMControl、VisualMatch\n\n- [驶向未来：面向自动驾驶的世界模型多视角视觉预测与规划](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2024\u002Fhtml\u002FWang_Driving_into_the_Future_Multiview_Visual_Forecasting_and_Planning_with_CVPR_2024_paper.html)\n  - Wang Yuqi, He Jiawei, Fan Lue, Li Hongxin, Chen Yuntao, Zhang Zhaoxiang *CVPR 2024*\n  - 关键词：自动驾驶世界建模\n  - 实验环境：[nuScenes]()\n\n- [DriveWorld：基于世界模型的4D预训练场景理解用于自动驾驶](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tT3LUdmzbd)\n  - Min Chen, Zhao Dawei, Xiao Liang, Zhao Jian, Xu Xinli, Zhu Zheng, Jin Lei, Li Jianshu, Guo Yulan, Xing Junliang, Jing Liping, Nie Yiming, Dai Bin *CVPR 2024*\n  - 关键词：自动驾驶世界建模\n  - 实验环境：[nuScenes](), [OpenScene]()\n\n- [用于预测、表征和控制的掩码轨迹模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tT3LUdmzbd)\n  - Wu Philipp, Majumdar Arjun, Stone Kevin, Lin Yixin, Mordatch Igor, Abbeel Pieter, Rajeswaran Aravind *ICLR 2023 Workshop RRL*\n  - 关键词：离线RL、控制学习、序列建模\n  - 实验环境：[d4rl](https:\u002F\u002Fgithub.com\u002Frail-berkeley\u002Fd4rl)\n\n- [基于策略引导的轨迹扩散的世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08533)\n  - Rigter Marc, Yamada Jun, Posner Ingmar *Arxiv 2023*\n  - 关键词：扩散模型、世界模型\n  - 实验环境：[deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[gridworld](https:\u002F\u002Fgithub.com\u002Fdennybritz\u002Freinforcement-learning\u002Fblob\u002Fmaster\u002Flib\u002Fenvs\u002Fgridworld.py)\n\n- [基于模型的价值认知方差用于风险感知的策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04386)\n  - Luis Carlos E., Bottero Alessandro G., Vinogradska Julia, Berkenkamp Felix, Peters Jan *Arxiv 2023*\n  - 关键词：MBRL中的累积奖励不确定性估计\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n- [使用数据增强的基于模型的强化学习高效学习解决现实世界迷宫游戏](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.09906)\n  - Bi Thomas, D'Andrea Raffaello *Arxiv 2023*\n  - 关键词：数据增强、DreamerV3\n  - 实验环境：[现实世界迷宫游戏]()\n\n- [通过世界模型掌握多样化领域](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.04104)\n  - Hafner Danijar, Pasukonis Jurgis, Ba Jimmy, Lillicrap Timothy *Arxiv 2023*\n  - 关键词：DreamerV3、世界模型的可扩展性\n  - 实验环境：[deepmind control suite](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Fdm_control)、[atari](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgym)、[DMLab](https:\u002F\u002Fgithub.com\u002Fdeepmind\u002Flab)、[minecraft](https:\u002F\u002Fgithub.com\u002Fminerllabs\u002Fminerl)\n\n- [从基于模型的规划中提炼出理论保证的策略改进](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.12933)\n  - Li Chuming, Jia Ruonan, Yao Jiawei, Liu Jie, Zhang Yinmin, Niu Yazhe, Yang Yaodong, Liu Yu, Ouyang Wanli *IJCAI Workshop 2023*\n  - 关键词：扩展的策略改进、模型正则化、规划定理\n  - 实验环境：[mujoco](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmujoco-py)\n\n\n## 教程\n\n- [视频] [Csaba Szepesvári - 基于模型的强化学习面临的挑战及其克服方法](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-Y-fHsPIQ_Q)\n- [博客] [基于模型的强化学习：理论与实践](https:\u002F\u002Fbair.berkeley.edu\u002Fblog\u002F2019\u002F12\u002F12\u002Fmbpo\u002F)\n\n\n## 代码库\n\n- [mbrl-lib](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmbrl-lib) - Meta：基于模型强化学习的库\n- [DI-engine](https:\u002F\u002Fgithub.com\u002Fopendilab\u002FDI-engine) - OpenDILab：决策AI引擎\n\n\n## 贡献\n\n我们的目标是让这个仓库变得更好。如果您有兴趣贡献，请参阅[此处](CONTRIBUTING.md)以获取贡献说明。\n\n\n## 许可证\n\nAwesome Model-Based RL 根据 Apache 2.0 许可证发布。\n\n\u003Cp align=\"right\">(\u003Ca href=\"#top\">返回顶部\u003C\u002Fa>)\u003C\u002Fp>","# awesome-model-based-RL 快速上手指南\n\n`awesome-model-based-RL` 并非一个可直接安装运行的软件库或框架，而是一个**基于模型强化学习（Model-Based RL, MBRL）领域的精选论文与资源清单**。它由 OpenDILab 社区维护，旨在帮助研究者和开发者追踪该领域的前沿进展、经典算法及开源代码实现。\n\n本指南将指导你如何获取、浏览并利用该资源库进行研究和学习。\n\n## 环境准备\n\n由于本项目本质是一个文档仓库（Awesome List），无需复杂的系统环境或深度学习框架即可浏览内容。但为了运行列表中链接到的具体算法代码，建议准备以下基础环境：\n\n*   **操作系统**：Linux (推荐 Ubuntu 18.04+), macOS 或 Windows (WSL2)\n*   **版本控制**：已安装 `git`\n*   **浏览器**：现代浏览器（Chrome, Firefox, Edge 等）用于查看在线文档\n*   **可选依赖（用于复现论文代码）**：\n    *   Python 3.8+\n    *   PyTorch 或 TensorFlow (视具体论文实现而定)\n    *   MuJoCo, Gym\u002FDm_Control 等仿真环境 (如需运行实验)\n\n## 安装步骤\n\n你可以通过克隆仓库到本地或直接在线浏览两种方式使用该资源。\n\n### 方式一：在线浏览（推荐快速查阅）\n\n直接访问 GitHub 仓库页面或国内镜像（如有）查看最新整理的论文列表和分类。\n\n*   **GitHub 主站**: [https:\u002F\u002Fgithub.com\u002Fopendilab\u002Fawesome-model-based-RL](https:\u002F\u002Fgithub.com\u002Fopendilab\u002Fawesome-model-based-RL)\n*   **国内加速访问**: 如果访问 GitHub 较慢，可使用国内代码托管平台的镜像功能，或通过代理工具访问。\n\n### 方式二：克隆到本地（推荐深度阅读与离线查找）\n\n在终端执行以下命令将仓库克隆至本地：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fopendilab\u002Fawesome-model-based-RL.git\ncd awesome-model-based-RL\n```\n\n若需国内加速下载，可使用 Gitee 镜像（如果存在）或配置 git 代理：\n\n```bash\n# 示例：使用 git 代理加速（需自行配置代理地址）\ngit clone https:\u002F\u002Fgithub.com\u002Fopendilab\u002Fawesome-model-based-RL.git --config http.proxy=http:\u002F\u002F127.0.0.1:7890\n```\n\n克隆完成后，使用 Markdown 阅读器（如 VS Code, Typora）打开 `README.md` 文件即可浏览完整内容。\n\n## 基本使用\n\n本项目的核心用法是**按图索骥**：通过分类索引找到感兴趣的论文，然后前往其对应的开源代码库进行复现或学习。\n\n### 1. 浏览算法分类体系\n打开 `README.md`，首先查看 **\"A Taxonomy of Model-Based RL Algorithms\"** 章节。这里提供了 MBRL 算法的分类图谱，将算法分为两大类：\n*   **Learn the Model**: 专注于如何构建环境模型（如 World Models, PILCO）。\n*   **Given the Model**: 专注于如何利用已学到的模型进行规划或策略优化（如 AlphaZero, MCTS）。\n\n### 2. 查找特定论文\n利用目录（Table of Contents）快速定位到你关注的会议年份或类别，例如：\n*   **经典论文**: 查看 `Classic Model-Based RL Papers` 了解 Dyna, PILCO 等奠基性工作。\n*   **最新前沿**: 查看 `NeurIPS 2025`, `ICML 2025`, `ICLR 2025` 等章节获取最新研究成果。\n\n每条记录包含：\n*   **标题与链接**: 点击可直接跳转论文原文。\n*   **核心贡献 (Key)**: 一句话概括算法的关键创新点。\n*   **实验环境 (ExpEnv)**: 标注了论文使用的测试环境（如 Mujoco, Atari），便于评估复现难度。\n\n### 3. 获取代码并运行\n该项目本身不提供统一的可执行代码，但每个论文条目通常隐含或明确指向其开源实现。\n*   **步骤**:\n    1. 在列表中找到目标论文（例如 `DreamerV1`）。\n    2. 点击论文标题阅读摘要，确认是否符合需求。\n    3. 搜索该论文名称 + \"github\" (或在项目 `Codebase` 章节查找)，找到官方或第三方复现代码库。\n    4. 进入对应的代码库，按照其独立的 `README` 进行安装和运行。\n\n**示例：复现 DreamerV1**\n1. 在本列表中找到 `[Dream to Control: Learning Behaviors by Latent Imagination]`。\n2. 记录关键信息：使用 DeepMind Control Suite 环境，基于潜在空间想象。\n3. 前往 GitHub 搜索 `dreamerv1 pytorch` 找到高星实现（如 `danijar\u002Fdreamerv2` 通常包含 v1 分支或参考实现）。\n4. 在该代码库中执行：\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fdanijar\u002Fdreamerv2.git\n   cd dreamerv2\n   pip install -r requirements.txt\n   python dreamer.py --logdir .\u002Flog --task walker_walk\n   ```\n\n通过这种方式，`awesome-model-based-RL` 成为了你进入模型基强化学习世界的导航图，帮助你高效定位高质量的研究成果与代码资源。","某自动驾驶初创公司的算法团队正致力于开发基于模型的强化学习（MBRL）策略，以在仿真环境中高效训练车辆应对复杂路况。\n\n### 没有 awesome-model-based-RL 时\n- **文献检索如大海捞针**：研究人员需手动在 arXiv、NeurIPS、ICML 等各大会议中筛选论文，耗时数周仍难以覆盖最新的前沿成果，极易遗漏关键突破。\n- **技术路线梳理困难**：面对“学习模型”与“利用模型”等不同流派，缺乏系统的分类指引，团队难以快速构建清晰的技术演进图谱，导致选型盲目。\n- **复现成本高昂**：找不到官方代码库或权威教程，新手往往需要从零摸索算法细节，大量时间浪费在调试基础环境而非核心创新上。\n- **信息更新滞后**：由于缺乏持续维护的渠道，团队无法及时获取如 2025 年最新顶会论文列表，技术栈容易与社区前沿脱节。\n\n### 使用 awesome-model-based-RL 后\n- **一站式获取前沿资源**：团队直接查阅按年份和顶会（如 NeurIPS 2025、ICLR 2025）整理的论文清单，几分钟内即可锁定领域内最新的 SOTA 方法。\n- **清晰的技术导航**：借助仓库提供的算法分类图谱，研究人员迅速理清了 World Models、I2A 等经典与新兴算法的逻辑关系，精准定位适合自动驾驶场景的技术路线。\n- **加速落地验证**：通过集成的 Codebase 和 Tutorial 链接，工程师直接复用成熟的代码框架，将算法复现周期从数周缩短至几天，大幅降低试错成本。\n- **同步社区脉搏**：依托仓库的持续更新机制，团队能第一时间掌握每月新增的研究成果，确保技术方案始终处于行业领先地位。\n\nawesome-model-based-RL 将原本分散、滞后的科研资源转化为结构化的知识引擎，极大提升了团队在模型基强化学习领域的研发效率与创新速度。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendilab_awesome-model-based-RL_f2f7c68c.png","opendilab","OpenDILab","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fopendilab_83f31d72.png","Open-source Decision Intelligence (DI) Platform",null,"opendilab@pjlab.org.cn","https:\u002F\u002Fgithub.com\u002Fopendilab",1326,76,"2026-04-04T17:05:40","Apache-2.0","","未说明",{"notes":90,"python":88,"dependencies":91},"该仓库是一个模型基于强化学习（Model-Based RL）的研究论文列表和分类整理，并非可执行的软件工具或代码库。因此，它没有特定的操作系统、GPU、内存、Python 版本或依赖库要求。用户仅需浏览器即可访问内容，若需运行列表中提及的具体算法代码，请参考各论文对应的原始代码仓库。",[],[18],[94,95,96,97,98,99],"reinforcement-learning","reinforcement-learning-algorithms","model-based-reinforcement-learning","model-based-rl","awesome","awesome-list","2026-03-27T02:49:30.150509","2026-04-06T08:40:50.797053",[],[]]