[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-yingchengyang--Reinforcement-Learning-Papers":3,"tool-yingchengyang--Reinforcement-Learning-Papers":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85267,2,"2026-04-18T11:00:28",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[19,14,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},5773,"cs-video-courses","Developer-Y\u002Fcs-video-courses","cs-video-courses 是一个精心整理的计算机科学视频课程清单，旨在为自学者提供系统化的学习路径。它汇集了全球知名高校（如加州大学伯克利分校、新南威尔士大学等）的完整课程录像，涵盖从编程基础、数据结构与算法，到操作系统、分布式系统、数据库等核心领域，并深入延伸至人工智能、机器学习、量子计算及区块链等前沿方向。\n\n面对网络上零散且质量参差不齐的教学资源，cs-video-courses 解决了学习者难以找到成体系、高难度大学级别课程的痛点。该项目严格筛选内容，仅收录真正的大学层级课程，排除了碎片化的简短教程或商业广告，确保用户能接触到严谨的学术内容。\n\n这份清单特别适合希望夯实计算机基础的开发者、需要补充特定领域知识的研究人员，以及渴望像在校生一样系统学习计算机科学的自学者。其独特的技术亮点在于分类极其详尽，不仅包含传统的软件工程与网络安全，还细分了生成式 AI、大语言模型、计算生物学等新兴学科，并直接链接至官方视频播放列表，让用户能一站式获取高质量的教育资源，免费享受世界顶尖大学的课堂体验。",79792,"2026-04-08T22:03:59",[18,13,14,20],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",75872,"2026-04-18T10:54:57",[19,13,20,18],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":29,"last_commit_at":63,"category_tags":64,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,"2026-04-03T21:50:24",[20,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":82,"stars":85,"forks":86,"last_commit_at":87,"license":88,"difficulty_score":29,"env_os":89,"env_gpu":90,"env_ram":90,"env_deps":91,"category_tags":94,"github_topics":95,"view_count":10,"oss_zip_url":82,"oss_zip_packed_at":82,"status":22,"created_at":115,"updated_at":116,"faqs":117,"releases":118},9232,"yingchengyang\u002FReinforcement-Learning-Papers","Reinforcement-Learning-Papers","Related papers for reinforcement learning, including classic papers and latest papers in top conferences","Reinforcement-Learning-Papers 是一个专注于强化学习领域的开源论文精选库，旨在帮助研究者高效追踪该方向的核心进展。面对每年顶级会议（如 ICLR、ICML、NeurIPS）涌现的海量新论文，手动筛选高价值内容往往耗时费力，而该项目通过人工阅读与甄别，整理出了一份涵盖经典奠基之作与前沿最新成果的清单，有效解决了信息过载与优质资源难寻的痛点。\n\n该资源特别适合强化学习领域的研究人员、算法工程师以及相关专业的学生使用。无论是希望系统梳理知识体系的初学者，还是急需把握最新技术风向的资深专家，都能从中获益。其独特亮点在于不仅收录了从 DQN、策略梯度等经典方法到基于 Transformer\u002FLLM 的序列生成等前沿探索，还细致地按“无模型\u002F有模型”、“在线\u002F离线”、“元学习”及“对抗学习”等维度进行了结构化分类。此外，项目持续更新直至 2026 年的会议论文，并特别关注单智能体场景，为使用者提供了一条清晰、高质量的技术演进路径，是深入理解强化学习不可或缺的参考指南。","# Reinforcement Learning Papers\n[![PRs Welcome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-welcome-brightgreen.svg?style=flat-square)](http:\u002F\u002Fmakeapullrequest.com)\n\nRelated papers for Reinforcement Learning (we mainly focus on single-agent).\n\nSince there are tens of thousands of new papers on reinforcement learning at each conference every year, we are only able to list those we read and consider as insightful.\n\n**We have added some ICLR22, ICML22, NeurIPS22, ICLR23, ICML23, NeurIPS23, ICLR24, ICML24, NeurIPS24, ICLR25, ICML25, NeurIPS25, ICLR26 papers on RL.**\n\u003C!-- neurips 24 page 31-->\n\u003C!-- iclr 25 page 11-->\n\u003C!-- icml 25 page 1-->\n\u003C!-- nsueips 25 page 21-->\n\n## Contents \n* [Model Free (Online) RL](#Model-Free-Online)\n    - [Classic Methods](#model-free-classic)\n    - [Exploration](#exploration)\n    - [Representation Learning](#Representation-RL)\n    - [Unsupervised Learning](#Unsupervised-RL)\n    - [Current methods](#current)\n* [Model Based (Online) RL](#Model-Based-Online)\n    - [Classic Methods](#model-based-classic)\n    - [World Models](#dreamer)\n    - [CodeBase](#model-based-code)\n* [(Model Free) Offline RL](#Model-Free-Offline)\n    - [Current methods](#offline-current)\n    - [Combined with Diffusion Models](#offline-diffusion)\n* [Model Based Offline RL](#Model-Based-Offline)\n* [Meta RL](#Meta-RL)\n* [Adversarial RL](#Adversarial-RL)\n* [Genaralisation in RL](#Genaralization-in-RL)\n    - [Environments](#Gene-Environments)\n    - [Methods](#Gene-Methods)\n* [RL with Transformer\u002FLLM](#Sequence-Generation)\n* [Tutorial and Lesson](#Tutorial-and-Lesson)\n* [ICLR22](#ICLR22)\n* [ICML22](#ICML22)\n* [NeurIPS22](#NeurIPS22)\n* [ICLR23](#ICLR23)\n* [ICML23](#ICML23)\n* [NeurIPS23](#NeurIPS23)\n* [ICLR24](#ICLR24)\n* [ICML24](#ICML24)\n* [NeurIPS24](#NeurIPS24)\n* [ICLR25](#ICLR25)\n* [ICML25](#ICML25)\n* [NeurIPS25](#NeurIPS25)\n* [ICLR26](#ICLR26)\n\n\u003Ca id='Model-Free-Online'>\u003C\u002Fa>\n## Model Free (Online) RL\n\n\n\u003Ca id='model-free-classic'>\u003C\u002Fa>\n### Classic Methods\n\n|  Title | Method | Conference | on\u002Foff policy | Action Space | Policy | Description |\n| ----  | ----   | ----       |   ----  | ----  |  ---- |  ---- | \n| [Human-level control through deep reinforcement learning](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature14236.pdf), [\\[other link\\]](http:\u002F\u002Fwww.kreimanlab.com\u002Facademia\u002Fclasses\u002FBAI\u002Fpdfs\u002FMnihEtAlHassibis15NatureControlDeepRL.pdf) | DQN | Nature15 | off | Discrete | based on value function | use deep neural network to train q learning and reach the human level in the Atari games; mainly two trick: replay buffer for improving sample efficiency, decouple target network and behavior network |\n| [Deep reinforcement learning with double q-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1509.06461.pdf) | Double DQN | AAAI16 | off | Discrete | based on value function | find that the Q function in DQN may overestimate; decouple calculating q function and choosing action with two neural networks |\n| [Dueling network architectures for deep reinforcement learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1511.06581.pdf) | Dueling DQN | ICML16 | off| Discrete | based on value function | use the same neural network to approximate q function and value function for calculating advantage function |\n| [Prioritized Experience Replay](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1511.05952.pdf) | Priority Sampling | ICLR16 | off | Discrete | based on value function | give different weights to the samples in the replay buffer (e.g. TD error) |\n| [Rainbow: Combining Improvements in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.02298.pdf) | Rainbow | AAAI18 | off | Discrete | based on value function | combine different improvements to DQN: Double DQN, Dueling DQN, Priority Sampling, Multi-step learning, Distributional RL, Noisy Nets |\n| [Policy Gradient Methods for Reinforcement Learning with Function Approximation](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F1999\u002Ffile\u002F464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf) | PG | NeurIPS99 | on\u002Foff | Continuous or Discrete | function approximation | propose Policy Gradient Theorem: how to calculate the gradient of the expected cumulative return to policy |\n| ---- | AC\u002FA2C | ---- | on\u002Foff | Continuous or Discrete | parameterized neural network | AC: replace the return in PG with q function approximator to reduce variance; A2C: replace the q function in AC with advantage function to reduce variance |\n| [Asynchronous Methods for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1602.01783.pdf) | A3C | ICML16 | on\u002Foff | Continuous or Discrete | parameterized neural network | propose three tricks to improve performance: (i) use different agents to interact with the environment; (ii) value function and policy share network parameters; (iii) modify the loss function (mse of value function + pg loss + policy entropy)|\n| [Trust Region Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1502.05477.pdf) | TRPO | ICML15 | on | Continuous or Discrete | parameterized neural network | introduce trust region to policy optimization for guaranteed monotonic improvement |\n| [Proximal Policy Optimization Algorithms](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347.pdf) | PPO | arxiv17 | on | Continuous or Discrete | parameterized neural network | replace the hard constraint of TRPO with a penalty by clipping the coefficient |\n| [Deterministic Policy Gradient Algorithms](http:\u002F\u002Fproceedings.mlr.press\u002Fv32\u002Fsilver14.pdf) | DPG | ICML14 | off | Continuous | function approximation | consider deterministic policy for continuous action space and prove Deterministic Policy Gradient Theorem; use a stochastic behaviour policy for encouraging exploration |\n| [Continuous Control with Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1509.02971.pdf) | DDPG | ICLR16 | off | Continuous | parameterized neural network | adapt the ideas of DQN to DPG: (i) deep neural network function approximators, (ii) replay buffer, (iii) fix the target q function at each epoch |\n| [Addressing Function Approximation Error in Actor-Critic Methods](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.09477.pdf) | TD3 | ICML18 | off | Continuous | parameterized neural network | adapt the ideas of Double DQN to DDPG: taking the minimum value between a pair of critics to limit overestimation |\n| [Reinforcement Learning with Deep Energy-Based Policies](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1702.08165.pdf) | SQL | ICML17 | off | main for Continuous | parameterized neural network | consider max-entropy rl and propose soft q iteration as well as soft q learning |\n| [Soft Actor-Critic Algorithms and Applications](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1812.05905.pdf), [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](http:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Fhaarnoja18b\u002Fhaarnoja18b.pdf), [\\[appendix\\]](http:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Fhaarnoja18b\u002Fhaarnoja18b-supp.pdf) | SAC | ICML18 | off | main for Continuous | parameterized neural network | base the theoretical analysis of SQL and extend soft q iteration (soft q evaluation + soft q improvement); reparameterize the policy and use two parameterized value functions; propose SAC |\n\n\u003Ca id='exploration'>\u003C\u002Fa>\n### Exploration\n|  Title | Method | Conference |  Description |\n| ----  | ----   | ----       |   ----  |\n| [Curiosity-driven Exploration by Self-supervised Prediction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1705.05363.pdf) | ICM | ICML17 | propose that curiosity can serve as an intrinsic reward signal to enable the agent to explore its environment and learn skills when rewards are sparse; formulate curiosity as the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model |\n| [Curiosity-Driven Exploration via Latent Bayesian Surprise](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.07495.pdf) | LBS | AAAI22 | apply Bayesian surprise in a latent space representing the agent’s current understanding of the dynamics of the system |\n| [Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10886.pdf) | AIRS | ICML23 | select shaping function from a predefined set based on the estimated task return in real-time, providing reliable exploration incentives and alleviating the biased objective problem; develop a toolkit that provides highquality implementations of various intrinsic reward modules based on PyTorch |\n| [Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10515.pdf) | Curiosity in Hindsight | ICML23 | consider exploration in stochastic environments; learn representations of the future that capture precisely the unpredictable aspects of each outcome—which we use as additional input for predictions, such that intrinsic rewards only reflect the predictable aspects of world dynamics |\n| Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration || NeurIPS23 spotlight ||\n| [MIMEx: Intrinsic Rewards from Masked Input Modeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.08932.pdf) | MIMEx | NeurIPS23 | propose that the mask distribution can be flexibly tuned to control the difficulty of the underlying conditional prediction task |\n\n\u003C!--\u003Ca id='off-policy-evaluation'>\u003C\u002Fa>\n### Off-Policy Evaluation\n|  Title | Method | Conference |  Description |\n| ----  | ----   | ----       |   ----  |\n| [Weighted importance sampling for off-policy learning with linear function approximation](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2014\u002Ffile\u002Fbe53ee61104935234b174e62a07e53cf-Paper.pdf) | WIS-LSTD | NeurIPS14 |  |\n| [Importance Sampling Policy Evaluation with an Estimated Behavior Policy](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1806.01347.pdf) | RIS | ICML19 |  |\n| [On the Reuse Bias in Off-Policy Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07074.pdf) | BIRIS | IJCAI23 | discuss the bias of off-policy evaluation due to reusing the replay buffer; derive a high-probability bound of the Reuse Bias; introduce the concept of stability for off-policy algorithms and provide an upper bound for stable off-policy algorithms | \n\n\n\u003Ca id='soft-rl'>\u003C\u002Fa>\n### Soft RL\n\n|  Title | Method | Conference |  Description |\n| ----  | ----   | ----       |   ----  |\n| [A Max-Min Entropy Framework for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.10517.pdf) | MME | NeurIPS21 | find that SAC may fail in explore states with low entropy (arrive states with high entropy and increase their entropies); propose a max-min entropy framework to address this issue |\n| [Maximum Entropy RL (Provably) Solves Some Robust RL Problems ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | ---- | ICLR22 | theoretically prove that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function | \n\n\n\u003Ca id='data-augmentation'>\u003C\u002Fa>\n### Data Augmentation\n|  Title | Method | Conference |  Description |\n| ----  | ----   | ----       |   ----  |\n| [Reinforcement Learning with Augmented Data](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.14990.pdf) | RAD | NeurIPS20 | propose first extensive study of general data augmentations for RL on both pixel-based and state-based inputs |\n| [Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.13649.pdf) | DrQ | ICLR21 Spotlight | propose to regularize the value function when applying data augmentation with model-free methods and reach state-of-the-art performance in image-pixels tasks | -->\n\n\u003C!-- | [Equivalence notions and model minimization in Markov decision processes](https:\u002F\u002Fwww.ics.uci.edu\u002F~dechter\u002Fcourses\u002Fics-295\u002Fwinter-2018\u002Fpapers\u002Fgivan-dean-greig.pdf) |  | Artificial Intelligence, 2003 |  |\n| [Metrics for Finite Markov Decision Processes](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F1207\u002F1207.4114.pdf) || UAI04 ||\n| [Bisimulation metrics for continuous Markov decision processes](https:\u002F\u002Fwww.normferns.com\u002Fassets\u002Fdocuments\u002Fsicomp2011.pdf) || SIAM Journal on Computing, 2011 ||\n| [Scalable methods for computing state similarity in deterministic Markov Decision Processes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.09291.pdf) || AAAI20 ||\n| [Learning Invariant Representations for Reinforcement Learning without Reconstruction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.10742.pdf) | DBC | ICLR21 || -->\n\n\n\u003Ca id='Representation-RL'>\u003C\u002Fa>\n## Representation Learning\n\nNote: representation learning with MBRL is in the part [World Models](#dreamer)\n\n|  Title | Method | Conference | Description |\n| ----  | ----   | ----       |   ----  |\n| [CURL: Contrastive Unsupervised Representations for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.04136.pdf) | CURL | ICML20 | extracts high-level features from raw pixels using contrastive learning and performs offpolicy control on top of the extracted features |\n| [Learning Invariant Representations for Reinforcement Learning without Reconstruction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.10742.pdf) | DBC | ICLR21 | propose using Bisimulation to learn robust latent representations which encode only the task-relevant information from observations |\n| [Reinforcement Learning with Prototypical Representations](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.11271.pdf) | Proto-RL | ICML21 | pre-train task-agnostic representations and prototypes on environments without downstream task information |\n| [Understanding the World Through Action](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12543.pdf) | ---- | CoRL21 | discusse how self-supervised reinforcement learning combined with offline RL can enable scalable representation learning |\n| [Flow-based Recurrent Belief State Learning for POMDPs](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22q\u002Fchen22q.pdf) | FORBES | ICML22 | incorporate normalizing flows into the variational inference to learn general continuous belief states for POMDPs |\n| [Contrastive Learning as Goal-Conditioned Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07568.pdf) | Contrastive RL | NeurIPS22 | show (contrastive) representation learning methods can be cast as RL algorithms in their own right |\n| [Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05266.pdf) | ---- | NeurIPS22 | conduct an extensive comparison of various self-supervised losses under the existing joint learning framework for pixel-based reinforcement learning in many environments from different benchmarks, including one real-world environment |\n| [Reinforcement Learning with Automated Auxiliary Loss Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06041.pdf) | A2LS | NeurIPS22 | propose to automatically search top-performing auxiliary loss functions for learning better representations in RL; define a general auxiliary loss space of size 7.5 × 1020 based on the collected trajectory data and explore the space with an efficient evolutionary search strategy |\n| [Mask-based Latent Reconstruction for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12096.pdf) | MLR | NeurIPS22 | propose an effective self-supervised method to predict complete state representations in the latent space from the observations with spatially and temporally masked pixels |\n| [Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00030.pdf) | VIP | ICLR23 Spotlight | cast representation learning from human videos as an offline goal-conditioned reinforcement learning problem; derive a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos |\n| [Latent Variable Representation for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08765.pdf) | ---- | ICLR23 | provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism\u002Fpessimism principle in the face of uncertainty for exploration |\n| Spectral Decomposition Representation for Reinforcement Learning || ICLR23 ||\n| [Become a Proficient Player with Limited Data through Watching Pure Videos](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Sy-o2N0hF4f) | FICC | ICLR23 | consider the setting where the pre-training data are action-free videos; introduce a two-phase training pipeline; pre-training phase: implicitly extract the hidden action embedding from videos and pre-train the visual representation and the environment dynamics network based on vector quantization; down-stream tasks: finetune with small amount of task data based on the learned models |\n| [Bootstrapped Representations in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.10171.pdf) | ---- | ICML23 | provide a theoretical characterization of the state representation learnt by temporal difference learning; find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting |\n| [Representation-Driven Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19922.pdf) | RepRL | ICML23 | reduce the policy search problem to a contextual bandit problem, using a mapping from policy space to a linear feature space |\n| [Conditional Mutual Information for Disentangled Representations in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14133.pdf) | CMID | NeurIPS23 spotlight | propose an auxiliary task for RL algorithms that learns a disentangled representation of high-dimensional observations with correlated features by minimising the conditional mutual information between features in the representation |\n\n\u003Ca id='Unsupervised-RL'>\u003C\u002Fa>\n## Unsupervised Learning\n\n|  Title | Method | Conference | Description |\n| ----  | ----   | ----       |   ----  |\n| [Variational Intrinsic Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1611.07507.pdf) | ---- | arXiv1611 | introduce a new unsupervised reinforcement learning method for discovering the set of intrinsic options available to an agent, which is learned by maximizing the number of different states an agent can reliably reach, as measured by the mutual information between the set of options and option termination states |\n| [Diversity is All You Need: Learning Skills without a Reward Function](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.06070.pdf) | DIAYN | ICLR19 | learn diverse skills in environments without any rewards by maximizing an information theoretic objective |\n| Unsupervised Control Through Non-Parametric Discriminative Rewards || ICLR19 ||\n| [Dynamics-Aware Unsupervised Discovery of Skills](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1907.01657.pdf) | DADS | ICLR20 | propose to learn low-level skills using model-free RL with the explicit aim of making model-based control easy |\n| [Fast task inference with variational intrinsic successor features](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1906.05030.pdf) | VISR | ICLR20 ||\n| [Decoupling representation learning from reinforcement learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2009.08319.pdf) | ATC | ICML21 | propose a new unsupervised task tailored to reinforcement learning named Augmented Temporal Contrast (ATC), which borrows ideas from Contrastive learning; benchmark several leading Unsupervised Learning algorithms by pre-training encoders on expert demonstrations and using them in RL agents|\n| [Unsupervised Skill Discovery with Bottleneck Option Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.14305.pdf) | IBOL | ICML21 | propose a novel skill discovery method with information bottleneck, which provides multiple benefits including learning skills in a more disentangled and interpretable way with respect to skill latents and being robust to nuisance information |\n| [APS: Active Pretraining with Successor Features](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.13956.pdf) | APS | ICML21 | address the issues of APT and VISR by combining them together in a novel way |\n| [Behavior From the Void: Unsupervised Active Pre-Training](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.04551.pdf) | APT | NeurIPS21 | propose a non-parametric entropy computed in an abstract representation space; compute the average of the Euclidean distance of each particle to its nearest neighbors for a set of samples |\n| [Pretraining representations for data-efficient reinforcement learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.04799.pdf) | SGI | NeurIPS21 | consider to pretrian with unlabeled data and finetune on a small amount of task-specific data to improve the data efficiency of RL; employ a combination of latent dynamics modelling and unsupervised goal-conditioned RL |\n| [URLB: Unsupervised Reinforcement Learning Benchmark](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15191.pdf) | URLB | NeurIPS21 | a benchmark for unsupervised reinforcement learning |\n| [Discovering and Achieving Goals via World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09514.pdf) | LEXA | NeurIPS21 | unsupervised train both an explorer and an achiever policy via imagined rollouts in world models; after the unsupervised phase, solve tasks specified as goal images zero-shot without any additional learning |\n| [The Information Geometry of Unsupervised Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02719.pdf) | ---- | ICLR22 oral | show that unsupervised skill discovery algorithms based on mutual information maximization do not learn skills that are optimal for every possible reward function; provide a geometric perspective on some skill learning methods |\n| [Lipschitz Constrained Unsupervised Skill Discovery](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00914.pdf) | LSD | ICLR22 | argue that the MI-based skill discovery methods can easily maximize the MI objective with only slight differences in the state space; propose a novel objective based on a Lipschitz-constrained state representation function so that the objective maximization in the latent space always entails an increase in traveled distances (or variations) in the state space |\n| [Learning more skills through optimistic exploration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.14226.pdf) | DISDAIN | ICLR22 | derive an information gain auxiliary objective that involves training an ensemble of discriminators and rewarding the policy for their disagreement; the objective directly estimates the epistemic uncertainty that comes from the discriminator not having seen enough training examples|\n| [A Mixture of Surprises for Unsupervised Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06702.pdf) | MOSS | NeurIPS22 |  train one mixture component whose objective is to maximize the surprise and another whose objective is to minimize the surprise for handling the setting that the entropy of the environment’s dynamics may be unknown |\n| [Unsupervised Reinforcement Learning with Contrastive Intrinsic Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00161.pdf) | CIC | NeurIPS22 | propose to maximize the mutual information between statetransitions and latent skill vectors |\n| [Unsupervised Skill Discovery via Recurrent Skill Training](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sYDX_OxNNjh) | ReST | NeurIPS22 | encourage the latter trained skills to avoid entering the same states covered by the previous skills |\n| [Choreographer: Learning and Adapting Skills in Imagination](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13350.pdf) | Choreographer | ICLR23 Spotlight | decouples the exploration and skill learning processes; uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination |\n| Provable Unsupervised Data Sharing for Offline Reinforcement Learning || ICLR23 ||\n| Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality || ICLR23 ||\n| [Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12016.pdf) | Dyna-MPC | ICML23 oral | utilize unsupervised model-based RL for pre-training the agent; finetune downstream tasks via a task-aware finetuning strategy combined with a hybrid planner, Dyna-MPC |\n| [On the Importance of Feature Decorrelation for Unsupervised Representation Learning in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05637.pdf) | SimTPR | ICML23 | propose a novel URL framework that causally predicts future states while increasing the dimension of the latent manifold by decorrelating the features in the latent space |\n| CLUTR: Curriculum Learning via Unsupervised Task Representation Learning || ICML23 ||\n| [Controllability-Aware Unsupervised Skill Discovery](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05103.pdf) | CSD | ICML23 | train a controllability-aware distance function based on the current skill repertoire and combine it with distance-maximizing skill discovery |\n| [Behavior Contrastive Learning for Unsupervised Skill Discovery](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04477.pdf) | BeCL | ICML23 | propose a novel unsupervised skill discovery method through contrastive learning among behaviors, which makes the agent produce similar behaviors for the same skill and diverse behaviors for different skills |\n| Variational Curriculum Reinforcement Learning for Unsupervised Discovery of Skills || ICML23 ||\n| [Learning to Discover Skills through Guidance](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.20178.pdf) | DISCO-DANCE | NeurIPS23 |  selects the guide skill that possesses the highest potential to reach unexplored states; guides other skills to follow the guide skill; the guided skills are dispersed to maximize their discriminability in unexplored states |\n| Creating Multi-Level Skill Hierarchies in Reinforcement Learning || NeurIPS23 ||\n| Unsupervised Behavior Extraction via Random Intent Priors || NeurIPS23 ||\n| [METRA: Scalable Unsupervised RL with Metric-Aware Abstraction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.08887.pdf) | METRA | ICLR24 oral |  |\n| [Language Guided Skill Discovery](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.06615) | LGSD | arXiv2406 | take user prompts as input and output a set of semantically distinctive skills |\n| [PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14073) | CEURL, PEAC | NeurIPS24 | consider unsupervised pre-training across a distribution of embodiments, namely CEURL; propose PEAC for handling CEURL |\n| [Exploratory Diffusion Model for Unsupervised Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.07279) | ExDM | ICLR26 oral | utilize diffusion models for boosting unsupervised exploration and for fine-tuning pre-trained diffusion policies |\n\n\n\u003C!-- ### \u003Cspan id='current'>Current methods\u003C\u002Fspan> -->\n\u003Ca id='current'>\u003C\u002Fa>\n### Current methods\n\n|  Title | Method | Conference |  Description |\n| ----  | ----   | ----       |   ----  |\n| [Weighted importance sampling for off-policy learning with linear function approximation](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2014\u002Ffile\u002Fbe53ee61104935234b174e62a07e53cf-Paper.pdf) | WIS-LSTD | NeurIPS14 |  |\n| [Importance Sampling Policy Evaluation with an Estimated Behavior Policy](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1806.01347.pdf) | RIS | ICML19 |  |\n| [Provably efficient RL with Rich Observations via Latent State Decoding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1901.09018.pdf) | Block MDP | ICML19 ||\n| [Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.12729.pdf) | ---- | ICLR20 | show that the improvement of performance is related to code-level optimizations |\n| [Reinforcement Learning with Augmented Data](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.14990.pdf) | RAD | NeurIPS20 | propose first extensive study of general data augmentations for RL on both pixel-based and state-based inputs |\n| [Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.13649.pdf) | DrQ | ICLR21 Spotlight | propose to regularize the value function when applying data augmentation with model-free methods and reach state-of-the-art performance in image-pixels tasks |\n| [What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.05990.pdf) | ---- | ICLR21 | do a large scale empirical study to evaluate different tricks for on-policy algorithms on MuJoCo |\n| [Mirror Descent Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.09814.pdf) | MDPO | ICLR21 |  |\n| [Learning Invariant Representations for Reinforcement Learning without Reconstruction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.10742.pdf) | DBC | ICLR21 ||\n| [Randomized Ensemble Double Q-Learning: Learning Fast Without a Model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.05982.pdf) | REDQ | ICLR21 | consider three ingredients: (i) update q functions many times at every epoch; (ii) use an ensemble of Q functions; (iii) use the minimization across a random subset of Q functions from the ensemble for avoiding the overestimation; propose REDQ and achieve similar performance with model-based methods |\n| [Deep Reinforcement Learning at the Edge of the Statistical Precipice](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.13264.pdf) | ---- | NeurIPS21 oustandstanding paper | advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results; [\\[rliable\\]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frliable\u002F) |\n| [Generalizable Episodic Memory for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06469.pdf) | GEM | ICML21 | propose to integrate the generalization ability of neural networks and the fast retrieval manner of episodic memory |\n| [A Max-Min Entropy Framework for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.10517.pdf) | MME | NeurIPS21 | find that SAC may fail in explore states with low entropy (arrive states with high entropy and increase their entropies); propose a max-min entropy framework to address this issue |\n| [Maximum Entropy RL (Provably) Solves Some Robust RL Problems ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | ---- | ICLR22 | theoretically prove that \n| [SO(2)-Equivariant Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04439.pdf) | Equi DQN, Equi SAC | ICLR22 Spotlight | consider to learn transformation-invariant policies and value functions; define and analyze group equivariant MDPs |\n| [CoBERL: Contrastive BERT for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.05431.pdf) | CoBERL | ICLR22 Spotlight | propose Contrastive BERT for RL (COBERL) that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency |\n| [Understanding and Preventing Capacity Loss in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZkC8wKoLbQ7) | InFeR | ICLR22 Spotlight | propose that deep RL agents lose some of their capacity to quickly fit new prediction tasks during training; propose InFeR to regularize a set of network outputs towards their initial values |\n| [On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2105.01648.pdf) | ---- | ICLR22 Spotlight | consider lottery ticket hypothesis in deep reinforcement learning |\n| [Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04628.pdf) | LOGO | ICLR22 Spotlight | consider the sparse reward challenges in RL; propose LOGO that exploits the offline demonstration data generated by a sub-optimal behavior policy; each step of LOGO contains a policy improvement step via TRPO and an additional policy guidance step by using the sub-optimal behavior policy |\n| [Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01666.pdf) | IV-RL | ICLR22 Spotlight | analyze the sources of uncertainty in the supervision of modelfree DRL algorithms, and show that the variance of the supervision noise can be estimated with negative log-likelihood and variance ensembles |\n| [Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09765.pdf) | GPM | ICLR22 Spotlight | focus on generating consistent actions for model-free RL, and borrow ideas from Model-based planning and action-repeat; use the policy to generate multi-step actions |\n| [When should agents explore?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11811.pdf) | ---- | ICLR22 Spotlight | consider when to explore and propose to choose a heterogeneous mode-switching behavior policy |\n| [Maximizing Ensemble Diversity in Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hjd-kcpDpf2) | MED-RL | ICLR22 |  |\n| [Maximum Entropy RL (Provably) Solves Some Robust RL Problems ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | ---- | ICLR22 | theoretically prove that standard maximum entropy RL is robust to some disturbances in the dynamics and the reward function |\n| [Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zBOI9LFpESK) | AMBS | ICLR22 |  |\n| [Large Batch Experience Replay](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.01528.pdf) | LaBER | ICML22 oral | cast the replay buffer sampling problem as an importance sampling one for estimating the gradient and derive the theoretically optimal sampling distribution |\n| [Do Differentiable Simulators Give Better Gradients for Policy Optimization?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00817.pdf) | ---- | ICML22 oral | consider whether differentiable simulators give better policy gradients; show some pitfalls of First-order estimates and propose alpha-order estimates |\n| Federated Reinforcement Learning: Communication-Efficient Algorithms and Convergence Analysis || ICML22 oral ||\n| [An Analytical Update Rule for General Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02045.pdf) | ---- | ICML22 oral | provide a tighter bound for truse-region methods |\n| [Generalised Policy Improvement with Geometric Policy Composition](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08736.pdf) | GSPs | ICML22 oral | propose the concept of geometric switching policy (GSP), i.e., we have a set of policies and will use them to take action in turn, for each policy, we sample a number from the geometric distribution and take this policy such number of steps; consider policy improvement over nonMarkov GSPs |\n| [Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12417.pdf) | ---- | ICML22 | aim to better understand the relationship between the Bellman error and the accuracy of value functions through theoretical analysis and empirical study; point out that the Bellman error is a poor replacement for value error, including (i) The magnitude of the Bellman error hides bias, (ii) Missing transitions breaks the Bellman equation |\n| [Adaptive Model Design for Markov Decision Process](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22ab\u002Fchen22ab.pdf) | ---- | ICML22 | consider Regularized Markov Decision Process and formulate it as a bi-level problem |\n| [Stabilizing Off-Policy Deep Reinforcement Learning from Pixels](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fcetin22a\u002Fcetin22a.pdf) | A-LIX | ICML22 | propose that temporal-difference learning with a convolutional encoder and lowmagnitude reward will cause instabilities, which is named catastrophic self-overfitting; propose to provide adaptive regularization to the encoder’s gradients that explicitly prevents the occurrence of catastrophic self-overfitting |\n| [Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwu22i\u002Fwu22i.pdf) | ---- | ICML22 | study PG from a perturbation perspective |\n| [Mirror Learning: A Unifying Framework of Policy Optimisation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02373.pdf) | Mirror Learning | ICML22 | propose a novel unified theoretical framework named Mirror Learning to provide theoretical guarantees for General Policy Improvement (GPI) and Trust-Region Learning (TRL); propose an interesting, graph-theoretical perspective on mirror learning |\n| [Continuous Control with Action Quantization from Demonstrations](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fdadashi22a\u002Fdadashi22a.pdf) | AQuaDem | ICML22 | leverag the prior of human demonstrations for reducing a continuous action space to a discrete set of meaningful actions; point out that using a set of actions rather than a single one (Behavioral Cloning) enables to capture the multimodality of behaviors in the demonstrations |\n| [Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fzhang22al\u002Fzhang22al.pdf) | ---- | ICML22 | analyze Fitted Q Evaluation (FQE) with general differentiable function approximators, including neural function approximations by using the Z-estimation theory |\n| [The Primacy Bias in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07802.pdf) | primacy bias | ICML22 | find that deep RL agents incur a risk of overfitting to earlier experiences, which will negatively affect the rest of the learning process; propose a simple yet generally-applicable mechanism that tackles the primacy bias by periodically resetting a part of the agent |\n| [Optimizing Sequential Experimental Design with Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00821.pdf) |  | ICML22 | use DRL for solving the optimal design of sequential experiments |\n| [The Geometry of Robust Value Functions](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwang22k\u002Fwang22k.pdf) |  | ICML22 | study the geometry of the robust value space for the more general Robust MDPs |\n| [Utility Theory for Markovian Sequential Decision Making](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13637.pdf) | Affine-Reward MDPs | ICML22 | extend von Neumann-Morgenstern (VNM) utility theorem to decision making setting |\n| [Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fliang22c\u002Fliang22c.pdf) | MeanQ | ICML22 | consider variance reduction in Temporal-Difference Value Estimation; propose MeanQ to estimate target values by ensembling |\n| Unifying Approximate Gradient Updates for Policy Optimization || ICML22 ||\n| [Reinforcement Learning with Neural Radiance Fields](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01634.pdf) | NeRF-RL | NeurIPS22 | propose to train an encoder that maps multiple image observations to a latent space describing the objects in the scene |\n| [On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00761.pdf) | ---- | NeurIPS22 | explore the theoretical connections between Reward Maximization (RM) and Distribution Matching (DM) |\n| [Faster Deep Reinforcement Learning with Slower Online Network](https:\u002F\u002Fassets.amazon.science\u002F31\u002Fca\u002F0c09418b4055a7536ced1b218d72\u002Ffaster-deep-reinforcement-learning-with-slower-online-network.pdf) | DQN Pro, Rainbow Pro | NeurIPS22 | incentivize the online network to remain in the proximity of the target network |\n| [Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01626.pdf) | PVRL | NeurIPS22 | focus on reincarnating RL from any agent to any other agent; present reincarnating RL as an alternative workflow or class of problem settings, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another |\n| [Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OpC-9aBBVJe) | SR-SAC, SR-SPR | ICLR23 oral | show that fully or partially resetting the parameters of deep reinforcement learning agents causes better replay ratio scaling capabilities to emerge |\n| [Guarded Policy Optimization with Imperfect Online Demonstrations](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01728.pdf) | TS2C | ICLR23 Spotlight | h incorporate teacher intervention based on trajectory-based value estimation |\n| [Towards Interpretable Deep Reinforcement Learning with Human-Friendly Prototypes](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hWwY_Jq0xsN) | PW-Net | ICLR23 Spotlight | focus on making an “interpretable-by-design” deep reinforcement learning agent which is forced to use human-friendly prototypes in its decisions for making its reasoning process clear; train a “wrapper” model called PW-Net that can be added to any pre-trained agent, which allows them to be interpretable |\n| [DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00484.pdf) | DEP-RL | ICLR23 Spotlight | identify the DEP controller, known from the field of self-organizing behavior, to generate more effective exploration than other commonly used noise processes; first control the 7 degrees of freedom (DoF) human arm model with RL on a muscle stimulation level |\n| [Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.10466.pdf) | AVTD | ICLR23 | propose a simple active model selection method (AVTD) that attempts to automatically select regularization schemes by hill-climbing on validation TD error |\n| [Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.09103.pdf) | CCEM, GreedyAC | ICLR23 | propose to iteratively take the top percentile of actions, ranked according to the learned action-values; leverage theory for CEM to validate that CCEM concentrates on maximally valued actions across states over time |\n| [Reward Design with Language Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=10uNUgI5Kl) | ---- | ICLR23 | explore how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior |\n| [Solving Continuous Control via Q-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12566.pdf) | DecQN | ICLR23 | combine value decomposition with bang-bang action space discretization to DQN to handle continuous control tasks; evaluate on DMControl, Meta World, and Isaac Gym |\n| [Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.12558.pdf) | WAE-MDP | ICLR23 | minimize a penalized form of the optimal transport between the behaviors of the agent executing the original policy and the distilled policy |\n| [Human-level Atari 200x faster](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07550.pdf) | MEME | ICLR23 | outperform the human baseline across all 57 Atari games in 390M frames; four key components: (1) an approximate trust region method which enables stable bootstrapping from the online network, (2) a normalisation scheme for the loss and priorities which improves robustness when learning a set of value functions with a wide range of scales, (3) an improved architecture employing techniques from NFNets in order to leverage deeper networks without the need for normalization layers, and (4) a policy distillation method which serves to smooth out the instantaneous greedy policy over time. |\n| [Improving Deep Policy Gradients with Value Function Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10145.pdf) | VFS | ICLR23 | focus on improving value approximation and analyzing the effects on Deep PG primitives such as value prediction, variance reduction, and correlation of gradient estimates with the true gradient; show that value functions with better predictions improve Deep PG primitives, leading to better sample efficiency and policies with higher returns |\n| [Memory Gym: Partially Observable Challenges to Memory-Based Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jHc8dCx6DDr) | Memory Gym | ICLR23 | a benchmark for challenging Deep Reinforcement Learning agents to memorize events across long sequences, be robust to noise, and generalize; consists of the partially observable 2D and discrete control environments Mortar Mayhem, Mystery Path, and Searing Spotlights; [\\[code\\]](https:\u002F\u002Fgithub.com\u002FMarcoMeter\u002Fdrl-memory-gym\u002F) |\n| [Hybrid RL: Using both offline and online data can make RL efficient](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06718.pdf) | Hy-Q | ICLR23 | focus on a hybrid setting named Hybrid RL, where the agent has both an offline dataset and the ability to interact with the environment; extend fitted Q-iteration algorithm |\n| [POPGym: Benchmarking Partially Observable Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01859.pdf) | POPGym | ICLR23 | a two-part library containing (1) a diverse collection of 15 partially observable environments, each with multiple difficulties and (2) implementations of 13 memory model baselines; [\\[code\\]](https:\u002F\u002Fgithub.com\u002Fproroklab\u002Fpopgym) |\n| [Critic Sequential Monte Carlo](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15460.pdf) | CriticSMC | ICLR23 | combine sequential Monte Carlo with learned Soft-Q function heuristic factors |\n| [Planning-oriented Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10156.pdf) || CVPR23 best paper ||\n| [On the Reuse Bias in Off-Policy Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07074.pdf) | BIRIS | IJCAI23 | discuss the bias of off-policy evaluation due to reusing the replay buffer; derive a high-probability bound of the Reuse Bias; introduce the concept of stability for off-policy algorithms and provide an upper bound for stable off-policy algorithms |\n| [The Dormant Neuron Phenomenon in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12902.pdf) | ReDo | ICML23 oral | understand the underlying reasons behind the loss of expressivity during the training of RL agents; demonstrate the existence of the dormant neuron phenomenon in deep RL; propose Recycling Dormant neurons (ReDo) to reduce the number of dormant neurons and maintain network expressivity during training |\n| [Efficient RL via Disentangled Environment and Agent Representations](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kWS8mpioS9) | SEAR | ICML23 oral | consider to build a representation that can disentangle a robotic agent from its environment for improving the learning efficiency for RL; augment the RL loss with an agent-centric auxiliary loss |\n| [On the Statistical Benefits of Temporal Difference Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13289.pdf) | ---- | ICML23 oral | provide crisp insight into the statistical benefits of TD |\n| [Settling the Reward Hypothesis](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10420.pdf) | ---- | ICML23 oral | provide a treatment of the reward hypothesis in both the setting that goals are the subjective desires of the agent and in the setting where goals are the objective desires of an agent designer |\n| [Learning Belief Representations for Partially Observable Deep RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=4IzEmHLono) | Believer | ICML23 | decouple belief state modelling (via unsupervised learning) from policy optimization (via RL); propose a representation learning approach to capture a compact set of reward-relevant features of the state |\n| [Internally Rewarded Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00270.pdf) | IRRL | ICML23 | study a class of reinforcement learning problems where the reward signals for policy learning are generated by an internal reward model that is dependent on and jointly optimized with the policy; theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function |\n| [Hyperparameters in Reinforcement Learning and How To Tune Them](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01324.pdf) | ---- | ICML23 | Exploration of the hyperparameter landscape for commonly-used RL algorithms and environments; Comparison of different types of HPO methods on state-of-the-art RL algorithms and challenging RL environments |\n| Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning || ICML23 ||\n| [Correcting discount-factor mismatch in on-policy policy gradient methods](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13284.pdf) | ---- | ICML23 | introduce a novel distribution correction to account for the discounted stationary distribution |\n| [Reinforcement Learning Can Be More Efficient with Multiple Rewards](https:\u002F\u002Fopenreview.net\u002Fpdf?id=skDVsmXjPR) | ---- | ICML23 | theoretically analyze multi-reward extensions of action-elimination algorithms and prove more favorable instance-dependent regret bounds compared to their single-reward counterparts, both in multi-armed bandits and in tabular Markov decision processes |\n| [Performative Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00046.pdf) | ---- | ICML23 | introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment |\n| [Reinforcement Learning with History Dependent Dynamic Contexts](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02061.pdf) | DCMDPs | ICML23 | introduce DCMDPs, a novel reinforcement learning framework for history-dependent environments that handles non-Markov environments, where contexts change over time; derive an upper-confidence-bound style algorithm for logistic DCMDPs |\n| [On Many-Actions Policy Gradient](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13011.pdf) | MBMA | ICML23 | propose MBMA, an approach leveraging dynamics models for many-actions sampling in the context of stochastic policy gradient (SPG). which yields lower bias and comparable variance to SPG estimated from states in model-simulated rollouts |\n| [Scaling Laws for Reward Model Overoptimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=bBLjms8nZE&name=pdf) | ---- | ICML23 | study overoptimization in the context of large language models fine-tuned as reward models trained to predict which of two options a human will prefer; study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling |\n| [Bigger, Better, Faster: Human-level Atari with human-level efficiency](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19452.pdf) | BBF | ICML23 | rely on scaling the neural networks used for value estimation and a number of other design choices like resetting |\n| [Synthetic Experience Replay](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06614.pdf) | SynthER | NeurIPS23 | utilize diffusion to augment data in the replay buffer; evaluate in both online RL and offline RL|\n| [OMPO: A Unified Framework for RL under Policy and Dynamics Shifts](https:\u002F\u002Fopenreview.net\u002Fattachment?id=R83VIZtHXA&name=pdf) | OMPO | ICML24 oral | consider the distribution discrepancies induced by policy or dynamics shifts; propose a surrogate policy learning objective by considering the transition occupancy discrepancies and then cast it into a tractable min-max optimization problem through dual reformulation |\n\n\u003Ca id='Model-Based-Online'>\u003C\u002Fa>\n## Model Based (Online) RL\n\n\u003Ca id='model-based-classic'>\u003C\u002Fa>\n### Classic Methods\n|  Title | Method | Conference |  Description |\n| ----  | ----   | ----       |   ----  |\n| [Value-Aware Loss Function for Model-based Reinforcement Learning](http:\u002F\u002Fproceedings.mlr.press\u002Fv54\u002Ffarahmand17a\u002Ffarahmand17a-supp.pdf) | VAML | AISTATS17 | propose to train model by using the difference between TD error rather than KL-divergence |\n| [Model-Ensemble Trust-Region Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.10592.pdf) | ME-TRPO | ICLR18 | analyze the behavior of vanilla MBRL methods with DNN; propose ME-TRPO with two ideas: (i) use an ensemble of models, (ii)  use likelihood ratio derivatives; significantly reduce the sample complexity compared to model-free methods |\n| [Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.00101.pdf) | MVE | ICML18 | use a dynamics model to simulate the short-term horizon and Q-learning to estimate the long-term value beyond the simulation horizon; use the trained model and the policy to estimate k-step value function for updating value function |\n| [Iterative value-aware model learning](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2018\u002Ffile\u002F7a2347d96752880e3d58d72e9813cc14-Paper.pdf) | IterVAML | NeurIPS18 | replace e the supremum in VAML with the current estimate of the value function |\n| [Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1807.01675.pdf) | STEVE | NeurIPS18 | an extension to MVE; only utilize roll-outs without introducing significant errors |\n| [Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1805.12114.pdf) | PETS | NeurIPS18 | propose PETS that incorporate uncertainty via an ensemble of bootstrapped models |\n| [Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1807.03858.pdf)  | SLBO | ICLR19 | propose a novel algorithmic framework for designing and analyzing model-based RL algorithms with theoretical guarantees: provide a lower bound of the true return satisfying some properties s.t. optimizing this lower bound can actually optimize the true return |\n| [When to Trust Your Model: Model-Based Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1906.08253.pdf) | MBPO | NeurIPS19  | propose MBPO with monotonic model-based improvement; theoretically discuss how to choose k for model rollouts |\n| [Model Based Reinforcement Learning for Atari](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1903.00374.pdf) | SimPLe | ICLR20 | first successfully handle ALE benchmark with model-based method with some designs: (i) deterministic Model; (ii) well-designed loss functions; (iii) scheduled sampling; (iv) stochastic Models |\n| [Bidirectional Model-based Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2007.01995.pdf) | BMPO | ICML20 | an extension to MBPO; consider both forward dynamics model and backward dynamics model |\n| [Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.06800.pdf) | CaDM | ICML20 |  develop a context-aware dynamics model (CaDM) capable of generalizing across a distribution of environments with varying transition dynamics; introduce a backward dynamics model that predicts a previous state by utilizing a context latent vector |\n| [A Game Theoretic Framework for Model Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.07804.pdf) | PAL, MAL | ICML20 | develop a novel framework that casts MBRL as a game between a policy player and a model player; setup a Stackelberg game between the two players |\n| [Planning to Explore via Self-Supervised World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.05960.pdf) | Plan2Explore | ICML20 | propose a self-supervised reinforcement learning agent for addressing two challenges: quick adaptation and expected future novelty |\n| [Trust the Model When It Is Confident: Masked Model-based Actor-Critic](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.04893.pdf)| M2AC | NeurIPS20 | an extension to MBPO; use model rollouts only when the model is confident |\n| [The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2007.03158.pdf) | LoCA | NeurIPS20 | propose LoCA to measure how quickly a method adapts its policy after the environment is changed from the first task to the second |\n| [Generative Temporal Difference Learning for Infinite-Horizon Prediction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.14496.pdf) | GHM, or gamma-model | NeurIPS20 | propose gamma-model to make long-horizon predictions without the need to repeatedly apply a single-step model |\n| [Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2012.04603.pdf) | ---- | arXiv2012 | study a number of design decisions for the predictive model in visual MBRL algorithms, focusing specifically on methods that use a predictive model for planning |\n| [Mastering Atari Games with Limited Data](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.00210.pdf) | EfficientZero | NeurIPS21 | first achieve super-human performance on Atari games with limited data; propose EfficientZero with three components: (i) use self-supervised learning to learn a temporally consistent environment model, (ii) learn the value prefix in an end-to-end manner, (iii) use the learned model to correct off-policy value targets |\n| [On Effective Scheduling of Model-based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08550.pdf) | AutoMBPO | NeurIPS21 | an extension to MBPO; automatically schedule the real data ratio as well as other hyperparameters for MBPO |\n| [Model-Advantage and Value-Aware Models for Model-Based Reinforcement Learning: Bridging the Gap in Theory and Practice](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.14080.pdf) | ---- | arxiv22 | bridge the gap in theory and practice of value-aware model learning (VAML) for model-based RL |\n| [Value Gradient weighted Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01464.pdf) | VaGraM | ICLR22 Spotlight | consider the objective mismatch problem in MBRL; propose VaGraM by rescaling the MSE loss function with gradient information from the current value function estimate |\n| [Constrained Policy Optimization via Bayesian World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09802.pdf) | LAMBDA | ICLR22 Spotlight | consider Bayesian model-based methods for CMDP |\n| [On-Policy Model Errors in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07985.pdf) | OPC | ICLR22 | consider to combine real-world data and a learned model in order to get the best of both worlds; propose to exploit the real-world data for onpolicy predictions and use the learned model only to generalize to different actions; propose to use on-policy transition data on top of a separately learned model to enable accurate long-term predictions for MBRL |\n| [Temporal Difference Learning for Model Predictive Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04955.pdf) | TD-MPC | ICML22 | propose to use the model only to predice reward; use a policy to accelerate the planning |\n| [Causal Dynamics Learning for Task-Independent State Abstraction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13452.pdf) |  | ICML22 |  |\n| [Mismatched no More: Joint Model-Policy Optimization for Model-Based RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02758.pdf) | MnM | NeurIPS22 | propose a model-based RL algorithm where the model and policy are jointly optimized with respect to the same objective, which is a lower bound on the expected return under the true environment dynamics, and becomes tight under certain assumptions |\n| [Reinforcement Learning with Non-Exponential Discounting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13413.pdf) | ---- | NeurIPS22 | propose a theory for continuous-time model-based reinforcement learning generalized to arbitrary discount functions; derive a Hamilton–Jacobi–Bellman (HJB) equation characterizing the optimal policy and describe how it can be solved using a collocation method |\n| [Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08466.pdf) | ALM | ICLR23 | propose a single objective which jointly optimizes the policy, the latent-space model, and the representations produced by the encoder using the same objective: maximize predicted rewards while minimizing the errors in the predicted representations |\n| [SpeedyZero: Mastering Atari with Limited Data and Time](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Mg5CLXZgvLJ) | SpeedyZero | ICLR23 | a distributed RL system built upon EfficientZero with Priority Refresh and Clipped LARS; lead to human-level performances on the Atari benchmark within 35 minutes using only 300k samples |\n| Investigating the role of model-based learning in exploration and transfer || ICML23 ||\n| [STEERING : Stein Information Directed Exploration for Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12038.pdf) | STEERING | ICML23 |  |\n| [Predictable MDP Abstraction for Unsupervised Model-Based RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03921.pdf) | PMA | ICML23 | apply model-based RL on top of an abstracted, simplified MDP, by restricting unpredictable actions |\n| The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms || ICML23 ||\n| [Stop Regressing: Training Value Functions via Classification for Scalable Deep RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dVpFKfqF3R&name=pdf) | HL-Gauss | ICML24 oral | show that training value functions with categorical cross-entropy significantly enhances performance and scalability across various domains, including single-task RL on Atari 2600 games, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving stateof-the-art results on these domains |\n| [Trust the Model Where It Trusts Itself: Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption](https:\u002F\u002Fopenreview.net\u002Fattachment?id=N0ntTjTfHb&name=pdf) | MACURA | ICML24 | propose an easy-to-tune mechanism for model-based rollout length scheduling |\n\n\n\u003Ca id='dreamer'>\u003C\u002Fa>\n### World Models\n\n|  Title | Method | Conference |  Description |\n| ----  | ----   | ----       |   ----  |\n| [World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.10122.pdf), [\\[NeurIPS version\\]](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2018\u002Ffile\u002F2de5d16682c3c35007e4e92982f1a2ba-Paper.pdf) | World Models | NeurIPS18 | use an unsupervised manner to learn a compressed spatial and temporal representation of the environment and use the world model to train a very compact and simple policy for solving the required task |\n| [Learning latent dynamics for planning from pixels](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1811.04551.pdf) | PlaNet | ICML19 | propose PlaNet to learn the environment dynamics from images; the dynamic model consists transition model, observation model, reward model and encoder; use the cross entropy method for selecting actions for planning |\n| [Dream to Control: Learning Behaviors by Latent Imagination](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1912.01603.pdf) | Dreamer | ICLR20 | solve long-horizon tasks from images purely by latent imagination; test in image-based MuJoCo; propose to use an agent to replace the control algorithm in the PlaNet |\n| [Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.12142.pdf) | BIRD | NeurIPS20 | propose to maximize the mutual information between imaginary and real trajectories so that the policy improvement learned from imaginary trajectories can be easily generalized to real trajectories |\n| [Planning to Explore via Self-Supervised World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.05960.pdf) | Plan2Explore | ICML20 | propose Plan2Explore to  self-supervised exploration and fast adaptation to new tasks |\n| [Mastering Atari with Discrete World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.02193.pdf) | Dreamerv2 | ICLR21 | solve long-horizon tasks from images purely by latent imagination; test in image-based Atari |\n| [Temporal Predictive Coding For Model-Based Planning In Latent Space](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.07156.pdf) | TPC | ICML21 | propose a temporal predictive coding approach for planning from high-dimensional observations and theoretically analyze its ability to prioritize the encoding of task-relevant information |\n| [Learning Task Informed Abstractions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.15612.pdf) | TIA | ICML21 | introduce the formalism of Task Informed MDP (TiMDP) that is realized by training two models that learn visual features via cooperative reconstruction, but one model is adversarially dissociated from the reward signal |\n| [Dreaming: Model-based Reinforcement Learning by Latent Imagination without Reconstruction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2007.14535.pdf) | Dreaming | ICRA21 | propose a decoder-free extension of Dreamer since the autoencoding based approach often causes object vanishing|\n| [Model-Based Reinforcement Learning via Imagination with Derived Memory](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jeATherHHGj) | IDM | NeurIPS21 | hope to improve the diversity of imagination for model-based policy optimization with the derived memory; point out that current methods cannot effectively enrich the imagination if the latent state is disturbed by random noises |\n| [Maximum Entropy Model-based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01195.pdf) |  MaxEnt Dreamer | NeurIPS21 | create a connection between exploration methods and model-based reinforcement learning; apply maximum-entropy exploration for Dreamer |\n| [Discovering and Achieving Goals via World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09514.pdf) | LEXA | NeurIPS21 | unsupervised train both an explorer and an achiever policy via imagined rollouts in world models; after the unsupervised phase, solve tasks specified as goal images zero-shot without any additional learning |\n| [TransDreamer: Reinforcement Learning with Transformer World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09481.pdf) | TransDreamer | arxiv2202 | replace the RNN in RSSM by a transformer |\n| [DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fdeng22a\u002Fdeng22a.pdf) | DreamerPro | ICML22 | consider reconstruction-free MBRL; propose to learn the prototypes from the recurrent states of the world model, thereby distilling temporal structures from past observations and actions into the prototypes. |\n| [Towards Evaluating Adaptivity of Model-Based Reinforcement Learning Methods](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwan22d\u002Fwan22d.pdf) | ---- | ICML22 | introduce an improved version of the LoCA setup and use it to evaluate PlaNet and Dreamerv2 |\n| [Reinforcement Learning with Action-Free Pre-Training from Videos](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13880.pdf) | APV | ICML22 | pre-train an action-free latent video prediction model using videos from different domains, and then fine-tune the pre-trained model on target domains |\n| [Denoised MDPs: Learning World Models Better Than the World Itself](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15477.pdf) | Denoised MDP | ICML22 | divide information into four categories: controllable\u002Funcontrollable (whether infected by the action) and reward-relevant\u002Firrelevant (whether affects the return); propose to only consider information which is controllable and reward-relevant |\n| [DreamingV2: Reinforcement Learning with Discrete World Models without Reconstruction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00494.pdf) | Dreamingv2 | arxiv2203 | adopt both the discrete representation of DreamerV2 and the reconstruction-free objective of Dreaming |\n| [Masked World Models for Visual Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14244.pdf) | MWM | arxiv2206 | decouple visual representation learning and dynamics learning for visual model-based RL and use masked autoencoder to train visual representation |\n| [DayDreamer: World Models for Physical Robot Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14176.pdf) | DayDreamer | arxiv2206 | apply Dreamer to 4 robots to learn online and directly in the real world, without any simulators |\n| [Iso-Dream: Isolating Noncontrollable Visual Dynamics in World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13817.pdf) | Iso-Dream | NeurIPS22 | consider noncontrollable dynamics independent of the action signals; encourage the world model to learn controllable and noncontrollable sources of spatiotemporal changes on isolated state transition branches; optimize the behavior of the agent on the decoupled latent imaginations of the world model |\n| [Learning General World Models in a Handful of Reward-Free Deployments](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12719.pdf) | CASCADE | NeurIPS22 | introduce the reward-free deployment efficiency setting to facilitate generalization (exploration should be task agnostic) and scalability (exploration policies should collect large quantities of data without costly centralized retraining); propose an information theoretic objective inspired by Bayesian Active Learning by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective |\n| [Learning Robust Dynamics through Variational Sparse Gating](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11698.pdf) | VSG, SVSG, BBS | NeurIPS22 | consider to sparsely update the latent states at each step; develope a new partially-observable and stochastic environment, called BringBackShapes (BBS) |\n| [Transformers are Sample Efficient World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00588.pdf) | IRIS | ICLR23 oral | use a discrete autoencoder and an autoregressive Transformer to conduct World Models and significantly improve the data efficiency in Atari (2 hours of real-time experience); [\\[code\\]](https:\u002F\u002Fgithub.com\u002Feloialonso\u002Firis) |\n| [Transformer-based World Models Are Happy With 100k Interactions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.07109.pdf) | TWM | ICLR23 | present a new autoregressive world model based on the Transformer-XL; obtain excellent results on the Atari 100k benchmark; [\\[code\\]](https:\u002F\u002Fgithub.com\u002Fjrobine\u002Ftwm) |\n| [Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.10144.pdf) | DUTD | ICLR23 | propose a new general method that dynamically adjusts the update to data (UTD) ratio during training based on underand overfitting detection on a small subset of the continuously collected experience not used for training; apply this method in DreamerV2 |\n| [Evaluating Long-Term Memory in 3D Mazes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13383.pdf) | Memory Maze | ICLR23 | introduce the Memory Maze, a 3D domain of randomized mazes specifically designed for evaluating long-term memory in agents, including an online reinforcement learning benchmark, a diverse offline dataset, and an offline probing evaluation; [\\[code\\]](https:\u002F\u002Fgithub.com\u002Fjurgisp\u002Fmemory-maze) |\n| [Mastering Diverse Domains through World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04104.pdf) | DreamerV3 | arxiv2301 | propose DreamerV3 to handle a wide range of domains, including continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales|\n| [Task Aware Dreamer for Task Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05092.pdf) | TAD | arXiv2303 | propose Task Distribution Relevance to capture the relevance of the task distribution quantitatively; propose TAD to use world models to improve task generalization via encoding reward signals into policies |\n| [Reparameterized Policy Learning for Multimodal Trajectory Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.10710.pdf) | RPG | ICML23 oral | propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories; present RPG to leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency |\n| [Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12016.pdf) | Dyna-MPC | ICML23 oral | utilize unsupervised model-based RL for pre-training the agent; finetune downstream tasks via a task-aware finetuning strategy combined with a hybrid planner, Dyna-MPC |\n| [Posterior Sampling for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.00477.pdf) | PSDRL | ICML23 | combine efficient uncertainty quantification over latent state space models with a specially tailored continual planning algorithm based on value-function approximation |\n| [Model-based Reinforcement Learning with Scalable Composite Policy Gradient Estimators](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rDMAJECBM2) | TPX | ICML23 | propose Total Propagation X, the first composite gradient estimation algorithm using inverse variance weighting that is demonstrated to be applicable at scale; combine TPX with Dreamer |\n| [Go Beyond Imagination: Maximizing Episodic Reachability with World Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=JsAMuzA9o2) | GoBI | ICML23 | combine the traditional lifelong novelty motivation with an episodic intrinsic reward that is designed to maximize the stepwise reachability expansion; apply learned world models to generate predicted future states with random actions |\n| [Simplified Temporal Consistency Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09466.pdf) | TCRL | ICML23 | propose a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency and it is sufficient for high-performance RL |\n| [Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12050.pdf) | DECKARD | ICML23 | hypothesize an Abstract World Model (AWM) over subgoals by few-shot prompting an LLM |\n| Demonstration-free Autonomous Reinforcement Learning via Implicit and Bidirectional Curriculum || ICML23 ||\n| [Curious Replay for Model-based Adaptation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15934.pdf) | CR | ICML23 | aid model-based RL agent adaptation by prioritizing replay of experiences the agent knows the least about |\n| [Multi-View Masked World Models for Visual Robotic Manipulation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02408.pdf) | MV-MWM | ICML23 | train a multi-view masked autoencoder that reconstructs pixels of randomly masked viewpoints and then learn a world model operating on the representations from the autoencoder |\n| [Facing off World Model Backbones: RNNs, Transformers, and S4](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02064.pdf) | S4WM | NeurIPS23 | propose the first S4-based world model that can generate high-dimensional image sequences through latent imagination |\n\n\n\n\u003Ca id='model-based-code'>\u003C\u002Fa>\n### CodeBase\n\n|  Title | Conference | Methods |  Github |\n| ---- | ---- | ---- | ---- |\n| [MBRL-Lib: A Modular Library for Model-based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.10159.pdf) | arxiv21 | MBPO,PETS,PlaNet | [link](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmbrl-lib) |\n\n\n\n\u003Ca id='Model-Free-Offline'>\u003C\u002Fa>\n## (Model Free) Offline RL\n\n\u003Ca id='offline-current'>\u003C\u002Fa>\n### Current Methods\n\n|  Title | Method | Conference | Description |\n| ----  | ----   | ----       |   ----  |\n| [Off-Policy Deep Reinforcement Learning without Exploration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1812.02900.pdf) | BCQ | ICML19 | show that off-policy methods perform badly because of extrapolation error; propose batch-constrained reinforcement learning: maximizing the return as well as minimizing the mismatch between the state-action visitation of the policy and the state-action pairs contained in the batch |\n| [Conservative Q-Learning for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.04779.pdf) | CQL | NeurIPS20 | propose CQL with conservative q function, which is a lower bound of its true value, since standard off-policy methods will overestimate the value function |\n| [Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.01643.pdf) | ---- | arxiv20 | tutorial about methods, applications and open problems of offline rl |\n| [Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.01548.pdf) |  | NeurIPS21 |  |\n| [A Minimalist Approach to Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.06860.pdf) | TD3+BC | NeurIPS21 | propsoe to add a behavior cloning term to regularize the policy, and normalize the states over the dataset |\n| [DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04716.pdf) | DR3 | ICLR22 Spotlight | consider the implicit regularization effect of SGD in RL; based on theoretical analyses, propose an explicit regularizer, called DR3, and combine with offline RL methods |\n| [Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11566.pdf) | PBRL | ICLR22 Spotlight | consider the distributional shift and extrapolation error in offline RL; propose PBRL with bootstrapping, for uncertainty quantification, and an OOD sampling method as a regularizer |\n| [COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=FLA55mBee6Q) | COptiDICE | ICLR22 Spotlight | consider offline constrained reinforcement learning; propose COptiDICE to directly optimize the distribution of state-action pair with contraints |\n| [Offline Reinforcement Learning with Value-based Episodic Memory](https:\u002F\u002Fopenreview.net\u002Fpdf?id=RCZqv9NXlZ) | EVL, VEM | ICLR22 | present a new offline V -learning method to learn the value function through the trade-offs between imitation learning and optimal value learning; use a memory-based planning scheme to enhance advantage estimation and conduct policy learning in a regression manner |\n| [Offline reinforcement learning with implicit Q-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06169.pdf) | IQL | ICLR22 | propose to learn an optimal policy with in-sample learning, without ever querying the values of any unseen actions |\n| [Offline RL Policies Should Be Trained to be Adaptive](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02200.pdf) | APE-V | ICML22 oral | show that learning from an offline dataset does not fully specify the environment; formally demonstrate the necessity of adaptation in offline RL by using the Bayesian formalism and to provide a practical algorithm for learning optimally adaptive policies; propose an ensemble-based offline RL algorithm that imbues policies with the ability to adapt within an episode |\n| [When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11027.pdf) | DOGE | ICLR23 | train a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint |\n| [Jump-Start Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02372.pdf) | JSRL | ICML23 | consider the setting that employs two policies to solve tasks: a guide-policy, and an exploration-policy; bootstrap an RL algorithm by gradually “rolling in” with the guide-policy |\n\n\n\u003Ca id='offline-diffusion'>\u003C\u002Fa>\n### Combined with Diffusion Models\n\n|  Title | Method | Conference | Description |\n| ----  | ----   | ----       |   ----  |\n| [Planning with Diffusion for Flexible Behavior Synthesis](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09991.pdf) | Diffuser | ICML22 oral | first propose a denoising diffusion model designed for trajectory data and an associated probabilistic framework for behavior synthesis; demonstrate that Diffuser has a number of useful properties and is particularly effective in offline control settings that require long-horizon reasoning and test-time flexibility |\n| Is Conditional Generative Modeling all you need for Decision Making? || ICLR23 oral ||\n| [Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06193.pdf) | Diffusion-QL | ICLR23 | perform policy regularization using diffusion (or scorebased) models; utilize a conditional diffusion model to represent the policy |\n| [Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14548.pdf) | SfBC | ICLR23 | decouple the learned policy into two parts: an expressive generative behavior model and an action evaluation model |\n| [AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01877.pdf) | AdaptDiffuser | ICML23 oral | propose AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, which can also adapt to unseen tasks |\n| [Energy-Guided Diffusion Sampling for Offline-to-Online Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hunSEjeCPE&name=pdf) | EDIS | ICML24 | utilizes a diffusion model to extract prior knowledge from the offline dataset and employs energy functions to distill this knowledge for enhanced data generation in the online phase; formulate three distinct energy functions to guide the diffusion sampling process for the distribution alignment |\n| [DIDI: Diffusion-Guided Diversity for Offline Behavioral Generation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8296yUBoXr&name=pdf) | DIDI | ICML24 | propose to learn a diverse set of skills from a mixture of label-free offline data |\n\n\n\u003Ca id='Model-Based-Offline'>\u003C\u002Fa>\n## Model Based Offline RL\n\n|  Title | Method | Conference | Description |\n| ----  | ----   | ----       |   ----  |\n| [Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.03647.pdf) | BREMEN | ICLR20 | propose deployment efficiency, to count the number of changes in the data-collection policy during learning (offline: 1, online: no limit); propose BERMEN with an ensemble of dynamics models for off-policy and offline rl |\n| [MOPO: Model-based Offline Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.13239.pdf) | MOPO | NeurIPS20 | observe that existing model-based RL algorithms can improve the performance of offline RL compared with model free RL algorithms; design MOPO by extending MBPO on uncertainty-penalized MDPs (new_reward = reward - uncertainty) |\n| [MOReL: Model-Based Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.05951.pdf) | MOReL | NeurIPS20 | present MOReL for model-based offline RL, including two steps: (a) learning a pessimistic MDP, (b) learning a near-optimal policy in this P-MDP |\n| [Model-Based Offline Planning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.05556.pdf) | MBOP | ICLR21 | learn a model for planning |\n| [Representation Balancing Offline Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=QpNz8r_Ri2Y) | RepB-SDE | ICLR21 | focus on learning the representation for a robust model of the environment under the distribution shift and extend RepBM to deal with the curse of horizon; propose RepB-SDE framework for off-policy evaluation and offline rl |\n| [Conservative Objective Models for Effective Offline Model-Based Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.06882.pdf) | COMs | ICML21 | consider offline model-based optimization (MBO, optimize an unknown function only with some samples); add a regularizer (resemble adversarial training methods) to the objective forlearning conservative objective models |\n| [COMBO: Conservative Offline Model-Based Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.08363v1.pdf) | COMBO | NeurIPS21 | try to optimize a lower bound of performance without considering uncertainty quantification; extend CQL with model-based methods|\n| [Weighted Model Estimation for Offline Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zdC5eXljMPy) | ---- | NeurIPS21 | address the covariate shift issue by re-weighting the model losses for different datapoints |\n| [Revisiting Design Choices in Model-Based Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04135.pdf) | ---- | ICLR22 Spotlight | conduct a rigorous investigation into a series of these design choices for Model-based Offline RL |\n| [Planning with Diffusion for Flexible Behavior Synthesis](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09991.pdf) | Diffuser | ICML22 oral | first design a denoising diffusion model for trajectory data and an associated probabilistic framework for behavior synthesis |\n| [Learning Temporally Abstract World Models without Online Experimentation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=YeTYJz7th5) | OPOSM | ICML23 | present an approach for simultaneously learning sets of skills and temporally abstract, skill-conditioned world models purely from offline data, enabling agents to perform zero-shot online planning of skill sequences for new tasks |\n\n\n\u003Ca id='Meta-RL'>\u003C\u002Fa>\n## Meta RL\n\n|  Title | Method | Conference | Description |\n| ----  | ----   | ----       |   ----  |\n| [RL2 : Fast reinforcement learning via slow reinforcement learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1611.02779.pdf) | RL2 | arxiv16 | view the learning process of the agent itself as an objective; structure the agent as a recurrent neural network to store past rewards, actions, observations and termination flags for adapting to the task at hand when deployed |\n| [Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](https:\u002F\u002Fwww.cs.utexas.edu\u002Fusers\u002Fsniekum\u002Fclasses\u002FRL-F17\u002Fpapers\u002FMeta.pdf) | MAML | ICML17 | propose a general framework for different learning problems, including classification, regression andreinforcement learning; the main idea is to optimize the parameters to quickly adapt to new tasks (with a few steps of gradient descent) |\n| [Meta reinforcement learning with latent variable gaussian processes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.07551.pdf) | ---- | arxiv18 |  |\n| [Learning to adapt in dynamic, real-world environments through meta-reinforcement learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.11347.pdf) | ReBAL, GrBAL | ICLR18 | consider learning online adaptation in the context of model-based reinforcement learning |\n| [Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1711.01244.pdf) | ---- | ICML18 | extend various PAC-Bayes bounds to meta learning |\n| [Meta reinforcement learning of structured exploration strategies](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.07245.pdf) |  | NeurIPS18 |  |\n| [Meta-learning surrogate models for sequential decision making](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1903.11907.pdf) |  | arxiv19 |  |\n| [Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1903.08254.pdf) | PEARL | ICML19 | encode past tasks’ experience with probabilistic latent context and use inference network to estimate the posterior|\n| [Fast context adaptation via meta-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.03642.pdf) | CAVIA | ICML19 | propose CAVIA as an extension to MAML that is less prone to meta-overfitting, easier to parallelise, and more interpretable; partition the model parameters into two parts: context parameters and shared parameters, and only update the former one in the test stage |\n| [Taming MAML: Efficient Unbiased Meta-Reinforcement Learning](http:\u002F\u002Fproceedings.mlr.press\u002Fv97\u002Fliu19g\u002Fliu19g.pdf) |  | ICML19 |  |\n| [Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning](http:\u002F\u002Fproceedings.mlr.press\u002Fv100\u002Fyu20a\u002Fyu20a.pdf) | Meta World | CoRL19 | an envoriment for meta RL as well as multi-task RL |\n| [Guided meta-policy search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1904.00956.pdf) | GMPS | NeurIPS19 | consider the sample efficiency during the meta-training process by using supervised imitation learning; |\n| [Meta-Q-Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.00125.pdf) | MQL | ICLR20 | an off-policy algorithm for meta RL andbuilds upon three simple ideas: (i) Q Learning with context variable represented by pasttrajectories is competitive with SOTA; (ii) Multi-task objective is useful for meta RL; (iii) Past data from the meta-training replay buffer can be recycled |\n| [Varibad: A very good method for bayes-adaptive deep RL via meta-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.08348.pdf) | variBAD | ICLR20 | represent a single MDP M using a learned, low-dimensional stochastic latent variable m; jointly meta-train a variational auto-encoder that can infer the posterior distribution over m in a new task, and a policy that conditions on this posterior belief over MDP embeddings |\n| [On the global optimality of modelagnostic meta-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.13182.pdf), [ICML version](http:\u002F\u002Fproceedings.mlr.press\u002Fv119\u002Fwang20b\u002Fwang20b-supp.pdf) | ---- | ICML20 | characterize the optimality gap of the stationary points attained by MAML for both rl and sl |\n| [Meta-reinforcement learning robust to distributional shift via model identification and experience relabeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.07178.pdf) | MIER | arxiv20 |  |\n| [FOCAL: Efficient fully-offline meta-reinforcement learning via distance metric learning and behavior regularization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.01112.pdf) | FOCAL | ICLR21 | first consider offline meta-reinforcement learning; propose FOCAL based on PEARL |\n| [Offline meta reinforcement learning with advantage weighting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.06043.pdf) | MACAW | ICML21 | introduce the offline meta reinforcement learning problem setting; propose an optimization-based meta-learning algorithm named MACAW that uses simple, supervised regression objectives for both the inner and outer loop of meta-training |\n| [Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2105.13524.pdf) | LDM | NeurIPS21 | aim to train an agent that prepares for unseen test tasks during training, propose to train a policy on mixture tasks along with original training tasks for preventing the agent from overfitting the training tasks |\n| [Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.13125.pdf) | ---- | NeurIPS21 | present a unified framework for estimating higher-order derivatives of value functions, based on the concept of off-policy evaluation, for gradient-based meta rl |\n| [Generalization of Model-Agnostic Meta-Learning Algorithms: Recurring and Unseen Tasks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.03832.pdf) | ---- | NeurIPS21 |  |\n| [Offline Meta Learning of Exploration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.02598.pdf), [Offline Meta Reinforcement Learning -- Identifiability Challenges and Effective Data Collection Strategies](https:\u002F\u002Fopenreview.net\u002Fpdf?id=IBdEfhLveS) | BOReL | NeurIPS21 |  |\n| [On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2002.05135.pdf) | SG-MRL | NeurIPS21 |  |\n| [Hindsight Task Relabelling: Experience Replay for Sparse Reward Meta-RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00901.pdf) | ---- | NeurIPS21 |  |\n| [Generalization Bounds for Meta-Learning via PAC-Bayes and Uniform Stability](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.06589.pdf) | ---- | NeurIPS21 | provide generalization bound on meta-learning by combining PAC-Bayes thchnique and uniform stability |\n| [Bootstrapped Meta-Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04504.pdf) | BMG | ICLR22 Oral | propose BMG to let the metalearner teach itself for tackling ill-conditioning problems and myopic metaobjectives in meta learning; BGM introduces meta-bootstrap to mitigate myopia and formulate the meta-objective in terms of minimising distance to control curvature |\n| [Model-Based Offline Meta-Reinforcement Learning with Regularization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.02929.pdf) | MerPO, RAC | ICLR22 | empirically point out that offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets; consider how to learn an informative offline meta-policy in order to achieve the optimal tradeoff between “exploring” the out-of-distribution state-actions by following the meta-policy and “exploiting” the offline dataset by staying close to the behavior policy; propose MerPO which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions |\n| [Skill-based Meta-Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jeLW-Fh9bV) | SiMPL | ICLR22 | propose a method that jointly leverages (i) a large offline dataset of prior experience collected across many tasks without reward or task annotations and (ii) a set of meta-training tasks to learn how to quickly solve unseen long-horizon tasks. |\n| [Hindsight Foresight Relabeling for Meta-Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.09031.pdf) | HFR | ICLR22 | focus on improving the sample efficiency of the meta-training phase via data sharing; combine relabeling techniques with meta-RL algorithms in order to boost both sample efficiency and asymptotic performance |\n| [CoMPS: Continual Meta Policy Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04467.pdf) | CoMPS | ICLR22 | first formulate the continual meta-RL setting, where the agent interacts with a single task at a time and, once finished with a task, never interacts with it again |\n| [Learning a subspace of policies for online adaptation in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05169.pdf) | ---- | ICLR22 | consider the setting with just a single train environment; propose an approach where we learn a subspace of policies within the parameter space |\n| [an adaptive deep rl method for non-stationary environments with piecewise stable context](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.12735.pdf) | SeCBAD | NeurIPS22 | introduce latent situational MDP with piecewise-stable context; jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment |\n| [Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models and Amortized Policy Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.08291.pdf) | GSSM | ICML22 | consider model-based meta reinforcement learning, which consists of dynamics model learning and policy optimization; develop a graph structured dynamics model with superior generalization capability across tasks|\n| [Meta-Learning Hypothesis Spaces for Sequential Decision-making](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00602.pdf) | Meta-KeL | ICML22 | argue that two critical capabilities of transformers, reason over long-term dependencies and present context-dependent weights from self-attention, compose the central role of a Meta-Reinforcement Learner; propose Meta-LeL for meta-learning the hypothesis space of a sequential decision task |\n| [Transformers are Meta-Reinforcement Learners](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06614.pdf) | TrMRL | ICML22 | propose TrMRL, a memory-based meta-Reinforcement Learner which uses the transformer architecture to formulate the learning process; |\n| [ContraBAR: Contrastive Bayes-Adaptive Deep RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02418.pdf) | ContraBAR | ICML23 | investigate whether contrastive methods, like contrastive predictive coding, can be used for learning Bayes-optimal behavior |\n\n\n\n\u003Ca id='Adversarial-RL'>\u003C\u002Fa>\n## Adversarial RL\n\n|  Title | Method | Conference | Description |\n| ----  | ----   | ----       |   ----  |\n| [Adversarial Attacks on Neural Network Policies](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1702.02284.pdf) | ---- | ICLR 2017 workshop | first show that existing rl policies coupled with deep neural networks are vulnerable to adversarial noises in white-box and black-box settings | \n| [Delving into Adversarial Attacks on Deep Policies](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1705.06452.pdf) | ---- | ICLR 2017 workshop | show rl algorithms are vulnerable to adversarial noises; show adversarial training can improve robustness |\n| [Robust Adversarial Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.02702.pdf) | RARL | ICML17 | formulate the robust policy learning as a zero-sum, minimax objective function |\n| [Stealthy and Efficient Adversarial Attacks against Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.07099.pdf) | Critical Point Attack, Antagonist Attack | AAAI20 |  critical point attack: build a model to predict the future environmental states and agent’s actions for attacking; antagonist attack: automatically learn a domain-agnostic model for attacking |\n| [Safe Reinforcement Learning in Constrained Markov Decision Processes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.06626.pdf) | SNO-MDP | ICML20 | explore and optimize Markov decision processes under unknown safety constraints |\n| [Robust Deep Reinforcement Learning Against Adversarial Perturbations on State Observations](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2003.08938.pdf) | SA-MDP | NeurIPS20 | formalize adversarial attack on state observation as SA-MDP; propose some novel attack methods: Robust SARSA and Maximal Action Difference; propose a defence framework and some practical methods: SA-DQN, SA-PPO and SA-DDPG |\n| [Robust Reinforcement Learning on State Observations with Learned Optimal Adversary](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.08452.pdf) | ATLA | ICLR21 | use rl algorithms to train an \"optimal\" adversary; alternatively train \"optimal\" adversary and robust agent |\n| [Robust Deep Reinforcement Learning through Adversarial Loss](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.01976.pdf) | RADIAL-RL | NeurIPS21 | propose a robust rl framework, which penalizes the overlap between output bounds of actions; propose a more efficient evaluation method (GWC) to measure attack agnostic robustness | \n| [Policy Smoothing for Provably Robust Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.11420.pdf) | Policy Smoothing | ICLR22 | introduce randomized smoothing into RL; propose adaptive Neyman-Person Lemma |\n| [CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.09292.pdf) | CROP | ICLR22 | present a framework of Certifying Robust Policies for RL (CROP) against adversarial state perturbations with two certification criteria: robustness of per-state actions and lower bound of cumulative rewards; theoretically prove the certification radius; conduct experiments to provide certification for six empirically robust RL algorithms on Atari |\n| [Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.15860.pdf) | ---- | SCIS 2023 | summarize current optimization-based adversarial attacks in RL; propose a two-stage methods: train a deceptive policy and mislead the victim to imitate the deceptive policy |\n| [Consistent Attack: Universal Adversarial Perturbation on Embodied Vision Navigation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05751.pdf) | Reward UAP, Trajectory UAP | PRL 2023 | extend universal adversarial perturbations into sequential decision and propose Reward UAP as well as Trajectory UAP via utilizing the dynamic; experiment in Embodied Vision Navigation tasks |\n\n\u003Ca id='Genaralization-in-RL'>\u003C\u002Fa>\n## Genaralisation in RL\n\n\u003Ca id='Gene-Environments'>\u003C\u002Fa>\n### Environments\n\n| Title | Method | Conference | Description | \n| ----  | ----   | ----       |   ----  |\n| [Quantifying Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1812.02341.pdf) | CoinRun | ICML19 | introduce a new environment called CoinRun for generalisation in RL; empirically show L2 regularization, dropout, data augmentation and batch normalization can improve generalization in RL |\n| [Leveraging Procedural Generation to Benchmark Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1912.01588.pdf) | Procgen Benchmark | ICML20 | introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning |\n\n\u003Ca id='Gene-Methods'>\u003C\u002Fa>\n### Methods\n\n| Title | Method | Conference | Description | \n| ----  | ----   | ----       |   ----  |\n| [Towards Generalization and Simplicity in Continuous Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.02660.pdf) | ---- | NeurIPS17 | policies with simple linear and RBF parameterizations can be trained to solve a variety of widely studied continuous control tasks; training with a diverse initial state distribution induces more global policies with better generalization |\n| [Universal Planning Networks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1804.00645.pdf) | UPN | ICML18 |  study a model-based architecture that performs a differentiable planning computation in a latent space jointly learned with forward dynamics, trained end-to-end to encode what is necessary for solving tasks by gradient-based planning |\n| [On the Generalization Gap in Reparameterizable Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1905.12654.pdf) | ---- | ICML19 | theoretically provide guarantees on the gap between the expected and empirical return for both intrinsic and external errors in reparameterizable RL |\n| [Investigating Generalisation in Continuous Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1902.07015.pdf) | ---- | arxiv19 | study generalisation in Deep RL for continuous control |\n| [Generalization in Reinforcement Learning with Selective Noise Injection and Information Bottleneck](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.12911.pdf) | SNI | NeurIPS19 | consder regularization techniques relying on the injection of noise into the learned function for improving generalization; hope to maintain the regularizing effect of the injected noise and mitigate its adverse effects on the gradient quality |\n| [Network randomization: A simple technique for generalization in deep reinforcement learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.05396.pdf) | Network Randomization | ICLR20 | introduce a randomized (convolutional) neural network that randomly perturbs input observations, which enables trained agents to adapt to new domains by learning robust features invariant across varied and randomized environments |\n| [Observational Overfitting in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1912.02975.pdf) | observational overfitting | ICLR20 | discuss realistic instances where observational overfitting may occur and its difference from other confounding factors, and design a parametric theoretical framework to induce observational overfitting that can be applied to any underlying MDP |\n| [Context-aware Dynamics Model for Generalization in Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.06800.pdf) | CaDM | ICML20 | decompose the task of learning a global dynamics model into two stages: (a) learning a context latent vector that captures the local dynamics, then (b) predicting the next state conditioned on it |\n| [Improving Generalization in Reinforcement Learning with Mixture Regularization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.10814.pdf) | mixreg | NeurIPS20 | train agents on a mixture of observations from different training environments and imposes linearity constraints on the observation interpolations and the supervision (e.g. associated reward) interpolations |\n| [Instance based Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2011.01089.pdf) | IPAE | NeurIPS20 | formalize the concept of training levels as instances and show that this instance-based view is fully consistent with the standard POMDP formulation; provide generalization bounds to the value gap in train and test environments based on the number of training instances, and use insights based on these to improve performance on unseen levels |\n| [Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.05265.pdf) | PSM | ICLR21 | incorporate the inherent sequential structure in reinforcement learning into the representation learning process to improve generalization;  introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states |\n| [Generalization in Reinforcement Learning by Soft Data Augmentation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2011.13389.pdf) | SODA | ICRA21 | imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data, |\n| [Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.05632.pdf) | AugWM | ICML21 | consider the setting named \"dynamics generalization from a single offline environment\" and concentrate on the zero-shot performance to unseen dynamics; propose dynamics augmentation for model based offline RL; propose a simple self-supervised context adaptation reward-free algorithm |\n| [Decoupling Value and Policy for Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.10330.pdf) | IDAAC | ICML21 | decouples the optimization of the policy and value function, using separate networks to model them; introduce an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment |\n| [Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.06277.pdf) | LEEP | NeurIPS21 | generalisation in RL induces implicit partial observability; propose LEEP to use an ensemble of policies to approximately learn the Bayes-optimal policy for maximizing test-time performance |\n| [Automatic Data Augmentation for Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.12862.pdf) | DrAC | NeurIPS21 | focus on automatic data augmentation based two novel regularization terms for the policy and value function |\n| [When Is Generalizable Reinforcement Learning Tractable?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.00300.pdf) | ---- | NeurIPS21 | propose Weak Proximity and Strong Proximity for theoretically analyzing the generalisation of RL |\n| [A Survey of Generalisation in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09794.pdf) | ---- | arxiv21 | provide a unifying formalism and terminology for discussing different generalisation problems |\n| [Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.02193.pdf) | CTRL | ICLR22 | consider zero-shot generalization (ZSG); use self-supervised learning to learn a representation across tasks |\n| [The Role of Pretrained Representations for the OOD Generalization of RL Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.05686.pdf) | ---- | ICLR22 | train 240 representations and 11,520 downstream policies and systematically investigate their performance under a diverse range of distribution shifts; find that a specific representation metric that measures the generalization of a simple downstream proxy task reliably predicts the generalization of downstream RL agents under the broad spectrum of OOD settings considered here |\n| [Generalisation in Lifelong Reinforcement Learning through Logical Composition](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZOcX-eybqoL) | ---- | ICLR22 | e leverage logical composition in reinforcement learning to create a framework that enables an agent to autonomously determine whether a new task can be immediately solved using its existing abilities, or whether a task-specific skill should be learned |\n| [Local Feature Swapping for Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06355.pdf) | CLOP | ICLR22 | introduce a new regularization technique consisting of channel-consistent local permutations of the feature maps |\n| [A Generalist Agent](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.06175.pdf) | Gato | arxiv2205 | [slide](https:\u002F\u002Fml.cs.tsinghua.edu.cn\u002F~chengyang\u002Freading_meeting\u002FReading_Meeting_20220607.pdf) |\n| [Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04436.pdf) | CPPO | IJCAI22 | find the connection between modifying observations and dynamics, which are structurally different |\n| [CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08883.pdf) | CtrlFormer | ICML22 | jointly learns self-attention mechanisms between visual tokens and policy tokens among different control tasks, where multitask representation can be learned and transferred without catastrophic forgetting |\n| [Learning Dynamics and Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02126.pdf) | ---- | ICML22 | show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization |\n| [Improving Policy Optimization with Generalist-Specialist Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12984.pdf) | GSL | ICML22 | hope to utilize experiences from the specialists to aid the policy optimization of the generalist; propose the phenomenon “catastrophic ignorance” in multi-task learning |\n| [DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.13268.pdf) | DRIBO | ICML22 | learn robust representations that encode only task-relevant information from observations based on the unsupervised multi-view setting; introduce a novel contrastive version of the Multi-View Information Bottleneck (MIB) objective for temporal data |\n| [Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09081.pdf) | GRADER | NeurIPS22 | use the causal graph as a latent variable to reformulate the GCRL problem and then derive an iterative training framework from solving this problem |\n| [Rethinking Value Function Learning for Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09960.pdf) | DCPG, DDCPG | NeurIPS22 | consider to train agents on multiple training environments to improve observational generalization performance; identify that the value network in the multiple-environment setting is more challenging to optimize; propose regularization methods that penalize large estimates of the value network for preventing overfitting |\n| [Masked Autoencoding for Scalable and Generalizable Decision Making](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12740.pdf) | MaskDP | NeurIPS22 | employ a masked autoencoder (MAE) to state-action trajectories for reinforcement learning (RL) and behavioral cloning (BC) and gain the capability of zero-shot transfer to new tasks |\n| [Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08860.pdf) | PIE-G | NeurIPS22 | find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL |\n| [Look where you look! Saliency-guided Q-networks for visual RL tasks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09203.pdf) | SGQN | NeurIPS22 | propose that a good visual policy should be able to identify which pixels are important for its decision; preserve this identification of important sources of information across images |\n| [Human-Timescale Adaptation in an Open-Ended Task Space](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07608.pdf) | AdA | arXiv 2301 | demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans |\n| [In-context Reinforcement Learning with Algorithm Distillation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14215.pdf) | AD | ICLR23 oral | propose Algorithm Distillation for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model |\n| [Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPs](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sSt9fROSZRO) | ---- | ICLR23 | show that, given a fixed amount of pretraining data, agents trained with more variations are able to generalize better; find that increasing the capacity of the value and policy network is critical to achieve good performance |\n| [Investigating Multi-task Pretraining and Generalization in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sSt9fROSZRO) | ---- | ICLR23 |  find that, given a fixed amount of pretraining data, agents trained with more variations are able to generalize better; this advantage can still be present after fine-tuning for 200M environment frames than when doing zero-shot transfer |\n| [Cross-domain Random Pre-training with Prototypes for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05614.pdf) | CRPTpro | arXiv2302 | use prototypical representation learning with a novel intrinsic loss to pre-train an effective and generic encoder across different domains |\n| [Task Aware Dreamer for Task Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05092.pdf) | TAD | arXiv2303 | propose Task Distribution Relevance to capture the relevance of the task distribution quantitatively; propose TAD to use world models to improve task generalization via encoding reward signals into policies |\n| [The Benefits of Model-Based Generalization in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Vue1ulwlPD) | ---- | ICML23 | provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful |\n| [Multi-Environment Pretraining Enables Transfer to Action Limited Datasets](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13337.pdf) | ALPT | ICML23 | given n source environments with fully action labelled dataset, consider offline RL in the target environment with a small action labelled dataset and a large dataset without action labels; utilize inverse dynamics model to learn a representation that generalizes well to the limited action data from the target environment |\n| [In-Context Reinforcement Learning for Variable Action Spaces](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pp3v2ch5Sd&name=pdf) | Headless-AD | ICML24 | extend Algorithm Distillation to environments with variable discrete action spaces |\n\n\n\u003Ca id='Sequence-Generation'>\u003C\u002Fa>\n## RL with Transformer\n\n|  Title | Method | Conference | Description |\n| ----  | ----   | ----       |   ----  |\n| [Stabilizing transformers for reinforcement learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.06764.pdf) | GTrXL | ICML20 | stabilizing training with a reordering of the layer normalization coupled with the addition of a new gating mechanism to key points in the submodules of the transformer |\n| [Decision Transformer: Reinforcement Learning via Sequence Modeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.01345.pdf) | DT | NeurIPS21 | regard RL as a sequence generation task and use transformer to generate (return-to-go, state, action, return-to-go, ...); there is not explicit optimization process; evaluate on Offline RL |\n| [Offline Reinforcement Learning as One Big Sequence Modeling Problem](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.02039.pdf) | TT | NeurIPS21 | regard RL as a sequence generation task and use transformer to generate (s_0^0, ..., s_0^N, a_0^0, ..., a_0^M, r_0, ...); use beam search to inference; evaluate on imitation learning, goal-conditioned RL and Offline RL | \n| [Can Wikipedia Help Offline Reinforcement Learning?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12122.pdf) | ChibiT | arxiv2201 | demonstrate that pre-training on autoregressively modeling natural language provides consistent performance gains when compared to the Decision Transformer on both the popular OpenAI Gym and Atari |\n| [Online Decision Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05607.pdf) | ODT | ICML22 oral | blends offline pretraining with online finetuning in a unified framework; use sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning |\n| Prompting Decision Transformer for Few-shot Policy Generalization || ICML22 ||\n| [Multi-Game Decision Transformers](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15241.pdf) | ---- | NeurIPS22 | show that a single transformer-based model trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance |\n| [Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02662.pdf) | GLAM | ICML23 | consider an agent using an LLM as a policy that is progressively updated as the agent interacts with the environment, leveraging online Reinforcement Learning to improve its performance to solve goals |\n\n\n\u003Ca id='Tutorial-and-Lesson'>\u003C\u002Fa>\n## Tutorial and Lesson\n\n| Tutorial and Lesson |\n| ---- |\n| [Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto](https:\u002F\u002Fd1wqtxts1xzle7.cloudfront.net\u002F54674740\u002FReinforcement_Learning-with-cover-page-v2.pdf?Expires=1641130151&Signature=eYy7kmTVqTXFcANS-9GZJUyb86cDqKeh2QX8VvzjouEM-QSfuiCm1WHhP~bW5C57Mecj6en~YRoTvxekzU5lq~UaHSBoc-7xP8dXBp91shcwdfJ8M0LUkktpqcQjXQi7ZzhGn33qZeah0p8S06ARzjimF5coL5arvp9yANAsy4KigXSZwAZNXxksKwqUAult2QseLL~Bv1p2locjYahRzTuex3vMxdBLhT9HOGFF0qOdKYxsWiaITUKnVYl8AvePDHEEXgfmuqEfjqjF5p~FHOsYl3gEDZOvUp1eUzPg2~i0MQXY49nUpzsThL5~unTRIsYJiBghnkYl8py0r~UelQ__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA) |\n| [Introduction to Reinforcement Learning with David Silver](https:\u002F\u002Fdeepmind.com\u002Flearning-resources\u002F-introduction-reinforcement-learning-david-silver) | \n| [Deep Reinforcement Learning, CS285](https:\u002F\u002Frail.eecs.berkeley.edu\u002Fdeeprlcourse\u002F) |\n| [Deep Reinforcement Learning and Control, CMU 10703](https:\u002F\u002Fkatefvision.github.io\u002F) |\n| [RLChina](http:\u002F\u002Frlchina.org\u002Ftopic\u002F9) |\n\n\u003Ca id='ICLR22'>\u003C\u002Fa>\n## ICLR22\n| Paper | Type |\n| ---- | ---- |\n| [Bootstrapped Meta-Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04504.pdf) | oral |\n| [The Information Geometry of Unsupervised Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02719.pdf) | oral |\n| [SO(2)-Equivariant Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04439.pdf) | spotlight |\n| [CoBERL: Contrastive BERT for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.05431.pdf) | spotlight |\n| [Understanding and Preventing Capacity Loss in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZkC8wKoLbQ7) | spotlight |\n| [On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2105.01648.pdf) | spotlight |\n| [Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04628.pdf) | spotlight |\n| [Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01666.pdf) | spotlight |\n| [Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09765.pdf) | spotlight |\n| [When should agents explore?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11811.pdf) | spotlight |\n| [Revisiting Design Choices in Model-Based Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04135.pdf) | spotlight |\n| [DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04716.pdf) | spotlight |\n| [Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11566.pdf) | spotlight |\n| [COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=FLA55mBee6Q) | spotlight |\n| [Value Gradient weighted Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01464.pdf) | spotlight |\n| [Constrained Policy Optimization via Bayesian World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09802.pdf) | spotlight |\n| [Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.02193.pdf) | poster |\n| [The Role of Pretrained Representations for the OOD Generalization of RL Agents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.05686.pdf) | poster |\n| [Generalisation in Lifelong Reinforcement Learning through Logical Composition](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZOcX-eybqoL) | poster |\n| [Local Feature Swapping for Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06355.pdf) | poster |\n| [Policy Smoothing for Provably Robust Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.11420.pdf) | poster |\n| [CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.09292.pdf) | poster |\n| [Model-Based Offline Meta-Reinforcement Learning with Regularization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.02929.pdf) | poster |\n| [Skill-based Meta-Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jeLW-Fh9bV) | poster |\n| [Hindsight Foresight Relabeling for Meta-Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.09031.pdf) | poster |\n| [CoMPS: Continual Meta Policy Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04467.pdf) | poster |\n| [Learning a subspace of policies for online adaptation in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05169.pdf) | poster |\n| [Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.06226.pdf) | poster |\n| [Pareto Policy Pool for Model-based Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OqcZu8JIIzS) | poster |\n| [Offline Reinforcement Learning with Value-based Episodic Memory](https:\u002F\u002Fopenreview.net\u002Fpdf?id=RCZqv9NXlZ) | poster |\n| [Offline reinforcement learning with implicit Q-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06169.pdf) | poster |\n| [On-Policy Model Errors in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07985.pdf) | poster |\n| [Maximum Entropy RL (Provably) Solves Some Robust RL Problems ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | poster |\n| [Maximizing Ensemble Diversity in Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hjd-kcpDpf2) | poster |\n| [Maximum Entropy RL (Provably) Solves Some Robust RL Problems ](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | poster |\n| [Learning Generalizable Representations for Reinforcement Learning via Adaptive Meta-learner of Behavioral Similarities](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zBOI9LFpESK) | poster |\n| [Lipschitz Constrained Unsupervised Skill Discovery](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00914.pdf) | poster |\n| [Learning more skills through optimistic exploration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.14226.pdf) | poster |\n\n\u003Ca id='ICML22'>\u003C\u002Fa>\n## ICML22\n| Paper | Type |\n| ---- | ---- |\n| [Online Decision Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05607.pdf) | oral |\n| The Unsurprising Effectiveness of Pre-Trained Vision Models for Control | oral |\n| The Importance of Non-Markovianity in Maximum State Entropy Exploration | oral |\n| [Planning with Diffusion for Flexible Behavior Synthesis](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09991.pdf) | oral |\n| Adversarially Trained Actor Critic for Offline Reinforcement Learning | oral |\n| Learning Bellman Complete Representations for Offline Policy Evaluation | oral |\n| [Offline RL Policies Should Be Trained to be Adaptive](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02200.pdf) | oral |\n| [Large Batch Experience Replay](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.01528.pdf) | oral |\n| [Do Differentiable Simulators Give Better Gradients for Policy Optimization?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00817.pdf) | oral |\n| Federated Reinforcement Learning: Communication-Efficient Algorithms and Convergence Analysis | oral |\n| [An Analytical Update Rule for General Policy Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02045.pdf) | oral |\n| [Generalised Policy Improvement with Geometric Policy Composition](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08736.pdf) | oral |\n| Prompting Decision Transformer for Few-shot Policy Generalization | poster |\n| [CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08883.pdf) | poster |\n| [Learning Dynamics and Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02126.pdf) | poster |\n| [Improving Policy Optimization with Generalist-Specialist Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12984.pdf) | poster |\n| [DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.13268.pdf) | poster |\n| [Policy Gradient Method For Robust Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07344.pdf) | poster |\n| SAUTE RL: Toward Almost Surely Safe Reinforcement Learning Using State Augmentation | poster |\n| Constrained Variational Policy Optimization for Safe Reinforcement Learning | poster |\n| Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum | poster |\n| Distributionally Robust Q-Learning | poster |\n| Robust Meta-learning with Sampling Noise and Label Noise via Eigen-Reptile | poster |\n| DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck | poster |\n| [Model-based Meta Reinforcement Learning using Graph Structured Surrogate Models and Amortized Policy Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.08291.pdf) | poster |\n| [Meta-Learning Hypothesis Spaces for Sequential Decision-making](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00602.pdf) | poster |\n| Biased Gradient Estimate with Drastic Variance Reduction for Meta Reinforcement Learning | poster |\n| [Transformers are Meta-Reinforcement Learners](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06614.pdf) | poster |\n| Offline Meta-Reinforcement Learning with Online Self-Supervision | poster |\n| Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning | poster |\n| Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity | poster |\n| How to Leverage Unlabeled Data in Offline Reinforcement Learning? | poster |\n| On the Role of Discount Factor in Offline Reinforcement Learning | poster |\n| Model Selection in Batch Policy Optimization | poster |\n| Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics | poster |\n| Robust Task Representations for Offline Meta-Reinforcement Learning via Contrastive Learning | poster |\n| Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning | poster |\n| Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Matters | poster |\n| Constrained Offline Policy Optimization | poster |\n| [DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fdeng22a\u002Fdeng22a.pdf) | poster |\n| [Towards Evaluating Adaptivity of Model-Based Reinforcement Learning Methods](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwan22d\u002Fwan22d.pdf) | poster |\n| [Reinforcement Learning with Action-Free Pre-Training from Videos](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13880.pdf) | poster |\n| [Denoised MDPs: Learning World Models Better Than the World Itself](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15477.pdf) | poster |\n| [Temporal Difference Learning for Model Predictive Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04955.pdf) | poster |\n| [Causal Dynamics Learning for Task-Independent State Abstraction](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13452.pdf) | poster |\n| [Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12417.pdf) | poster |\n| [Adaptive Model Design for Markov Decision Process](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22ab\u002Fchen22ab.pdf) | poster |\n| [Stabilizing Off-Policy Deep Reinforcement Learning from Pixels](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fcetin22a\u002Fcetin22a.pdf) | poster |\n| [Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwu22i\u002Fwu22i.pdf) | poster |\n| [Mirror Learning: A Unifying Framework of Policy Optimisation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02373.pdf) | poster |\n| [Continuous Control with Action Quantization from Demonstrations](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fdadashi22a\u002Fdadashi22a.pdf) | poster |\n| [Off-Policy Fitted Q-Evaluation with Differentiable Function Approximators: Z-Estimation and Inference Theory](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fzhang22al\u002Fzhang22al.pdf) | poster | Evaluation (FQE) with general differentiable function approximators, including neural function approximations by using the Z-estimation theory |\n| [A Temporal-Difference Approach to Policy Gradient Estimation](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Ftosatto22a\u002Ftosatto22a.pdf) | poster |\n| [The Primacy Bias in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07802.pdf) | poster |\n| [Optimizing Sequential Experimental Design with Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00821.pdf) | poster |\n| [The Geometry of Robust Value Functions](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwang22k\u002Fwang22k.pdf) | poster |\n| Direct Behavior Specification via Constrained Reinforcement Learning | poster |\n| [Utility Theory for Markovian Sequential Decision Making](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13637.pdf) | poster |\n| [Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fliang22c\u002Fliang22c.pdf) | poster |\n| Unifying Approximate Gradient Updates for Policy Optimization | poster |\n| [EqR: Equivariant Representations for Data-Efficient Reinforcement Learning](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fmondal22a\u002Fmondal22a.pdf) | poster |\n| [Provable Reinforcement Learning with a Short-Term Memory](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fefroni22a\u002Fefroni22a.pdf) | poster |\n| Optimal Estimation of Off-Policy Policy Gradient via Double Fitted Iteration | poster |\n| Cliff Diving: Exploring Reward Surfaces in Reinforcement Learning Environments | poster |\n| Lagrangian Method for Q-Function Learning (with Applications to Machine Translation) | poster |\n| Learning to Assemble with Large-Scale Structured Reinforcement Learning | poster |\n| Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning | poster |\n| Off-Policy Reinforcement Learning with Delayed Rewards | poster |\n| Reachability Constrained Reinforcement Learning | poster |\n| [Flow-based Recurrent Belief State Learning for POMDPs](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22q\u002Fchen22q.pdf) | poster |\n| [Off-Policy Evaluation for Large Action Spaces via Embeddings](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06317.pdf) | poster |\n| [Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fkallus22a\u002Fkallus22a.pdf) | poster |\n| [On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22u\u002Fchen22u.pdf) | poster |\n| Communicating via Maximum Entropy Reinforcement Learning | poster |\n\n\u003Ca id='NeurIPS22'>\u003C\u002Fa>\n## NeurIPS22\n| Paper | Type |\n| ---- | ---- |\n| [Multi-Game Decision Transformers](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15241.pdf) | poster |\n| Bootstrapped Transformer for Offline Reinforcement Learning | poster |\n| [Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09081.pdf) | poster |\n| [Rethinking Value Function Learning for Generalization in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09960.pdf) | poster |\n| [Masked Autoencoding for Scalable and Generalizable Decision Making](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12740.pdf) | poster |\n| [Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08860.pdf) | poster |\n| [GALOIS: Boosting Deep Reinforcement Learning via Generalizable Logic Synthesis](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13728.pdf) | poster |\n| [Look where you look! Saliency-guided Q-networks for visual RL tasks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09203.pdf) | poster |\n| [an adaptive deep rl method for non-stationary environments with piecewise stable context](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.12735.pdf) | poster |\n| [Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06692.pdf) | poster |\n| [A Unified Framework for Alternating Offline Model Training and Policy Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05922.pdf) | poster |\n| [Bidirectional Learning for Offline Infinite-width Model-based Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07507.pdf) | poster |\n| DASCO: Dual-Generator Adversarial Support Constrained Offline Reinforcement Learning | poster |\n| [Supported Policy Optimization for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06239.pdf) | poster |\n| [Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13703.pdf) | poster |\n| Oracle Inequalities for Model Selection in Offline Reinforcement Learning | poster |\n| [Mildly Conservative Q-Learning for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04745.pdf) | poster |\n| [A Policy-Guided Imitation Approach for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08323.pdf) | poster |\n| [Bootstrapped Transformer for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08569.pdf) | poster |\n| [LobsDICE: Offline Learning from Observation via Stationary Distribution Correction Estimation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13536.pdf) | poster |\n| [Latent-Variable Advantage-Weighted Policy Optimization for Offline RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08949.pdf) | poster |\n| [How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via f-Advantage Regression](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03023.pdf) | poster |\n| [NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.00714.pdf) | poster |\n| [When does return-conditioned supervised learning work for offline reinforcement learning?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01079.pdf) | poster |\n| [Bellman Residual Orthogonalization for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12786.pdf) | poster |\n| [Oracle Inequalities for Model Selection in Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02016.pdf) | poster |\n| [Mismatched no More: Joint Model-Policy Optimization for Model-Based RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02758.pdf) | poster |\n| When to Update Your Model: Constrained Model-based Reinforcement Learning | poster |\n| Bayesian Optimistic Optimization: Optimistic Exploration for Model-Based Reinforcement Learning | poster |\n| [Model-based Lifelong Reinforcement Learning with Bayesian Exploration](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11579.pdf) | poster |\n| Plan to Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning | poster |\n| data-driven model-based optimization via invariant representation learning | poster |\n| [Reinforcement Learning with Non-Exponential Discounting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13413.pdf) | poster |\n| [Reinforcement Learning with Neural Radiance Fields](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01634.pdf) | poster |\n| [Recursive Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11430.pdf) | poster |\n| [Challenging Common Assumptions in Convex Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.01511.pdf) | poster |\n| Explicable Policy Search | poster |\n| [On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00761.pdf)| poster |\n| [When to Ask for Help: Proactive Interventions in Autonomous Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10765.pdf) | poster |\n| Adaptive Bio-Inspired Fish Simulation with Deep Reinforcement Learning | poster |\n| Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space | poster |\n| [Discovered Policy Optimisation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05639.pdf) | poster |\n| Faster Deep Reinforcement Learning with Slower Online Network | poster |\n| exploration-guided reward shaping for reinforcement learning under sparse rewards | poster |\n| [Large-Scale Retrieval for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05314.pdf) | poster |\n| [Sustainable Online Reinforcement Learning for Auto-bidding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07006.pdf) | poster |\n| [LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05409.pdf) | poster |\n| [DNA: Proximal Policy Optimization with a Dual Network Architecture](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10027.pdf) | poster |\n| [Faster Deep Reinforcement Learning with Slower Online Network](https:\u002F\u002Fassets.amazon.science\u002F31\u002Fca\u002F0c09418b4055a7536ced1b218d72\u002Ffaster-deep-reinforcement-learning-with-slower-online-network.pdf) | poster |\n| [Online Reinforcement Learning for Mixed Policy Scopes](https:\u002F\u002Fcausalai.net\u002Fr84.pdf) | poster |\n| [ProtoX: Explaining a Reinforcement Learning Agent via Prototyping](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03162.pdf) | poster |\n| [Hardness in Markov Decision Processes: Theory and Practice](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13075.pdf) | poster |\n| [Robust Phi-Divergence MDPs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14202.pdf) | poster |\n| [On the convergence of policy gradient methods to Nash equilibria in general stochastic games](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08857.pdf) | poster |\n| [A Unified Off-Policy Evaluation Approach for General Value Function](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.02711.pdf) | poster |\n| [Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14552.pdf) | poster |\n| Continuous Deep Q-Learning in Optimal Control Problems: Normalized Advantage Functions Analysis | poster |\n| [Parametrically Retargetable Decision-Makers Tend To Seek Power](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13477.pdf) | poster |\n| [Batch size-invariance for policy optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.00641.pdf) | poster |\n| [Trust Region Policy Optimization with Optimal Transport Discrepancies: Duality and Algorithm for Continuous Actions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11137.pdf) | poster |\n| Adaptive Interest for Emphatic Reinforcement Learning | poster |\n| [The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07570.pdf) | poster |\n| [Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01626.pdf) | poster |\n| [Bayesian Risk Markov Decision Processes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.02558.pdf) | poster |\n| [Explainable Reinforcement Learning via Model Transforms](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12006.pdf) | poster |\n| PDSketch: Integrated Planning Domain Programming and Learning | poster |\n| [Contrastive Learning as Goal-Conditioned Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07568.pdf) | poster |\n| [Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05266.pdf) | poster |\n| [Reinforcement Learning with Automated Auxiliary Loss Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06041.pdf) | poster |\n| [Mask-based Latent Reconstruction for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12096.pdf) | poster |\n| [Iso-Dream: Isolating Noncontrollable Visual Dynamics in World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13817.pdf) | poster |\n| [Learning General World Models in a Handful of Reward-Free Deployments](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12719.pdf) | poster |\n| [Learning Robust Dynamics through Variational Sparse Gating](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11698.pdf) | poster |\n| [A Mixture of Surprises for Unsupervised Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06702.pdf)  | poster |\n| [Unsupervised Reinforcement Learning with Contrastive Intrinsic Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00161.pdf)  | poster |\n| [Unsupervised Skill Discovery via Recurrent Skill Training](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sYDX_OxNNjh)  | poster |\n| [A Unified Off-Policy Evaluation Approach for General Value Function](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.02711.pdf) | poster |\n| The Pitfalls of Regularizations in Off-Policy TD Learning | poster |\n| Off-Policy Evaluation for Action-Dependent Non-Stationary Environments | poster |\n| [Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13373.pdf) | poster |\n| [Off-Policy Evaluation with Policy-Dependent Optimization Response](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12958.pdf) | poster |\n\n\u003Ca id='ICLR23'>\u003C\u002Fa>\n## ICLR23\n| Paper | Type |\n| ---- | ---- |\n| Dichotomy of Control: Separating What You Can Control from What You Cannot  | oral |\n| [In-context Reinforcement Learning with Algorithm Distillation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14215.pdf) | oral |\n| Is Conditional Generative Modeling all you need for Decision Making? | oral |\n| Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes | oral |\n| Confidence-Conditioned Value Functions for Offline Reinforcement Learning | oral |\n| Extreme Q-Learning: MaxEnt RL without Entropy | oral |\n| Sparse Q-Learning: Offline Reinforcement Learning with Implicit Value Regularization | oral |\n| [Transformers are Sample Efficient World Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00588.pdf) | oral | \n| [Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OpC-9aBBVJe) | oral |\n| [Guarded Policy Optimization with Imperfect Online Demonstrations](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01728.pdf) | spotlight |\n| [Towards Interpretable Deep Reinforcement Learning with Human-Friendly Prototypes](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hWwY_Jq0xsN) | spotlight | \n| Pink Noise Is All You Need: Colored Noise Exploration in Deep Reinforcement Learning | spotlight |\n| [DEP-RL: Embodied Exploration for Reinforcement Learning in Overactuated and Musculoskeletal Systems](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00484.pdf) | spotlight |\n| The In-Sample Softmax for Offline Reinforcement Learning | spotlight |\n| Benchmarking Offline Reinforcement Learning on Real-Robot Hardware | spotlight |\n| [Choreographer: Learning and Adapting Skills in Imagination](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13350.pdf) | spotlight | \n| [Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00030.pdf) | spotlight | \n| Decision Transformer under Random Frame Dropping | poster |\n| Hyper-Decision Transformer for Efficient Online Policy Adaptation  | poster |\n| Preference Transformer: Modeling Human Preferences using Transformers for RL | poster |\n| On the Data-Efficiency with Contrastive Image Transformation in Reinforcement Learning  | poster |\n| Can Agents Run Relay Race with Strangers? Generalization of RL to Out-of-Distribution Trajectories | poster |\n| [Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPs](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sSt9fROSZRO) | poster |\n| [Investigating Multi-task Pretraining and Generalization in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sSt9fROSZRO) | poster |\n| Priors, Hierarchy, and Information Asymmetry for Skill Transfer in Reinforcement Learning | poster |\n| On the Robustness of Safe Reinforcement Learning under Observational Perturbations  | poster |\n| Distributional Meta-Gradient Reinforcement Learning | poster |\n| Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization | poster |\n| Value Memory Graph: A Graph-Structured World Model for Offline Reinforcement Learning | poster |\n| Efficient Offline Policy Optimization with a Learned Model  | poster |\n| [Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06193.pdf) | poster |\n| [Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14548.pdf) | poster |\n| Decision S4: Efficient Sequence-Based RL via State Spaces Layers | poster |\n| Behavior Proximal Policy Optimization | poster |\n| Learning Achievement Structure for Structured Exploration in Domains with Sparse Reward | poster |\n| Explaining RL Decisions with Trajectories | poster |\n| User-Interactive Offline Reinforcement Learning | poster |\n| Pareto-Efficient Decision Agents for Offline Multi-Objective Reinforcement Learning | poster |\n| Offline RL for Natural Language Generation with Implicit Language Q Learning | poster |\n| In-sample Actor Critic for Offline Reinforcement Learning | poster |\n| Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting | poster |\n| Mind the Gap: Offline Policy Optimizaiton for Imperfect Rewards | poster |\n| [When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11027.pdf) | poster |\n| MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations | poster |\n| [Transformer-based World Models Are Happy With 100k Interactions](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.07109.pdf) | poster |\n| [Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.10144.pdf) | poster |\n| [Evaluating Long-Term Memory in 3D Mazes](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13383.pdf) | poster |\n| Making Better Decision by Directly Planning in Continuous Control | poster |\n| HiT-MDP: Learning the SMDP option framework on MDPs with Hidden Temporal Embeddings | poster |\n| Diminishing Return of Value Expansion Methods in Model-Based Reinforcement Learning | poster |\n| [Simplifying Model-based RL: Learning Representations, Latent-space Models, and Policies with One Objective](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08466.pdf) | poster |\n| [SpeedyZero: Mastering Atari with Limited Data and Time](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Mg5CLXZgvLJ) | poster |\n| [Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.10466.pdf) | poster |\n| Replay Memory as An Empirical MDP: Combining Conservative Estimation with Experience Replay | poster |\n| [Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.09103.pdf) | poster |\n| [Reward Design with Language Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=10uNUgI5Kl) | poster |\n| [Solving Continuous Control via Q-learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12566.pdf) | poster |\n| [Wasserstein Auto-encoded MDPs: Formal Verification of Efficiently Distilled RL Policies with Many-sided Guarantees](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.12558.pdf) | poster |\n| Quality-Similar Diversity via Population Based Reinforcement Learning | poster |\n| [Human-level Atari 200x faster](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07550.pdf) | poster |\n| Policy Expansion for Bridging Offline-to-Online Reinforcement Learning  | poster |\n| [Improving Deep Policy Gradients with Value Function Search](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10145.pdf) | poster |\n| [Memory Gym: Partially Observable Challenges to Memory-Based Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jHc8dCx6DDr) | poster |\n| [Hybrid RL: Using both offline and online data can make RL efficient](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06718.pdf) | poster |\n| [POPGym: Benchmarking Partially Observable Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01859.pdf) | poster |\n| [Critic Sequential Monte Carlo](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15460.pdf) | poster |\n| Revocable Deep Reinforcement Learning with Affinity Regularization for Outlier-Robust Graph Matching | poster |\n| Provable Unsupervised Data Sharing for Offline Reinforcement Learning | poster |\n| Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality | poster |\n| [Latent Variable Representation for Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08765.pdf) | poster |\n| Spectral Decomposition Representation for Reinforcement Learning | poster |\n| Behavior Prior Representation learning for Offline Reinforcement Learning | poster |\n| [Become a Proficient Player with Limited Data through Watching Pure Videos](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Sy-o2N0hF4f) | poster |\n| Variational Latent Branching Model for Off-Policy Evaluation | poster |\n\n\u003Ca id='ICML23'>\u003C\u002Fa>\n## ICML23\n| Paper | Type |\n| ---- | ---- |\n| [On the Power of Pre-training for Generalization in RL: Provable Benefits and Hardness](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10464.pdf) | oral |\n| [AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01877.pdf) | oral |\n| [Reparameterized Policy Learning for Multimodal Trajectory Optimization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.10710.pdf) | oral |\n| [Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12016.pdf) | oral |\n| [The Dormant Neuron Phenomenon in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12902.pdf) | oral |\n| [Efficient RL via Disentangled Environment and Agent Representations](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kWS8mpioS9) | oral |\n| [On the Statistical Benefits of Temporal Difference Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13289.pdf) | oral |\n| Warm-Start Actor-Critic: From Approximation Error to Sub-optimality Gap | oral |\n| Reinforcement Learning from Passive Data via Latent Intentions | oral |\n| Subequivariant Graph Reinforcement Learning in 3D Environments | oral |\n| Representation Learning with Multi-Step Inverse Kinematics: An Efficient and Optimal Approach to Rich-Observation RL | oral |\n| Flipping Coins to Estimate Pseudocounts for Exploration in Reinforcement Learning | oral |\n| [Settling the Reward Hypothesis](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10420.pdf) | oral |\n| Information-Theoretic State Space Model for Multi-View Reinforcement Learning | oral |\n| [Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12016.pdf) | oral |\n| [Learning Belief Representations for Partially Observable Deep RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=4IzEmHLono) | poster |\n| [Internally Rewarded Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00270.pdf) | poster |\n| Active Policy Improvement from Multiple Black-box Oracles | poster |\n| When is Realizability Sufficient for Off-Policy Reinforcement Learning? | poster |\n| The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation | poster |\n| [Hyperparameters in Reinforcement Learning and How To Tune Them](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01324.pdf) |  poster |\n| Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning | poster |\n| [Correcting discount-factor mismatch in on-policy policy gradient methods](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13284.pdf) | poster |\n| Masked Trajectory Models for Prediction, Representation, and Control | poster |\n| Off-Policy Average Reward Actor-Critic with Deterministic Policy Search | poster |\n| TGRL: An Algorithm for Teacher Guided Reinforcement Learning | poster |\n| LIV: Language-Image Representations and Rewards for Robotic Control | poster |\n| Stein Variational Goal Generation for adaptive Exploration in Multi-Goal Reinforcement Learning | poster |\n| Emergence of Adaptive Circadian Rhythms in Deep Reinforcement Learning | poster |\n| Explaining Reinforcement Learning with Shapley Values | poster |\n| [Reinforcement Learning Can Be More Efficient with Multiple Rewards](https:\u002F\u002Fopenreview.net\u002Fpdf?id=skDVsmXjPR) | poster |\n| [Performative Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00046.pdf) | poster |\n| Truncating Trajectories in Monte Carlo Reinforcement Learning | poster |\n| ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs | poster |\n| Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling | poster |\n| Hyperbolic Diffusion Embedding and Distance for Hierarchical Representation Learning | poster |\n| Revisiting Domain Randomization via Relaxed State-Adversarial Policy Optimization | poster |\n| Parallel $Q$-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation | poster |\n| LESSON: Learning to Integrate Exploration Strategies for Reinforcement Learning via an Option Framework | poster |\n| Graph Reinforcement Learning for Network Control via Bi-Level Optimization | poster |\n| Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies | poster |\n| [Reinforcement Learning with History Dependent Dynamic Contexts](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02061.pdf) | poster |\n| Efficient Online Reinforcement Learning with Offline Data | poster |\n| Variance Control for Distributional Reinforcement Learning | poster |\n| Hindsight Learning for MDPs with Exogenous Inputs | poster |\n| RLang: A Declarative Language for Describing Partial World Knowledge to Reinforcement Learning Agents | poster |\n| Scalable Safe Policy Improvement via Monte Carlo Tree Search | poster |\n| Bayesian Reparameterization of Reward-Conditioned Reinforcement Learning with Energy-based Models | poster |\n| Understanding the Complexity Gains of Single-Task RL with a Curriculum | poster |\n| PPG Reloaded: An Empirical Study on What Matters in Phasic Policy Gradient | poster |\n| [On Many-Actions Policy Gradient](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13011.pdf) | poster |\n| Multi-task Hierarchical Adversarial Inverse Reinforcement Learning | poster |\n| Cell-Free Latent Go-Explore | poster |\n| Trustworthy Policy Learning under the Counterfactual No-Harm Criterion | poster |\n| Reachability-Aware Laplacian Representation in Reinforcement Learning | poster |\n| Interactive Object Placement with Reinforcement Learning | poster |\n| Leveraging Offline Data in Online Reinforcement Learning | poster |\n| Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space | poster |\n| DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm | poster |\n| [Scaling Laws for Reward Model Overoptimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=bBLjms8nZE&name=pdf) |  poster  |\n| SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learning | poster |\n| Set-membership Belief State-based Reinforcement Learning for POMDPs | poster |\n| Robust Satisficing MDPs | poster |\n| Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling | poster |\n| Quantum Policy Gradient Algorithm with Optimized Action Decoding | poster |\n| For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal | poster |\n| Model-Free Robust Average-Reward Reinforcement Learning | poster |\n| Fair and Robust Estimation of Heterogeneous Treatment Effects for Policy Learning | poster |\n| Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning | poster |\n| Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons | poster |\n| Social learning spontaneously emerges by searching optimal heuristics with deep reinforcement learning | poster |\n| [Bigger, Better, Faster: Human-level Atari with human-level efficiency](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19452.pdf) | poster |\n| [Posterior Sampling for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.00477.pdf) | poster |\n| [Model-based Reinforcement Learning with Scalable Composite Policy Gradient Estimators](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rDMAJECBM2) | poster |\n| [Go Beyond Imagination: Maximizing Episodic Reachability with World Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=JsAMuzA9o2) | poster |\n| [Simplified Temporal Consistency Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09466.pdf) | poster |\n| [Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12050.pdf) | poster |\n| Demonstration-free Autonomous Reinforcement Learning via Implicit and Bidirectional Curriculum | poster |\n| [Curious Replay for Model-based Adaptation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15934.pdf) | poster |\n| [Multi-View Masked World Models for Visual Robotic Manipulation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02408.pdf) | poster |\n| [Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10886.pdf) | poster |\n| [Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10515.pdf) | poster |\n| Representations and Exploration for Deep Reinforcement Learning using Singular Value Decomposition | poster |\n| [Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02662.pdf) | poster |\n| Distilling Internet-Scale Vision-Language Models into Embodied Agents | poster |\n| VIMA: Robot Manipulation with Multimodal Prompts | poster |\n| Future-conditioned Unsupervised Pretraining for Decision Transformer | poster |\n| Emergent Agentic Transformer from Chain of Hindsight Experience | poster |\n| [The Benefits of Model-Based Generalization in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Vue1ulwlPD) | poster |\n| [Multi-Environment Pretraining Enables Transfer to Action Limited Datasets](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13337.pdf) | poster |\n| [On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.05749.pdf) | poster |\n| Unsupervised Skill Discovery for Learning Shared Structures across Changing Environments | poster |\n| An Investigation into Pre-Training Object-Centric Representations for Reinforcement Learning | poster |\n| Guiding Pretraining in Reinforcement Learning with Large Language Models | poster |\n| What is Essential for Unseen Goal Generalization of Offline Goal-conditioned RL? | poster |\n| Online Prototype Alignment for Few-shot Policy Transfer | poster |\n| Detecting Adversarial Directions in Deep Reinforcement Learning to Make Robust Decisions | poster |\n| Robust Situational Reinforcement Learning in Face of Context Disturbances | poster |\n| Adversarial Learning of Distributional Reinforcement Learning | poster |\n| Towards Robust and Safe Reinforcement Learning with Benign Off-policy Data | poster |\n| Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement Learning | poster |\n| [ContraBAR: Contrastive Bayes-Adaptive Deep RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02418.pdf) | poster |\n| Model-based Offline Reinforcement Learning with Count-based Conservatism | poster |\n| Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning | poster |\n| [Learning Temporally Abstract World Models without Online Experimentation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=YeTYJz7th5) | poster |\n| Contrastive Energy Prediction for Exact Energy-Guided Diffusion Sampling in Offline Reinforcement Learning | poster |\n| MetaDiffuser: Diffusion Model as Conditional Planner for Offline Meta-RL | poster |\n| Actor-Critic Alignment for Offline-to-Online Reinforcement Learning | poster |\n| Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories | poster |\n| Principled Offline RL in the Presence of Rich Exogenous Information | poster |\n| Offline Meta Reinforcement Learning with In-Distribution Online Adaptation | poster |\n| Policy Regularization with Dataset Constraint for Offline Reinforcement Learning | poster |\n| Supported Trust Region Optimization for Offline Reinforcement Learning | poster |\n| Constrained Decision Transformer for Offline Safe Reinforcement Learning | poster |\n| PAC-Bayesian Offline Contextual Bandits With Guarantees | poster |\n| Beyond Reward: Offline Preference-guided Policy Optimization | poster |\n| Offline Reinforcement Learning with Closed-Form Policy Improvement Operators | poster |\n| ChiPFormer: Transferable Chip Placement via Offline Decision Transformer | poster |\n| Boosting Offline Reinforcement Learning with Action Preference Query | poster |\n| [Jump-Start Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02372.pdf) | poster |\n| Investigating the role of model-based learning in exploration and transfer | poster |\n| [STEERING : Stein Information Directed Exploration for Model-Based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12038.pdf) | poster |\n| [Predictable MDP Abstraction for Unsupervised Model-Based RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03921.pdf) | poster |\n| The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms | poster |\n| [On the Importance of Feature Decorrelation for Unsupervised Representation Learning in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05637.pdf) | poster |\n| CLUTR: Curriculum Learning via Unsupervised Task Representation Learning | poster |\n| [Controllability-Aware Unsupervised Skill Discovery](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05103.pdf) | poster |\n| [Behavior Contrastive Learning for Unsupervised Skill Discovery](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04477.pdf) | poster |\n| Variational Curriculum Reinforcement Learning for Unsupervised Discovery of Skills | poster |\n| [Bootstrapped Representations in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.10171.pdf) | poster |\n| [Representation-Driven Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19922.pdf) | poster |\n| Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation | poster |\n| An Instrumental Variable Approach to Confounded Off-Policy Evaluation | poster |\n| Semiparametrically Efficient Off-Policy Evaluation in Linear Markov Decision Processes | poster |\n| [Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10886.pdf) | poster |\n| [Curiosity in Hindsight: Intrinsic Exploration in Stochastic Environments](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10515.pdf) | poster |\n\n\u003Ca id='NeurIPS23'>\u003C\u002Fa>\n## NeurIPS23\n| Paper | Type |\n| ---- | ---- |\n| Learning Generalizable Agents via Saliency-guided Features Decorrelation | oral |\n| Understanding Expertise through Demonstrations: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning | oral |\n| When Demonstrations meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning | oral |\n| DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models | oral |\n| When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment | oral |\n| Bridging RL Theory and Practice with the Effective Horizon | oral |\n| SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks | spotlight |\n| RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability | spotlight |\n| Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration | spotlight |\n| [Conditional Mutual Information for Disentangled Representations in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14133.pdf) | spotlight |\n| Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL | spotlight |\n| Double Gumbel Q-Learning | spotlight |\n| Future-Dependent Value-Based Off-Policy Evaluation in POMDPs | spotlight |\n| Supervised Pretraining Can Learn In-Context Reinforcement Learning | spotlight |\n| Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online Reinforcement Learning | spotlight |\n| Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning | poster |\n| Explore to Generalize in Zero-Shot RL | poster |\n| Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies | poster |\n| Reining Generalization in Offline Reinforcement Learning via Representation Distinction | poster |\n| Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL | poster |\n| Doubly Robust Augmented Transfer for Meta-Reinforcement Learning | poster |\n| Recurrent Hypernetworks are Surprisingly Strong in Meta-RL | poster |\n| Parameterizing Non-Parametric Meta-Reinforcement Learning Tasks via Subtask Decomposition | poster |\n| One Risk to Rule Them All: A Risk-Sensitive Perspective on Model-Based Offline Reinforcement Learning | poster |\n| Efficient Diffusion Policies For Offline Reinforcement Learning | poster |\n| Learning to Influence Human Behavior with Offline Reinforcement Learning | poster |\n| Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization | poster |\n| SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations | poster |\n| Constrained Policy Optimization with Explicit Behavior Density For Offline Reinforcement Learning | poster |\n| Conservative State Value Estimation for Offline Reinforcement Learning | poster |\n| Offline RL with Discrete Proxy Representations for Generalizability in POMDPs | poster |\n| Context Shift Reduction for Offline Meta-Reinforcement Learning | poster |\n| Mutual Information Regularized Offline Reinforcement Learning | poster |\n| Recovering from Out-of-sample States via Inverse Dynamics in Offline Reinforcement Learning | poster |\n| Percentile Criterion Optimization in Offline Reinforcement Learning | poster |\n| Language Models Meet World Models: Embodied Experiences Enhance Language Models | poster |\n| Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models | poster |\n| [Facing off World Model Backbones: RNNs, Transformers, and S4](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02064.pdf) | poster |\n| Efficient Exploration in Continuous-time Model-based Reinforcement Learning | poster |\n| Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms | poster |\n| [Learning to Discover Skills through Guidance](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.20178.pdf) | poster |\n| Creating Multi-Level Skill Hierarchies in Reinforcement Learning | poster |\n| Unsupervised Behavior Extraction via Random Intent Priors | poster |\n| [MIMEx: Intrinsic Rewards from Masked Input Modeling](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.08932.pdf) | poster |\n| f-Policy Gradients: A General Framework for Goal-Conditioned RL using f-Divergences | poster |\n| Prediction and Control in Continual Reinforcement Learning | poster |\n| Residual Q-Learning: Offline and Online Policy Customization without Value | poster |\n| Small batch deep reinforcement learning | poster |\n| Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs | poster |\n| Is RLHF More Difficult than Standard RL? A Theoretical Perspective | poster |\n| Reflexion: language agents with verbal reinforcement learning | poster |\n| Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning | poster |\n| Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning | poster |\n| Direct Preference-based Policy Optimization without Reward Modeling | poster |\n| Learning to Modulate pre-trained Models in RL | poster |\n| Ignorance is Bliss: Robust Control via Information Gating | poster |\n| Marginal Density Ratio for Off-Policy Evaluation in Contextual Bandits | poster |\n| Model-Free Reinforcement Learning with the Decision-Estimation Coefficient | poster |\n| Optimal and Fair Encouragement Policy Evaluation and Learning | poster |\n| BIRD: Generalizable Backdoor Detection and Removal for Deep Reinforcement Learning | poster |\n| Probabilistic Inference in Reinforcement Learning Done Right | poster |\n| Reference-Based POMDPs | poster |\n| Persuading Farsighted Receivers in MDPs: the Power of Honesty | poster |\n| Distributional Policy Evaluation: a Maximum Entropy approach to Representation Learning | poster |\n| Structured State Space Models for In-Context Reinforcement Learning | poster |\n| An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient | poster |\n| Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning | poster |\n| PLASTIC: Improving Input and Label Plasticity for Sample Efficient Reinforcement Learning | poster |\n| Hybrid Policy Optimization from Imperfect Demonstrations | poster |\n| Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control | poster |\n| Semantic HELM: A Human-Readable Memory for Reinforcement Learning | poster |\n| A Definition of Continual Reinforcement Learning | poster |\n| Fast Bellman Updates for Wasserstein Distributionally Robust MDPs | poster |\n| Policy Gradient for Rectangular Robust Markov Decision Processes | poster |\n| Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning | poster |\n| Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach | poster |\n| Model-Free Active Exploration in Reinforcement Learning | poster |\n| Self-Supervised Reinforcement Learning that Transfers using Random Features | poster |\n| FlowPG: Action-constrained Policy Gradient with Normalizing Flows | poster |\n| Flexible Attention-Based Multi-Policy Fusion for Efficient Deep Reinforcement Learning | poster |\n| ODE-based Recurrent Model-free Reinforcement Learning for POMDPs | poster |\n| Suggesting Variable Order for Cylindrical Algebraic Decomposition via Reinforcement Learning | poster |\n| SPQR: Controlling Q-ensemble Independence with Spiked Random Model for Reinforcement Learning | poster |\n| CaMP: Causal Multi-policy Planning for Interactive Navigation in Multi-room Scenes | poster |\n| POMDP Planning for Object Search in Partially Unknown Environment | poster |\n| Unified Off-Policy Learning to Rank: a Reinforcement Learning Perspective | poster |\n| Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation | poster |\n| A Long $N$-step Surrogate Stage Reward for Deep Reinforcement Learning | poster |\n| State-Action Similarity-Based Representations for Off-Policy Evaluation | poster |\n| Weakly Coupled Deep Q-Networks | poster |\n| Large Language Models Are Semi-Parametric Reinforcement Learning Agents | poster |\n| The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning | poster |\n| Online Nonstochastic Model-Free Reinforcement Learning | poster |\n| When is Agnostic Reinforcement Learning Statistically Tractable? | poster |\n| Bayesian Risk-Averse Q-Learning with Streaming Observations | poster |\n| Resetting the Optimizer in Deep RL: An Empirical Study | poster |\n| Optimistic Exploration in Reinforcement Learning Using Symbolic Model Estimates | poster |\n| Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms | poster |\n| Regularity as Intrinsic Reward for Free Play | poster |\n| TACO: Temporal Latent Action-Driven Contrastive Loss for Visual Reinforcement Learning | poster |\n| Policy Optimization for Continuous Reinforcement Learning | poster |\n| Active Observing in Continuous-time Control | poster |\n| Replicable Reinforcement Learning | poster |\n| On the Importance of Exploration for Generalization in Reinforcement Learning | poster |\n| Monte Carlo Tree Search with Boltzmann Exploration | poster |\n| Iterative Reachability Estimation for Safe Reinforcement Learning | poster |\n| Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design | poster |\n| Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? | poster |\n| Inverse Dynamics Pretraining Learns Good Representations for Multitask Imitation | poster |\n| Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach | poster |\n| Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning | poster |\n| Sample-Efficient and Safe Deep Reinforcement Learning via Reset Deep Ensemble Agents | poster |\n| Distributional Pareto-Optimal Multi-Objective Reinforcement Learning | poster |\n| Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents | poster |\n| Efficient Potential-based Exploration in Reinforcement Learning using Inverse Dynamic Bisimulation Metric | poster |\n| Iteratively Learn Diverse Strategies with State Distance Information | poster |\n| Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration | poster |\n| Gradient Informed Proximal Policy Optimization | poster |\n| The Curious Price of Distributional Robustness in Reinforcement Learning with a Generative Model | poster |\n| [Optimal Treatment Allocation for Efficient Policy Evaluation in Sequential Decision Making](https:\u002F\u002Fopenreview.net\u002Fattachment?id=EcReRm7q9p&name=pdf) | poster |\n| [Thinker: Learning to Plan and Act](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mumEBl0arj&name=pdf) | poster |\n| [Learning Better with Less: Effective Augmentation for Sample-Efficient Visual Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jRL6ErxMVB&name=pdf) | poster |\n| [Reinforcement Learning with Simple Sequence Priors](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qxF8Pge6vM&name=pdf) | poster |\n| [Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kChEBODIx9&name=pdf) | poster |\n| [Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets](https:\u002F\u002Fopenreview.net\u002Fattachment?id=TW99HrZCJU&name=pdf) | poster |\n| [CQM: Curriculum Reinforcement Learning with a Quantized World Model](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tcotyjon2a&name=pdf) | poster |\n| [H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=lvvaNwnP6M&name=pdf) | poster |\n| [Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=GcEIvidYSw&name=pdf) | poster |\n| [Anytime-Competitive Reinforcement Learning with Policy Prior](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FCwfZj1bQl&name=pdf) | poster |\n| [Budgeting Counterfactual for Offline RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=1MUxtSBUox&name=pdf) | poster |\n| [Fractal Landscapes in Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QQidjdmyPp&name=pdf) | poster |\n| [Goal-Conditioned Predictive Coding for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IJblKO45YU&name=pdf) | poster |\n| [For SALE: State-Action Representation Learning for Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xZvGrzRq17&name=pdf) | poster |\n| [Inverse Reinforcement Learning with the Average Reward Criterion](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YFSrf8aciU&name=pdf) | poster |\n| [Revisiting the Minimalist Approach to Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vqGWslLeEw&name=pdf) | poster |\n| [Adversarial Model for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6UCMa0Qgej&name=pdf) | poster |\n| [Supported Value Regularization for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=fze7P9oy6l) | poster |\n| [PID-Inspired Inductive Biases for Deep Reinforcement Learning in Partially Observable Control Tasks](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pKnhUWqZTJ&name=pdf) | poster |\n| [How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=d7a5TpePV7&name=pdf) | poster |\n| [Learning from Visual Observation via Offline Pretrained State-to-Go Transformer](https:\u002F\u002Fopenreview.net\u002Fattachment?id=E58gaxJN1d&name=pdf) | poster |\n| [Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents](https:\u002F\u002Fopenreview.net\u002Fattachment?id=KtvPdGb31Z&name=pdf) | poster |\n| [Robust Knowledge Transfer in Tiered Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=1WMdoiVMov&name=pdf) | poster |\n| [Train Hard, Fight Easy: Robust Meta Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JvOZ4IIjwP&name=pdf) | poster |\n| [Task-aware world model learning with meta weighting via bi-level optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IN3hQx1BrC&name=pdf) | poster |\n| [Video Prediction Models as Rewards for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HWNl9PAYIP&name=pdf) | poster |\n| [Synthetic Experience Replay](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6jNQ1AY1Uf&name=pdf) | poster |\n| [Policy Finetuning in Reinforcement Learning via Design of Experiments using Offline Data](https:\u002F\u002Fopenreview.net\u002Fattachment?id=fjXTcUUgaC&name=pdf) | poster |\n| [Learning Dynamic Attribute-factored World Models for Efficient Multi-object Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=bsNslV3Ahe&name=pdf) | poster |\n| [Learning World Models with Identifiable Factorization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6JJq5TW9Mc&name=pdf) | poster |\n| [Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8GuEVzAUQS&name=pdf) | poster |\n| [Inverse Preference Learning: Preference-based RL without a Reward Function](https:\u002F\u002Fopenreview.net\u002Fattachment?id=gAP52Z2dar&name=pdf) | poster |\n| [Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=71P7ugOGCV&name=pdf) | poster |\n| [Latent exploration for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JSVXZKqfLU&name=pdf) | poster |\n| [Large Language Models can Implement Policy Iteration](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LWxjWoBTsr&name=pdf) | poster |\n| [Generalized Weighted Path Consistency for Mastering Atari Games](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vHRLS8HhK1&name=pdf) | poster |\n| [Learning Environment-Aware Affordance for 3D Articulated Object Manipulation under Occlusions](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Re2NHYoZ5l&name=pdf) | poster |\n| [Accelerating Value Iteration with Anchoring](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Wn82NbmvJy&name=pdf) | poster |\n| [Reduced Policy Optimization for Continuous Control with Hard Constraints](https:\u002F\u002Fopenreview.net\u002Fattachment?id=fKVEMNmWqU&name=pdf) | poster |\n| [State Regularized Policy Optimization on Data with Dynamics Shift](https:\u002F\u002Fopenreview.net\u002Fattachment?id=I8t9RKDnz2&name=pdf) | poster |\n| [Offline Reinforcement Learning with Differential Privacy](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YVMc3KiWBQ&name=pdf) | poster |\n| [Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=sQyRQjun46&name=pdf) | poster |\n\n\n\u003Ca id='ICLR24'>\u003C\u002Fa>\n## ICLR24\n| Paper | Type |\n| ---- | ---- |\n| [Predictive auxiliary objectives in deep RL mimic learning in the brain](https:\u002F\u002Fopenreview.net\u002Fpdf?id=agPpmEgf8C) | oral |\n| [Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=o2IEmeLL9r) | oral |\n| [METRA: Scalable Unsupervised RL with Metric-Aware Abstraction](https:\u002F\u002Fopenreview.net\u002Fpdf?id=c5pwL0Soay) | oral |\n| [ASID: Active Exploration for System Identification and Reconstruction in Robotic Manipulation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jNR6s6OSBT) | oral |\n| [Mastering Memory Tasks with World Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=1vDArHJ68h) | oral |\n| [Generalized Policy Iteration using Tensor Approximation for Hybrid Control](https:\u002F\u002Fopenreview.net\u002Fpdf?id=csukJcpYDe) | spotlight |\n| [Selective Visual Representations Improve Convergence and Generalization for Embodied AI](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kC5nZDU5zf) | spotlight |\n| [AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=M6XWoEdmwf) | spotlight |\n| [Confronting Reward Model Overoptimization with Constrained RLHF](https:\u002F\u002Fopenreview.net\u002Fpdf?id=gkfUvn0fLU) | spotlight |\n| [Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=TFKIfhvdmZ) | spotlight |\n| [Improving Offline RL by Blending Heuristics](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MCl0TLboP1) | spotlight |\n| [Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making](https:\u002F\u002Fopenreview.net\u002Fpdf?id=af2c8EaKl8) | spotlight |\n| [Tool-Augmented Reward Modeling](https:\u002F\u002Fopenreview.net\u002Fpdf?id=d94x0gWTUX) | spotlight |\n| [Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GSBHKiw19c) | spotlight |\n| [Dual RL: Unification and New Methods for Reinforcement and Imitation Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=xt9Bu66rqv) | spotlight |\n| [Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Xkf2EBj4w3) | spotlight |\n| [Safe RLHF: Safe Reinforcement Learning from Human Feedback](https:\u002F\u002Fopenreview.net\u002Fpdf?id=TyFrPOKYXw) | spotlight |\n| [Cross$Q$: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity](https:\u002F\u002Fopenreview.net\u002Fpdf?id=PczQtTsTIX) | spotlight |\n| [Blending Imitation and Reinforcement Learning for Robust Policy Improvement](https:\u002F\u002Fopenreview.net\u002Fpdf?id=eJ0dzPJq1F) | spotlight |\n| [Unlocking the Power of Representations in Long-term Novelty-based Exploration](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OwtMhMSybu) | spotlight |\n| [Spatially-Aware Transformers for Embodied Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Ts95eXsPBc) | spotlight |\n| [Learning to Act without Actions](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rvUq3cxpDF) | spotlight |\n| [Towards Principled Representation Learning from Videos for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3mnWvUZIXt) | spotlight |\n| [TorchRL: A data-driven decision-making library for PyTorch](https:\u002F\u002Fopenreview.net\u002Fpdf?id=QxItoEAVMb) | spotlight |\n| [Towards Robust Offline Reinforcement Learning under Diverse Data Corruption](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5hAMmCU0bK) | spotlight |\n| [Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics](https:\u002F\u002Fopenreview.net\u002Fpdf?id=TjCDNssXKU) | spotlight |\n| [Text2Reward: Reward Shaping with Language Models for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tUM39YTRxH) | spotlight |\n| [Robotic Task Generalization via Hindsight Trajectory Sketches](https:\u002F\u002Fopenreview.net\u002Fpdf?id=F1TKzG8LJO) | spotlight |\n| [Submodular Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=loYSzjSaAK) | spotlight |\n| [Query-Policy Misalignment in Preference-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=UoBymIwPJR) | spotlight |\n| [Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies](https:\u002F\u002Fopenreview.net\u002Fpdf?id=plebgsdiiV) | spotlight |\n| [GenSim: Generating Robotic Simulation Tasks via Large Language Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OI3RoHoWAN) | spotlight |\n| [Entity-Centric Reinforcement Learning for Object Manipulation from Pixels](https:\u002F\u002Fopenreview.net\u002Fpdf?id=uDxeSZ1wdI) | spotlight |\n| [Illusory Attacks: Detectability Matters in Adversarial Attacks on Sequential Decision-Makers](https:\u002F\u002Fopenreview.net\u002Fpdf?id=F5dhGCdyYh) | spotlight |\n| [Addressing Signal Delay in Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Z8UfDs4J46) | spotlight |\n| [DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MSe8YFbhUE) | spotlight |\n| [Task Adaptation from Skills: Information Geometry, Disentanglement, and New Objectives for Unsupervised Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zSxpnKh1yS) | spotlight |\n| [$\\mathcal{B}$-Coder: On Value-Based Deep Reinforcement Learning for Program Synthesis](https:\u002F\u002Fopenreview.net\u002Fpdf?id=fLf589bx1f) | spotlight |\n| [Physics-Regulated Deep Reinforcement Learning: Invariant Embeddings](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5Dwqu5urzs) | spotlight |\n| [Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization](https:\u002F\u002Fopenreview.net\u002Fpdf?id=KOZu91CzbK) | spotlight |\n| [Learning to Act from Actionless Videos through Dense Correspondences](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Mhb5fpA1T0) | spotlight |\n| [CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=UBVNwD3hPN) | spotlight |\n| [TD-MPC2: Scalable, Robust World Models for Continuous Control](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Oxh5CstDJU) | spotlight |\n| [Universal Humanoid Motion Representations for Physics-Based Control](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OrOd8PxOO2) | spotlight |\n| [Adaptive Rational Activations to Boost Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=g90ysX1sVs) | spotlight |\n| [Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula](https:\u002F\u002Fopenreview.net\u002Fpdf?id=pFOoOdaiue) | spotlight |\n| [Locality Sensitive Sparse Encoding for Learning World Models Online](https:\u002F\u002Fopenreview.net\u002Fpdf?id=i8PjQT3Uig) | poster |\n| [On Representation Complexity of Model-based and Model-free Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3K3s9qxSn7) | poster |\n| [Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=m3xVPaZp6Z) | poster |\n| [What Matters to You? Towards Visual Representation Alignment for Robot Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=CTlUHIKF71) | poster |\n| [Improving Language Models with Advantage-based Offline Policy Gradients](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZDGKPbF0VQ) | poster |\n| [Training Diffusion Models with Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=YCWjhGrJFD) | poster |\n| [The Trickle-down Impact of Reward Inconsistency on RLHF](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MeHmwCDifc) | poster |\n| [Maximum Entropy Model Correction in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kNpSUN0uCc) | poster |\n| [Tree Search-Based Policy Optimization under Stochastic Execution Delay](https:\u002F\u002Fopenreview.net\u002Fpdf?id=RaqZX9LSGA) | poster |\n| [Offline RL with Observation Histories: Analyzing and Improving Sample Complexity](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GnOLWS4Llt) | poster |\n| [Understanding Hidden Context in Preference Learning: Consequences for RLHF](https:\u002F\u002Fopenreview.net\u002Fpdf?id=0tWTxYYPnW) | poster |\n| [Eureka: Human-Level Reward Design via Coding Large Language Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=IEduRUO55F) | poster |\n| [Retrieval-Guided Reinforcement Learning for Boolean Circuit Minimization](https:\u002F\u002Fopenreview.net\u002Fpdf?id=0t1O8ziRZp) | poster |\n| [Score Models for Offline Goal-Conditioned Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=oXjnwQLcTA) | poster |\n| [Contrastive Difference Predictive Coding](https:\u002F\u002Fopenreview.net\u002Fpdf?id=0akLDTFR9x) | poster |\n| [Hindsight PRIORs for Reward Learning from Human Preferences](https:\u002F\u002Fopenreview.net\u002Fpdf?id=NLevOah0CJ) | poster |\n| [Reward Model Ensembles Help Mitigate Overoptimization](https:\u002F\u002Fopenreview.net\u002Fpdf?id=dcjtMYkpXx) | poster |\n| [Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model](https:\u002F\u002Fopenreview.net\u002Fpdf?id=j5JvZCaDM0) | poster |\n| [Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=HRkyLbBRHI) | poster |\n| [Flow to Better: Offline Preference-based Reinforcement Learning via Preferred Trajectory Generation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=EG68RSznLT) | poster |\n| [PAE: Reinforcement Learning from External Knowledge for Efficient Exploration](https:\u002F\u002Fopenreview.net\u002Fpdf?id=R7rZUSGOPD) | poster |\n| [Identifying Policy Gradient Subspaces](https:\u002F\u002Fopenreview.net\u002Fpdf?id=iPWxqnt2ke) | poster |\n| [PARL: A Unified Framework for Policy Alignment in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ByR3NdDSZB) | poster |\n| [SafeDreamer: Safe Reinforcement Learning with World Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tsE5HLYtYg) | poster |\n| [Vanishing Gradients in Reinforcement Finetuning of Language Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=IcVNBR7qZi) | poster |\n| [Goodhart's Law in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5o9G4XF1LI) | poster |\n| [Score Regularized Policy Optimization through Diffusion Behavior](https:\u002F\u002Fopenreview.net\u002Fpdf?id=xCRr9DrolJ) | poster |\n| [Making RL with Preference-based Feedback Efficient via Randomization](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Pe2lo3QOvo) | poster |\n| [Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=v8jdwkUNXb) | poster |\n| [Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=iX1RjVQODj) | poster |\n| [Privileged Sensing Scaffolds Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=EpVe8jAjdx) | poster |\n| [Learning Planning Abstractions from Language](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3UWuFoksGb) | poster |\n| [CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=UCfz492fM8) | poster |\n| [Efficient Dynamics Modeling in Interactive Environments with Koopman Theory](https:\u002F\u002Fopenreview.net\u002Fpdf?id=fkrYDQaHOJ) | poster |\n| [Jumanji: a Diverse Suite of Scalable Reinforcement Learning Environments in JAX](https:\u002F\u002Fopenreview.net\u002Fpdf?id=C4CxQmp9wc) | poster |\n| [Searching for High-Value Molecules Using Reinforcement Learning and Transformers](https:\u002F\u002Fopenreview.net\u002Fpdf?id=nqlymMx42E) | poster |\n| [Privately Aligning Language Models with Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3d0OmYTNui) | poster |\n| [The HIM Solution for Legged Locomotion: Minimal Sensors, Efficient Learning, and Substantial Agility](https:\u002F\u002Fopenreview.net\u002Fpdf?id=93LoCyww8o) | poster |\n| [S$2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rAHcTCMaLc) | poster |\n| [Replay across Experiments: A Natural Extension of Off-Policy RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Nf4Lm6fXN8) | poster |\n| [Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hOMVq57Ce0) | poster |\n| [Time-Efficient Reinforcement Learning with Stochastic Stateful Policies](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5liV2xUdJL) | poster |\n| [Open the Black Box: Step-based Policy Updates for Temporally-Correlated Episodic Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=mnipav175N) | poster |\n| [On Trajectory Augmentations for Off-Policy Evaluation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=eMNN0wIyVw) | poster |\n| [Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=NxoFmGgWC9) | poster |\n| [Understanding the Effects of RLHF on LLM Generalisation and Diversity](https:\u002F\u002Fopenreview.net\u002Fpdf?id=PXD3FAVHJT) | poster |\n| [Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding](https:\u002F\u002Fopenreview.net\u002Fpdf?id=lUYY2qsRTI) | poster |\n| [Prioritized Soft Q-Decomposition for Lexicographic Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=c0MyyXyGfn) | poster |\n| [The Curse of Diversity in Ensemble-Based Exploration](https:\u002F\u002Fopenreview.net\u002Fpdf?id=M3QXCOTTk4) | poster |\n| [Off-Policy Primal-Dual Safe Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=vy42bYs1Wo) | poster |\n| [STARC: A General Framework For Quantifying Differences Between Reward Functions](https:\u002F\u002Fopenreview.net\u002Fpdf?id=wPhbtwlCDa) | poster |\n| [Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=AY6aM13gGF) | poster |\n| [Discovering Temporally-Aware Reinforcement Learning Algorithms](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MJJcs3zbmi) | poster |\n| [Revisiting Data Augmentation in Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=EGQBpkIEuu) | poster |\n| [Reward-Free Curricula for Training Robust World Models](https:\u002F\u002Fopenreview.net\u002Fpdf?id=eCGpNGDeNu) | poster |\n| [CPPO: Continual Learning for Reinforcement Learning with Human Feedback](https:\u002F\u002Fopenreview.net\u002Fpdf?id=86zAUE80pP) | poster |\n| [A Study of Generalization in Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3w6xuXDOdY) | poster |\n| [RLIF: Interactive Imitation Learning as Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=oLLZhbBSOU) | poster |\n| [Uncertainty-aware Constraint Inference in Inverse Constrained Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ILYjDvUM6U) | poster |\n| [Towards Imitation Learning to Branch for MIP: A Hybrid Reinforcement Learning based Sample Augmentation Approach](https:\u002F\u002Fopenreview.net\u002Fpdf?id=NdcQQ82mfy) | poster |\n| [Towards Assessing and Benchmarking Risk-Return Tradeoff of Off-Policy Evaluation](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ycF7mKfVGO) | poster |\n| [Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tbFBh3LMKi) | poster |\n| [Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7zY781bMDO) | poster |\n| [Revisiting Plasticity in Visual Reinforcement Learning: Data, Modules and Training Stages](https:\u002F\u002Fopenreview.net\u002Fpdf?id=0aR1s9YxoL) | poster |\n| [Robot Fleet Learning via Policy Merging](https:\u002F\u002Fopenreview.net\u002Fpdf?id=IL71c1z7et) | poster |\n| [Improving Intrinsic Exploration by Creating Stationary Objectives](https:\u002F\u002Fopenreview.net\u002Fpdf?id=YbZxT0SON4) | poster |\n| [Motif: Intrinsic Motivation from Artificial Intelligence Feedback](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tmBKIecDE9) | poster |\n| [Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updates](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sVEu295o70) | poster |\n| [RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment](https:\u002F\u002Fopenreview.net\u002Fpdf?id=v3XXtxWKi6) | poster |\n| [Reasoning with Latent Diffusion in Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tGQirjzddO) | poster |\n| [Belief-Enriched Pessimistic Q-Learning against Adversarial State Perturbations](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7gDENzTzw1) | poster |\n| [Reward Design for Justifiable Sequential Decision-Making](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OUkZXbbwQr) | poster |\n| [MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=1RE0H6mU7M) | poster |\n| [LOQA: Learning with Opponent Q-Learning Awareness](https:\u002F\u002Fopenreview.net\u002Fpdf?id=FDQF6A1s6M) | poster |\n| [Intelligent Switching for Reset-Free RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Nq45xeghcL) | poster |\n| [True Knowledge Comes from Practice: Aligning Large Language Models with Embodied Environments via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hILVmJ4Uvu) | poster |\n| [Skill Machines: Temporal Logic Skill Composition in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qiduMcw3CU) | poster |\n| [Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback](https:\u002F\u002Fopenreview.net\u002Fpdf?id=WesY0H9ghM) | poster |\n| [Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=N0I2RtD8je) | poster |\n| [DittoGym: Learning to Control Soft Shape-Shifting Robots](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MpyFAhH9CK) | poster |\n| [Decoupling regularization from the action space](https:\u002F\u002Fopenreview.net\u002Fpdf?id=UaMgmoKEBj) | poster |\n| [Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hQVCCxQrYN) | poster |\n| [Robust Model Based Reinforcement Learning Using $\\mathcal{L}_1$ Adaptive Control](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GaLCLvJaoF) | poster |\n| [DMBP: Diffusion model-based predictor for robust offline reinforcement learning against state observation perturbations](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZULjcYLWKe) | poster |\n| [LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ADSxCpCu9s) | poster |\n| [Integrating Planning and Deep Reinforcement Learning via Automatic Induction of Task Substructures](https:\u002F\u002Fopenreview.net\u002Fpdf?id=PR6RMsxuW7) | poster |\n| [Closing the Gap between TD Learning and Supervised Learning - A Generalisation Point of View.](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qg5JENs0N4) | poster |\n| [COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jnFcKjtUPN) | poster |\n| [$\\pi$2vec: Policy Representation with Successor Features](https:\u002F\u002Fopenreview.net\u002Fpdf?id=o5Bqa4o5Mi) | poster |\n| [Task Planning for Visual Room Rearrangement under Partial Observability](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jJvXNpvOdM) | poster |\n| [DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GruDNzQ4ux) | poster |\n| [Meta Inverse Constrained Reinforcement Learning: Convergence Guarantee and Generalization Analysis](https:\u002F\u002Fopenreview.net\u002Fpdf?id=bJ3gFiwRgi) | poster |\n| [Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Diq6urt3lS) | poster |\n| [Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=eo9dHwtTFt) | poster |\n| [Function-space Parameterization of Neural Networks for Sequential Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=2dhxxIKhqz) | poster |\n| [When should we prefer Decision Transformers for Offline Reinforcement Learning?](https:\u002F\u002Fopenreview.net\u002Fpdf?id=vpV7fOFQy4) | poster |\n| [Bridging State and History Representations: Understanding Self-Predictive RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ms0VgzSGF2) | poster |\n| [Embodied Active Defense: Leveraging Recurrent Feedback to Counter Adversarial Patches](https:\u002F\u002Fopenreview.net\u002Fpdf?id=uXjfOmTiDt) | poster |\n| [Stylized Offline Reinforcement Learning: Extracting Diverse High-Quality Behaviors from Heterogeneous Datasets](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rnHNDihrIT) | poster |\n| [Pre-training with Synthetic Data Helps Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=PcxQgtHGj2) | poster |\n| [Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=N6o0ZtPzTg) | poster |\n| [A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories](https:\u002F\u002Fopenreview.net\u002Fattachment?id=805CW5w2CY&name=pdf) | poster |\n| [Offline Imitation Learning with Variational Counterfactual Reasoning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6d9Yxttb3w&name=pdf) | poster |\n| [Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qP0Drg2HuH&name=pdf) | poster |\n| [Reinforcement Learning with Fast and Forgetful Memory](https:\u002F\u002Fopenreview.net\u002Fattachment?id=KTfAtro6vP&name=pdf) | poster |\n| [Active Vision Reinforcement Learning under Limited Visual Observability](https:\u002F\u002Fopenreview.net\u002Fattachment?id=j2oYaFpbrB&name=pdf) | poster |\n| [Sequential Preference Ranking for Efficient Reinforcement Learning from Human Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=MIYBTjCVjR&name=pdf) | poster |\n| [Hierarchical Adaptive Value Estimation for Multi-modal Visual Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jB4wsc1DQW&name=pdf) | poster |\n| [Elastic Decision Transformer](https:\u002F\u002Fopenreview.net\u002Fattachment?id=RMeQjexaRj&name=pdf) | poster |\n| [Importance-aware Co-teaching for Offline Model-based Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OvPnc5kVsb&name=pdf) | poster |\n| [Parallel-mentoring for Offline Model-based Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tJwyg9Zg9G&name=pdf) | poster |\n| [Accountability in Offline Reinforcement Learning: Explaining Decisions with a Corpus of Examples](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kmbG9iBRIb&name=pdf) | poster |\n\n\n\n\u003Ca id='ICML24'>\u003C\u002Fa>\n## ICML24\n| Paper | Type |\n| ---- | ---- |\n| [Stop Regressing: Training Value Functions via Classification for Scalable Deep RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dVpFKfqF3R&name=pdf) | oral |\n| [Position: Automatic Environment Shaping is the Next Frontier in RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dslUyy1rN4&name=pdf) | oral |\n| [ACE: Off-Policy Actor-Critic with Causality-Aware Entropy Regularization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=1puvYh729M&name=pdf) | oral |\n| [Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6XH8R7YrSk&name=pdf) | oral |\n| [SAPG: Split and Aggregate Policy Gradients](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4dOJAfXhNV&name=pdf) | oral |\n| [Environment Design for Inverse Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Ar0dsOMStE&name=pdf) | oral |\n| [OMPO: A Unified Framework for RL under Policy and Dynamics Shifts](https:\u002F\u002Fopenreview.net\u002Fattachment?id=R83VIZtHXA&name=pdf) | oral |\n| [Learning to Model the World With Language](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7dP6Yq9Uwv&name=pdf) | oral |\n| [Offline Actor-Critic Reinforcement Learning Scales to Large Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tl2qmO5kpD&name=pdf) | oral |\n| [Self-Composing Policies for Scalable Continual Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=f5gtX2VWSB&name=pdf) | oral |\n| [Genie: Generative Interactive Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=bJbSbJskOS&name=pdf) | oral |\n| [Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings](https:\u002F\u002Fopenreview.net\u002Fattachment?id=a6wCNfIj8E&name=pdf) | spotlight |\n| [Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hg4wXlrQCV&name=pdf) | spotlight |\n| [Mixtures of Experts Unlock Parameter Scaling for Deep RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=X9VMhfFxwn&name=pdf) | spotlight |\n| [RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PKJqsZD5nQ&name=pdf) | spotlight |\n| [Code as Reward: Empowering Reinforcement Learning with VLMs](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6P88DMUDvH&name=pdf) | spotlight |\n| [EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LHGMXcr6zx&name=pdf) | spotlight |\n| [Behavior Generation with Latent Actions](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hoVwecMqV5&name=pdf) | spotlight |\n| [Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5vZzmCeTYu&name=pdf) | poster |\n| [Hard Tasks First: Multi-Task Reinforcement Learning Through Task Scheduling](https:\u002F\u002Fopenreview.net\u002Fattachment?id=haUOhXo70o&name=pdf) | poster |\n| [Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XwnABAdH5y&name=pdf) | poster |\n| [Meta-Learners for Partially-Identified Treatment Effects Across Multiple Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=s5PLISyNyP&name=pdf) | poster |\n| [How to Explore with Belief: State Entropy Maximization in POMDPs](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LbcNAIgNnB&name=pdf) | poster |\n| [PIPER: Primitive-Informed Preference-based Hierarchical Reinforcement Learning via Hindsight Relabeling](https:\u002F\u002Fopenreview.net\u002Fattachment?id=l6Hef6FVd0&name=pdf) | poster |\n| [Iterative Regularized Policy Optimization with Imperfect Demonstrations](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Gp5F6qzwGK&name=pdf) | poster |\n| [Fourier Controller Networks for Real-Time Decision-Making in Embodied Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=nDps3Q8j2l) | poster |\n| [Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=t82Y3fmRtk&name=pdf) | poster |\n| [AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZwrfsrCduj&name=pdf) | poster |\n| [DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design](https:\u002F\u002Fopenreview.net\u002Fattachment?id=uku9r6RROl&name=pdf) | poster |\n| [Adapting Pretrained ViTs with Convolution Injector for Visuo-Motor Control](https:\u002F\u002Fopenreview.net\u002Fattachment?id=CuiRGtVI55&name=pdf) | poster |\n| [Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration](https:\u002F\u002Fopenreview.net\u002Fattachment?id=lwTshcWlmB&name=pdf) | poster |\n| [Energy-Guided Diffusion Sampling for Offline-to-Online Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hunSEjeCPE&name=pdf) | poster |\n| [RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xB6YJZOKyT&name=pdf) | poster |\n| [Offline Transition Modeling via Contrastive Energy Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dqpg8jdA2w&name=pdf) | poster |\n| [Model-based Reinforcement Learning for Confounded POMDPs](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DlR8fWgJRl&name=pdf) | poster |\n| [Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=yrFUJzcTsk&name=pdf) | poster |\n| [Absolute Policy Optimization: Enhancing Lower Probability Bound of Performance with High Confidence](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Ss3h1ixJAU&name=pdf) | poster |\n| [Meta-Reinforcement Learning Robust to Distributional Shift Via Performing Lifelong In-Context Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=laIOUtstMs&name=pdf) | poster |\n| [DIDI: Diffusion-Guided Diversity for Offline Behavioral Generation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8296yUBoXr&name=pdf) | poster |\n| [When Do Skills Help Reinforcement Learning? A Theoretical Analysis of Temporal Abstractions](https:\u002F\u002Fopenreview.net\u002Fattachment?id=39UqOkTjFn&name=pdf) | poster |\n| [BeigeMaps: Behavioral Eigenmaps for Reinforcement Learning from Images](https:\u002F\u002Fopenreview.net\u002Fattachment?id=myCgfQZzbc&name=pdf) | poster |\n| [Physics-Informed Neural Network Policy Iteration: Algorithms, Convergence, and Verification](https:\u002F\u002Fopenreview.net\u002Fattachment?id=sZla6SnooP&name=pdf) | poster |\n| [RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YSoMmNWZZx&name=pdf) | poster |\n| [RoboDreamer: Learning Compositional World Models for Robot Imagination](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kHjOmAUfVe&name=pdf) | poster |\n| [Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OiI12sNbgD&name=pdf) | poster |\n| [RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=SQIDlJd3hN&name=pdf) | poster |\n| [Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=t3SEfoTaYQ&name=pdf) | poster |\n| [GFlowNet Training by Policy Gradients](https:\u002F\u002Fopenreview.net\u002Fattachment?id=G1igwiBBUj&name=pdf) | poster |\n| [Value-Evolutionary-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XobPpcN4yZ&name=pdf) | poster |\n| [PEARL: Zero-shot Cross-task Preference Alignment and Robust Reward Learning for Robotic Manipulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=0urN0PnNDj&name=pdf) | poster |\n| [Feasibility Consistent Representation Learning for Safe Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JNHK11bAGl&name=pdf) | poster |\n| [Distilling Morphology-Conditioned Hypernetworks for Efficient Universal Morphology Control](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WjvEvYTy3w&name=pdf) | poster |\n| [Constrained Reinforcement Learning Under Model Mismatch](https:\u002F\u002Fopenreview.net\u002Fattachment?id=GcW9pg4P9x&name=pdf) | poster |\n| [Discovering Multiple Solutions from a Single Task in Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=j6rG1ETRyu&name=pdf) | poster |\n| [Learning to Stabilize Online Reinforcement Learning in Unbounded State Spaces](https:\u002F\u002Fopenreview.net\u002Fattachment?id=64fdhmogiD&name=pdf) | poster |\n| [Learning to Play Atari in a World of Tokens](https:\u002F\u002Fopenreview.net\u002Fattachment?id=w8BnKGFIYN&name=pdf) | poster |\n| [Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WJ5fJhwvCl&name=pdf) | poster |\n| [Probabilistic Constrained Reinforcement Learning with Formal Interpretability](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Zo9zXdVhW2&name=pdf) | poster |\n| [Hieros: Hierarchical Imagination on Structured State Space Sequence World Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IUBhvyJ9Sr&name=pdf) | poster |\n| [Random Latent Exploration for Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Y9qzwNlKVU&name=pdf) | poster |\n| [Model-based Reinforcement Learning for Parameterized Action Spaces](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xW79geE0RA&name=pdf) | poster |\n| [Confidence Aware Inverse Constrained Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6TCeizkLJV&name=pdf) | poster |\n| [Averaging n-step Returns Reduces Variance in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jM9A3Kz6Ki&name=pdf) | poster |\n| [Position: A Call for Embodied AI](https:\u002F\u002Fopenreview.net\u002Fattachment?id=e5admkWKgV&name=pdf) | poster |\n| [Just Cluster It: An Approach for Exploration in High-Dimensions using Clustering and Pre-Trained Representations](https:\u002F\u002Fopenreview.net\u002Fattachment?id=cXBPPfNUZJ&name=pdf) | poster |\n| [The Max-Min Formulation of Multi-Objective Reinforcement Learning: From Theory to a Model-Free Algorithm](https:\u002F\u002Fopenreview.net\u002Fattachment?id=cY9g0bwiZx&name=pdf) | poster |\n| [Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9laB7ytoMp&name=pdf) | poster |\n| [Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7joG3i2pUR&name=pdf) | poster |\n| [Sequence Compression Speeds Up Credit Assignment in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=NlM4gp8hyO&name=pdf) | poster |\n| [Seizing Serendipity: Exploiting the Value of Past Success in Off-Policy Actor-Critic](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9Tq4L3Go9f&name=pdf) | poster |\n| [Generalization to New Sequential Decision Making Tasks with In-Context Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=lVQ4FUZ6dp&name=pdf) | poster |\n| [Simple Ingredients for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=japBn31gXC&name=pdf) | poster |\n| [Efficient World Models with Context-Aware Tokenization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BiWIERWBFX&name=pdf) | poster |\n| [In value-based deep reinforcement learning, a pruned network is a good network](https:\u002F\u002Fopenreview.net\u002Fattachment?id=seo9V9QRZp&name=pdf) | poster |\n| [Probabilistic Subgoal Representations for Hierarchical Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=b6AwZauZPV&name=pdf) | poster |\n| [Premier-TACO is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss](https:\u002F\u002Fopenreview.net\u002Fattachment?id=KSNl7VgeVr&name=pdf) | poster |\n| [Understanding and Diagnosing Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=s9RKqT7jVM&name=pdf) | poster |\n| [To the Max: Reinventing Reward in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4KQ0VwqPg8&name=pdf) | poster |\n| [ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3umNqxjFad&name=pdf) | poster |\n| [Stochastic Q-learning for Large Discrete Action Spaces](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HPQaMmABgK&name=pdf) | poster |\n| [Feasible Reachable Policy Iteration](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ks8qSwkkuZ&name=pdf) | poster |\n| [Position: Video as the New Language for Real-World Decision Making](https:\u002F\u002Fopenreview.net\u002Fattachment?id=EZH4CsKV6O&name=pdf) | poster |\n| [Learning Latent Dynamic Robust Representations for World Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=C4jkx6AgWc&name=pdf) | poster |\n| [Reinformer: Max-Return Sequence Modeling for Offline RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mBc8Pestd5&name=pdf) | poster |\n| [Rethinking Transformers in Solving POMDPs](https:\u002F\u002Fopenreview.net\u002Fpdf?id=SyY7ScNpGL) | poster |\n| [Single-Trajectory Distributionally Robust Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3B6vmW2L80&name=pdf) | poster |\n| [Trust the Model Where It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption](https:\u002F\u002Fopenreview.net\u002Fattachment?id=N0ntTjTfHb&name=pdf) | poster |\n| [A Minimaximalist Approach to Reinforcement Learning from Human Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5kVgd2MwMY&name=pdf) | poster |\n| [EvoRainbow: Combining Improvements in Evolutionary Reinforcement Learning for Policy Search](https:\u002F\u002Fopenreview.net\u002Fattachment?id=75Hes6Zse4&name=pdf) | poster |\n| [SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZtOXZCTgBa&name=pdf) | poster |\n| [Adaptive-Gradient Policy Optimization: Enhancing Policy Learning in Non-Smooth Differentiable Simulations](https:\u002F\u002Fopenreview.net\u002Fattachment?id=S9DV6ZP4eE&name=pdf) | poster |\n| [Dense Reward for Free in Reinforcement Learning from Human Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=eyxVRMrZ4m&name=pdf) | poster |\n| [Configurable Mirror Descent: Towards a Unification of Decision Making](https:\u002F\u002Fopenreview.net\u002Fpdf?id=U841CrDUx9) | poster |\n| [Policy Learning for Balancing Short-Term and Long-Term Rewards](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7Qf1uHTahP&name=pdf) | poster |\n| [Reward Shaping for Reinforcement Learning with An Assistant Reward Agent](https:\u002F\u002Fopenreview.net\u002Fattachment?id=a3XFF0PGLU&name=pdf) | poster |\n| [Distributional Bellman Operators over Mean Embeddings](https:\u002F\u002Fopenreview.net\u002Fattachment?id=j2pLfsBm4J&name=pdf) | poster |\n| [SiT: Symmetry-invariant Transformers for Generalisation in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=SWrwurHAeq&name=pdf) | poster |\n| [Geometric Active Exploration in Markov Decision Processes: the Benefit of Abstraction](https:\u002F\u002Fopenreview.net\u002Fattachment?id=2JYOxcGlRe&name=pdf) | poster |\n| [Learning a Diffusion Model Policy from Rewards via Q-Score Matching](https:\u002F\u002Fopenreview.net\u002Fattachment?id=35ahHydjXo&name=pdf) | poster |\n| [ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dmfvHU1LNF&name=pdf) | poster |\n| [Position: Benchmarking is Limited in Reinforcement Learning Research](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Xe7n2ZqpBP&name=pdf) | poster |\n| [Learning Constraints from Offline Demonstrations via Superior Distribution Correction Estimation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Ax90jQPbgF&name=pdf) | poster |\n| [Augmenting Decision with Hypothesis in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=NeO2hoSexj&name=pdf) | poster |\n| [SHINE: Shielding Backdoors in Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=nMWxLnSBGW&name=pdf) | poster |\n| [Learning Coverage Paths in Unknown Environments with Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=nCZYRBK1J4&name=pdf) | poster |\n| [Improving Token-Based World Models with Parallel Observation Prediction](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Lfp5Dk1xb6&name=pdf) | poster |\n| [Learning to Explore in POMDPs with Informational Rewards](https:\u002F\u002Fopenreview.net\u002Fattachment?id=oTD3WoQyFR&name=pdf) | poster |\n| [Stealthy Imitation: Reward-guided Environment-free Policy Stealing](https:\u002F\u002Fopenreview.net\u002Fattachment?id=H5FDHzrWe2&name=pdf) | poster |\n| [FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BmPWtzL7Eq&name=pdf) | poster |\n| [Enhancing Value Function Estimation through First-Order State-Action Dynamics in Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=nSGnx8lNJ6&name=pdf) | poster |\n| [In-Context Reinforcement Learning for Variable Action Spaces](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pp3v2ch5Sd&name=pdf) | poster |\n| [Information-Directed Pessimism for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JOKOsJHSao&name=pdf) | poster |\n| [PcLast: Discovering Plannable Continuous Latent States](https:\u002F\u002Fopenreview.net\u002Fattachment?id=AaTYLZQPyC&name=pdf) | poster |\n| [Bayesian Design Principles for Offline-to-Online Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HLHQxMydFk&name=pdf) | poster |\n| [Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FV3kY9FBW6&name=pdf) | poster |\n| [ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=b6rA0kAHT1&name=pdf) | poster |\n| [RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=uydQ2W41KO&name=pdf) | poster |\n| [Langevin Policy for Safe Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xgoilgLPGD&name=pdf) | poster |\n| [Reflective Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Cs0Xy6WETl&name=pdf) | poster |\n| [Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint](https:\u002F\u002Fopenreview.net\u002Fattachment?id=c1AKcA6ry1&name=pdf) | poster |\n| [Contrastive Representation for Data Filtering in Cross-Domain Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rReWhol66R&name=pdf) | poster |\n| [Position: Foundation Agents as the Paradigm Shift for Decision Making](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jzHmElqpPe&name=pdf) | poster |\n| [Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Xb3IXEBYuw&name=pdf) | poster |\n| [Do Transformer World Models Give Better Policy Gradients?](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Uoved2xD81&name=pdf) | poster |\n| [Boosting Reinforcement Learning with Strongly Delayed Feedback Through Auxiliary Short Delays](https:\u002F\u002Fopenreview.net\u002Fattachment?id=0IDaPnY5d5&name=pdf) | poster |\n| [Zero-Shot Reinforcement Learning via Function Encoders](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tHBLwSYnLf&name=pdf) | poster |\n| [3D-VLA: A 3D Vision-Language-Action Generative World Model](https:\u002F\u002Fopenreview.net\u002Fattachment?id=EZcFK8HupF&name=pdf) | poster |\n| [SF-DQN: Provable Knowledge Transfer using Successor Feature for Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pDoAjdrMf0&name=pdf) | poster |\n| [In-Context Decision Transformer: Reinforcement Learning via Hierarchical Chain-of-Thought](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jmmji1EU3g&name=pdf) | poster |\n| [Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ISG3l8nXrI&name=pdf) | poster |\n| [Listwise Reward Estimation for Offline Preference-based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=If6Q9OYfoJ&name=pdf) | poster |\n| [Position: Scaling Simulation is Neither Necessary Nor Sufficient for In-the-Wild Robot Manipulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Jtjurj7oIJ&name=pdf) | poster |\n| [Hybrid Reinforcement Learning from Offline Observation Alone](https:\u002F\u002Fopenreview.net\u002Fattachment?id=c6rVlTKpb5&name=pdf) | poster |\n| [Is Inverse Reinforcement Learning Harder than Standard Reinforcement Learning? A Theoretical Perspective](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6dKUu2EkZy&name=pdf) | poster |\n| [Regularized Q-learning through Robust Averaging](https:\u002F\u002Fopenreview.net\u002Fattachment?id=07f24ya6eX&name=pdf) | poster |\n| [Cross-Domain Policy Adaptation by Capturing Representation Mismatch](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3uPSQmjXzd&name=pdf) | poster |\n| [HarmoDT: Harmony Multi-Task Decision Transformer for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=2Asakozn3Z&name=pdf) | poster |\n| [Foundation Policies with Hilbert Representations](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LhNsSaAKub&name=pdf) | poster |\n| [Subequivariant Reinforcement Learning in 3D Multi-Entity Physical Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hQpUhySEJi&name=pdf) | poster |\n| [LLM-Empowered State Representation for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xJMZbdiQnf&name=pdf) | poster |\n| [Prompt-based Visual Alignment for Zero-shot Policy Transfer](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PPoQz8K4GZ&name=pdf) | poster |\n| [An Embodied Generalist Agent in 3D World](https:\u002F\u002Fopenreview.net\u002Fattachment?id=V4qV08Vk6S&name=pdf) | poster |\n| [Q-value Regularized Transformer for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ojtddicekd&name=pdf) | poster |\n| [Highway Value Iteration Networks](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rORsGuE2hV&name=pdf) | poster |\n| [Robust Inverse Constrained Reinforcement Learning under Model Misspecification](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pkUl39b0in&name=pdf) | poster |\n| [Exploration and Anti-Exploration with Distributional Random Network Distillation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rIrpzmqRBk&name=pdf) | poster |\n| [Policy-conditioned Environment Models are More Generalizable](https:\u002F\u002Fopenreview.net\u002Fattachment?id=g9mYBdooPA&name=pdf) | poster |\n| [Constrained Ensemble Exploration for Unsupervised Skill Discovery](https:\u002F\u002Fopenreview.net\u002Fattachment?id=AOJCCFTlfJ&name=pdf) | poster |\n| [DiffStitch: Boosting Offline Reinforcement Learning with Diffusion-based Trajectory Stitching](https:\u002F\u002Fopenreview.net\u002Fattachment?id=phGHQOKmaU&name=pdf) | poster |\n| [Rethinking Decision Transformer via Hierarchical Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WsM4TVsZpJ&name=pdf) | poster |\n| [Learning Cognitive Maps from Transformer Representations for Efficient Planning in Partially Observed Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JUa5XNXuoT&name=pdf) | poster |\n| [HarmonyDream: Task Harmonization Inside World Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=x0yIaw2fgk&name=pdf) | poster |\n| [Advancing DRL Agents in Commercial Fighting Games: Training, Integration, and Agent-Human Alignment](https:\u002F\u002Fopenreview.net\u002Fattachment?id=eN1T7I7OpZ&name=pdf) | poster |\n| [Offline Imitation from Observation via Primal Wasserstein State Occupancy Matching](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4Zr7T6UrBS&name=pdf) | poster |\n| [Fine-Grained Causal Dynamics Learning with Quantization for Improving Robustness in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=mrd4e8ZJjm) | poster |\n| [Switching the Loss Reduces the Cost in Batch Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7PXSc5fURu&name=pdf) | poster |\n| [Think Before You Act: Decision Transformers with Working Memory](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PSQ5Z920M8&name=pdf) | poster |\n\n\n\u003Ca id='NeurIPS24'>\u003C\u002Fa>\n## NeurIPS24\n| Paper | Type |\n| ---- | ---- |\n| [Maximum Entropy Inverse Reinforcement Learning of Diffusion Models with Energy-Based Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=V0oJaLqY4E&name=pdf) | oral |\n| [Improving Environment Novelty Quantification for Effective Unsupervised Environment Design](https:\u002F\u002Fopenreview.net\u002Fattachment?id=UdxpjKO2F9&name=pdf) | oral |\n| [RL-GPT: Integrating Reinforcement Learning and Code-as-policy](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LEzx6QRkRH&name=pdf) | oral |\n| [Optimizing Automatic Differentiation with Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hVmi98a0ki&name=pdf) | spotlight |\n| [Bigger, Regularized, Optimistic: scaling for compute and sample efficient continuous control](https:\u002F\u002Fopenreview.net\u002Fattachment?id=fu0xdh4aEJ&name=pdf) | spotlight |\n| [Can Learned Optimization Make Reinforcement Learning Less Difficult?](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YbxFwaSA9Z&name=pdf) | spotlight |\n| [Goal Reduction with Loop-Removal Accelerates RL and Models Human Brain Activity in Goal-Directed Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Y0EfJJeb4V&name=pdf) | spotlight |\n| [BricksRL: A Platform for Democratizing Robotics and Reinforcement Learning Research and Education with LEGO](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8iytZCnXIu&name=pdf) | spotlight |\n| [Humanoid Locomotion as Next Token Prediction](https:\u002F\u002Fopenreview.net\u002Fattachment?id=GrMczQGTlA&name=pdf) | spotlight |\n| [Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=gRG6SzbW9p&name=pdf) | spotlight |\n| [Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control](https:\u002F\u002Fopenreview.net\u002Fattachment?id=KY07A73F3Y&name=pdf) | spotlight |\n| [A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=MsUf8kpKTF&name=pdf) | spotlight |\n| [Diffusion for World Modeling: Visual Details Matter in Atari](https:\u002F\u002Fopenreview.net\u002Fattachment?id=NadTwTODgC&name=pdf) | spotlight |\n| [Exclusively Penalized Q-learning for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=2bdSnxeQcW&name=pdf) | spotlight |\n| [DiffTORI: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Mwj57TcHWX&name=pdf) | spotlight |\n| [Variational Delayed Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DAtNDZHbqj&name=pdf) | spotlight |\n| [Rethinking Exploration in Reinforcement Learning with Effective Metric-Based Exploration Bonus](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QpKWFLtZKi&name=pdf) | spotlight |\n| [Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QFUsZvw9mx&name=pdf) | spotlight |\n| [Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5l5bhYexYO&name=pdf) | spotlight |\n| [The Value of Reward Lookahead in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=URyeU8mwz1&name=pdf) | spotlight |\n| [PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14073) | poster |\n| [Reward Machines for Deep RL in Noisy and Uncertain Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Cc0ckJlJF2&name=pdf) | poster |\n| [Provable Partially Observable Reinforcement Learning with Privileged Information](https:\u002F\u002Fopenreview.net\u002Fattachment?id=o3i1JEfzKw&name=pdf) | poster|\n| [Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pMaCRgu8GV&name=pdf) | poster|\n| [SimPO: Simple Preference Optimization with a Reference-Free Reward](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3Tzcot1LKb&name=pdf) | poster|\n| [Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WfpvtH7oC1&name=pdf) | poster|\n| [Model-based Diffusion for Trajectory Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BJndYScO6o&name=pdf) | poster|\n| [Operator World Models for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kbBjVMcJ7G&name=pdf) | poster|\n| [Foundations of Multivariate Distributional Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=aq3I5B6GLG&name=pdf) | poster|\n| [Imitating Language via Scalable Inverse Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5d2eScRiRC&name=pdf) | poster|\n| [Beyond Optimism: Exploration With Partially Observable Rewards](https:\u002F\u002Fopenreview.net\u002Fattachment?id=k6ZHvF1vkg&name=pdf) | poster|\n| [SleeperNets: Universal Backdoor Poisoning Attacks Against Reinforcement Learning Agents](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HkC4OYee3Q&name=pdf) | poster|\n| [Learning World Models for Unconstrained Goal Navigation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=aYqTwcDlCG&name=pdf) | poster|\n| [Exploring the Edges of Latent State Clusters for Goal-Conditioned Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9hKN99RNdR&name=pdf) | poster|\n| [Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=k2hS5Rt1N0&name=pdf) | poster|\n| [Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pEhvscmSgG&name=pdf) | poster|\n| [Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JMBWTlazjW&name=pdf) | poster|\n| [Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VFRyS7Wx08&name=pdf) | poster|\n| [Normalization and effective learning rates in reinforcement learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZbjJE6Nq5k&name=pdf) | poster|\n| [ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8rcFOqEud5&name=pdf) | poster|\n| [Deep Policy Gradient Methods Without Batch Updates, Target Networks, or Replay Buffers](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DX5GUwMFFb&name=pdf) | poster|\n| [Text-Aware Diffusion for Policy Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=nK6OnCpd3n&name=pdf) | poster|\n| [A Tractable Inference Perspective of Offline RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=UZIHW8eFRp&name=pdf) | poster|\n| [Reinforcing LLM Agents via Policy Optimization with Action Decomposition](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Hz6cSigMyU&name=pdf) | poster|\n| [Parseval Regularization for Continual Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=RB1F2h5YEx&name=pdf) | poster|\n| [The Surprising Ineffectiveness of Pre-Trained Visual Representations for Model-Based Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LvAy07mCxU&name=pdf) | poster |\n| [Speculative Monte-Carlo Tree Search](https:\u002F\u002Fopenreview.net\u002Fattachment?id=g1HxCIc0wi&name=pdf) | poster |\n| [Safety through feedback in Constrained RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WSsht66fbC&name=pdf) | poster |\n| [Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=TwrnhZfD6a&name=pdf) | poster |\n| [Skill-aware Mutual Information Optimisation for Zero-shot Generalisation in Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=GtbwJ6mruI&name=pdf) | poster |\n| [Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hWRVbdAWiS&name=pdf) | poster |\n| [An Analytical Study of Utility Functions in Multi-Objective Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=K3h2kZFz8h&name=pdf) | poster |\n| [Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=EIl9qmMmvy&name=pdf) | poster |\n| [Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tSWoT8ttkO&name=pdf) | poster |\n| [Uncertainty-based Offline Variational Bayesian Reinforcement Learning for Robustness under Diverse Data Corruptions](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rTxCIWsfsD&name=pdf) | poster |\n| [Any2Policy: Learning Visuomotor Policy with Any-Modality](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8lcW9ltJx9&name=pdf) | poster |\n| [Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems](https:\u002F\u002Fopenreview.net\u002Fattachment?id=MRO2QhydPF&name=pdf) | poster |\n| [Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps](https:\u002F\u002Fopenreview.net\u002Fattachment?id=biAqUbAuG7&name=pdf) | poster |\n| [ROIDICE: Offline Return on Investment Maximization for Efficient Decision Making](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6Kg26g1quR&name=pdf) | poster |\n| [Prediction with Action: Visual Policy Learning via Joint Denoising Process](https:\u002F\u002Fopenreview.net\u002Fattachment?id=teVxVdy8R2&name=pdf) | poster |\n| Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control | poster |\n\n\n\u003Ca id='ICLR25'>\u003C\u002Fa>\n## ICLR25\n| Paper | Type |\n| ---- | ---- |\n| [RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QEHrmQPBdd&name=pdf) | oral |\n| [Diffusion-Based Planning for Autonomous Driving with Flexible Guidance](https:\u002F\u002Fopenreview.net\u002Fattachment?id=wM2sfVgMDH&name=pdf) | oral |\n| [Learning to Search from Demonstration Sequences](https:\u002F\u002Fopenreview.net\u002Fattachment?id=v593OaNePQ&name=pdf) | oral |\n| [Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BPgK5XW1Nb&name=pdf) | oral |\n| [Interpreting Emergent Planning in Model-Free Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DzGe40glxs&name=pdf) | oral |\n| [Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks](https:\u002F\u002Fopenreview.net\u002Fattachment?id=zCxGCdzreM&name=pdf) | oral |\n| [OptionZero: Planning with Learned Options](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3IFRygQKGL&name=pdf) | oral |\n| [Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=meRCKuUpmc&name=pdf) | oral |\n| [Data Scaling Laws in Imitation Learning for Robotic Manipulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pISLZG7ktL&name=pdf) | oral |\n| [More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FpiCLJrSW8&name=pdf) | oral |\n| [Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7BLXhmWvwF&name=pdf) | oral |\n| [DeepLTL: Learning to Efficiently Satisfy Complex LTL Specifications for Multi-Task RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9pW2J49flQ&name=pdf) | oral |\n| [Training Language Models to Self-Correct via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=CjwERcAU7w&name=pdf) | oral |\n| [Prioritized Generative Replay](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5IkDAfabuo&name=pdf) | oral |\n| [Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4OaO3GjP7k&name=pdf) | oral |\n| [Open-World Reinforcement Learning over Long Short-Term Imagination](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vzItLaEoDa&name=pdf) | oral |\n| [Online Preference Alignment for Language Models via Count-based Exploration](https:\u002F\u002Fopenreview.net\u002Fattachment?id=cfKZ5VrhXt&name=pdf) | spotlight |\n| [Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignment](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VCbqXtS5YY&name=pdf) | spotlight |\n| [Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking](https:\u002F\u002Fopenreview.net\u002Fattachment?id=msEr27EejF&name=pdf) | spotlight |\n| [Online Reinforcement Learning in Non-Stationary Context-Driven Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=l6QnSQizmN&name=pdf) | spotlight |\n| [DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=00SnKBGTsz&name=pdf) | spotlight |\n| [Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hXm0Wu2U9K&name=pdf) | spotlight |\n| [TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=N4NhVN30ph&name=pdf) | spotlight |\n| [VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QOfswj7hij&name=pdf) | spotlight |\n| [Multi-Robot Motion Planning with Diffusion Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=AUCYptvAf3&name=pdf) | spotlight |\n| [Simplifying Deep Temporal Difference Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7IzeL0kflu&name=pdf) | spotlight |\n| [ODE-based Smoothing Neural Network for Reinforcement Learning Tasks](https:\u002F\u002Fopenreview.net\u002Fattachment?id=S5Yo6w3n3f&name=pdf) | spotlight |\n| [MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6RtRsg8ZV1&name=pdf) | spotlight |\n| [Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DRiLWb8bJg&name=pdf) | spotlight |\n| [Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8oCrlOaYcc&name=pdf) | spotlight |\n| [Learning Transformer-based World Models with Contrastive Predictive Coding](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YK9G4Htdew&name=pdf) | spotlight |\n| [Towards General-Purpose Model-Free Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=R1hIXdST22&name=pdf) | spotlight |\n| [Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Cnwz9jONi5&name=pdf) | spotlight |\n| [Accelerating Goal-Conditioned Reinforcement Learning Algorithms and Research](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4gaySj8kvX&name=pdf) | spotlight |\n| [SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jXLiDKsuDo&name=pdf) | spotlight |\n| [Test-time Alignment of Diffusion Models without Reward Over-optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vi3DjUhFVm&name=pdf) | spotlight |\n| [Mitigating Information Loss in Tree-Based Reinforcement Learning via Direct Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qpXctF2aLZ&name=pdf) | spotlight |\n| [What Makes a Good Diffusion Planner for Decision Making?](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7BQkXXM8Fy&name=pdf) | spotlight |\n| [ADAM: An Embodied Causal Agent in Open-World Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Ouu3HnIVBc&name=pdf) | poster |\n| [How to Evaluate Reward Models for RLHF](https:\u002F\u002Fopenreview.net\u002Fattachment?id=cbttLtO94Q&name=pdf) | poster |\n| [SafeDiffuser: Safe Planning with Diffusion Probabilistic Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ig2wk7kK9J&name=pdf) | poster |\n| [Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HN0CYZbAPw&name=pdf) | poster |\n| [Efficient Reinforcement Learning with Large Language Model Priors](https:\u002F\u002Fopenreview.net\u002Fattachment?id=e2NRNQ0sZe&name=pdf) | poster |\n| [Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FvQsk3la17&name=pdf) | poster |\n| [Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Dem5LyVk8R&name=pdf) | poster |\n| [Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity](https:\u002F\u002Fopenreview.net\u002Fattachment?id=lOi6FtIwR8&name=pdf) | poster |\n| [Safety-Prioritizing Curricula for Constrained Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=f3QR9TEERH&name=pdf) | poster |\n| [Neural Stochastic Differential Equations for Uncertainty-Aware Offline RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hxUMQ4fic3&name=pdf) | poster |\n| [MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=R4q3cY3kQf&name=pdf) | poster |\n| [SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rJ5g8ueQaI&name=pdf) | poster |\n| [Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search](https:\u002F\u002Fopenreview.net\u002Fattachment?id=gfI9v7AbFg&name=pdf) | poster |\n\n\n\u003Ca id='ICML25'>\u003C\u002Fa>\n## ICML25\n| Paper | Type |\n| ---- | ---- |\n| [EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DgGF2LEBPS&name=pdf) | oral |\n| [Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mIomqOskaa&name=pdf) | oral |\n| [Multi-Turn Code Generation Through Single-Step Rewards](https:\u002F\u002Fopenreview.net\u002Fattachment?id=aJeLhLcsh0&name=pdf) | spotlight |\n| [Policy-labeled Preference Learning: Is Preference Enough for RLHF?](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qLfo1sef50&name=pdf) | spotlight |\n| [Monte Carlo Tree Diffusion for System 2 Planning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XrCbBdycDc&name=pdf) | spotlight |\n| [RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PzSG5nKe1q&name=pdf) | spotlight |\n| [Decision Making under the Exponential Family: Distributionally Robust Optimisation with Bayesian Ambiguity Sets](https:\u002F\u002Fopenreview.net\u002Fattachment?id=r9HlTuCQfr&name=pdf) | spotlight |\n| [Hyperspherical Normalization for Scalable Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kfYxyvCYQ4&name=pdf) | spotlight |\n| [The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5hyfZ2jYfI&name=pdf) | spotlight |\n| [Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FSVdEzR4To&name=pdf) | spotlight |\n| [A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XEyGcrhxB8&name=pdf) | spotlight |\n| [Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations](https:\u002F\u002Fopenreview.net\u002Fattachment?id=c0dhw1du33&name=pdf) | spotlight |\n| [Continual Reinforcement Learning by Planning with Online World Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mQeZEsdODh&name=pdf) | spotlight |\n| [Policy Regularization on Globally Accessible States in Cross-Dynamics Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Go0DdhjATH&name=pdf) | spotlight |\n| [Latent Diffusion Planning for Imitation Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vhACnRfuYh&name=pdf) | spotlight |\n| [Novelty Detection in Reinforcement Learning with World Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xtlixzbcfV&name=pdf) | spotlight |\n| [DPO Meets PPO: Reinforced Token Optimization for RLHF](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IfWKVF6LfY&name=pdf) | spotlight |\n\n\n\u003Ca id='NeurIPS25'>\u003C\u002Fa>\n## NeurIPS25\n| Paper | Type |\n| ---- | ---- |\n| [State Entropy Regularization for Robust Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rtG7n93Ru8&name=pdf) | oral |\n| [PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4xvE6Iy77Y&name=pdf) | oral |\n| [A Clean Slate for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8P3QNSckMp&name=pdf) | oral |\n| [1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities](https:\u002F\u002Fopenreview.net\u002Fattachment?id=s0JVsx3bx1&name=pdf) | oral |\n| [QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZwCVFBFUFb&name=pdf) | oral |\n| [Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4OsgYD7em5&name=pdf) | oral |\n| [AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jSgCM0uZn3&name=pdf) | spotlight |\n| [Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems](https:\u002F\u002Fopenreview.net\u002Fattachment?id=W6WC6047X2&name=pdf) | spotlight |\n| [Forecasting in Offline Reinforcement Learning for Non-stationary Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=24UJqxw1kv&name=pdf) | spotlight |\n| [Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qaHrpITIvB&name=pdf) | spotlight |\n| [Reverse Engineering Human Preferences with Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=heY0zzGvYm&name=pdf) | spotlight |\n| [SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dt940loCBT&name=pdf) | spotlight |\n| [d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7ZVRlBFuEv&name=pdf) | spotlight |\n| [SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kmv7yg6QXv&name=pdf) | spotlight |\n| [Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9eIntNc69t&name=pdf) | spotlight |\n| [DeepDiver: Adaptive Web-Search Intensity Scaling via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=CqLWckpTbG&name=pdf) | spotlight |\n| [DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YFa7eULIeN&name=pdf) | spotlight |\n| [Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7cPDOBWTbM&name=pdf) | spotlight |\n| [Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DYuPwwDy9n&name=pdf) | spotlight |\n| [Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Vqj65VeDOu&name=pdf) | spotlight |\n| [AlphaZero Neural Scaling and Zipf's Law: a Tale of Board Games and Power Laws](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IMmkDMqFMU&name=p) | spotlight |\n| [DAPO : Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage-Based Policy Optimization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=77eEDRhPkQ&name=pdf) | spotlight |\n| [VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4oYxzssbVg&name=pdf) | spotlight |\n| [CURE: Co-Evolving Coders and Unit Testers via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=wPdBe9zxNr&name=pdf) | spotlight |\n| [To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=iEgaS6wbLa&name=pdf) | spotlight |\n| [Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qgi5TfBXBw&name=pdf) | spotlight |\n| [To Think or Not To Think: A Study of Thinking in Rule-Based Visual Reinforcement Fine-Tuning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YexxvBGwQM&name=pdf) | spotlight |\n| [DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZZ09oX2Xpo&name=pdf) | spotlight |\n| [Q-Insight: Understanding Image Quality via Visual Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Bds54EfR9x&name=pdf) | spotlight |\n| [Novel Exploration via Orthogonality](https:\u002F\u002Fopenreview.net\u002Fattachment?id=yJS1eZSNUv&name=pdf) | poster |\n| [Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DWf4vroKWJ&name=pdf) | poster |\n| [Reinforcement Learning with Backtracking Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=14B5d6NEaH&name=pdf) | poster |\n| [Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OIH3T5ZPBW&name=pdf) | poster |\n| [World-aware Planning Narratives Enhance Large Vision-Language Model Planner](https:\u002F\u002Fopenreview.net\u002Fattachment?id=fggSyPPk0K&name=pdf) | poster |\n| [STAR: Efficient Preference-based Reinforcement Learning via Dual Regularization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=E9EwDc45f8&name=pdf) | poster |\n| [FairDICE: Fairness-Driven Offline Multi-Objective Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=2jQJ7aNdT1&name=pdf) | poster |\n| [Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics](https:\u002F\u002Fopenreview.net\u002Fattachment?id=N2bLuwofZ0&name=pdf) | poster |\n| [On Evaluating Policies for Robust POMDPs](https:\u002F\u002Fopenreview.net\u002Fattachment?id=l2Wl77TSYY&name=pdf) | poster |\n| [Periodic Skill Discovery](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BPSU46emit&name=pdf) | poster |\n| [Reinforcement Learning with Action Chunking](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XUks1Y96NR&name=pdf) | poster |\n| [ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OuGAwwAT8G&name=pdf) | poster |\n| [UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection](https:\u002F\u002Fopenreview.net\u002Fattachment?id=sH0ZwzDJZn&name=pdf) | poster |\n| [Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9osvTOYbT4&name=pdf) | poster |\n| [DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=guZBnsKPsw) | poster |\n| [EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=igtjRQfght&name=pdf) | poster |\n| [Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization](https:\u002F\u002Fopenreview.net\u002Fattachment?id=41bIzD5sit&name=pdf) | poster |\n| [Off-policy Reinforcement Learning with Model-based Exploration Augmentation](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JGkZgEEjiM&name=pdf) | poster |\n| [IOSTOM: Offline Imitation Learning from Observations via State Transition Occupancy Matching](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OEp1J4V2fN&name=pdf) | poster |\n| [Tree-Guided Diffusion Planner](https:\u002F\u002Fopenreview.net\u002Fattachment?id=I1C0a01BZu&name=pdf) | poster |\n| [Real-Time Execution of Action Chunking Flow Policies](https:\u002F\u002Fopenreview.net\u002Fattachment?id=UkR2zO5uww&name=pdf) | poster |\n| [Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs](https:\u002F\u002Fopenreview.net\u002Fattachment?id=95plu1Mo20&name=pdf) | poster |\n| [Behavior Injection: Preparing Language Models for Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mzlwDAQkgJ&name=pdf) | poster |\n| [ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=D1PeGJtVEu&name=pdf) | poster |\n| [Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VoFXUNc9Zh&name=pdf) | poster |\n| [Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=A0T3piHiis&name=pdf) | poster |\n| [Deep RL Needs Deep Behavior Analysis: Exploring Implicit Planning by Model-Free Agents in Open-Ended Environments](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QD06Qv7O0P&name=pdf) | poster |\n\n\u003Ca id='ICLR26'>\u003C\u002Fa>\n## ICLR26\n| Paper | Type |\n| ---- | ---- |\n| [Exploratory Diffusion Model for Unsupervised Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fpdf?id=k0Kb1ynFbt) | oral |\n| [Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kMuQBgPIdg&name=pdf) | oral |\n| [Why DPO is a Misspecified Estimator and How to Fix It](https:\u002F\u002Fopenreview.net\u002Fattachment?id=btEiAfnLsX&name=pdf) | oral |\n| [SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PJdw4VBsXD&name=pdf) | oral |\n| [Compositional Diffusion with Guided search for Long-Horizon Planning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=b8avf4F2hn&name=pdf) | oral |\n| [LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts](https:\u002F\u002Fopenreview.net\u002Fattachment?id=o29E01Q6bv&name=pdf) | oral |\n| [GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=RQm2KQTM5r&name=pdf) | oral |\n| [Reasoning without Training: Your Base Model is Smarter Than You Think](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Vsgq2ldr4K&name=pdf) | oral |\n| [Rodrigues Network for Learning Robot Actions](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IZHk6BXBST&name=pdf) | oral |\n| [Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation]() | oral |\n| [TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=SzXDuBN8M1&name=pdf) | oral |\n| [WoW!: World Models in a Closed-Loop World](https:\u002F\u002Fopenreview.net\u002Fattachment?id=yDmb7xAfeb&name=pdf) | oral |\n| [DiffusionNFT: Online Diffusion Reinforcement with Forward Process](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VJZ477R89F&name=pdf) | oral |\n| [Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VdLEaGPYWT&name=pdf) | oral |\n| [LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JWx4DI2N8k&name=pdf) | oral |\n| [MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mIeKe74W43&name=pdf) | oral |\n","# 强化学习论文\n[![欢迎提交PR](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-welcome-brightgreen.svg?style=flat-square)](http:\u002F\u002Fmakeapullrequest.com)\n\n强化学习相关论文（我们主要关注单智能体强化学习）。\n\n由于每年各大会议上都有数以万计的强化学习新论文，我们只能列出那些我们阅读过并认为具有启发性的论文。\n\n**我们已添加了一些ICLR22、ICML22、NeurIPS22、ICLR23、ICML23、NeurIPS23、ICLR24、ICML24、NeurIPS24、ICLR25、ICML25、NeurIPS25、ICLR26的强化学习论文。**\n\u003C!-- neurips 24 page 31-->\n\u003C!-- iclr 25 page 11-->\n\u003C!-- icml 25 page 1-->\n\u003C!-- nsueips 25 page 21-->\n\n## 目录 \n* [无模型（在线）强化学习](#Model-Free-Online)\n    - [经典方法](#model-free-classic)\n    - [探索](#exploration)\n    - [表示学习](#Representation-RL)\n    - [无监督学习](#Unsupervised-RL)\n    - [当前方法](#current)\n* [基于模型（在线）强化学习](#Model-Based-Online)\n    - [经典方法](#model-based-classic)\n    - [世界模型](#dreamer)\n    - [代码库](#model-based-code)\n* [(无模型) 离线强化学习](#Model-Free-Offline)\n    - [当前方法](#offline-current)\n    - [与扩散模型结合](#offline-diffusion)\n* [基于模型的离线强化学习](#Model-Based-Offline)\n* [元强化学习](#Meta-RL)\n* [对抗性强化学习](#Adversarial-RL)\n* [强化学习中的泛化](#Genaralization-in-RL)\n    - [环境](#Gene-Environments)\n    - [方法](#Gene-Methods)\n* [使用Transformer\u002FLLM的强化学习](#Sequence-Generation)\n* [教程与课程](#Tutorial-and-Lesson)\n* [ICLR22](#ICLR22)\n* [ICML22](#ICML22)\n* [NeurIPS22](#NeurIPS22)\n* [ICLR23](#ICLR23)\n* [ICML23](#ICML23)\n* [NeurIPS23](#NeurIPS23)\n* [ICLR24](#ICLR24)\n* [ICML24](#ICML24)\n* [NeurIPS24](#NeurIPS24)\n* [ICLR25](#ICLR25)\n* [ICML25](#ICML25)\n* [NeurIPS25](#NeurIPS25)\n* [ICLR26](#ICLR26)\n\n\u003Ca id='Model-Free-Online'>\u003C\u002Fa>\n## 无模型（在线）强化学习\n\n\n\u003Ca id='model-free-classic'>\u003C\u002Fa>\n\n### 经典方法\n\n| 标题 | 方法 | 会议 | 策略类型 | 动作空间 | 策略 | 描述 |\n| ----  | ----   | ----       |   ----  | ----  |  ---- |  ---- | \n| [通过深度强化学习实现人类水平控制](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature14236.pdf), [\\[其他链接\\]](http:\u002F\u002Fwww.kreimanlab.com\u002Facademia\u002Fclasses\u002FBAI\u002Fpdfs\u002FMnihEtAlHassibis15NatureControlDeepRL.pdf) | DQN | Nature15 | 离线 | 离散 | 基于值函数 | 使用深度神经网络训练Q学习，在Atari游戏中达到人类水平；主要技巧包括：用于提高样本效率的回放缓冲区，以及目标网络与行为网络的解耦 |\n| [基于双重Q学习的深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1509.06461.pdf) | Double DQN | AAAI16 | 离线 | 离散 | 基于值函数 | 发现DQN中的Q函数可能会高估；通过两个神经网络分别负责Q值计算和动作选择来解耦 |\n| [用于深度强化学习的决斗网络架构](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1511.06581.pdf) | Dueling DQN | ICML16 | 离线 | 离散 | 基于值函数 | 使用同一神经网络同时近似Q值和状态价值，以计算优势函数 |\n| [优先级经验回放](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1511.05952.pdf) | 优先采样 | ICLR16 | 离线 | 离散 | 基于值函数 | 为回放缓冲区中的样本赋予不同权重（例如TD误差） |\n| [Rainbow：结合深度强化学习中的多项改进](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.02298.pdf) | Rainbow | AAAI18 | 离线 | 离散 | 基于值函数 | 将多种DQN改进整合在一起：Double DQN、Dueling DQN、优先采样、多步学习、分布型RL、噪声网络 |\n| [带有函数逼近的强化学习策略梯度方法](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F1999\u002Ffile\u002F464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf) | PG | NeurIPS99 | 在线\u002F离线 | 连续或离散 | 函数逼近 | 提出策略梯度定理：如何计算策略期望累积回报的梯度 |\n| ---- | AC\u002FA2C | ---- | 在线\u002F离线 | 连续或离散 | 参数化神经网络 | AC：用Q值近似器替代PG中的回报，以降低方差；A2C：用优势函数替代AC中的Q值，以降低方差 |\n| [深度强化学习的异步方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1602.01783.pdf) | A3C | ICML16 | 在线\u002F离线 | 连续或离散 | 参数化神经网络 | 提出三项提升性能的技巧：(i) 使用多个智能体与环境交互；(ii) 价值函数和策略共享网络参数；(iii) 修改损失函数（价值函数的均方误差 + PG损失 + 策略熵）|\n| [信任域策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1502.05477.pdf) | TRPO | ICML15 | 在线 | 连续或离散 | 参数化神经网络 | 在策略优化中引入信任域，以保证单调性改进 |\n| [近端策略优化算法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347.pdf) | PPO | arxiv17 | 在线 | 连续或离散 | 参数化神经网络 | 用剪裁系数的惩罚项替代TRPO的硬约束 |\n| [确定性策略梯度算法](http:\u002F\u002Fproceedings.mlr.press\u002Fv32\u002Fsilver14.pdf) | DPG | ICML14 | 离线 | 连续 | 函数逼近 | 针对连续动作空间考虑确定性策略，并证明确定性策略梯度定理；同时使用随机行为策略以鼓励探索 |\n| [基于深度强化学习的连续控制](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1509.02971.pdf) | DDPG | ICLR16 | 离线 | 连续 | 参数化神经网络 | 将DQN的思想应用于DPG：(i) 使用深度神经网络作为函数逼近器，(ii) 使用回放缓冲区，(iii) 每个epoch固定目标Q值 |\n| [解决演员-评论家方法中的函数逼近误差](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.09477.pdf) | TD3 | ICML18 | 离线 | 连续 | 参数化神经网络 | 将Double DQN的思想应用到DDPG中：取一对评论家网络输出的最小值，以限制高估现象 |\n| [基于深度能量模型的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1702.08165.pdf) | SQL | ICML17 | 离线 | 主要针对连续动作 | 参数化神经网络 | 考虑最大熵强化学习，并提出软Q迭代及软Q学习 |\n| [软演员-评论家算法及其应用](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1812.05905.pdf), [软演员-评论家：具有随机演员的离策略最大熵深度强化学习](http:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Fhaarnoja18b\u002Fhaarnoja18b.pdf), [\\[附录\\]](http:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Fhaarnoja18b\u002Fhaarnoja18b-supp.pdf) | SAC | ICML18 | 离线 | 主要针对连续动作 | 参数化神经网络 | 基于SQL的理论分析，扩展软Q迭代（软Q评估+软Q改进）；对策略进行重参数化，并使用两个参数化的价值函数；提出SAC |\n\n\u003Ca id='exploration'>\u003C\u002Fa>\n\n### 探索\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [由自监督预测驱动的好奇心探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1705.05363.pdf) | ICM | ICML17 | 提出好奇心可以作为内在奖励信号，使智能体在奖励稀疏的情况下探索环境并学习技能；将好奇心定义为智能体在由自监督逆动力学模型学习到的视觉特征空间中，预测自身行为后果的能力误差 |\n| [基于潜在贝叶斯惊讶的 curiosity 驱动探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.07495.pdf) | LBS | AAAI22 | 在表示智能体当前对系统动态理解的潜在空间中应用贝叶斯惊讶 |\n| [深度强化学习中用于探索的自动内在奖励塑造](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10886.pdf) | AIRS | ICML23 | 根据实时估计的任务回报从预定义集合中选择塑造函数，提供可靠的探索激励并缓解目标偏差问题；开发了一个基于 PyTorch 的工具包，提供多种高质量的内在奖励模块实现 |\n| [事后好奇心：随机环境中的内在探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10515.pdf) | 事后好奇心 | ICML23 | 考虑随机环境中的探索；学习能够精确捕捉每个结果不可预测方面的未来表征——我们将其用作预测的额外输入，从而使内在奖励仅反映世界动态中可预测的部分 |\n| 最大化以探索：融合估计、规划和探索的单一目标函数 || NeurIPS23 焦点 ||\n| [MIMEx：来自掩码输入建模的内在奖励](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.08932.pdf) | MIMEx | NeurIPS23 | 提出可以通过灵活调整掩码分布来控制底层条件预测任务的难度 |\n\n\u003C!--\u003Ca id='off-policy-evaluation'>\u003C\u002Fa>\n### 离策略评估\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [带有线性函数近似的离策略学习的加权重要性采样](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2014\u002Ffile\u002Fbe53ee61104935234b174e62a07e53cf-Paper.pdf) | WIS-LSTD | NeurIPS14 |  |\n| [基于估计行为策略的重要性采样策略评估](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1806.01347.pdf) | RIS | ICML19 |  |\n| [关于离策略强化学习中的重用偏差](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07074.pdf) | BIRIS | IJCAI23 | 讨论由于重用回放缓冲区而导致的离策略评估偏差；推导出重用偏差的高概率界；引入离策略算法稳定性的概念，并给出稳定离策略算法的上界 | \n\n\n\u003Ca id='soft-rl'>\u003C\u002Fa>\n### 软强化学习\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [强化学习中的最大—最小熵框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.10517.pdf) | MME | NeurIPS21 | 发现 SAC 可能在低熵状态中无法有效探索（到达高熵状态并提高其熵）；提出最大—最小熵框架来解决这一问题 |\n| [最大熵 RL（理论上）解决部分鲁棒 RL 问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | ---- | ICLR22 | 从理论上证明，标准的最大熵强化学习对动力学和奖励函数中的一些扰动具有鲁棒性 | \n\n\n\u003Ca id='data-augmentation'>\u003C\u002Fa>\n### 数据增强\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [基于增强数据的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.14990.pdf) | RAD | NeurIPS20 | 提出首次对 RL 中针对像素和状态输入的通用数据增强进行全面研究 |\n| [图像增强就够了：从像素出发正则化深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.13649.pdf) | DrQ | ICLR21 焦点 | 提出在使用无模型方法进行数据增强时对价值函数进行正则化，并在图像像素任务中达到最先进水平 | -->\n\n\u003C!-- | [马尔可夫决策过程中的等价性概念和模型约简](https:\u002F\u002Fwww.ics.uci.edu\u002F~dechter\u002Fcourses\u002Fics-295\u002Fwinter-2018\u002Fpapers\u002Fgivan-dean-greig.pdf) |  | 人工智能, 2003 |  |\n| [有限马尔可夫决策过程的度量](https:\u002F\u002Farxiv.org\u002Fftp\u002Farxiv\u002Fpapers\u002F1207\u002F1207.4114.pdf) || UAI04 ||\n| [连续马尔可夫决策过程的双仿真度量](https:\u002F\u002Fwww.normferns.com\u002Fassets\u002Fdocuments\u002Fsicomp2011.pdf) || SIAM 计算期刊, 2011 ||\n| [用于计算确定性马尔可夫决策过程中状态相似性的可扩展方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1911.09291.pdf) || AAAI20 ||\n| [无需重建即可学习强化学习的不变表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.10742.pdf) | DBC | ICLR21 || -->\n\n\n\u003Ca id='Representation-RL'>\u003C\u002Fa>\n\n## 表征学习\n\n注意：基于MBRL的表征学习位于[世界模型](#dreamer)部分。\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [CURL: 强化学习中的对比无监督表征](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.04136.pdf) | CURL | ICML20 | 使用对比学习从原始像素中提取高层次特征，并在这些特征之上进行离策略控制 |\n| [无需重建的强化学习不变表征学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.10742.pdf) | DBC | ICLR21 | 提出使用双仿射度量来学习鲁棒的潜在表征，该表征仅编码观测中的任务相关信息 |\n| [基于原型表征的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.11271.pdf) | Proto-RL | ICML21 | 在没有下游任务信息的环境中预训练与任务无关的表征和原型 |\n| [通过行动理解世界](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.12543.pdf) | ---- | CoRL21 | 讨论自监督强化学习与离线强化学习相结合如何实现可扩展的表征学习 |\n| [基于流的POMDP递归信念状态学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22q\u002Fchen22q.pdf) | FORBES | ICML22 | 将归一化流融入变分推断中，以学习适用于POMDP的一般连续信念状态 |\n| [作为目标条件强化学习的对比学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07568.pdf) | 对比RL | NeurIPS22 | 表明（对比）表征学习方法本身就可以被视为强化学习算法 |\n| [自监督学习真的能提升基于像素的强化学习吗？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05266.pdf) | ---- | NeurIPS22 | 在现有的像素强化学习联合学习框架下，对多种自监督损失函数进行了广泛比较，测试环境涵盖了不同基准中的多个环境，其中包括一个真实环境 |\n| [自动化辅助损失搜索的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06041.pdf) | A2LS | NeurIPS22 | 提出自动搜索表现最佳的辅助损失函数，以在强化学习中学习更好的表征；基于收集到的轨迹数据定义了一个大小为7.5 × 10^20的通用辅助损失空间，并采用高效的进化搜索策略对该空间进行探索 |\n| [基于掩码的强化学习潜在空间重建](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12096.pdf) | MLR | NeurIPS22 | 提出一种有效的自监督方法，通过时空掩码后的像素观测来预测潜在空间中的完整状态表征 |\n| [通过价值隐式预训练实现通用视觉奖励与表征](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00030.pdf) | VIP | ICLR23 Spotlight | 将从人类视频中进行表征学习视为一个离线的目标条件强化学习问题；推导出一种不依赖于动作的自监督双重目标条件价值函数目标，从而能够在未标注的人类视频上进行预训练 |\n| [强化学习中的潜在变量表征](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08765.pdf) | ---- | ICLR23 | 为状态-动作价值函数提供了一种基于潜在变量模型的表征视角，该视角既允许可处理的变分学习算法，又能在面对不确定性时有效实施乐观\u002F悲观原则以促进探索 |\n| 强化学习中的谱分解表征 || ICLR23 ||\n| [通过观看纯视频，在有限数据下成为熟练玩家](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Sy-o2N0hF4f) | FICC | ICLR23 | 考虑预训练数据仅为无动作视频的情况；引入两阶段训练流程：预训练阶段——从视频中隐式提取隐藏的动作嵌入，并基于向量量化预训练视觉表征和环境动力学网络；下游任务阶段——基于已学习的模型，用少量任务数据进行微调 |\n| [强化学习中的自举表征](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.10171.pdf) | ---- | ICML23 | 对时序差分学习所学到的状态表征进行了理论刻画；发现该表征与蒙特卡洛算法和残差梯度算法在策略评估场景下对于大多数环境转移结构所学到的特征存在差异 |\n| [表征驱动的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19922.pdf) | RepRL | ICML23 | 通过将策略空间映射到线性特征空间，将策略搜索问题简化为上下文相关的多臂赌博机问题 |\n| [用于强化学习中解耦表征的条件互信息](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14133.pdf) | CMID | NeurIPS23 spotlight | 为强化学习算法提出一项辅助任务，通过最小化表征中各特征之间的条件互信息来学习具有相关特征的高维观测的解耦表征 |\n\n\u003Ca id='Unsupervised-RL'>\u003C\u002Fa>\n## 无监督学习\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [变分内在控制](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1611.07507.pdf) | ---- | arXiv1611 | 提出一种新的无监督强化学习方法，用于发现智能体可用的内在选项集，该方法通过最大化智能体能够可靠到达的不同状态数量来学习，而这一数量由选项集与选项终止状态之间的互信息衡量 |\n| [多样性就是一切：无需奖励函数的学习技能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.06070.pdf) | DIAYN | ICLR19 | 通过最大化信息论目标，在没有任何奖励的环境中学习多样化的技能 |\n| 无监督控制通过非参数判别奖励 || ICLR19 ||\n| [动态感知的无监督技能发现](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1907.01657.pdf) | DADS | ICLR20 | 提议使用无模型RL学习低级技能，其明确目标是使基于模型的控制更加容易 |\n| [基于变分内在后续特征的快速任务推断](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1906.05030.pdf) | VISR | ICLR20 ||\n| [将表征学习与强化学习解耦](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2009.08319.pdf) | ATC | ICML21 | 提出一种名为增强时间对比（ATC）的新无监督任务，专为强化学习设计，借鉴了对比学习的思想；通过在专家演示上预训练编码器，并将其用于强化学习智能体中，对几种领先的无监督学习算法进行基准测试|\n| [瓶颈选项学习下的无监督技能发现](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.14305.pdf) | IBOL | ICML21 | 提出一种基于信息瓶颈的新型技能发现方法，具有多重优势，包括以更解耦和可解释的方式学习技能，同时对干扰信息具有鲁棒性 |\n| [APS：基于后续特征的主动预训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.13956.pdf) | APS | ICML21 | 通过一种新颖的方式将APT和VISR结合起来，以解决两者的不足 |\n| [从虚空中涌现的行为：无监督主动预训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.04551.pdf) | APT | NeurIPS21 | 提出在抽象表征空间中计算非参数熵；对于一组样本，计算每个粒子与其最近邻点之间欧氏距离的平均值 |\n| [为数据高效的强化学习预训练表征](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.04799.pdf) | SGI | NeurIPS21 | 考虑使用未标记数据进行预训练，并在少量特定任务数据上进行微调，以提高强化学习的数据效率；采用潜在动力学建模与无监督目标条件RL相结合的方法 |\n| [URLB：无监督强化学习基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.15191.pdf) | URLB | NeurIPS21 | 一个用于无监督强化学习的基准 |\n| [通过世界模型发现并实现目标](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09514.pdf) | LEXA | NeurIPS21 | 在世界模型中通过想象的rollout无监督地训练探索者和实现者策略；在无监督阶段结束后，无需任何额外学习，即可零样本地解决以目标图像形式指定的任务 |\n| [无监督强化学习的信息几何](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02719.pdf) | ---- | ICLR22口头报告 | 表明基于互信息最大化的无监督技能发现算法并不能学习到对所有可能奖励函数都最优的技能；为一些技能学习方法提供了几何视角 |\n| [利普希茨约束下的无监督技能发现](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00914.pdf) | LSD | ICLR22 | 认为基于MI的技能发现方法很容易仅通过状态空间中的微小差异就最大化MI目标；提出一种基于利普希茨约束状态表示函数的新目标，使得在潜在空间中最大化该目标时，总是伴随着状态空间中行进距离（或变化）的增加 |\n| [通过乐观探索学习更多技能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.14226.pdf) | DISDAIN | ICLR22 | 推导出一种涉及训练判别器集成并奖励策略使其产生分歧的信息增益辅助目标；该目标直接估计由于判别器未见过足够训练样本而产生的认识论不确定性 |\n| [无监督强化学习的惊喜混合](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06702.pdf) | MOSS | NeurIPS22 | 训练一个以最大化惊喜为目标的组件，另一个以最小化惊喜为目标的组件，以应对环境动态熵未知的情况 |\n| [基于对比内在控制的无监督强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00161.pdf) | CIC | NeurIPS22 | 提议最大化状态转移与潜在技能向量之间的互信息 |\n| [通过循环技能训练进行无监督技能发现](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sYDX_OxNNjh) | ReST | NeurIPS22 | 鼓励后训练的技能避免进入先前技能已覆盖的状态 |\n| [编舞者：在想象中学习和适应技能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13350.pdf) | 编舞者 | ICLR23 Spotlight | 将探索和技能学习过程解耦；利用元控制器高效评估和调整所学技能，通过在想象中并行部署它们来实现 |\n| 可证明的离线强化学习无监督数据共享 || ICLR23 ||\n| 通过DOMiNO发现策略：保持近似最优性的多样性优化 || ICLR23 ||\n| [从像素开始掌握无监督强化学习基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12016.pdf) | Dyna-MPC | ICML23口头报告 | 利用无监督基于模型的RL对智能体进行预训练；通过结合混合规划器Dyna-MPC的任务感知微调策略，对下游任务进行微调 |\n| [特征去相关性在强化学习无监督表征学习中的重要性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05637.pdf) | SimTPR | ICML23 | 提出一种新的URL框架，能够在因果预测未来状态的同时，通过去相关潜在空间中的特征来增加潜在流形的维度 |\n| CLUTR：基于无监督任务表征学习的课程学习 || ICML23 ||\n| [考虑可控性的无监督技能发现](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05103.pdf) | CSD | ICML23 | 基于当前技能库训练一个考虑可控性的距离函数，并将其与最大化距离的技能发现相结合 |\n| [行为对比学习用于无监督技能发现](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04477.pdf) | BeCL | ICML23 | 提出一种通过行为间对比学习来进行无监督技能发现的新方法，使智能体对同一技能产生相似行为，而对不同技能则产生多样化行为 |\n| 用于无监督技能发现的变分课程强化学习 || ICML23 ||\n| [通过引导发现技能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.20178.pdf) | DISCO-DANCE | NeurIPS23 | 选择具有最高潜力到达未探索状态的引导技能，引导其他技能跟随该引导技能；被引导的技能会被分散开来，以最大化其在未探索状态中的可区分性 |\n| 在强化学习中创建多层级技能层次结构 || NeurIPS23 ||\n| 通过随机意图先验进行无监督行为提取 || NeurIPS23 ||\n| [METRA：基于度量感知抽象的可扩展无监督RL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.08887.pdf) | METRA | ICLR24口头报告 |  |\n| [语言引导的技能发现](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.06615) | LGSD | arXiv2406 | 以用户提示作为输入，输出一组语义上独特的技能 |\n| [PEAC：跨化身强化学习的无监督预训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14073.pdf) | CEURL, PEAC | NeurIPS24 | 考虑在多种化身分布上进行无监督预训练，即CEURL；并提出PEAC来处理CEURL |\n| [无监督强化学习的探索性扩散模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.07279.pdf) | ExDM | ICLR26口头报告 | 利用扩散模型来增强无监督探索，并对预训练的扩散策略进行微调 |\n\n\u003C!-- ### \u003Cspan id='current'>当前方法\u003C\u002Fspan> -->\n\u003Ca id='current'>\u003C\u002Fa>\n\n\n### 当前方法\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [用于线性函数逼近的离策略学习加权重要性采样](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2014\u002Ffile\u002Fbe53ee61104935234b174e62a07e53cf-Paper.pdf) | WIS-LSTD | NeurIPS14 |  |\n| [基于估计行为策略的重要性采样策略评估](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1806.01347.pdf) | RIS | ICML19 |  |\n| [通过潜在状态解码实现可证明高效的强化学习与丰富观测](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1901.09018.pdf) | 块MDP | ICML19 ||\n| [深度策略梯度中的实现细节：以PPO和TRPO为例](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.12729.pdf) | ---- | ICLR20 | 表明性能提升与代码级优化相关 |\n| [增强数据的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.14990.pdf) | RAD | NeurIPS20 | 提出首次对强化学习中像素和状态输入的通用数据增强进行全面研究 |\n| [图像增强就够了：从像素出发正则化深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.13649.pdf) | DrQ | ICLR21 Spotlight | 在无模型方法中应用数据增强时，提出对价值函数进行正则化，并在图像像素任务上达到最先进水平 |\n| [在策略强化学习中什么最重要？一项大规模实证研究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.05990.pdf) | ---- | ICLR21 | 对MuJoCo上的不同技巧进行大规模实证研究，以评估在策略算法的效果 |\n| [镜像下降策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.09814.pdf) | MDPO | ICLR21 |  |\n| [无需重建即可学习强化学习的不变表征](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.10742.pdf) | DBC | ICLR21 ||\n| [随机集成双Q学习：无需模型也能快速学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.05982.pdf) | REDQ | ICLR21 | 考虑三个要素：(i) 每个epoch多次更新Q函数；(ii) 使用Q函数集成；(iii) 从集成中随机子集中取最小值以避免过估计；提出REDQ并达到与基于模型的方法相当的性能 |\n| [处于统计临界点边缘的深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.13264.pdf) | ---- | NeurIPS21 杰出论文 | 倡导报告聚合性能的区间估计，并提出性能轮廓来考虑结果的变异性，同时引入更稳健、高效的聚合指标，如四分位数均值，以减小结果的不确定性；[\\[rliable\\]](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Frliable\u002F) |\n| [适用于深度强化学习的可泛化情景记忆](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06469.pdf) | GEM | ICML21 | 提出将神经网络的泛化能力和情景记忆的快速检索方式相结合 |\n| [强化学习中的最大-最小熵框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.10517.pdf) | MME | NeurIPS21 | 发现SAC可能无法探索低熵状态（到达高熵状态并提高其熵）；提出最大-最小熵框架来解决这一问题 |\n| [最大熵RL（理论上）可解决部分鲁棒RL问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | ---- | ICLR22 | 理论证明 |\n| [SO(2)等变强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04439.pdf) | Equi DQN, Equi SAC | ICLR22 Spotlight | 考虑学习变换不变的策略和价值函数；定义并分析群等变MDP |\n| [CoBERL：用于强化学习的对比BERT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.05431.pdf) | CoBERL | ICLR22 Spotlight | 提出用于强化学习的对比BERT（COBERL），结合新的对比损失和混合LSTM-Transformer架构，以应对提高数据效率的挑战 |\n| [理解和预防强化学习中的容量损失](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZkC8wKoLbQ7) | InFeR | ICLR22 Spotlight | 提出深度RL智能体在训练过程中会失去快速拟合新预测任务的能力；提出InFeR，将一组网络输出正则化回初始值 |\n| [关于深度强化学习中的彩票假设与最小任务表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2105.01648.pdf) | ---- | ICLR22 Spotlight | 探讨深度强化学习中的彩票假设 |\n| [利用离线演示引导的稀疏奖励强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04628.pdf) | LOGO | ICLR22 Spotlight | 针对强化学习中的稀疏奖励挑战；提出LOGO，利用次优行为策略生成的离线演示数据；每一步LOGO包括通过TRPO进行策略改进，以及使用次优行为策略进行额外的策略引导 |\n| [通过不确定性估计实现样本高效的深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01666.pdf) | IV-RL | ICLR22 Spotlight | 分析无模型DRL算法监督中的不确定性来源，并表明可以通过负对数似然和方差集成来估计监督噪声的方差 |\n| [用于强化学习中时间协调探索的生成式规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09765.pdf) | GPM | ICLR22 Spotlight | 专注于为无模型RL生成一致的动作，借鉴基于模型的规划和重复动作的思想；使用策略生成多步动作 |\n| [智能体何时应该探索？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11811.pdf) | ---- | ICLR22 Spotlight | 探讨何时进行探索，并提出选择异质模式切换的行为策略 |\n| [最大化深度强化学习中集成的多样性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hjd-kcpDpf2) | MED-RL | ICLR22 |  |\n| [最大熵RL（理论上）可解决部分鲁棒RL问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | ---- | ICLR22 | 理论证明标准的最大熵RL对动力学和奖励函数中的某些扰动具有鲁棒性 |\n| [通过行为相似性的自适应元学习器学习强化学习的可泛化表征](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zBOI9LFpESK) | AMBS | ICLR22 |  |\n| [大批次经验回放](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.01528.pdf) | LaBER | ICML22 口头报告 | 将回放缓冲区采样问题视为一种重要性采样问题，用于估计梯度，并推导出理论最优采样分布 |\n| [可微分模拟器是否能提供更好的策略梯度？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00817.pdf) | ---- | ICML22 口头报告 | 探讨可微分模拟器是否能提供更好的策略梯度；指出一阶估计的一些陷阱，并提出α阶估计 |\n| 联邦强化学习：通信高效的算法及收敛性分析 || ICML22 口头报告 ||\n| [通用策略优化的解析更新规则](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02045.pdf) | ---- | ICML22 口头报告 | 为信赖域方法提供更紧的边界 |\n| [基于几何策略组合的广义策略改进](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08736.pdf) | GSPs | ICML22 口头报告 | 提出几何切换策略（GSP）的概念，即我们有一组策略，轮流使用它们采取行动；对于每个策略，从几何分布中抽取一个数字，执行该策略相应次数；探讨非马尔可夫GSPs上的策略改进 |\n| [我为什么要相信你，贝尔曼？贝尔曼误差不能替代价值误差](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12417.pdf) | ---- | ICML22 | 旨在通过理论分析和实证研究更好地理解贝尔曼误差与价值函数准确性之间的关系；指出贝尔曼误差不能很好地替代价值误差，包括(i) 贝尔曼误差的大小掩盖了偏差，(ii) 缺失的转移打破了贝尔曼方程 |\n| [马尔可夫决策过程的自适应模型设计](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22ab\u002Fchen22ab.pdf) | ---- | ICML22 | 考虑正则化马尔可夫决策过程，并将其建模为双层问题 |\n| [稳定基于像素的离策略深度强化学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fcetin22a\u002Fcetin22a.pdf) | A-LIX | ICML22 | 提出带有卷积编码器和低幅度奖励的时间差学习会导致不稳定，称为灾难性自我过拟合；建议对编码器的梯度进行自适应正则化，以明确防止灾难性自我过拟合的发生 |\n| [基于敏感性分析的理解策略梯度算法](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwu22i\u002Fwu22i.pdf) | ---- | ICML22 | 从扰动的角度研究PG |\n| [镜像学习：统一的策略优化框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02373.pdf) | 镜像学习 | ICML22 | 提出一种新颖的统一理论框架——镜像学习，为广义策略改进（GPI）和信赖域学习（TRL）提供理论保证；并从图论角度提出了对镜像学习的有趣见解 |\n| [基于演示的连续控制动作量化](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fdadashi22a\u002Fdadashi22a.pdf) | AQuaDem | ICML22 | 利用人类演示的先验知识，将连续动作空间简化为一组有意义的离散动作；指出使用一组动作而非单一动作（行为克隆）能够捕捉演示中行为的多模态性 |\n| [使用可微分函数近似的离策略拟合Q评估：Z估计与推断理论](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fzhang22al\u002Fzhang22al.pdf) | ---- | ICML22 | 使用Z估计理论分析使用一般可微分函数近似的拟合Q评估（FQE），包括基于神经网络的函数近似 |\n| [深度强化学习中的首因效应](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07802.pdf) | 首因效应 | ICML22 | 发现深度RL智能体有过度拟合早期经验的风险，这会对后续学习过程产生负面影响；提出一种简单但普遍适用的机制，通过定期重置智能体的一部分来缓解首因效应 |\n| [利用深度强化学习优化序列实验设计](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00821.pdf) |  | ICML22 | 使用DRL解决序列实验的最佳设计问题 |\n| [鲁棒价值函数的几何结构](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwang22k\u002Fwang22k.pdf) |  | ICML22 | 研究更一般的鲁棒MDP中鲁棒价值空间的几何结构 |\n| [马尔可夫序列决策中的效用理论](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13637.pdf) | 仿射奖励MDP | ICML22 | 将冯·诺依曼-摩根斯特恩（VNM）效用定理扩展到决策场景 |\n| [通过深度网络集成降低时序差价值估计的方差](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fliang22c\u002Fliang22c.pdf) | MeanQ | ICML22 | 考虑降低时序差价值估计的方差；提出通过集成来估计目标值的MeanQ |\n| 统一策略优化的近似梯度更新 || ICML22 ||\n| [基于神经辐射场的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01634.pdf) | NeRF-RL | NeurIPS22 | 提出训练一个编码器，将多张图像观测映射到描述场景中物体的潜在空间 |\n| [关于强化学习与分布匹配在语言模型微调中的应用，且无灾难性遗忘](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00761.pdf) | ---- | NeurIPS22 | 探索奖励最大化（RM）与分布匹配（DM）之间的理论联系 |\n| [用较慢的在线网络加速深度强化学习](https:\u002F\u002Fassets.amazon.science\u002F31\u002Fca\u002F0c09418b4055a7536ced1b218d72\u002Ffaster-deep-reinforcement-learning-with-slower-online-network.pdf) | DQN Pro, Rainbow Pro | NeurIPS22 | 鼓励在线网络保持在目标网络附近 |\n| [强化学习的重生：复用先前计算以加速进展](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01626.pdf) | PVRL | NeurIPS22 | 专注于将任何智能体的强化学习成果迁移到其他智能体；将重生的强化学习作为一种替代工作流或一类问题设置，其中先前的计算工作（例如学到的策略）可以在强化学习智能体的设计迭代之间，或从一个智能体转移到另一个智能体 |\n| [突破回放比例障碍实现样本高效强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OpC-9aBBVJe) | SR-SAC, SR-SPR | ICLR23 口头报告 | 表明完全或部分重置深度强化学习智能体的参数后，会出现更好的回放比例扩展能力 |\n| [利用不完美的在线演示进行受保护策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01728.pdf) | TS2C | ICLR23 Spotlight | 结合基于轨迹的价值估计进行教师干预 |\n| [迈向人机友好的原型驱动的可解释深度强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hWwY_Jq0xsN) | PW-Net | ICLR23 Spotlight | 专注于打造“设计之初就可解释”的深度强化学习智能体，使其在决策时必须使用人机友好的原型，从而清晰展示其推理过程；训练一种名为PW-Net的“包装”模型，可添加到任何预训练智能体上，使其具备可解释性 |\n| [DEP-RL：用于过度驱动和肌肉骨骼系统的具身式强化学习探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00484.pdf) | DEP-RL | ICLR23 Spotlight | 引入来自自组织行为领域的DEP控制器，以生成比其他常用噪声过程更有效的探索；首次在肌肉刺激层面使用RL控制7自由度的人臂模型 |\n| [高效的深度强化学习需要调节统计过拟合](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.10466.pdf) | AVTD | ICLR23 | 提出一种简单的主动模型选择方法（AVTD），通过在验证TD误差上进行爬山搜索，自动选择正则化方案 |\n| [贪婪演员-评论家：一种用于策略改进的新条件交叉熵方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.09103.pdf) | CCEM, GreedyAC | ICLR23 | 提议根据学习到的动作价值对动作进行排名，然后迭代选取前百分之一的动作；利用CEM的理论验证CCEM能够随着时间的推移集中在各状态下价值最高的动作上 |\n| [利用语言模型进行奖励设计](https:\u002F\u002Fopenreview.net\u002Fpdf?id=10uNUgI5Kl) | ---- | ICLR23 | 探讨如何通过大型语言模型（LLM）如GPT-3作为代理奖励函数来简化奖励设计，用户只需提供包含少量示例（少次）或对期望行为的描述（零次）的文本提示 |\n| [通过Q学习解决连续控制问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12566.pdf) | DecQN | ICLR23 | 将价值分解与bang-bang动作空间离散化结合应用于DQN，以处理连续控制任务；在DMControl、Meta World和Isaac Gym上进行了评估 |\n| [瓦瑟斯坦自编码MDP：以多方保障高效蒸馏RL策略的形式验证](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.12558.pdf) | WAE-MDP | ICLR23 | 将执行原始策略的智能体行为与蒸馏策略行为之间的最优传输惩罚形式最小化 |\n| [人类水平Atari游戏速度提升200倍](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07550.pdf) | MEME | ICLR23 | 在3.9亿帧内超越所有57款Atari游戏的人类基准；四个关键组件：(1) 一种近似信赖域方法，可从在线网络稳定启动，(2) 一套用于损失和优先级的归一化方案，可在学习一系列尺度广泛的价值函数时提高鲁棒性，(3) 改进的架构，采用NFNets的技术，以便在无需归一化层的情况下使用更深的网络，(4) 一种策略蒸馏方法，可随时间平滑瞬时的贪婪策略。|\n| [通过价值函数搜索改进深度策略梯度](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10145.pdf) | VFS | ICLR23 | 专注于改进价值近似，并分析其对深度PG原语的影响，如价值预测、方差减少和梯度估计与真实梯度的相关性；表明具有良好预测能力的价值函数可以改善深度PG原语，从而提高样本效率和政策回报率 |\n| [记忆健身房：基于记忆的智能体面临的部分可观测挑战](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jHc8dCx6DDr) | 智能体记忆健身房 | ICLR23 | 一个用于挑战深度强化学习智能体的记忆基准，要求其能够记住长序列中的事件、抵抗噪声干扰并进行泛化；由部分可观测的2D和离散控制环境组成，包括Mortar Mayhem、Mystery Path和Searing Spotlights；[\\[代码\\]](https:\u002F\u002Fgithub.com\u002FMarcoMeter\u002Fdrl-memory-gym\u002F) |\n| [混合RL：同时使用离线和在线数据可以使RL更高效](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06718.pdf) | Hy-Q | ICLR23 | 专注于一种名为混合RL的设置，在这种设置下，智能体既拥有离线数据集，又能够与环境互动；扩展拟合Q迭代算法 |\n| [POPGym：部分可观测强化学习的基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01859.pdf) | POPGym | ICLR23 | 一个包含两部分的库：(1) 包含15种部分可观测环境的多样化集合，每种环境都有多种难度级别，(2) 实现了13种记忆模型基线；[\\[代码\\]](https:\u002F\u002Fgithub.com\u002Fproroklab\u002Fpopgym) |\n| [评论家顺序蒙特卡洛](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15460.pdf) | CriticSMC | ICLR23 | 将顺序蒙特卡洛与学习到的Soft-Q函数启发因子结合起来 |\n| [面向规划的自动驾驶](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10156.pdf) || CVPR23 最佳论文 ||\n| [关于离策略强化学习中的重用偏见](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07074.pdf) | BIRIS | IJCAI23 | 讨论由于重用回放缓冲区而导致的离策略评估偏见；推导出重用偏见的高概率边界；引入离策略算法稳定性的概念，并给出稳定离策略算法的上限 |\n| [深度强化学习中的休眠神经元现象](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12902.pdf) | ReDo | ICML23 口头报告 | 研究RL智能体在训练过程中表达力下降的根本原因；证明深度RL中存在休眠神经元现象；提出“回收休眠神经元”（ReDo）以减少休眠神经元数量，并在训练过程中维持网络的表达力 |\n| [通过解耦环境和智能体表征实现高效RL](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kWS8mpioS9) | SEAR | ICML23 口头报告 | 考虑构建一种能够将机器人智能体与其环境解耦的表征，以提高RL的学习效率；通过以智能体为中心的辅助损失来增强RL损失 |\n| [关于时间差学习的统计益处](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13289.pdf) | ---- | ICML23 口头报告 | 对TD的统计益处进行了清晰阐述 |\n| [解决奖励假说](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10420.pdf) | ---- | ICML23 口头报告 | 从目标是智能体主观愿望和目标是智能体设计师客观愿望两种情境分别探讨奖励假说 |\n| [为部分可观测的深度RL学习信念表征](https:\u002F\u002Fopenreview.net\u002Fpdf?id=4IzEmHLono) | Believer | ICML23 | 将信念状态建模（通过无监督学习）与策略优化（通过RL）分离；提出一种表征学习方法，以捕捉状态中与奖励相关的紧凑特征集 |\n| [内部奖励强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00270.pdf) | IRRL | ICML23 | 研究一类强化学习问题，其中用于策略学习的奖励信号是由依赖于策略并与之联合优化的内部奖励模型产生的；从理论和实践上分析IRRL中奖励函数的影响，并基于这些分析提出剪切线性奖励函数 |\n| [强化学习中的超参数及其调优方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01324.pdf) | ---- | ICML23 | 探索常用RL算法和环境的超参数景观；比较不同类型HPO方法在最先进的RL算法和具有挑战性的RL环境中表现 |\n| 兰金顿普森采样与对数通信：多臂老虎机与强化学习 || ICML23 ||\n| [纠正在策略策略梯度方法中的折扣因子不匹配](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13284.pdf) | ---- | ICML23 | 引入一种新的分布校正，以考虑折现后的平稳分布 |\n| [强化学习若采用多重奖励可更高效](https:\u002F\u002Fopenreview.net\u002Fpdf?id=skDVsmXjPR) | ---- | ICML23 | 理论分析动作消除算法的多奖励扩展版本，并证明与单奖励版本相比，在多臂老虎机和表格马尔可夫决策过程中，其实例依赖型遗憾边界更为有利 |\n| [表演性强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00046.pdf) | ---- | ICML23 | 引入表演性强化学习框架，其中学习者选择的策略会影响环境的基础奖励和转移动态 |\n| [具有历史依赖动态上下文的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02061.pdf) | DCMDPs | ICML23 | 引入DCMDPs，一种针对历史依赖环境的新型强化学习框架，可处理非马尔可夫环境，其中上下文会随时间变化；为逻辑DCMDPs推导出类似上置信区间风格的算法 |\n| [关于多动作策略梯度](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13011.pdf) | MBMA | ICML23 | 提出MBMA，一种利用动力学模型在随机策略梯度（SPG）背景下进行多动作采样的方法，其偏差低于从模型模拟滚动中估算的SPG，而方差则与之相当 |\n| [奖励模型过度优化的缩放规律](https:\u002F\u002Fopenreview.net\u002Fattachment?id=bBLjms8nZE&name=pdf) | ---- | ICML23 | 研究在大型语言模型微调为奖励模型时的过度优化问题，这些模型被训练用来预测人类会偏好两个选项中的哪一个；研究当使用强化学习或最佳n次抽样方法对抗代理奖励模型进行优化时，黄金奖励模型得分如何变化 |\n| [更大、更好、更快：人类水平Atari游戏，同时具备人类水平效率](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19452.pdf) | BBF | ICML23 | 依靠扩大用于价值估计的神经网络规模以及其他一些设计选择，如重置等 |\n| [合成经验回放](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.06614.pdf) | SynthER | NeurIPS23 | 利用扩散技术增强回放缓冲区的数据；在在线RL和离线RL中均进行了评估 |\n| [OMPO：一种应对策略和动力学变化的统一RL框架](https:\u002F\u002Fopenreview.net\u002Fattachment?id=R83VIZtHXA&name=pdf) | OMPO | ICML24 口头报告 | 考虑由策略或动力学变化引起的分布差异；提出通过考虑转移占用率差异来设定一个替代策略学习目标，然后通过双重重构将其转化为一个易于处理的最小-最大优化问题 |\n\n\u003Ca id='基于模型的在线'>\u003C\u002Fa>\n\n\n## 基于模型的（在线）强化学习\n\n\u003Ca id='基于模型的经典'>\u003C\u002Fa>\n\n### 经典方法\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [基于模型的强化学习中的值感知损失函数](http:\u002F\u002Fproceedings.mlr.press\u002Fv54\u002Ffarahmand17a\u002Ffarahmand17a-supp.pdf) | VAML | AISTATS17 | 提出使用TD误差之差而非KL散度来训练模型 |\n| [模型集成信任区域策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.10592.pdf) | ME-TRPO | ICLR18 | 分析使用深度神经网络的普通MBRL方法的行为；提出ME-TRPO，包含两个思想：(i) 使用模型集成，(ii) 使用似然比导数；与无模型方法相比显著降低样本复杂度 |\n| [用于高效无模型强化学习的基于模型的价值扩展](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.00101.pdf) | MVE | ICML18 | 使用动力学模型模拟短期 horizon，并用Q-learning估计超出模拟 horizon 的长期价值；利用训练好的模型和策略估计k步价值函数以更新价值函数 |\n| [迭代式值感知模型学习](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2018\u002Ffile\u002F7a2347d96752880e3d58d72e9813cc14-Paper.pdf) | IterVAML | NeurIPS18 | 用当前价值函数的估计值替换VAML中的上确界 |\n| [基于随机集成价值扩展的高效强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1807.01675.pdf) | STEVE | NeurIPS18 | MVE的扩展；仅利用roll-out而不会引入显著误差 |\n| [使用概率动力学模型在少量试验中实现深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1805.12114.pdf) | PETS | NeurIPS18 | 提出PETS，通过自助模型集成来纳入不确定性 |\n| [具有理论保证的基于模型深度强化学习算法框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1807.03858.pdf) | SLBO | ICLR19 | 提出一种新的算法框架，用于设计和分析具有理论保证的基于模型RL算法：提供满足某些性质的真实回报下界，使得优化该下界实际上可以优化真实回报 |\n| [何时信任你的模型：基于模型的策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1906.08253.pdf) | MBPO | NeurIPS19 | 提出具有单调性基于模型改进的MBPO；从理论上讨论如何选择模型rollout的k值 |\n| [Atari游戏的基于模型强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1903.00374.pdf) | SimPLe | ICLR20 | 首次成功使用基于模型的方法处理ALE基准测试，并采用以下设计：(i) 确定性模型；(ii) 设计良好的损失函数；(iii) 调度采样；(iv) 随机模型 |\n| [双向基于模型的策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2007.01995.pdf) | BMPO | ICML20 | MBPO的扩展；同时考虑前向和逆向动力学模型 |\n| [面向基于模型强化学习中泛化的上下文感知动力学模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.06800.pdf) | CaDM | ICML20 | 开发了一种能够跨具有不同转移动态的环境分布进行泛化的上下文感知动力学模型（CaDM）；引入一种逆向动力学模型，可通过利用上下文潜在向量预测先前状态 |\n| [基于模型强化学习的游戏理论框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.07804.pdf) | PAL, MAL | ICML20 | 开发了一种新颖的框架，将MBRL视为策略玩家与模型玩家之间的博弈；在两者之间设置斯塔克尔伯格博弈 |\n| [通过自监督世界模型规划探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.05960.pdf) | Plan2Explore | ICML20 | 提出一种自监督强化学习智能体，以应对快速适应和预期未来新奇性两大挑战 |\n| [当模型自信时就相信它：掩码式基于模型的演员-评论家](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.04893.pdf) | M2AC | NeurIPS20 | MBPO的扩展；仅在模型自信时才使用模型rollout |\n| [LoCA遗憾：评估强化学习中基于模型行为的一致性指标](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2007.03158.pdf) | LoCA | NeurIPS20 | 提出LoCA来衡量方法在环境从第一个任务切换到第二个任务后调整策略的速度 |\n| [用于无限 horizon 预测的生成时序差分学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.14496.pdf) | GHM，或gamma模型 | NeurIPS20 | 提出gamma模型，无需反复应用单步模型即可进行长horizon预测 |\n| [模型、像素和奖励：视觉基于模型强化学习中设计权衡的评估](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2012.04603.pdf) | ---- | arXiv2012 | 研究视觉MBRL算法中预测模型的若干设计决策，特别关注那些使用预测模型进行规划的方法 |\n| [用有限数据掌握Atari游戏](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.00210.pdf) | EfficientZero | NeurIPS21 | 首次在有限数据条件下实现Atari游戏的超人水平表现；提出EfficientZero，包含三个组成部分：(i) 使用自监督学习来学习时间一致的环境模型，(ii) 以端到端方式学习价值前缀，(iii) 使用学习到的模型来修正离策略价值目标 |\n| [关于基于模型强化学习的有效调度](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.08550.pdf) | AutoMBPO | NeurIPS21 | MBPO的扩展；自动调度真实数据比例以及其他MBPO的超参数 |\n| [基于模型强化学习中的模型优势与值感知模型：弥合理论与实践之间的鸿沟](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.14080.pdf) | ---- | arxiv22 | 弥合基于模型RL中值感知模型学习（VAML）的理论与实践差距 |\n| [基于值梯度加权的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01464.pdf) | VaGraM | ICLR22 Spotlight | 考虑MBRL中的目标不匹配问题；通过用当前价值函数估计的梯度信息对MSE损失函数进行重新缩放，提出VaGraM |\n| [通过贝叶斯世界模型进行约束策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09802.pdf) | LAMBDA | ICLR22 Spotlight | 考虑CMDP中的贝叶斯基于模型方法 |\n| [强化学习中的在线策略模型误差](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07985.pdf) | OPC | ICLR22 | 考虑将真实世界数据与学习到的模型相结合，以兼得两者的优点；建议利用真实世界数据进行在线策略预测，而仅使用学习到的模型来泛化到不同的动作；提议在单独学习的模型基础上使用在线策略转移数据，以实现MBRL的准确长期预测 |\n| [用于模型预测控制的时间差分学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04955.pdf) | TD-MPC | ICML22 | 提出仅使用模型来预测奖励；用策略加速规划过程 |\n| [用于任务无关状态抽象的因果动力学学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13452.pdf) |  | ICML22 |  |\n| [不再不匹配：基于模型RL的联合模型-策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02758.pdf) | MnM | NeurIPS22 | 提出一种基于模型RL算法，其中模型和策略针对同一目标进行联合优化，该目标是真实环境动态下预期回报的下界，并在特定假设下变得紧致 |\n| [非指数贴现下的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13413.pdf) | ---- | NeurIPS22 | 提出一种适用于任意贴现函数的连续时间基于模型强化学习理论；推导出刻画最优策略的汉密尔顿-雅可比-贝尔曼方程，并描述如何使用配点法求解 |\n| [简化基于模型RL：用一个目标学习表征、隐空间模型和策略](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08466.pdf) | ALM | ICLR23 | 提出单一目标，即使用相同的目标联合优化策略、隐空间模型和编码器产生的表征：最大化预测奖励，同时最小化预测表征中的误差 |\n| [SpeedyZero：用有限的数据和时间掌握Atari](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Mg5CLXZgvLJ) | SpeedyZero | ICLR23 | 基于EfficientZero构建的分布式RL系统，结合优先刷新和截断LARS；仅用30万次采样，在35分钟内达到Atari基准测试的人类水平表现 |\n| 探讨基于模型学习在探索和迁移中的作用 || ICML23 ||\n| [STEERING：基于模型强化学习的斯坦因信息导向探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12038.pdf) | STEERING | ICML23 |  |\n| [用于无监督基于模型RL的可预测MDP抽象](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03921.pdf) | PMA | ICML23 | 通过限制不可预测的动作，在抽象简化后的MDP之上应用基于模型RL |\n| 基于模型RL中的懒惰之美：统一目标和算法 || ICML23 ||\n| [停止回归：通过分类训练价值函数以实现可扩展的深度RL](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dVpFKfqF3R&name=pdf) | HL-Gauss | ICML24口头报告 | 表明使用分类交叉熵训练价值函数可显著提升性能和可扩展性，涵盖多个领域，包括Atari 2600游戏的单任务RL、使用大型ResNet的Atari多任务RL、使用Q-transformer的机器人操作、无需搜索的国际象棋对弈，以及使用高容量Transformer的语言代理Wordle任务，在这些领域均取得了最先进的结果 |\n| [在模型自信的地方就信任它：具有不确定性感知rollout自适应的基于模型演员-评论家](https:\u002F\u002Fopenreview.net\u002Fattachment?id=N0ntTjTfHb&name=pdf) | MACURA | ICML24 | 提出一种易于调优的基于模型rollout长度调度机制 |\n\n\u003Ca id='dreamer'>\u003C\u002Fa>\n\n\n### 世界模型\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.10122.pdf), [\\[NeurIPS版本\\]](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper\u002F2018\u002Ffile\u002F2de5d16682c3c35007e4e92982f1a2ba-Paper.pdf) | 世界模型 | NeurIPS18 | 使用无监督方式学习环境的压缩时空表征，并利用世界模型训练一个非常紧凑且简单的策略来解决目标任务 |\n| [从像素中学习潜在动力学以进行规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1811.04551.pdf) | PlaNet | ICML19 | 提出PlaNet，从图像中学习环境动力学；该动力学模型由转移模型、观测模型、奖励模型和编码器组成；采用交叉熵方法选择动作以进行规划 |\n| [从梦想中控制：通过潜在想象学习行为](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1912.01603.pdf) | Dreamer | ICLR20 | 完全依靠潜在想象从图像中解决长 horizon 任务；在基于图像的MuJoCo环境中测试；提出用智能体替代PlaNet中的控制算法 |\n| [为基于模型的深度强化学习架起想象与现实之间的桥梁](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.12142.pdf) | BIRD | NeurIPS20 | 提出最大化想象轨迹与真实轨迹之间的互信息，从而使从想象轨迹中学到的策略改进能够更容易地泛化到真实轨迹上 |\n| [通过自监督的世界模型进行探索式规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.05960.pdf) | Plan2Explore | ICML20 | 提出Plan2Explore，用于自监督的探索和快速适应新任务 |\n| [使用离散世界模型掌握Atari游戏](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.02193.pdf) | Dreamerv2 | ICLR21 | 完全依靠潜在想象从图像中解决长 horizon 任务；在基于图像的Atari游戏中测试 |\n| [用于潜在空间中基于模型规划的时序预测编码](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.07156.pdf) | TPC | ICML21 | 提出一种基于时序预测编码的方法，用于从高维观测中进行规划，并从理论上分析其优先编码任务相关信息的能力 |\n| [学习任务感知抽象](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.15612.pdf) | TIA | ICML21 | 引入任务感知MDP（TiMDP）的形式化框架，通过训练两个通过协同重建学习视觉特征的模型来实现，但其中一个模型被对抗性地与奖励信号分离 |\n| [Dreaming：无需重建的基于潜在想象的模型强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2007.14535.pdf) | Dreaming | ICRA21 | 提出Dreamer的无解码器扩展版本，因为基于自动编码的方法常常会导致物体消失 |\n| [通过衍生记忆进行想象的基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jeATherHHGj) | IDM | NeurIPS21 | 希望通过衍生记忆提高基于模型策略优化的想象力多样性；指出当前方法在潜在状态受到随机噪声干扰时无法有效丰富想象力 |\n| [最大熵基于模型的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.01195.pdf) | MaxEnt Dreamer | NeurIPS21 | 将探索方法与基于模型的强化学习联系起来；将最大熵探索应用于Dreamer |\n| [通过世界模型发现并达成目标](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.09514.pdf) | LEXA | NeurIPS21 | 通过世界模型中的想象回放无监督地训练探索者和达成者策略；无监督阶段结束后，无需任何额外学习即可零样本地解决以目标图像指定的任务 |\n| [TransDreamer：使用Transformer世界模型的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.09481.pdf) | TransDreamer | arxiv2202 | 用Transformer替换RSSM中的RNN |\n| [DreamerPro：无重建的基于原型表示的模型强化学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fdeng22a\u002Fdeng22a.pdf) | DreamerPro | ICML22 | 考虑无重建的MBRL；提出从世界模型的循环状态中学习原型，从而将过去观测和动作中的时序结构提炼到原型中。|\n| [迈向评估基于模型强化学习方法的适应性](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwan22d\u002Fwan22d.pdf) | ---- | ICML22 | 引入LoCA设置的改进版本，并用其评估PlaNet和Dreamerv2 |\n| [通过视频进行无动作预训练的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13880.pdf) | APV | ICML22 | 使用来自不同领域的视频预训练一个无动作的潜在视频预测模型，然后在目标领域对预训练模型进行微调 |\n| [去噪MDP：学习比世界本身更好的世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15477.pdf) | Denoised MDP | ICML22 | 将信息分为四类：可控\u002F不可控（是否受动作影响）以及与奖励相关\u002F无关（是否影响回报）；建议仅考虑可控且与奖励相关的信息 |\n| [DreamingV2：无需重建的离散世界模型强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.00494.pdf) | Dreamingv2 | arxiv2203 | 同时采用DreamerV2的离散表征和Dreaming的无重建目标 |\n| [用于视觉控制的掩码世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14244.pdf) | MWM | arxiv2206 | 将视觉表征学习与动力学学习解耦，用于基于视觉模型的强化学习，并使用掩码自编码器训练视觉表征 |\n| [DayDreamer：用于物理机器人学习的世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.14176.pdf) | DayDreamer | arxiv2206 | 将Dreamer应用于4个机器人，在真实世界中直接在线学习，无需任何模拟器 |\n| [Iso-Dream：在世界模型中隔离不可控的视觉动力学](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13817.pdf) | Iso-Dream | NeurIPS22 | 将独立于动作信号的不可控动力学单独考虑；鼓励世界模型在隔离的状态转移分支上学习可控和不可控的时空变化来源；优化智能体在世界模型解耦的潜在想象上的行为 |\n| [通过少量无奖励部署学习通用世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12719.pdf) | CASCADE | NeurIPS22 | 引入无奖励部署效率设定，以促进泛化（探索应与任务无关）和可扩展性（探索策略应在不进行昂贵集中再训练的情况下收集大量数据）；提出一种受贝叶斯主动学习启发的信息论目标，通过新颖的级联目标专门最大化群体采样的轨迹多样性 |\n| [通过变分稀疏门控学习鲁棒动力学](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11698.pdf) | VSG、SVSG、BBS | NeurIPS22 | 考虑每一步稀疏更新潜在状态；开发了一种新的部分可观测且随机的环境，称为BringBackShapes（BBS） |\n| [Transformer是样本高效的世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00588.pdf) | IRIS | ICLR23口头报告 | 使用离散自编码器和自回归Transformer进行世界模型训练，显著提高了Atari游戏中的数据效率（2小时实时经验）；[\\[代码\\]](https:\u002F\u002Fgithub.com\u002Feloialonso\u002Firis) |\n| [基于Transformer的世界模型只需10万次交互即可](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.07109.pdf) | TWM | ICLR23 | 展示了一种基于Transformer-XL的新自回归世界模型；在Atari 10万次基准测试中取得了优异的成绩；[\\[代码\\]](https:\u002F\u002Fgithub.com\u002Fjrobine\u002Ftwm) |\n| [动态更新-数据比例：最小化世界模型过拟合](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.10144.pdf) | DUTD | ICLR23 | 提出一种新的通用方法，在训练过程中根据对持续收集但未用于训练的小样本子集的欠拟合和过拟合检测，动态调整更新-数据（UTD）比例；并将该方法应用于DreamerV2 |\n| [在3D迷宫中评估长期记忆](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13383.pdf) | Memory Maze | ICLR23 | 引入Memory Maze，这是一个专为评估智能体长期记忆而设计的随机迷宫3D领域，包括在线强化学习基准、多样化的离线数据集以及离线探针评估；[\\[代码\\]](https:\u002F\u002Fgithub.com\u002Fjurgisp\u002Fmemory-maze) |\n| [通过世界模型掌握多样化领域](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.04104.pdf) | DreamerV3 | arxiv2301 | 提出DreamerV3，用于处理广泛的领域，包括连续和离散动作、视觉和低维输入、2D和3D世界、不同的数据预算、奖励频率和奖励规模 |\n| [面向强化学习中任务泛化的任务感知Dreamer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05092.pdf) | TAD | arXiv2303 | 提出任务分布相关性，以定量捕捉任务分布的相关性；建议使用世界模型通过将奖励信号编码到策略中来提升任务泛化能力 |\n| [用于多模态轨迹优化的重参数化策略学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.10710.pdf) | RPG | ICML23口头报告 | 提出一种原则性的框架，将连续RL策略建模为最优轨迹的生成模型；介绍RPG，利用多模态策略参数化和已学习的世界模型，实现强大的探索能力和高数据效率 |\n| [从像素中掌握无监督强化学习基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12016.pdf) | Dyna-MPC | ICML23口头报告 | 利用无监督的基于模型的强化学习对智能体进行预训练；通过结合混合规划器Dyna-MPC的任务感知微调策略对下游任务进行微调 |\n| [用于深度强化学习的后验采样](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.00477.pdf) | PSDRL | ICML23 | 将对潜在状态空间模型的有效不确定性量化与基于价值函数近似的定制连续规划算法相结合 |\n| [具有可扩展复合策略梯度估计器的基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rDMAJECBM2) | TPX | ICML23 | 提出Total Propagation X，这是首个采用逆方差加权的复合梯度估计算法，已被证明可在大规模应用；将TPX与Dreamer结合使用 |\n| [超越想象：利用世界模型最大化情节可达性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=JsAMuzA9o2) | GoBI | ICML23 | 将传统的终身新颖性动机与旨在最大化逐步可达性扩展的情节内在奖励相结合；利用已学习的世界模型生成随机行动下的预测未来状态 |\n| [简化的时序一致性强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09466.pdf) | TCRL | ICML23 | 提出一种简单的表征学习方法，仅依赖于通过潜在时序一致性训练的潜在动力学模型，即可实现高性能的强化学习 |\n| [具身智能体是否会梦见像素化的羊：使用语言引导的世界建模进行具身决策](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12050.pdf) | DECKARD | ICML23 | 通过少量提示调用LLM，假设存在一个关于子目标的抽象世界模型（AWM） |\n| 无需演示的自主强化学习：通过隐式和双向课程 || ICML23 ||\n| [用于基于模型适应的奇趣回放](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15934.pdf) | CR | ICML23 | 通过优先回放智能体最不了解的经验来帮助基于模型的RL智能体适应 |\n| [用于视觉机器人操作的多视角掩码世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02408.pdf) | MV-MWM | ICML23 | 训练一个多视角掩码自编码器，该自编码器可以重建随机遮挡视角的像素，然后基于自编码器的表征学习世界模型 |\n| [世界模型骨干对决：RNN、Transformer和S4](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02064.pdf) | S4WM | NeurIPS23 | 提出首个基于S4的世界模型，可通过潜在想象生成高维图像序列 |\n\n\u003Ca id='model-based-code'>\u003C\u002Fa>\n\n\n### 代码库\n\n| 标题 | 会议 | 方法 | GitHub |\n| ---- | ---- | ---- | ---- |\n| [MBRL-Lib: 基于模型的强化学习模块化库](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.10159.pdf) | arxiv21 | MBPO, PETS, PlaNet | [链接](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmbrl-lib) |\n\n\n\n\u003Ca id='Model-Free-Offline'>\u003C\u002Fa>\n## （无模型）离线强化学习\n\n\u003Ca id='offline-current'>\u003C\u002Fa>\n### 当前方法\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [无需探索的离线深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1812.02900.pdf) | BCQ | ICML19 | 表明离线方法因外推误差表现不佳；提出批处理约束强化学习：在最大化回报的同时，最小化策略的状态-动作访问分布与批次中包含的状态-动作对之间的不匹配 |\n| [用于离线强化学习的保守Q学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.04779.pdf) | CQL | NeurIPS20 | 提出使用保守Q函数的CQL，该Q函数是其真实值的下界，因为标准的离线方法会高估价值函数 |\n| [离线强化学习：教程、综述及开放问题展望](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.01643.pdf) | ---- | arxiv20 | 关于离线RL的方法、应用和开放问题的教程 |\n| [基于不确定性的离线强化学习与多样化Q集合](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.01548.pdf) |  | NeurIPS21 |  |\n| [离线强化学习的极简主义方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.06860.pdf) | TD3+BC | NeurIPS21 | 提出添加行为克隆项以正则化策略，并对数据集中的状态进行归一化 |\n| [DR3：基于价值的深度强化学习需要显式正则化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04716.pdf) | DR3 | ICLR22 Spotlight | 考虑SGD在RL中的隐式正则化作用；基于理论分析，提出一种称为DR3的显式正则化器，并将其与离线RL方法结合 |\n| [用于不确定性驱动的离线强化学习的悲观自举法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11566.pdf) | PBRL | ICLR22 Spotlight | 考虑离线RL中的分布偏移和外推误差；提出带有自举的PBRL，用于不确定性量化，并采用OOD采样方法作为正则化手段 |\n| [COptiDICE：通过稳态分布修正估计进行离线约束强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=FLA55mBee6Q) | COptiDICE | ICLR22 Spotlight | 考虑离线约束强化学习；提出COptiDICE直接优化受约束的状态-动作分布 |\n| [基于价值的 episodic memory 的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=RCZqv9NXlZ) | EVL, VEM | ICLR22 | 提出一种新的离线V-learning方法，通过模仿学习与最优价值学习之间的权衡来学习价值函数；使用基于记忆的规划方案来增强优势估计，并以回归方式执行策略学习 |\n| [基于隐式Q学习的离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06169.pdf) | IQL | ICLR22 | 提出仅通过样本内学习来学习最优策略，而无需查询任何未见动作的价值 |\n| [离线RL策略应被训练为具有适应性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02200.pdf) | APE-V | ICML22口头报告 | 表明从离线数据集中学习并不能完全指定环境；利用贝叶斯形式化正式证明了离线RL中适应性的必要性，并提供了一种学习最优适应性策略的实用算法；提出一种基于集成的离线RL算法，使策略具备在单个episode内适应的能力 |\n| [当数据几何遇上深度函数：泛化离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11027.pdf) | DOGE | ICLR23 | 训练一个状态条件下的距离函数，可直接插入到标准的actor-critic方法中作为策略约束 |\n| [跳跃式启动强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02372.pdf) | JSRL | ICML23 | 考虑使用两种策略解决问题的设置：引导策略和探索策略；通过逐步“滚动”引入引导策略来启动RL算法 |\n\n\n\u003Ca id='offline-diffusion'>\u003C\u002Fa>\n### 与扩散模型结合\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [用于灵活行为合成的扩散规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09991.pdf) | Diffuser | ICML22口头报告 | 首次提出专为轨迹数据设计的去噪扩散模型以及相关的概率框架用于行为合成；证明Diffuser具有一系列有用特性，尤其适用于需要长 horizon 推理和测试时灵活性的离线控制场景 |\n| 条件生成建模是否足以支持决策？ || ICLR23口头报告 ||\n| [扩散QL作为离线强化学习中富有表现力的策略类别](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06193.pdf) | Diffusion-QL | ICLR23 | 使用扩散（或基于分数）模型进行策略正则化；利用条件扩散模型来表示策略 |\n| [通过高保真生成式行为建模进行离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14548.pdf) | SfBC | ICLR23 | 将学习到的策略解耦为两部分：一个富有表现力的生成式行为模型和一个动作评估模型 |\n| [AdaptDiffuser：扩散模型作为自进化规划者](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01877.pdf) | AdaptDiffuser | ICML23口头报告 | 提出AdaptDiffuser，一种基于扩散的进化规划方法，能够自我进化以改进扩散模型，从而成为更好的规划者，同时也能适应未见过的任务 |\n| [用于离线到在线强化学习的能量引导扩散采样](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hunSEjeCPE&name=pdf) | EDIS | ICML24 | 利用扩散模型从离线数据集中提取先验知识，并借助能量函数提炼这些知识，以在在线阶段实现更高质量的数据生成；制定三种不同的能量函数来指导扩散采样过程，以实现分布对齐 |\n| [DIDI：扩散引导的离线行为多样性](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8296yUBoXr&name=pdf) | DIDI | ICML24 | 提出从混合的无标签离线数据中学习多样化的技能 |\n\n\n\u003Ca id='Model-Based-Offline'>\u003C\u002Fa>\n\n## 基于模型的离线强化学习\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [基于模型的离线优化实现部署高效的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.03647.pdf) | BREMEN | ICLR20 | 提出部署效率的概念，用于统计学习过程中数据收集策略的变化次数（离线：1次，在线：无限制）；提出使用动力学模型集成的BERMEN算法，用于离线和异策略强化学习 |\n| [MOPO：基于模型的离线策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.13239.pdf) | MOPO | NeurIPS20 | 观察到现有的基于模型的强化学习算法在性能上优于无模型强化学习算法；通过在不确定性惩罚的MDP上扩展MBPO来设计MOPO（新奖励 = 奖励 - 不确定性） |\n| [MOReL：基于模型的离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.05951.pdf) | MOReL | NeurIPS20 | 提出用于基于模型的离线强化学习的MOReL方法，包括两个步骤：(a) 学习一个悲观MDP，(b) 在这个P-MDP中学习近似最优策略 |\n| [基于模型的离线规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.05556.pdf) | MBOP | ICLR21 | 学习用于规划的模型 |\n| [基于表示平衡的离线基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=QpNz8r_Ri2Y) | RepB-SDE | ICLR21 | 专注于在分布偏移条件下学习环境的鲁棒表示，并扩展RepBM以应对“horizon curse”问题；提出RepB-SDE框架，用于异策略评估和离线强化学习 |\n| [用于有效离线基于模型优化的保守目标模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.06882.pdf) | COMs | ICML21 | 考虑离线基于模型的优化问题（MBO，仅利用少量样本优化未知函数）；在目标函数中加入正则项（类似于对抗训练方法），以学习保守的目标模型 |\n| [COMBO：保守的离线基于模型策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.08363v1.pdf) | COMBO | NeurIPS21 | 尝试在不考虑不确定性量化的情况下优化性能下界；将CQL与基于模型的方法相结合 |\n| [用于离线基于模型强化学习的加权模型估计](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zdC5eXljMPy) | ---- | NeurIPS21 | 通过为不同数据点重新加权模型损失来解决协变量偏移问题 |\n| [重新审视基于模型的离线强化学习中的设计选择](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04135.pdf) | ---- | ICLR22 Spotlight | 对一系列基于模型的离线强化学习的设计选择进行了严谨的调查研究 |\n| [利用扩散过程进行灵活的行为合成规划](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09991.pdf) | Diffuser | ICML22 口头报告 | 首先为轨迹数据设计了一个去噪扩散模型，并构建了相应的行为合成概率框架 |\n| [无需在线实验即可学习时间抽象的世界模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=YeTYJz7th5) | OPOSM | ICML23 | 提出一种仅从离线数据中同时学习技能集合和时间抽象、受技能条件约束的世界模型的方法，使智能体能够针对新任务进行零样本在线技能序列规划 |\n\n\n\u003Ca id='Meta-RL'>\u003C\u002Fa>\n## 元强化学习\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [RL2：通过慢速强化学习实现快速强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1611.02779.pdf) | RL2 | arxiv16 | 将智能体自身的学习过程视为目标；将智能体构建为循环神经网络，以存储过去的奖励、动作、观测和终止标志，从而在部署时适应当前任务 |\n| [用于深度网络快速适应的模型无关元学习](https:\u002F\u002Fwww.cs.utexas.edu\u002Fusers\u002Fsniekum\u002Fclasses\u002FRL-F17\u002Fpapers\u002FMeta.pdf) | MAML | ICML17 | 提出一个适用于分类、回归和强化学习等不同学习问题的通用框架；核心思想是优化参数，使其能够快速适应新任务（只需几步梯度下降） |\n| [基于潜在变量高斯过程的元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.07551.pdf) | ---- | arxiv18 |  |\n| [通过元强化学习在动态的真实环境中学习适应](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.11347.pdf) | ReBAL, GrBAL | ICLR18 | 在基于模型的强化学习框架下考虑在线适应性学习 |\n| [基于扩展PAC-Bayes理论调整先验的元学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1711.01244.pdf) | ---- | ICML18 | 将各种PAC-Bayes界推广到元学习领域 |\n| [结构化探索策略的元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.07245.pdf) |  | NeurIPS18 |  |\n| [用于序列决策的元学习代理模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1903.11907.pdf) |  | arxiv19 |  |\n| [基于概率上下文变量的高效离策略元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1903.08254.pdf) | PEARL | ICML19 | 使用概率潜在上下文编码过去任务的经验，并利用推理网络估计后验分布 |\n| [通过元学习实现快速上下文适应](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.03642.pdf) | CAVIA | ICML19 | 提出CAVIA作为MAML的扩展，它更不易发生元过拟合，更容易并行化且更具可解释性；将模型参数分为两部分：上下文参数和共享参数，在测试阶段仅更新前者 |\n| [驯服MAML：高效无偏的元强化学习](http:\u002F\u002Fproceedings.mlr.press\u002Fv97\u002Fliu19g\u002Fliu19g.pdf) |  | ICML19 |  |\n| [Meta-World：多任务与元强化学习的基准与评估](http:\u002F\u002Fproceedings.mlr.press\u002Fv100\u002Fyu20a\u002Fyu20a.pdf) | Meta World | CoRL19 | 一个用于元强化学习以及多任务强化学习的环境 |\n| [引导式元策略搜索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1904.00956.pdf) | GMPS | NeurIPS19 | 通过监督模仿学习来提高元训练过程中的样本效率； |\n| [元Q学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.00125.pdf) | MQL | ICLR20 | 一种用于元强化学习的离策略算法，其基于三个简单想法：(i) 使用过去轨迹表示的上下文变量进行Q学习的表现已可与SOTA相媲美；(ii) 多任务目标对元强化学习很有用；(iii) 元训练回放缓存中的历史数据可以被重复利用 |\n| [Varibad：一种通过元学习实现贝叶斯自适应深度强化学习的优秀方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.08348.pdf) | variBAD | ICLR20 | 使用一个学习到的低维随机潜在变量m来表示单个MDP M；联合元训练一个变分自编码器，该编码器能够在新任务中推断出关于m的后验分布，以及一个条件于这种关于MDP嵌入的后验信念的策略 |\n| [关于模型无关元学习全局最优性的研究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.13182.pdf)，[ICML版本](http:\u002F\u002Fproceedings.mlr.press\u002Fv119\u002Fwang20b\u002Fwang20b-supp.pdf) | ---- | ICML20 | 对MAML在强化学习和监督学习中所达到的驻点的最优性差距进行刻画 |\n| [通过模型识别和经验重标记实现对分布漂移鲁棒的元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.07178.pdf) | MIER | arxiv20 |  |\n| [FOCAL：基于距离度量学习和行为正则化的高效全离策略元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.01112.pdf) | FOCAL | ICLR21 | 首先探讨离策略元强化学习问题；在PEARL的基础上提出FOCAL |\n| [带有优势加权的离策略元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.06043.pdf) | MACAW | ICML21 | 引入离策略元强化学习这一问题设定；提出一种基于优化的元学习算法MACAW，该算法在元训练的内层和外层循环中均使用简单的监督回归目标 |\n| [通过潜在动力学混合的虚拟任务提升元强化学习的泛化能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2105.13524.pdf) | LDM | NeurIPS21 | 目的是训练一个在训练过程中就能为未见测试任务做好准备的智能体，建议在原始训练任务之外，还训练混合任务，以防止智能体过拟合训练任务 |\n| [通过离策略评估统一元强化学习的梯度估计器](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.13125.pdf) | ---- | NeurIPS21 | 基于离策略评估的概念，提出一个统一框架，用于估计基于梯度的元强化学习中价值函数的高阶导数 |\n| [模型无关元学习算法的泛化：重复与未见任务](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.03832.pdf) | ---- | NeurIPS21 |  |\n| [离策略元探索学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.02598.pdf)，[离策略元强化学习——可识别性挑战与有效数据收集策略](https:\u002F\u002Fopenreview.net\u002Fpdf?id=IBdEfhLveS) | BOReL | NeurIPS21 |  |\n| [关于去偏模型无关元强化学习收敛理论的研究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2002.05135.pdf) | SG-MRL | NeurIPS21 |  |\n| [事后任务重标记：稀疏奖励元强化学习中的经验回放](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.00901.pdf) | ---- | NeurIPS21 |  |\n| [基于PAC-Bayes和一致稳定性的元学习泛化界](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.06589.pdf) | ---- | NeurIPS21 | 结合PAC-Bayes技术和一致稳定性，为元学习提供泛化界 |\n| [自举式元学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04504.pdf) | BMG | ICLR22 口头报告 | 提出BMG，旨在让元学习者自我指导，以应对元学习中的病态问题和短视的元目标；BMG引入元自举机制，以缓解短视问题，并将元目标表述为最小化与控制曲率的距离 |\n| [基于模型的正则化离策略元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.02929.pdf) | MerPO, RAC | ICLR22 | 实证指出，在数据质量较好的任务上，离策略元强化学习的表现可能不如离策略单任务强化学习方法；探讨如何学习一个信息丰富的离策略元策略，以在“探索”元策略所引导的分布外状态-动作对与“利用”离策略数据集、贴近行为策略之间取得最佳平衡；提出MerPO，该方法学习一个高效的任务结构推理模型，以及一个安全探索分布外状态-动作对的信息丰富的元策略 |\n| [基于技能的元强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jeLW-Fh9bV) | SiMPL | ICLR22 | 提出一种方法，同时利用(i)一个包含大量跨任务、无奖励或任务标注的历史经验的大型离策略数据集，以及(ii)一组元训练任务，以学习如何快速解决未见的长 horizon 任务。 |\n| [元强化学习中的事后-事前重标记](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.09031.pdf) | HFR | ICLR22 | 专注于通过数据共享提高元训练阶段的样本效率；将重标记技术与元强化学习算法相结合，以同时提升样本效率和渐近性能 |\n| [CoMPS：持续元策略搜索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04467.pdf) | CoMPS | ICLR22 | 首先提出了持续元强化学习的设置，即智能体一次只与一个任务交互，完成任务后便不再与其互动 |\n| [在强化学习中学习用于在线适应的策略子空间](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05169.pdf) | ---- | ICLR22 | 考虑仅有一个训练环境的情况；提出一种方法，即在参数空间中学习一个策略子空间 |\n| [一种适用于非平稳环境、具有分段稳定上下文的自适应深度强化学习方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.12735.pdf) | SeCBAD | NeurIPS22 | 引入具有分段稳定上下文的潜在情境MDP；联合推断潜在上下文的信念分布与各片段长度的后验分布，并利用当前上下文片段内的观测数据进行更精确的上下文信念推断 |\n| [使用图结构代理模型和摊销策略搜索的基于模型的元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.08291.pdf) | GSSM | ICML22 | 考虑基于模型的元强化学习，包括动力学模型学习和策略优化；开发了一种具有更强跨任务泛化能力的图结构动力学模型 |\n| [用于序列决策的元学习假设空间](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00602.pdf) | Meta-KeL | ICML22 | 认为Transformer的两项关键能力——处理长期依赖关系以及通过自注意力机制呈现上下文相关的权重——构成了元强化学习者的核心角色；提出Meta-LeL，用于元学习序列决策任务的假设空间 |\n| [Transformer是元强化学习者](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06614.pdf) | TrMRL | ICML22 | 提出TrMRL，一种基于记忆的元强化学习者，利用Transformer架构来构建学习过程； |\n| [ContraBAR：对比式贝叶斯自适应深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02418.pdf) | ContraBAR | ICML23 | 探究对比式方法，如对比预测编码，是否可用于学习贝叶斯最优行为 |\n\n\u003Ca id='Adversarial-RL'>\u003C\u002Fa>\n\n\n## 对抗性强化学习\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [神经网络策略的对抗攻击](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1702.02284.pdf) | ---- | ICLR 2017研讨会 | 首次表明，现有的结合深度神经网络的强化学习策略在白盒和黑盒设置下都容易受到对抗噪声的影响 |\n| [深入研究深度策略的对抗攻击](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1705.06452.pdf) | ---- | ICLR 2017研讨会 | 表明强化学习算法易受对抗噪声影响；同时指出对抗训练可以提高鲁棒性 |\n| [鲁棒的对抗强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.02702.pdf) | RARL | ICML17 | 将鲁棒策略学习形式化为一个零和、极小极大目标函数 |\n| [针对深度强化学习的隐蔽且高效的对抗攻击](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.07099.pdf) | 临界点攻击、对抗者攻击 | AAAI20 | 临界点攻击：构建模型预测未来的环境状态和智能体动作以进行攻击；对抗者攻击：自动学习一个领域无关的攻击模型 |\n| [约束马尔可夫决策过程中的安全强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.06626.pdf) | SNO-MDP | ICML20 | 探索并优化在未知安全约束下的马尔可夫决策过程 |\n| [针对状态观测中对抗扰动的鲁棒深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2003.08938.pdf) | SA-MDP | NeurIPS20 | 将状态观测上的对抗攻击形式化为SA-MDP；提出几种新颖的攻击方法：鲁棒SARSA和最大动作差异；并提出防御框架及若干实用方法：SA-DQN、SA-PPO和SA-DDPG |\n| [基于学习到的最优对手的状态观测鲁棒强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.08452.pdf) | ATLA | ICLR21 | 使用强化学习算法训练“最优”对手；交替训练“最优”对手和鲁棒智能体 |\n| [通过对抗损失实现鲁棒深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2008.01976.pdf) | RADIAL-RL | NeurIPS21 | 提出一种鲁棒强化学习框架，该框架会惩罚不同动作输出边界之间的重叠；同时提出一种更高效的评估方法（GWC）来衡量攻击不可知的鲁棒性 |\n| [用于可证明鲁棒强化学习的策略平滑](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.11420.pdf) | 策略平滑 | ICLR22 | 将随机平滑引入强化学习；提出自适应的奈曼-皮尔逊引理 |\n| [CROP：通过功能平滑认证强化学习的鲁棒策略](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.09292.pdf) | CROP | ICLR22 | 提出一个用于认证强化学习鲁棒策略的框架（CROP），以应对对抗性的状态扰动，并设定两个认证标准：单步动作的鲁棒性和累积奖励的下界；从理论上证明了认证半径；并通过实验为Atari游戏中的六种经验上鲁棒的强化学习算法提供了认证 |\n| [理解深度强化学习中观测的对抗攻击](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.15860.pdf) | ---- | SCIS 2023 | 总结当前基于优化的强化学习对抗攻击；提出两阶段方法：训练一个欺骗性策略，并诱使受害者模仿该策略 |\n| [一致性攻击：具身视觉导航中的通用对抗扰动](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05751.pdf) | 奖励UAP、轨迹UAP | PRL 2023 | 将通用对抗扰动扩展到序列决策中，并利用动态特性提出了奖励UAP和轨迹UAP；并在具身视觉导航任务中进行了实验 |\n\n\u003Ca id='Genaralization-in-RL'>\u003C\u002Fa>\n## 强化学习中的泛化\n\n\u003Ca id='Gene-Environments'>\u003C\u002Fa>\n### 环境\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [量化强化学习中的泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1812.02341.pdf) | CoinRun | ICML19 | 引入名为CoinRun的新环境用于强化学习中的泛化；实证表明L2正则化、丢弃法、数据增强和批归一化都能提升强化学习的泛化能力 |\n| [利用程序化生成技术基准测试强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1912.01588.pdf) | Procgen基准 | ICML20 | 引入Procgen基准，这是一套由16个程序化生成的游戏类环境组成的集合，旨在同时评估强化学习的样本效率和泛化能力 |\n\n\u003Ca id='Gene-Methods'>\u003C\u002Fa>\n### 方法\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [面向连续控制中的泛化与简洁性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.02660.pdf) | ---- | NeurIPS17 | 采用简单线性和RBF参数化的策略可以被训练来解决多种广泛研究的连续控制任务；通过在多样化的初始状态分布上进行训练，能够得到更具全局性的策略，并实现更好的泛化能力 |\n| [通用规划网络](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1804.00645.pdf) | UPN | ICML18 | 研究一种基于模型的架构，在与前向动力学联合学习的潜在空间中执行可微分的规划计算，并以端到端的方式进行训练，从而通过基于梯度的规划来编码解决问题所需的信息 |\n| [再参数化强化学习中的泛化差距问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1905.12654.pdf) | ---- | ICML19 | 从理论上为再参数化强化学习中内生和外生误差的期望回报与经验回报之间的差距提供保证 |\n| [连续深度强化学习中的泛化研究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1902.07015.pdf) | ---- | arxiv19 | 研究连续控制领域深度强化学习的泛化问题 |\n| [选择性噪声注入与信息瓶颈在强化学习泛化中的应用](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.12911.pdf) | SNI | NeurIPS19 | 考虑利用向学习函数中注入噪声来提升泛化性能的正则化技术；旨在保持注入噪声的正则化效果，同时减轻其对梯度质量的负面影响 |\n| [网络随机化：深度强化学习中一种简单的泛化技术](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.05396.pdf) | 网络随机化 | ICLR20 | 提出一种随机化的（卷积）神经网络，该网络会随机扰动输入观测，从而使训练好的智能体能够通过学习在不同且随机化的环境中保持不变的鲁棒特征来适应新领域 |\n| [强化学习中的观测过拟合](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1912.02975.pdf) | 观测过拟合 | ICLR20 | 讨论观测过拟合可能发生的实际场景及其与其他混淆因素的区别，并设计一个参数化的理论框架来诱导观测过拟合，该框架可应用于任何底层的MDP |\n| [基于上下文的动力学模型用于基于模型的强化学习中的泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2005.06800.pdf) | CaDM | ICML20 | 将学习全局动力学模型的任务分解为两个阶段：(a) 学习捕捉局部动力学的上下文隐向量，然后 (b) 在其条件下来预测下一个状态 |\n| [混合正则化提升强化学习泛化能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.10814.pdf) | mixreg | NeurIPS20 | 在来自不同训练环境的混合观测数据上训练智能体，并对观测插值及监督信号（如相关奖励）插值施加线性约束 |\n| [基于实例的强化学习泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2011.01089.pdf) | IPAE | NeurIPS20 | 将训练关卡的形式化为实例，并证明这种基于实例的观点与标准的部分可观测马尔可夫决策过程表述完全一致；基于训练实例的数量给出训练环境与测试环境之间价值差距的泛化界，并利用这些见解来提升智能体在未见关卡上的表现 |\n| [对比行为相似性嵌入用于强化学习中的泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.05265.pdf) | PSM | ICLR21 | 将强化学习中固有的序列结构融入表示学习过程中，以提升泛化能力；提出一种理论驱动的策略相似性度量（PSM），用于衡量状态间的行为相似性 |\n| [软数据增强提升强化学习泛化能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2011.13389.pdf) | SODA | ICRA21 | 对编码器施加软约束，旨在最大化增强数据与非增强数据潜在表示之间的互信息 |\n| [增强世界模型助力从单一离线环境中实现零样本动力学泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.05632.pdf) | AugWM | ICML21 | 考虑“从单一离线环境中进行动力学泛化”的设定，并重点关注对未见动力学的零样本性能；提出针对基于模型的离线强化学习的动力学增强方法；并设计了一种简单的自监督、无需奖励的上下文适应算法 |\n| [解耦价值与策略以提升强化学习泛化能力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.10330.pdf) | IDAAC | ICML21 | 将策略和价值函数的优化解耦，分别使用独立的网络对其进行建模；引入辅助损失项，鼓励表示对环境的无关属性保持不变 |\n| [为什么强化学习中的泛化如此困难：认识论POMDP与隐式部分可观测性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.06277.pdf) | LEEP | NeurIPS21 | 强化学习中的泛化会引发隐式的部分可观测性；提出LEEP方法，利用策略集合近似学习贝叶斯最优策略，以最大化测试时的性能 |\n| [强化学习中的自动数据增强](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2006.12862.pdf) | DrAC | NeurIPS21 | 专注于基于两种新型正则化项的策略和价值函数的自动数据增强 |\n| [何时可实现可泛化的强化学习？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2101.00300.pdf) | ---- | NeurIPS21 | 提出弱邻近性和强邻近性，用于从理论上分析强化学习的泛化能力 |\n| [深度强化学习中泛化问题综述](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.09794.pdf) | ---- | arxiv21 | 提供统一的理论框架和术语体系，用于讨论不同的泛化问题 |\n| [跨轨迹表示学习用于强化学习中的零样本泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.02193.pdf) | CTRL | ICLR22 | 考虑零样本泛化（ZSG）；利用自监督学习跨任务学习表示 |\n| [预训练表示在RL智能体OOD泛化中的作用](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.05686.pdf) | ---- | ICLR22 | 训练了240个表示和11,520个下游策略，并系统地考察它们在各种分布偏移下的表现；发现一个特定的表示指标，该指标衡量简单下游代理任务的泛化能力，能够可靠地预测下游RL智能体在所考虑的广泛OOD设置下的泛化能力 |\n| [通过逻辑组合实现终身强化学习中的泛化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZOcX-eybqoL) | ---- | ICLR22 | 利用强化学习中的逻辑组合创建一个框架，使智能体能够自主判断新任务是否可直接利用现有能力解决，或者是否需要学习特定技能 |\n| [局部特征交换用于强化学习中的泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06355.pdf) | CLOP | ICLR22 | 提出一种新的正则化技术，即在特征图中进行通道一致的局部置换 |\n| [通才智能体](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.06175.pdf) | Gato | arxiv2205 | [幻灯片](https:\u002F\u002Fml.cs.tsinghua.edu.cn\u002F~chengyang\u002Freading_meeting\u002FReading_Meeting_20220607.pdf) |\n| [通过约束条件风险价值实现安全强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04436.pdf) | CPPO | IJCAI22 | 发现修改观测与改变动力学之间存在联系，尽管两者在结构上截然不同 |\n| [CtrlFormer：通过Transformer学习用于视觉控制的可迁移状态表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08883.pdf) | CtrlFormer | ICML22 | 在不同控制任务之间联合学习视觉token与策略token之间的自注意力机制，从而能够在不发生灾难性遗忘的情况下学习和迁移多任务表示 |\n| [强化学习中的动力学学习与泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02126.pdf) | ---- | ICML22 | 从理论上表明，时序差分学习会在训练早期促使智能体拟合价值函数中的非平滑成分，同时还会产生抑制泛化的二阶效应 |\n| [通过通才-专才学习改进策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12984.pdf) | GSL | ICML22 | 希望利用专家的经验来帮助通才的策略优化；提出了多任务学习中的“灾难性无知”现象 |\n| [DRIBO：基于多视角信息瓶颈的鲁棒深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.13268.pdf) | DRIBO | ICML22 | 在无监督的多视角设置下，从观测中学习仅包含任务相关信息的鲁棒表示；为时序数据引入了一种新颖的多视角信息瓶颈（MIB）目标的对比版本 |\n| [利用变分因果推理泛化目标条件强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09081.pdf) | GRADER | NeurIPS22 | 使用因果图作为隐变量重新表述GCRL问题，进而从解决该问题中推导出迭代式训练框架 |\n| [重新思考强化学习中的价值函数学习以促进泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09960.pdf) | DCPG、DDCPG | NeurIPS22 | 考虑在多个训练环境中训练智能体以提升观测泛化性能；指出在多环境设置下，价值网络的优化难度更大；提出通过惩罚价值网络的大规模估计来防止过拟合的正则化方法 |\n| [掩码自编码用于可扩展且可泛化的决策](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12740.pdf) | MaskDP | NeurIPS22 | 将掩码自编码器（MAE）应用于强化学习（RL）和行为克隆（BC）的状态-动作轨迹，从而获得零样本迁移到新任务的能力 |\n| [预训练图像编码器用于可泛化的视觉强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08860.pdf) | PIE-G | NeurIPS22 | 发现ImageNet预训练ResNet模型的早期层可以为视觉强化学习提供相当具有泛化能力的表示 |\n| [关注你所关注的地方！基于显著性引导的Q网络用于视觉强化学习任务](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09203.pdf) | SGQN | NeurIPS22 | 提出优秀的视觉策略应能识别对其决策至关重要的像素；并在不同图像之间保持对重要信息来源的识别 |\n| [开放任务空间中的人类时间尺度适应](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.07608.pdf) | AdA | arXiv 2301 | 表明大规模训练RL智能体可以得到一种情境感知的学习算法，该算法能够像人类一样快速适应开放式的新颖具身3D问题 |\n| [通过算法蒸馏实现情境感知强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14215.pdf) | AD | ICLR23口头报告 | 提出算法蒸馏方法，通过使用因果序列模型模拟训练历史，将强化学习（RL）算法蒸馏进神经网络 |\n| [隐藏参数MDP中模型与策略迁移的性能边界](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sSt9fROSZRO) | ---- | ICLR23 | 表明在预训练数据量固定的情况下，经过更多变化训练的智能体能够更好地泛化；同时指出提高价值和策略网络的容量对于取得良好性能至关重要 |\n| [强化学习中多任务预训练与泛化的研究](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sSt9fROSZRO) | ---- | ICLR23 | 发现，在预训练数据量固定的情况下，经过更多变化训练的智能体能够更好地泛化；即使在进行了2亿环境帧的微调之后，这种优势仍然比零样本迁移时更为明显 |\n| [基于原型的跨域随机预训练用于强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05614.pdf) | CRPTpro | arXiv2302 | 利用一种新颖的内在损失进行原型表示学习，从而在不同领域之间预训练出高效且通用的编码器 |\n| [任务感知梦想家用于强化学习中的任务泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.05092.pdf) | TAD | arXiv2303 | 提出任务分布相关性，以定量方式捕捉任务分布的相关性；并建议使用世界模型通过将奖励信号编码进策略来提升任务泛化能力 |\n| [基于模型的强化学习泛化的优势](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Vue1ulwlPD) | ---- | ICML23 | 提供理论和实证见解，说明我们何时以及如何能够预期由学习模型生成的数据是有用的 |\n| [多环境预训练支持迁移到行动受限的数据集](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13337.pdf) | ALPT | ICML23 | 给定n个具有完整动作标签数据集的源环境，考虑在目标环境中进行离线强化学习，该环境中仅有少量带动作标签的数据，而大部分数据则没有动作标签；利用逆动力学模型学习一种能够很好地泛化到目标环境中有限动作数据的表示 |\n| [面向可变动作空间的情境感知强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pp3v2ch5Sd&name=pdf) | 无头AD | ICML24 | 将算法蒸馏扩展到具有可变离散动作空间的环境中 |\n\n\u003Ca id='Sequence-Generation'>\u003C\u002Fa>\n\n\n## 基于Transformer的强化学习\n\n| 标题 | 方法 | 会议 | 描述 |\n| ----  | ----   | ----       |   ----  |\n| [用于强化学习的稳定化Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.06764.pdf) | GTrXL | ICML20 | 通过重新排列层归一化，并在Transformer子模块的关键位置添加新的门控机制，来稳定训练过程 |\n| [决策Transformer：基于序列建模的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.01345.pdf) | DT | NeurIPS21 | 将强化学习视为一个序列生成任务，使用Transformer生成（未来回报、状态、动作、未来回报，...）；没有显式的优化过程；在离线强化学习上进行评估 |\n| [将离线强化学习视为一个大型序列建模问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.02039.pdf) | TT | NeurIPS21 | 将强化学习视为一个序列生成任务，使用Transformer生成（s_0^0, ..., s_0^N, a_0^0, ..., a_0^M, r_0, ...）；采用束搜索进行推理；在模仿学习、目标条件强化学习和离线强化学习上进行评估 |\n| [维基百科能否帮助离线强化学习？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12122.pdf) | ChibiT | arxiv2201 | 表明与决策Transformer相比，在自然语言自回归建模上进行预训练，无论是在流行的OpenAI Gym还是Atari环境中，都能带来持续的性能提升 |\n| [在线决策Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05607.pdf) | ODT | ICML22口头报告 | 将离线预训练与在线微调融合在一个统一框架中；结合序列级别的熵正则化项和自回归建模目标，实现高效采样探索与微调 |\n| 针对少量样本策略泛化的提示式决策Transformer || ICML22 ||\n| [多游戏决策Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15241.pdf) | ---- | NeurIPS22 | 表明仅通过离线训练的一个基于Transformer的模型，就能以接近人类水平的表现同时玩多达46款Atari游戏 |\n| [利用在线强化学习将大型语言模型嵌入交互式环境](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02662.pdf) | GLAM | ICML23 | 考虑让一个使用LLM作为策略的智能体，在与环境交互的过程中逐步更新其策略，借助在线强化学习不断提升其解决问题的能力 |\n\n\n\u003Ca id='Tutorial-and-Lesson'>\u003C\u002Fa>\n## 教程与课程\n\n| 教程与课程 |\n| ---- |\n| [强化学习：导论，理查德·S·萨顿和安德鲁·G·巴托](https:\u002F\u002Fd1wqtxts1xzle7.cloudfront.net\u002F54674740\u002FReinforcement_Learning-with-cover-page-v2.pdf?Expires=1641130151&Signature=eYy7kmTVqTXFcANS-9GZJUyb86cDqKeh2QX8VvzjouEM-QSfuiCm1WHhP~bW5C57Mecj6en~YRoTvxekzU5lq~UaHSBoc-7xP8dXBp91shcwdfJ8M0LUkktpqcQjXQi7ZzhGn33qZeah0p8S06ARzjimF5coL5arvp9yANAsy4KigXSZwAZNXxksKwqUAult2QseLL~Bv1p2locjYahRzTuex3vMxdBLhT9HOGFF0qOdKYxsWiaITUKnVYl8AvePDHEEXgfmuqEfjqjF5p~FHOsYl3gEDZOvUp1eUzPg2~i0MQXY49nUpzsThL5~unTRIsYJiBghnkYl8py0r~UelQ__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA) |\n| [戴维·西尔弗的强化学习入门](https:\u002F\u002Fdeepmind.com\u002Flearning-resources\u002F-introduction-reinforcement-learning-david-silver) | \n| [深度强化学习，CS285](https:\u002F\u002Frail.eecs.berkeley.edu\u002Fdeeprlcourse\u002F) |\n| [深度强化学习与控制，CMU 10703](https:\u002F\u002Fkatefvision.github.io\u002F) |\n| [RLChina](http:\u002F\u002Frlchina.org\u002Ftopic\u002F9) |\n\n\u003Ca id='ICLR22'>\u003C\u002Fa>\n\n## ICLR22\n| 论文 | 类型 |\n| ---- | ---- |\n| [自举元学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.04504.pdf) | 口头报告 |\n| [无监督强化学习的信息几何](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02719.pdf) | 口头报告 |\n| [SO(2)等变强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04439.pdf) | 焦点论文 |\n| [CoBERL：用于强化学习的对比BERT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.05431.pdf) | 焦点论文 |\n| [理解和防止强化学习中的容量损失](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZkC8wKoLbQ7) | 焦点论文 |\n| [深度强化学习中的彩票假设与最小任务表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2105.01648.pdf) | 焦点论文 |\n| [利用离线示范指导进行稀疏奖励强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04628.pdf) | 焦点论文 |\n| [通过不确定性估计实现样本高效的深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.01666.pdf) | 焦点论文 |\n| [强化学习中基于生成式规划的时间协调探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09765.pdf) | 焦点论文 |\n| [智能体何时应该探索？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2108.11811.pdf) | 焦点论文 |\n| [再探基于模型的离线强化学习中的设计选择](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.04135.pdf) | 焦点论文 |\n| [DR3：基于值的深度强化学习需要显式正则化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04716.pdf) | 焦点论文 |\n| [用于不确定性驱动的离线强化学习的悲观自举法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.11566.pdf) | 焦点论文 |\n| [COptiDICE：通过稳态分布校正估计实现约束离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=FLA55mBee6Q) | 焦点论文 |\n| [基于价值梯度加权的模型化强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.01464.pdf) | 焦点论文 |\n| [通过贝叶斯世界模型进行约束策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.09802.pdf) | 焦点论文 |\n| [用于RL零样本泛化的跨轨迹表征学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.02193.pdf) | 海报展示 |\n| [预训练表征在RL智能体OOD泛化中的作用](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.05686.pdf) | 海报展示 |\n| [通过逻辑组合实现在终身强化学习中的泛化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZOcX-eybqoL) | 海报展示 |\n| [强化学习中的局部特征交换以促进泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06355.pdf) | 海报展示 |\n| [用于可证明鲁棒强化学习的策略平滑](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.11420.pdf) | 海报展示 |\n| [CROP：通过函数平滑认证强化学习的鲁棒策略](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.09292.pdf) | 海报展示 |\n| [带有正则化的基于模型的离线元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.02929.pdf) | 海报展示 |\n| [基于技能的元强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jeLW-Fh9bV) | 海报展示 |\n| [元强化学习中的事后预见重标记](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2109.09031.pdf) | 海报展示 |\n| [CoMPS：持续的元策略搜索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.04467.pdf) | 海报展示 |\n| [为强化学习中的在线适应学习策略子空间](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.05169.pdf) | 海报展示 |\n| [部分覆盖下的悲观基于模型的离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.06226.pdf) | 海报展示 |\n| [基于模型的离线强化学习的帕累托策略池](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OqcZu8JIIzS) | 海报展示 |\n| [基于值的 episodic memory 的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=RCZqv9NXlZ) | 海报展示 |\n| [隐式Q-learning的离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.06169.pdf) | 海报展示 |\n| [强化学习中的策略内模型误差](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.07985.pdf) | 海报展示 |\n| [最大熵RL（可证明地）解决某些鲁棒RL问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | 海报展示 |\n| [最大化深度强化学习中的集成多样性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hjd-kcpDpf2) | 海报展示 |\n| [最大熵RL（可证明地）解决某些鲁棒RL问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2103.06257.pdf) | 海报展示 |\n| [通过行为相似性的自适应元学习器学习强化学习的可泛化表征](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zBOI9LFpESK) | 海报展示 |\n| [利普希茨约束下的无监督技能发现](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00914.pdf) | 海报展示 |\n| [通过乐观探索学习更多技能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.14226.pdf) | 海报展示 |\n\n\u003Ca id='ICML22'>\u003C\u002Fa>\n\n## ICML22\n| 论文 | 类型 |\n| ---- | ---- |\n| [在线决策 Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.05607.pdf) | 口头报告 |\n| 预训练视觉模型在控制任务中的出人意料的有效性 | 口头报告 |\n| 最大状态熵探索中非马尔可夫性的重要性 | 口头报告 |\n| [基于扩散的规划用于灵活的行为合成](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.09991.pdf) | 口头报告 |\n| 用于离线强化学习的对抗训练演员-评论家 | 口头报告 |\n| 学习用于离线策略评估的贝尔曼完备表示 | 口头报告 |\n| [离线 RL 策略应当被训练为自适应的](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.02200.pdf) | 口头报告 |\n| [大规模批量经验回放](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.01528.pdf) | 口头报告 |\n| [可微分模拟器是否能为策略优化提供更好的梯度？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00817.pdf) | 口头报告 |\n| 联邦强化学习：通信高效算法及收敛性分析 | 口头报告 |\n| [通用策略优化的解析更新规则](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2112.02045.pdf) | 口头报告 |\n| [基于几何策略组合的广义策略改进](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08736.pdf) | 口头报告 |\n| 通过提示引导决策 Transformer 实现少样本策略泛化 | 海报展示 |\n| [CtrlFormer：通过 Transformer 学习用于视觉控制的可迁移状态表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08883.pdf) | 海报展示 |\n| [强化学习中的动力学学习与泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.02126.pdf) | 海报展示 |\n| [通过通才-专才学习改进策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.12984.pdf) | 海报展示 |\n| [DRIBO：基于多视角信息瓶颈的鲁棒深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.13268.pdf) | 海报展示 |\n| [用于鲁棒强化学习的策略梯度方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07344.pdf) | 海报展示 |\n| SAUTE RL：利用状态增强实现几乎确定安全的强化学习 | 海报展示 |\n| 用于安全强化学习的约束变分策略优化 | 海报展示 |\n| 通过自助式机会主义课程进行鲁棒深度强化学习 | 海报展示 |\n| 分布鲁棒 Q 学习 | 海报展示 |\n| 基于采样噪声和标签噪声的鲁棒元学习——Eigen-Reptile 方法 | 海报展示 |\n| DRIBO：基于多视角信息瓶颈的鲁棒深度强化学习 | 海报展示 |\n| [基于图结构代理模型和摊销策略搜索的基于模型的元强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.08291.pdf) | 海报展示 |\n| [面向序列决策的元学习假设空间](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00602.pdf) | 海报展示 |\n| 元强化学习中具有剧烈方差缩减的有偏梯度估计 | 海报展示 |\n| [Transformer 就是元强化学习者](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.06614.pdf) | 海报展示 |\n| 带在线自监督的离线元强化学习 | 海报展示 |\n| 通过正则化基于模型策略的平稳分布来稳定离线强化学习 | 海报展示 |\n| 离线强化学习中的悲观 Q 学习：迈向最优样本复杂度 | 海报展示 |\n| 如何在离线强化学习中利用未标注数据？ | 海报展示 |\n| 关于折扣因子在离线强化学习中作用的研究 | 海报展示 |\n| 批量策略优化中的模型选择 | 海报展示 |\n| 库普曼 Q 学习：基于动力学对称性的离线强化学习 | 海报展示 |\n| 通过对比学习为离线元强化学习构建鲁棒的任务表示 | 海报展示 |\n| 悲观主义与 VCG 机制的结合：利用离线强化学习学习动态机制设计 | 海报展示 |\n| 展示你的离线强化学习工作：在线评估预算很重要 | 海报展示 |\n| 约束下的离线策略优化 | 海报展示 |\n| [DreamerPro：基于原型表示的无重建基于模型的强化学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fdeng22a\u002Fdeng22a.pdf) | 海报展示 |\n| [迈向评估基于模型强化学习方法的自适应性](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwan22d\u002Fwan22d.pdf) | 海报展示 |\n| [基于视频的无动作预训练强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.13880.pdf) | 海报展示 |\n| [去噪 MDP：学习比世界本身更好的世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.15477.pdf) | 海报展示 |\n| [用于模型预测控制的时间差分学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.04955.pdf) | 海报展示 |\n| [用于任务无关状态抽象的因果动力学学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13452.pdf) | 海报展示 |\n| [我为什么要信任你，贝尔曼？贝尔曼误差并不能很好地替代价值误差](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12417.pdf) | 海报展示 |\n| [马尔可夫决策过程的自适应模型设计](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22ab\u002Fchen22ab.pdf) | 海报展示 |\n| [从像素出发稳定离策略深度强化学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fcetin22a\u002Fcetin22a.pdf) | 海报展示 |\n| [理解策略梯度算法：基于灵敏度的方法](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwu22i\u002Fwu22i.pdf) | 海报展示 |\n| [镜像学习：统一的策略优化框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.02373.pdf) | 海报展示 |\n| [基于演示的行动量化连续控制](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fdadashi22a\u002Fdadashi22a.pdf) | 海报展示 |\n| [使用可微函数近似器的离策略拟合 Q 评估：Z 估计与推断理论](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fzhang22al\u002Fzhang22al.pdf) | 海报展示 | 使用通用可微函数近似器（包括神经网络函数近似）进行 FQE 评估，并采用 Z 估计理论 |\n| [基于时间差分的策略梯度估计方法](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Ftosatto22a\u002Ftosatto22a.pdf) | 海报展示 |\n| [深度强化学习中的首要性偏差](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.07802.pdf) | 海报展示 |\n| [利用深度强化学习优化序列实验设计](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00821.pdf) | 海报展示 |\n| [鲁棒价值函数的几何结构](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fwang22k\u002Fwang22k.pdf) | 海报展示 |\n| 通过约束强化学习直接指定行为 | 海报展示 |\n| [马尔可夫序列决策中的效用理论](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13637.pdf) | 海报展示 |\n| [通过深度网络集成降低时间差分价值估计的方差](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fliang22c\u002Fliang22c.pdf) | 海报展示 |\n| 统一策略优化的近似梯度更新 | 海报展示 |\n| [EqR：用于数据高效强化学习的等变表示](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fmondal22a\u002Fmondal22a.pdf) | 海报展示 |\n| [带有短期记忆的可证明强化学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fefroni22a\u002Fefroni22a.pdf) | 海报展示 |\n| 通过双重拟合迭代最优估计离策略策略梯度 | 海报展示 |\n| 悬崖跳水：探索强化学习环境中的奖励曲面 | 海报展示 |\n| 用于 Q 函数学习的拉格朗日方法（及其在机器翻译中的应用） | 海报展示 |\n| 利用大规模结构化强化学习学习组装 | 海报展示 |\n| 解决强化学习序列建模中的乐观偏差 | 海报展示 |\n| 带延迟奖励的离策略强化学习 | 海报展示 |\n| 可达性约束强化学习 | 海报展示 |\n| [基于流的 POMDP 循环信念状态学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22q\u002Fchen22q.pdf) | 海报展示 |\n| [通过嵌入技术对大型动作空间进行离策略评估](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06317.pdf) | 海报展示 |\n| [双稳健且分布鲁棒的离策略评估与学习](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fkallus22a\u002Fkallus22a.pdf) | 海报展示 |\n| [关于离策略评估中非参数 Q 函数估计的良好适定性及极小极大最优率的研究](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fchen22u\u002Fchen22u.pdf) | 海报展示 |\n| 通过最大熵强化学习进行沟通 | 海报展示 |\n\n\u003Ca id='NeurIPS22'>\u003C\u002Fa>\n\n## NeurIPS22\n| 论文 | 类型 |\n| ---- | ---- |\n| [多游戏决策变换器](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15241.pdf) | 海报 |\n| 用于离线强化学习的自举变换器 | 海报 |\n| [基于变分因果推理的目标条件强化学习泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.09081.pdf) | 海报 |\n| [重新思考强化学习中的值函数学习以实现泛化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.09960.pdf) | 海报 |\n| [用于可扩展和可泛化的决策制定的掩码自编码](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.12740.pdf) | 海报 |\n| [用于可泛化视觉强化学习的预训练图像编码器](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08860.pdf) | 海报 |\n| [GALOIS：通过可泛化的逻辑综合提升深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13728.pdf) | 海报 |\n| [看你想看的地方！面向视觉强化学习任务的显著性引导Q网络](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.09203.pdf) | 海报 |\n| [一种适用于具有分段稳定环境的非平稳环境的自适应深度强化学习方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.12735.pdf) | 海报 |\n| [基于模型的悲观主义调制动力学信念的离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06692.pdf) | 海报 |\n| [交替进行离线模型训练和策略学习的统一框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05922.pdf) | 海报 |\n| [用于离线无限宽度模型优化的双向学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07507.pdf) | 海报 |\n| DASCO：双生成器对抗支持约束下的离线强化学习 | 海报 |\n| [用于离线强化学习的支持策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.06239.pdf) | 海报 |\n| [为何如此悲观？通过集成估计离线强化学习中的不确定性，以及它们独立性的重要性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13703.pdf) | 海报 |\n| 离线强化学习中模型选择的Oracle不等式 | 海报 |\n| [用于离线强化学习的温和保守Q学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.04745.pdf) | 海报 |\n| [一种用于离线强化学习的策略引导模仿方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08323.pdf) | 海报 |\n| [用于离线强化学习的自举变换器](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.08569.pdf) | 海报 |\n| [LobsDICE：通过稳态分布校正估计进行观察式离线学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.13536.pdf) | 海报 |\n| [用于离线强化学习的潜在变量优势加权策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.08949.pdf) | 海报 |\n| [我能走多远？通过f-优势回归实现离线目标条件强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.03023.pdf) | 海报 |\n| [NeoRL：一个接近真实世界的离线强化学习基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2102.00714.pdf) | 海报 |\n| [回报条件监督学习何时适用于离线强化学习？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01079.pdf) | 海报 |\n| [用于离线强化学习的贝尔曼残差正交化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2203.12786.pdf) | 海报 |\n| [离线强化学习中模型选择的Oracle不等式](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.02016.pdf) | 海报 |\n| [不再错配：基于模型强化学习的联合模型-策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.02758.pdf) | 海报 |\n| 何时更新你的模型：约束下的基于模型的强化学习 | 海报 |\n| 贝叶斯乐观优化：基于模型强化学习的乐观探索 | 海报 |\n| [具有贝叶斯探索的基于模型终身强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11579.pdf) | 海报 |\n| 计划以预测：为基于模型强化学习学习一个预见不确定性的模型 | 海报 |\n| 基于数据驱动的、通过不变表示学习进行的基于模型优化 | 海报 |\n| [非指数贴现的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.13413.pdf) | 海报 |\n| [使用神经辐射场的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01634.pdf) | 海报 |\n| [递归强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.11430.pdf) | 海报 |\n| [挑战凸强化学习中的常见假设](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.01511.pdf) | 海报 |\n| 可解释的策略搜索 | 海报 |\n| [关于强化学习与分布匹配在无灾难性遗忘情况下微调语言模型的应用](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00761.pdf) | 海报 |\n| [何时寻求帮助：自主强化学习中的主动干预](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10765.pdf) | 海报 |\n| 使用深度强化学习的自适应仿生鱼类模拟 | 海报 |\n| 生灭过程中的强化学习：打破对状态空间的依赖 | 海报 |\n| [发现式策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05639.pdf) | 海报 |\n| 用较慢的在线网络加速深度强化学习 | 海报 |\n| 探索引导的奖励塑造：用于稀疏奖励下的强化学习 | 海报 |\n| [强化学习中的大规模检索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05314.pdf) | 海报 |\n| [用于自动出价的可持续在线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.07006.pdf) | 海报 |\n| [LECO：用于特定任务内在奖励的学习型周期计数](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.05409.pdf) | 海报 |\n| [DNA：具有双重网络架构的近端策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.10027.pdf) | 海报 |\n| [用较慢的在线网络加速深度强化学习](https:\u002F\u002Fassets.amazon.science\u002F31\u002Fca\u002F0c09418b4055a7536ced1b218d72\u002Ffaster-deep-reinforcement-learning-with-slower-online-network.pdf) | 海报 |\n| [混合策略范围的在线强化学习](https:\u002F\u002Fcausalai.net\u002Fr84.pdf) | 海报 |\n| [ProtoX：通过原型设计解释强化学习智能体](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.03162.pdf) | 海报 |\n| [马尔可夫决策过程中的难度：理论与实践](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13075.pdf) | 海报 |\n| [鲁棒的Phi散度MDP](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14202.pdf) | 海报 |\n| [关于策略梯度方法在一般随机游戏中收敛到纳什均衡的问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.08857.pdf) | 海报 |\n| [针对通用价值函数的统一离策略评估方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.02711.pdf) | 海报 |\n| [用于强化学习中数据高效策略评估的鲁棒在策略采样](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2111.14552.pdf) | 海报 |\n| 最优控制问题中的连续深度Q学习：归一化优势函数分析 | 海报 |\n| [参数化可重定向的决策者倾向于追求权力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.13477.pdf) | 海报 |\n| [策略优化中的批次大小不变性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2110.00641.pdf) | 海报 |\n| [具有最优运输差异的信任区域策略优化：连续动作下的对偶性和算法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11137.pdf) | 海报 |\n| 用于强调式强化学习的自适应兴趣 | 海报 |\n| [多步分布强化学习中时间差分误差的本质](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.07570.pdf) | 海报 |\n| [重生的强化学习：重用先前计算以加速进展](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.01626.pdf) | 海报 |\n| [贝叶斯风险马尔可夫决策过程](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2106.02558.pdf) | 海报 |\n| [通过模型转换实现可解释的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12006.pdf) | 海报 |\n| PDSketch：集成规划领域编程与学习 | 海报 |\n| [对比学习作为目标条件强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.07568.pdf) | 海报 |\n| [自监督学习真的能改善基于像素的强化学习吗？](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.05266.pdf) | 海报 |\n| [带有自动化辅助损失搜索的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06041.pdf) | 海报 |\n| [基于掩码的强化学习潜在重建](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2201.12096.pdf) | 海报 |\n| [Iso-Dream：在世界模型中隔离不可控的视觉动态](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.13817.pdf) | 海报 |\n| [在少数无奖励部署中学习通用世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12719.pdf) | 海报 |\n| [通过变分稀疏门控学习鲁棒动力学](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.11698.pdf) | 海报 |\n| [用于无监督强化学习的惊喜混合](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06702.pdf) | 海报 |\n| [带有对比内在控制的无监督强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.00161.pdf) | 海报 |\n| [通过循环技能训练进行无监督技能发现](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sYDX_OxNNjh) | 海报 |\n| [针对通用价值函数的统一离策略评估方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2107.02711.pdf) | 海报 |\n| 离策略TD学习中正则化的陷阱 | 海报 |\n| 针对行动依赖型非平稳环境的离策略评估 | 海报 |\n| [用于上下文赌博机离策略评估的局部度量学习，适用于连续动作](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13373.pdf) | 海报 |\n| [具有策略依赖型优化响应的离策略评估](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.12958.pdf) | 海报 |\n\n\u003Ca id='ICLR23'>\u003C\u002Fa>\n\n## ICLR23\n| 论文 | 类型 |\n| ---- | ---- |\n| 控制的二分法：区分你能控制与不能控制的事物  | 口头报告 |\n| [基于算法蒸馏的上下文强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14215.pdf) | 口头报告 |\n| 条件生成建模是否足以应对决策任务？ | 口头报告 |\n| 在多样化多任务数据上的离线Q学习：兼具规模性和泛化能力 | 口头报告 |\n| 针对离线强化学习的置信度条件值函数 | 口头报告 |\n| 极端Q学习：无需熵项的最大熵强化学习 | 口头报告 |\n| 稀疏Q学习：带有隐式值函数正则化的离线强化学习 | 口头报告 |\n| [Transformer是样本高效的环境模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.00588.pdf) | 口头报告 |\n| [通过突破重放缓冲区比例限制实现样本高效强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OpC-9aBBVJe) | 口头报告 |\n| [利用不完美在线示范进行约束策略优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01728.pdf) | 展示报告 |\n| [面向人类友好的原型：迈向可解释的深度强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hWwY_Jq0xsN) | 展示报告 |\n| 粉红噪声就够了：深度强化学习中的彩色噪声探索 | 展示报告 |\n| [DEP-RL：用于过度驱动和肌肉骨骼系统的具身强化学习探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2206.00484.pdf) | 展示报告 |\n| 离线强化学习中的样本内Softmax | 展示报告 |\n| 基于真实机器人硬件的离线强化学习基准测试 | 展示报告 |\n| [编舞者：在想象中学习与适应技能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13350.pdf) | 展示报告 |\n| [通过值隐式预训练迈向通用视觉奖励与表征](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.00030.pdf) | 展示报告 |\n| 随机丢帧下的决策 Transformer | 海报展示 |\n| 用于高效在线策略适应的超决策 Transformer | 海报展示 |\n| 喜好 Transformer：利用Transformer建模人类偏好以应用于强化学习 | 海报展示 |\n| 对比图像变换在强化学习中的数据效率 | 海报展示 |\n| 智能体能否与陌生人接力赛跑？强化学习对分布外轨迹的泛化能力 | 海报展示 |\n| [隐藏参数MDP中模型与策略迁移的性能界](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sSt9fROSZRO) | 海报展示 |\n| [探究强化学习中的多任务预训练与泛化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sSt9fROSZRO) | 海报展示 |\n| 强化学习中技能迁移的先验、层次结构与信息不对称 | 海报展示 |\n| 观测扰动下安全强化学习的鲁棒性 | 海报展示 |\n| 分布式元梯度强化学习 | 海报展示 |\n| 保守贝叶斯模型基值扩展用于离线策略优化 | 海报展示 |\n| 值记忆图：一种面向离线强化学习的图结构世界模型 | 海报展示 |\n| 基于学习模型的高效离线策略优化 | 海报展示 |\n| [扩散策略作为离线强化学习中富有表现力的策略类](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.06193.pdf) | 海报展示 |\n| [通过高保真生成行为建模实现离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.14548.pdf) | 海报展示 |\n| 决策S4：基于状态空间层的高效序列型强化学习 | 海报展示 |\n| 行为邻近策略优化 | 海报展示 |\n| 在稀疏奖励环境中学习成就结构以实现结构化探索 | 海报展示 |\n| 用轨迹解释强化学习决策 | 海报展示 |\n| 用户交互式离线强化学习 | 海报展示 |\n| 面向离线多目标强化学习的帕累托最优决策代理 | 海报展示 |\n| 带有隐式语言Q学习的自然语言生成离线强化学习 | 海报展示 |\n| 离线强化学习中的样本内演员评论家 | 海报展示 |\n| 通过轨迹加权整合混合离线强化学习数据集 | 海报展示 |\n| 关注差距：针对不完美奖励的离线策略优化 | 海报展示 |\n| [当数据几何遇上深度函数：泛化离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.11027.pdf) | 海报展示 |\n| MAHALO：统一基于观测的离线强化学习与模仿学习 | 海报展示 |\n| [基于Transformer的世界模型仅需10万次交互即可满足要求](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.07109.pdf) | 海报展示 |\n| [动态更新数据比例：最小化世界模型过拟合](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.10144.pdf) | 海报展示 |\n| [在3D迷宫中评估长期记忆](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13383.pdf) | 海报展示 |\n| 通过直接规划连续控制做出更好决策 | 海报展示 |\n| HiT-MDP：在具有隐藏时间嵌入的MDP上学习SMDP选项框架 | 海报展示 |\n| 基于模型的强化学习中值扩展方法的边际收益递减 | 海报展示 |\n| [简化基于模型的强化学习：以单一目标同时学习表征、潜在空间模型和策略](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.08466.pdf) | 海报展示 |\n| [SpeedyZero：用有限的数据和时间掌握Atari游戏](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Mg5CLXZgvLJ) | 海报展示 |\n| [高效的深度强化学习需要调控统计过拟合](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2304.10466.pdf) | 海报展示 |\n| 回放内存作为经验MDP：结合保守估计与经验回放 | 海报展示 |\n| [贪婪演员评论家：一种用于策略改进的新条件交叉熵方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.09103.pdf) | 海报展示 |\n| [利用语言模型设计奖励](https:\u002F\u002Fopenreview.net\u002Fpdf?id=10uNUgI5Kl) | 海报展示 |\n| [通过Q学习解决连续控制问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.12566.pdf) | 海报展示 |\n| [Wasserstein自编码MDP：以多方保证形式正式验证高效蒸馏的强化学习策略](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.12558.pdf) | 海报展示 |\n| 基于群体的强化学习实现质量相似的多样性 | 海报展示 |\n| [人类水平的Atari游戏快200倍](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.07550.pdf) | 海报展示 |\n| 策略扩展用于连接离线与在线强化学习 | 海报展示 |\n| [通过值函数搜索改进深度策略梯度](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.10145.pdf) | 海报展示 |\n| [记忆健身房：面向部分可观测智能体的记忆挑战](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jHc8dCx6DDr) | 海报展示 |\n| [混合强化学习：同时使用离线和在线数据可使强化学习更高效](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.06718.pdf) | 海报展示 |\n| [POPGym：部分可观测强化学习的基准测试](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2303.01859.pdf) | 海报展示 |\n| [评论家顺序蒙特卡洛](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.15460.pdf) | 海报展示 |\n| 具有亲和力正则化的可撤销深度强化学习，用于对抗异常值的稳健图匹配 | 海报展示 |\n| 面向离线强化学习的可证明无监督数据共享 | 海报展示 |\n| 使用DOMiNO发现策略：保持接近最优的多样性优化 | 海报展示 |\n| [强化学习中的潜在变量表示](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.08765.pdf) | 海报展示 |\n| 强化学习中的谱分解表示 | 海报展示 |\n| 面向离线强化学习的行为先验表征学习 | 海报展示 |\n| [仅通过观看纯视频即可在有限数据下成为熟练玩家](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Sy-o2N0hF4f) | 海报展示 |\n| 用于离策略评估的变分潜在分支模型 | 海报展示 |\n\n\u003Ca id='ICML23'>\u003C\u002Fa>\n\n## ICML23\n| 论文 | 类型 |\n| ---- | ---- |\n| [预训练在强化学习泛化中的力量：可证明的优势与困难](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.10464.pdf) | 口头报告 |\n| [AdaptDiffuser：作为自适应自我进化规划器的扩散模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.01877.pdf) | 口头报告 |\n| [用于多模态轨迹优化的重参数化策略学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.10710.pdf) | 口头报告 |\n| [从像素中掌握无监督强化学习基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12016.pdf) | 口头报告 |\n| [深度强化学习中的休眠神经元现象](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.12902.pdf) | 口头报告 |\n| [通过解耦环境与智能体表征实现高效强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kWS8mpioS9) | 口头报告 |\n| [时序差分学习的统计优势研究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.13289.pdf) | 口头报告 |\n| 热启动演员-评论家：从近似误差到次优性差距 | 口头报告 |\n| 基于潜在意图的被动数据强化学习 | 口头报告 |\n| 三维环境中的子等变图强化学习 | 口头报告 |\n| 基于多步逆运动学的表征学习：一种高效且最优的丰富观测强化学习方法 | 口头报告 |\n| 抛硬币估计强化学习探索中的伪计数 | 口头报告 |\n| [奖励假设的最终解答](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.10420.pdf) | 口头报告 |\n| 多视角强化学习的信息论状态空间模型 | 口头报告 |\n| [从像素中掌握无监督强化学习基准](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2209.12016.pdf) | 口头报告 |\n| [用于部分可观测深度强化学习的信任表征学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=4IzEmHLono) | 海报展示 |\n| [内部奖励强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.00270.pdf) | 海报展示 |\n| 基于多个黑盒oracle的主动策略改进 | 海报展示 |\n| 在什么情况下，可实现性足以支持离策略强化学习？ | 海报展示 |\n| 分位数时序差分学习在价值估计中的统计优势 | 海报展示 |\n| [强化学习中的超参数及其调优方法](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.01324.pdf) | 海报展示 |\n| 对数通信下的朗之万汤普森采样：多臂老虎机与强化学习 | 海报展示 |\n| [纠正同策略策略梯度方法中的折扣因子不匹配问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13284.pdf) | 海报展示 |\n| 用于预测、表征和控制的掩码轨迹模型 | 海报展示 |\n| 具有确定性策略搜索的离策略平均奖励演员-评论家 | 海报展示 |\n| TGRL：教师引导强化学习算法 | 海报展示 |\n| LIV：面向机器人控制的语言-图像表征与奖励 | 海报展示 |\n| 斯坦因变分目标生成用于多目标强化学习中的自适应探索 | 海报展示 |\n| 深度强化学习中自适应昼夜节律的涌现 | 海报展示 |\n| 使用夏普利值解释强化学习 | 海报展示 |\n| [通过多重奖励使强化学习更高效](https:\u002F\u002Fopenreview.net\u002Fpdf?id=skDVsmXjPR) | 海报展示 |\n| [表演性强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2207.00046.pdf) | 海报展示 |\n| 蒙特卡洛强化学习中的轨迹截断 | 海报展示 |\n| ReLOAD：具有乐观上升-下降机制的强化学习，用于约束马尔可夫决策过程中的最后一轮收敛 | 海报展示 |\n| 具有在线敏感性采样的低切换策略梯度探索 | 海报展示 |\n| 双曲扩散嵌入与距离用于层次化表征学习 | 海报展示 |\n| 通过放松的状态对抗性策略优化重新审视领域随机化 | 海报展示 |\n| 并行$Q$-学习：大规模并行仿真下的离策略强化学习扩展 | 海报展示 |\n| LESSON：基于选项框架学习整合强化学习探索策略 | 海报展示 |\n| 基于双层优化的网络控制图强化学习 | 海报展示 |\n| 随机策略梯度方法：针对费舍尔非退化策略的样本复杂度改进 | 海报展示 |\n| [具有历史依赖动态上下文的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02061.pdf) | 海报展示 |\n| 利用离线数据进行高效的在线强化学习 | 海报展示 |\n| 分布式强化学习中的方差控制 | 海报展示 |\n| 面向具有外生输入MDP的回溯学习 | 海报展示 |\n| RLang：一种用于向强化学习智能体描述部分世界知识的声明式语言 | 海报展示 |\n| 基于蒙特卡洛树搜索的可扩展安全策略改进 | 海报展示 |\n| 基于能量模型的奖励条件强化学习的贝叶斯重参数化 | 海报展示 |\n| 理解单任务强化学习在课程学习中的复杂度收益 | 海报展示 |\n| PPG重装上阵：关于相位策略梯度中关键因素的实证研究 | 海报展示 |\n| [关于多动作策略梯度的研究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.13011.pdf) | 海报展示 |\n| 多任务层次化对抗性逆强化学习 | 海报展示 |\n| 无基站的潜伏式Go-Explore | 海报展示 |\n| 基于反事实无害准则的可信策略学习 | 海报展示 |\n| 强化学习中的可达性感知拉普拉斯表征 | 海报展示 |\n| 基于强化学习的交互式物体放置 | 海报展示 |\n| 在线强化学习中利用离线数据 | 海报展示 |\n| 具有通用效用函数的强化学习：更简单的方差缩减与更大的状态-动作空间 | 海报展示 |\n| DoMo-AC：双重多步离策略演员-评论家算法 | 海报展示 |\n| [奖励模型过度优化的规模定律](https:\u002F\u002Fopenreview.net\u002Fattachment?id=bBLjms8nZE&name=pdf) | 海报展示 |\n| SNeRL：面向强化学习的语义感知神经辐射场 | 海报展示 |\n| 基于集合成员信念状态的POMDP强化学习 | 海报展示 |\n| 鲁棒的满足型MDP | 海报展示 |\n| 基于联合效应建模的大动作空间离策略评估 | 海报展示 |\n| 具有优化动作解码的量子策略梯度算法 | 海报展示 |\n| 对于用于运动控制的预训练视觉模型，不同的策略学习方法并非等价 | 海报展示 |\n| 无模型鲁棒平均奖励强化学习 | 海报展示 |\n| 公平且鲁棒地估计异质性治疗效应以用于策略学习 | 海报展示 |\n| 面向离策略强化学习的轨迹感知资格迹 | 海报展示 |\n| 基于成对或K人比较的人类反馈原则性强化学习 | 海报展示 |\n| 社会学习通过深度强化学习搜索最优启发式自发涌现 | 海报展示 |\n| [更大、更好、更快：以人类水平的效率实现人类水平的Atari游戏](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19452.pdf) | 海报展示 |\n| [面向深度强化学习的后验采样](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.00477.pdf) | 海报展示 |\n| [基于模型的强化学习，采用可扩展的复合策略梯度估计器](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rDMAJECBM2) | 海报展示 |\n| [超越想象：利用世界模型最大化剧集可达性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=JsAMuzA9o2) | 海报展示 |\n| [简化的时序一致性强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.09466.pdf) | 海报展示 |\n| [具身智能体是否梦见像素化的羊：使用语言引导的世界建模进行具身决策](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12050.pdf) | 海报展示 |\n| 无需示范的自主强化学习，通过隐式及双向课程进行 | 海报展示 |\n| [基于模型适应的奇思妙想回放](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.15934.pdf) | 海报展示 |\n| [用于视觉机器人操作的多视角掩码世界模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02408.pdf) | 海报展示 |\n| [面向深度强化学习探索的自动内在奖励塑造](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10886.pdf) | 海报展示 |\n| [事后的好奇心：随机环境中的内在探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10515.pdf) | 海报展示 |\n| 基于奇异值分解的深度强化学习表征与探索 | 海报展示 |\n| [将大型语言模型置于交互环境中，结合在线强化学习进行接地](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.02662.pdf) | 海报展示 |\n| 将互联网规模的视觉-语言模型提炼为具身智能体 | 海报展示 |\n| VIMA：基于多模态提示的机器人操作 | 海报展示 |\n| 面向决策Transformer的未来条件无监督预训练 | 海报展示 |\n| 由一系列事后经验涌现的代理型Transformer | 海报展示 |\n| [基于模型的强化学习泛化优势](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Vue1ulwlPD) | 海报展示 |\n| [多环境预训练实现向动作受限数据集的迁移](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.13337.pdf) | 海报展示 |\n| [关于视觉-运动控制的预训练：重温从零开始的学习基线](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.05749.pdf) | 海报展示 |\n| 无监督技能发现，用于学习跨变化环境的共享结构 | 海报展示 |\n| 关于为强化学习预训练对象中心表征的调查 | 海报展示 |\n| 使用大型语言模型指导强化学习中的预训练 | 海报展示 |\n| 对于离线目标条件强化学习而言，实现未见目标泛化的关键是什么？ | 海报展示 |\n| 面向少样本策略迁移的在线原型对齐 | 海报展示 |\n| 检测深度强化学习中的对抗方向，以做出鲁棒决策 | 海报展示 |\n| 面对情境扰动的鲁棒情境强化学习 | 海报展示 |\n| 分布式强化学习的对抗性学习 | 海报展示 |\n| 朝着鲁棒且安全的强化学习迈进，利用良性离策略数据 | 海报展示 |\n| 作为元强化学习副产物的简单具身语言学习 | 海报展示 |\n| [ContraBAR：对比贝叶斯自适应深度强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.02418.pdf) | 海报展示 |\n| 基于模型的离线强化学习，采用基于计数的保守主义 | 海报展示 |\n| 基于模型的贝尔曼不一致，用于离线强化学习 | 海报展示 |\n| [无需在线实验即可学习时间抽象的世界模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=YeTYJz7th5) | 海报展示 |\n| 对比能量预测，用于离线强化学习中精确的能量引导扩散采样 | 海报展示 |\n| MetaDiffuser：作为离线元强化学习条件规划器的扩散模型 | 海报展示 |\n| 演员-评论家对齐，用于离线到在线强化学习 | 海报展示 |\n| 半监督离线强化学习，采用无动作轨迹 | 海报展示 |\n| 在丰富的外生信息存在下进行的原则性离线强化学习 | 海报展示 |\n| 具有分布内在线适应性的离线元强化学习 | 海报展示 |\n| 针对离线强化学习的数据集约束策略正则化 | 海报展示 |\n| 支持信任区域优化的离线强化学习 | 海报展示 |\n| 面向离线安全强化学习的约束型决策Transformer | 海报展示 |\n| 具有保证的PAC贝叶斯离线上下文多臂老虎机 | 海报展示 |\n| 超越奖励：基于偏好指导的离线策略优化 | 海报展示 |\n| 具有闭式策略改进算子的离线强化学习 | 海报展示 |\n| ChiPFormer：通过离线决策Transformer实现可转移的芯片布局 | 海报展示 |\n| 通过动作偏好查询提升离线强化学习效果 | 海报展示 |\n| [快速启动强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.02372.pdf) | 海报展示 |\n| 探究基于模型的学习在探索与迁移中的作用 | 海报展示 |\n| [STEERING：面向基于模型强化学习的斯坦因信息导向探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.12038.pdf) | 海报展示 |\n| [面向无监督基于模型强化学习的可预测MDP抽象](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.03921.pdf) | 海报展示 |\n| 基于模型强化学习中的懒惰美德：统一的目标与算法 | 海报展示 |\n| [关于特征去相关性在强化学习无监督表征学习中的重要性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.05637.pdf) | 海报展示 |\n| CLUTR：通过无监督任务表征学习进行课程学习 | 海报展示 |\n| [面向可控性的无监督技能发现](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.05103.pdf) | 海报展示 |\n| [面向无监督技能发现的行为对比学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.04477.pdf) | 海报展示 |\n| 用于无监督技能发现的变分课程强化学习 | 海报展示 |\n| [强化学习中的自举表征](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.10171.pdf) | 海报展示 |\n| [表征驱动的强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.19922.pdf) | 海报展示 |\n| 针对算法资源分配随机试验的策略评估改进 | 海报展示 |\n| 面向混杂离策略评估的工具变量方法 | 海报展示 |\n| 线性马尔可夫决策过程中半参数高效的离策略评估 | 海报展示 |\n| [面向深度强化学习探索的自动内在奖励塑造](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.10886.pdf) | 海报展示 |\n| [事后的好奇心：随机环境中的内在探索](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.10515.pdf) | 海报展示 |\n\n\u003Ca id='NeurIPS23'>\u003C\u002Fa>\n\n## NeurIPS23\n| 论文 | 类型 |\n| ---- | ---- |\n| 通过显著性引导的特征去相关学习可泛化智能体 | 口头报告 |\n| 通过示范理解专业知识：一种用于离线逆强化学习的最大似然框架 | 口头报告 |\n| 当示范遇到生成式世界模型时：一种用于离线逆强化学习的最大似然框架 | 口头报告 |\n| DiffuseBot：利用物理增强的生成扩散模型培育软体机器人 | 口头报告 |\n| 变压器在强化学习中何时大放异彩？将记忆与信用分配解耦 | 口头报告 |\n| 用有效视界连接强化学习理论与实践 | 口头报告 |\n| SwiftSage：一种具备快慢思维的生成式智能体，适用于复杂交互任务 | 点亮展示 |\n| RePo：通过正则化后验预测性实现稳健的基于模型的强化学习 | 点亮展示 |\n| 最大化以探索：一个融合估计、规划与探索的目标函数 | 点亮展示 |\n| [条件互信息在强化学习中的解耦表征应用](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.14133.pdf) | 点亮展示 |\n| 乐观自然策略梯度：一种简单高效的在线强化学习策略优化框架 | 点亮展示 |\n| 双重Gumbel Q学习 | 点亮展示 |\n| POMDPs中未来依赖的价值函数离线策略评估 | 点亮展示 |\n| 监督预训练可以学习上下文强化学习 | 点亮展示 |\n| 一次训练，获得一族策略：面向离线到在线强化学习的状态自适应平衡 | 点亮展示 |\n| 面向多场景安全强化学习的约束条件策略优化 | 海报展示 |\n| 在零样本强化学习中通过探索实现泛化 | 海报展示 |\n| 基于适应性上下文感知策略的强化学习动态泛化 | 海报展示 |\n| 通过表征区分提升离线强化学习的泛化能力 | 海报展示 |\n| 对比回顾：聚焦关键步骤以加速强化学习中的快速学习与泛化 | 海报展示 |\n| 元强化学习中的双重稳健增强迁移 | 海报展示 |\n| 循环超网络在元强化学习中表现惊人地强大 | 海报展示 |\n| 通过子任务分解参数化非参数化的元强化学习任务 | 海报展示 |\n| 统一的风险度量：一种基于风险敏感性的基于模型的离线强化学习视角 | 海报展示 |\n| 面向离线强化学习的有效扩散策略 | 海报展示 |\n| 利用离线强化学习学习如何影响人类行为 | 海报展示 |\n| 从策略出发设计：面向离线策略优化的保守测试时适应 | 海报展示 |\n| SafeDICE：使用非首选示范进行离线安全模仿学习 | 海报展示 |\n| 面向离线强化学习的显式行为密度约束策略优化 | 海报展示 |\n| 面向离线强化学习的保守状态价值估计 | 海报展示 |\n| 在POMDPs中使用离散代理表征提升泛化能力的离线强化学习 | 海报展示 |\n| 降低离线元强化学习中的情境漂移 | 海报展示 |\n| 基于互信息正则化的离线强化学习 | 海报展示 |\n| 通过离线强化学习中的逆动力学恢复未见状态 | 海报展示 |\n| 离线强化学习中的分位数准则优化 | 海报展示 |\n| 语言模型与世界模型相遇：具身经验增强语言模型 | 海报展示 |\n| 通过最大化证据进行动作推理：利用世界模型从观察中零样本模仿 | 海报展示 |\n| [世界模型骨干架构对决：RNN、Transformer和S4](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.02064.pdf) | 海报展示 |\n| 连续时间基于模型的强化学习中的高效探索 | 海报展示 |\n| 基于模型的再参数化策略梯度方法：理论与实用算法 | 海报展示 |\n| [通过指导学习发现技能](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.20178.pdf) | 海报展示 |\n| 在强化学习中创建多层级技能树 | 海报展示 |\n| 基于随机意图先验的无监督行为提取 | 海报展示 |\n| [MIMEx：来自掩码输入建模的内在奖励](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.08932.pdf) | 海报展示 |\n| f-策略梯度：一种使用f-散度的面向目标条件强化学习通用框架 | 海报展示 |\n| 持续强化学习中的预测与控制 | 海报展示 |\n| 残差Q学习：无需价值函数即可实现离线和在线策略定制 | 海报展示 |\n| 小批量深度强化学习 | 海报展示 |\n| 针对约束马尔可夫决策过程的最后迭代收敛策略梯度原对偶方法 | 海报展示 |\n| RLHF是否比标准RL更困难？一种理论视角 | 海报展示 |\n| Reflexion：具有语言强化学习能力的语言智能体 | 海报展示 |\n| 在强化学习中对具有任意约束的随机动作进行生成建模 | 海报展示 |\n| 扩散模型是多任务强化学习的有效规划者和数据合成器 | 海报展示 |\n| 无需奖励建模的直接偏好驱动策略优化 | 海报展示 |\n| 学习在强化学习中调制预训练模型 | 海报展示 |\n| 无知即幸福：通过信息门控实现鲁棒控制 | 海报展示 |\n| 上下文 bandit 中的离线策略评估边际密度比 | 海报展示 |\n| 使用决策估计系数进行无模型强化学习 | 海报展示 |\n| 最优且公平的鼓励政策评估与学习 | 海报展示 |\n| BIRD：面向深度强化学习的可泛化后门检测与移除 | 海报展示 |\n| 正确实施的强化学习中的概率推断 | 海报展示 |\n| 基于参考的POMDPs | 海报展示 |\n| 在MDPs中说服有远见的接收者：诚实的力量 | 海报展示 |\n| 分布式策略评估：一种基于最大熵的表征学习方法 | 海报展示 |\n| 面向上下文强化学习的结构化状态空间模型 | 海报展示 |\n| 方差的替代方案：针对风险厌恶型策略梯度的基尼偏差 | 海报展示 |\n| 面向风险敏感型强化学习的分布等价性 | 海报展示 |\n| PLASTIC：提升输入与标签可塑性以实现样本高效的强化学习 | 海报展示 |\n| 基于不完美示范的混合策略优化 | 海报展示 |\n| 噪音环境下的策略优化：关于连续控制中的回报景观 | 海报展示 |\n| 语义HELM：一种人类可读的强化学习记忆 | 海报展示 |\n| 持续强化学习的定义 | 海报展示 |\n| 面向Wasserstein分布鲁棒MDPs的快速贝尔曼更新 | 海报展示 |\n| 面向矩形鲁棒马尔可夫决策过程的策略梯度 | 海报展示 |\n| 通过对比学习发现强化学习中的层次化成就 | 海报展示 |\n| 蒙特卡洛策略评估中的轨迹截断：一种自适应方法 | 海报展示 |\n| 强化学习中的无模型主动探索 | 海报展示 |\n| 利用随机特征进行迁移的自监督强化学习 | 海报展示 |\n| FlowPG：基于归一化流的动作约束策略梯度 | 海报展示 |\n| 基于注意力的灵活多策略融合，用于高效深度强化学习 | 海报展示 |\n| 基于ODE的循环无模型强化学习，适用于POMDPs | 海报展示 |\n| 利用强化学习建议圆柱代数分解的变量顺序 | 海报展示 |\n| SPQR：利用尖峰随机模型控制Q集合独立性，用于强化学习 | 海报展示 |\n| CaMP：面向多房间场景交互导航的因果多策略规划 | 海报展示 |\n| 面向部分未知环境中的物体搜索的POMDP规划 | 海报展示 |\n| 统一的离线排序学习：一种强化学习视角 | 海报展示 |\n| 面向鲁棒强化学习的自然演员-评论家，结合函数近似 | 海报展示 |\n| 面向深度强化学习的长$N$步代理阶段奖励 | 海报展示 |\n| 基于状态-动作相似性的表征用于离线策略评估 | 海报展示 |\n| 弱耦合的深度Q网络 | 海报展示 |\n| 大型语言模型是半参数化的强化学习智能体 | 海报展示 |\n| 成为分布式的益处：强化学习中的小损失界 | 海报展示 |\n| 在线非随机无模型强化学习 | 海报展示 |\n| 何时无模型强化学习在统计上可行？ | 海报展示 |\n| 带有流式观测的贝叶斯风险厌恶Q学习 | 海报展示 |\n| 重置深度强化学习中的优化器：一项实证研究 | 海报展示 |\n| 利用符号模型估计进行强化学习中的乐观探索 | 海报展示 |\n| 基于策略的平均奖励强化学习算法的性能边界 | 海报展示 |\n| 规律性作为自由玩耍的内在奖励 | 海报展示 |\n| TACO：面向视觉强化学习的时序潜在动作驱动对比损失 | 海报展示 |\n| 面向连续强化学习的策略优化 | 海报展示 |\n| 连续时间控制中的主动观察 | 海报展示 |\n| 可复现的强化学习 | 海报展示 |\n| 探索对于强化学习中泛化的重要性 | 海报展示 |\n| 带有玻尔兹曼探索的蒙特卡洛树搜索 | 海报展示 |\n| 用于安全强化学习的迭代可达性估计 | 海报展示 |\n| 通过对抗性环境设计发现通用强化学习算法 | 海报展示 |\n| 我们在寻找用于具身智能的人工视觉皮层方面进展如何？ | 海报展示 |\n| 逆动力学预训练能够学习适用于多任务模仿的良好表征 | 海报展示 |\n| 强化学习中可解释的奖励再分配：一种因果方法 | 海报展示 |\n| 带有时序注意力的对比模块，用于多任务强化学习 | 海报展示 |\n| 通过重置深度集成智能体实现样本高效且安全的深度强化学习 | 海报展示 |\n| 分布式帕累托最优的多目标强化学习 | 海报展示 |\n| 面向具身智能体的对比提示集成，用于高效策略适应 | 海报展示 |\n| 利用逆动力学双曲度量，基于潜力的强化学习高效探索 | 海报展示 |\n| 结合状态距离信息，迭代学习多样策略 | 海报展示 |\n| 利用价值条件下的状态熵探索加速强化学习 | 海报展示 |\n| 梯度引导的近端策略优化 | 海报展示 |\n| 带有生成模型的强化学习中分布鲁棒性的代价之谜 | 海报展示 |\n| [面向序列决策中高效策略评估的最佳治疗分配](https:\u002F\u002Fopenreview.net\u002Fattachment?id=EcReRm7q9p&name=pdf) | 海报展示 |\n| [Thinker：学习计划与行动](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mumEBl0arj&name=pdf) | 海报展示 |\n| [用更少学习得更好：面向样本高效视觉强化学习的有效增广](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jRL6ErxMVB&name=pdf) | 海报展示 |\n| [使用简单序列先验进行强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qxF8Pge6vM&name=pdf) | 海报展示 |\n| [预训练的文本到图像模型能否为强化学习生成视觉目标？](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kChEBODIx9&name=pdf) | 海报展示 |\n| [超越均匀采样：不平衡数据集上的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=TW99HrZCJU&name=pdf) | 海报展示 |\n| [CQM：基于量化世界模型的课程式强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tcotyjon2a&name=pdf) | 海报展示 |\n| [H-InDex：面向灵巧操作的、由人工指导表征的视觉强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=lvvaNwnP6M&name=pdf) | 海报展示 |\n| [Cal-QL：为高效在线微调而设计的校准离线强化学习预训练](https:\u002F\u002Fopenreview.net\u002Fattachment?id=GcEIvidYSw&name=pdf) | 海报展示 |\n| [随时具备竞争力的基于策略先验的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FCwfZj1bQl&name=pdf) | 海报展示 |\n| [为离线强化学习编制反事实预算](https:\u002F\u002Fopenreview.net\u002Fattachment?id=1MUxtSBUox&name=pdf) | 海报展示 |\n| [策略优化中的分形景观](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QQidjdmyPp&name=pdf) | 海报展示 |\n| [面向离线强化学习的目标条件预测编码](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IJblKO45YU&name=pdf) | 海报展示 |\n| [出售：面向深度强化学习的状态-动作表征学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xZvGrzRq17&name=pdf) | 海报展示 |\n| [采用平均奖励准则的逆强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YFSrf8aciU&name=pdf) | 海报展示 |\n| [重温离线强化学习的极简主义方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vqGWslLeEw&name=pdf) | 海报展示 |\n| [用于离线强化学习的对抗性模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6UCMa0Qgej&name=pdf) | 海报展示 |\n| [支持的离线强化学习价值正则化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=fze7P9oy6l) | 海报展示 |\n| [受PID启发的归纳偏置，用于部分可观测控制任务中的深度强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pKnhUWqZTJ&name=pdf) | 海报展示 |\n| [如何调整模型：统一的模型转移与模型偏置策略优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=d7a5TpePV7&name=pdf) | 海报展示 |\n| [通过离线预训练的状态到Go转换器，从视觉观察中学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=E58gaxJN1d&name=pdf) | 海报展示 |\n| [描述、解释、计划与选择：LLMs的交互式规划使开放世界多任务智能体成为可能](https:\u002F\u002Fopenreview.net\u002Fattachment?id=KtvPdGb31Z&name=pdf) | 海报展示 |\n| [在分级强化学习中实现稳健的知识转移](https:\u002F\u002Fopenreview.net\u002Fattachment?id=1WMdoiVMov&name=pdf) | 海报展示 |\n| [苦练以求易战：稳健的元强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JvOZ4IIjwP&name=pdf) | 海报展示 |\n| [通过双层优化进行元加权的任务感知世界模型学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IN3hQx1BrC&name=pdf) | 海报展示 |\n| [视频预测模型作为强化学习的奖励](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HWNl9PAYIP&name=pdf) | 海报展示 |\n| [合成经验回放](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6jNQ1AY1Uf&name=pdf) | 海报展示 |\n| [利用离线数据进行实验设计，对强化学习策略进行微调](https:\u002F\u002Fopenreview.net\u002Fattachment?id=fjXTcUUgaC&name=pdf) | 海报展示 |\n| [学习动态属性因子化的世界模型，以实现高效的多目标强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=bsNslV3Ahe&name=pdf) | 海报展示 |\n| [学习可识别因子分解的世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6JJq5TW9Mc&name=pdf) | 海报展示 |\n| [利用野外视频为强化学习预训练情境化的世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8GuEVzAUQS&name=pdf) | 海报展示 |\n| [逆偏好学习：无需奖励函数的基于偏好强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=gAP52Z2dar&name=pdf) | 海报展示 |\n| [理解、预测并更好地解决离线强化学习中的Q值分歧](https:\u002F\u002Fopenreview.net\u002Fattachment?id=71P7ugOGCV&name=pdf) | 海报展示 |\n| [面向强化学习的潜在探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JSVXZKqfLU&name=pdf) | 海报展示 |\n| [大型语言模型可以实现策略迭代](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LWxjWoBTsr&name=pdf) | 海报展示 |\n| [推广加权路径一致性以精通Atari游戏](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vHRLS8HhK1&name=pdf) | 海报展示 |\n| [学习环境感知的可供性，以在遮挡条件下操纵3D关节物体](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Re2NHYoZ5l&name=pdf) | 海报展示 |\n| [通过锚定加速价值迭代](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Wn82NbmvJy&name=pdf) | 海报展示 |\n| [针对具有硬约束的连续控制，减少策略优化规模](https:\u002F\u002Fopenreview.net\u002Fattachment?id=fKVEMNmWqU&name=pdf) | 海报展示 |\n| [基于状态正则化的策略优化，应用于存在动态变化的数据](https:\u002F\u002Fopenreview.net\u002Fattachment?id=I8t9RKDnz2&name=pdf) | 海报展示 |\n| [具有差分隐私的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YVMc3KiWBQ&name=pdf) | 海报展示 |\n| [理解并应对基于双曲度量表征在离线强化学习中的陷阱](https:\u002F\u002Fopenreview.net\u002Fattachment?id=sQyRQjun46&name=pdf) | 海报展示 |\n\n\u003Ca id='ICLR24'>\u003C\u002Fa>\n\n## ICLR24\n| 论文 | 类型 |\n| ---- | ---- |\n| [深度强化学习中的预测性辅助目标模仿大脑中的学习机制](https:\u002F\u002Fopenreview.net\u002Fpdf?id=agPpmEgf8C) | 口头报告 |\n| [基于目标的预训练模型用于样本高效的强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=o2IEmeLL9r) | 口头报告 |\n| [METRA：具有度量感知抽象的可扩展无监督强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=c5pwL0Soay) | 口头报告 |\n| [ASID：机器人操作中用于系统辨识与重建的主动探索](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jNR6s6OSBT) | 口头报告 |\n| [利用世界模型掌握记忆任务](https:\u002F\u002Fopenreview.net\u002Fpdf?id=1vDArHJ68h) | 口头报告 |\n| [基于张量近似的广义策略迭代用于混合控制](https:\u002F\u002Fopenreview.net\u002Fpdf?id=csukJcpYDe) | 展示报告 |\n| [选择性视觉表征提升具身智能的收敛性和泛化能力](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kC5nZDU5zf) | 展示报告 |\n| [AMAGO：面向自适应智能体的可扩展上下文强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=M6XWoEdmwf) | 展示报告 |\n| [通过约束式RLHF应对奖励模型过优化问题](https:\u002F\u002Fopenreview.net\u002Fpdf?id=gkfUvn0fLU) | 展示报告 |\n| [用于质量多样性强化学习的近端策略梯度树状结构](https:\u002F\u002Fopenreview.net\u002Fpdf?id=TFKIfhvdmZ) | 展示报告 |\n| [通过混合启发式方法改进离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MCl0TLboP1) | 展示报告 |\n| [决策ConvFormer：MetaFormer中的局部滤波足以进行决策](https:\u002F\u002Fopenreview.net\u002Fpdf?id=af2c8EaKl8) | 展示报告 |\n| [工具增强的奖励建模](https:\u002F\u002Fopenreview.net\u002Fpdf?id=d94x0gWTUX) | 展示报告 |\n| [奖励一致的动力学模型对离线强化学习具有强大的泛化能力](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GSBHKiw19c) | 展示报告 |\n| [双强化学习：强化学习与模仿学习的统一及新方法](https:\u002F\u002Fopenreview.net\u002Fpdf?id=xt9Bu66rqv) | 展示报告 |\n| [稳定对比强化学习：利用离线数据实现机器人目标达成的技术](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Xkf2EBj4w3) | 展示报告 |\n| [安全RLHF：基于人类反馈的安全强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=TyFrPOKYXw) | 展示报告 |\n| [Cross$Q$：深度强化学习中的批归一化以提高样本效率和简洁性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=PczQtTsTIX) | 展示报告 |\n| [结合模仿学习与强化学习以稳健地改进策略](https:\u002F\u002Fopenreview.net\u002Fpdf?id=eJ0dzPJq1F) | 展示报告 |\n| [释放长期新颖性驱动探索中表征的力量](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OwtMhMSybu) | 展示报告 |\n| [面向具身智能体的空间感知Transformer](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Ts95eXsPBc) | 展示报告 |\n| [无需动作即可学会行动](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rvUq3cxpDF) | 展示报告 |\n| [迈向基于视频的原则性表征学习，用于强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3mnWvUZIXt) | 展示报告 |\n| [TorchRL：PyTorch的数据驱动决策库](https:\u002F\u002Fopenreview.net\u002Fpdf?id=QxItoEAVMb) | 展示报告 |\n| [迈向在多样化数据损坏下的稳健离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5hAMmCU0bK) | 展示报告 |\n| [从离散潜在动力学中学习具有自适应时间抽象的分层世界模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=TjCDNssXKU) | 展示报告 |\n| [Text2Reward：利用语言模型为强化学习设计奖励](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tUM39YTRxH) | 展示报告 |\n| [通过事后轨迹草图实现机器人任务泛化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=F1TKzG8LJO) | 展示报告 |\n| [次模强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=loYSzjSaAK) | 展示报告 |\n| [偏好型强化学习中的查询-策略不匹配](https:\u002F\u002Fopenreview.net\u002Fpdf?id=UoBymIwPJR) | 展示报告 |\n| [核度量学习用于确定性强化学习策略的样本内离策略评估](https:\u002F\u002Fopenreview.net\u002Fpdf?id=plebgsdiiV) | 展示报告 |\n| [GenSim：利用大型语言模型生成机器人仿真任务](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OI3RoHoWAN) | 展示报告 |\n| [以实体为中心的强化学习：基于像素的对象操作](https:\u002F\u002Fopenreview.net\u002Fpdf?id=uDxeSZ1wdI) | 展示报告 |\n| [幻觉攻击：序列决策者对抗攻击中的可检测性至关重要](https:\u002F\u002Fopenreview.net\u002Fpdf?id=F5dhGCdyYh) | 展示报告 |\n| [解决深度强化学习中的信号延迟问题](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Z8UfDs4J46) | 展示报告 |\n| [DrM：通过最小化休眠比率掌握视觉强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MSe8YFbhUE) | 展示报告 |\n| [从技能到任务适应：信息几何、解耦与无监督强化学习的新目标](https:\u002F\u002Fopenreview.net\u002Fpdf?id=zSxpnKh1yS) | 展示报告 |\n| [$\\mathcal{B}$-Coder：关于基于价值的深度强化学习在程序合成中的应用](https:\u002F\u002Fopenreview.net\u002Fpdf?id=fLf589bx1f) | 展示报告 |\n| [受物理规律约束的深度强化学习：不变嵌入](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5Dwqu5urzs) | 展示报告 |\n| [Retroformer：带有策略梯度优化的回顾性大型语言模型智能体](https:\u002F\u002Fopenreview.net\u002Fpdf?id=KOZu91CzbK) | 展示报告 |\n| [通过密集对应关系从无动作视频中学习行动](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Mhb5fpA1T0) | 展示报告 |\n| [CivRealm：文明中的学习与推理奥德赛，用于决策智能体](https:\u002F\u002Fopenreview.net\u002Fpdf?id=UBVNwD3hPN) | 展示报告 |\n| [TD-MPC2：适用于连续控制的可扩展、鲁棒的世界模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Oxh5CstDJU) | 展示报告 |\n| [面向基于物理控制的通用人形运动表征](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OrOd8PxOO2) | 展示报告 |\n| [自适应理性激活函数助力深度强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=g90ysX1sVs) | 展示报告 |\n| [通过有界理性课程实现稳健的对抗性强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=pFOoOdaiue) | 展示报告 |\n| [局部敏感稀疏编码用于在线学习世界模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=i8PjQT3Uig) | 海报展示 |\n| [关于基于模型与无模型强化学习的表征复杂性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3K3s9qxSn7) | 海报展示 |\n| [策略排练：训练可泛化的强化学习策略](https:\u002F\u002Fopenreview.net\u002Fpdf?id=m3xVPaZp6Z) | 海报展示 |\n| [对你来说什么重要？迈向机器人学习的视觉表征对齐](https:\u002F\u002Fopenreview.net\u002Fpdf?id=CTlUHIKF71) | 海报展示 |\n| [利用优势导向的离策略策略梯度改进语言模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZDGKPbF0VQ) | 海报展示 |\n| [用强化学习训练扩散模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=YCWjhGrJFD) | 海报展示 |\n| [奖励不一致性对RLHF的涓滴效应](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MeHmwCDifc) | 海报展示 |\n| [强化学习中的最大熵模型修正](https:\u002F\u002Fopenreview.net\u002Fpdf?id=kNpSUN0uCc) | 海报展示 |\n| [基于树搜索的策略优化，考虑随机执行延迟](https:\u002F\u002Fopenreview.net\u002Fpdf?id=RaqZX9LSGA) | 海报展示 |\n| [使用观测历史的离线强化学习：分析并改进样本复杂度](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GnOLWS4Llt) | 海报展示 |\n| [理解偏好学习中的隐藏背景：对RLHF的影响](https:\u002F\u002Fopenreview.net\u002Fpdf?id=0tWTxYYPnW) | 海报展示 |\n| [Eureka：通过大型语言模型编程实现人类水平的奖励设计](https:\u002F\u002Fopenreview.net\u002Fpdf?id=IEduRUO55F) | 海报展示 |\n| [检索引导的强化学习用于布尔电路最小化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=0t1O8ziRZp) | 海报展示 |\n| [用于离线目标条件强化学习的分数模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=oXjnwQLcTA) | 海报展示 |\n| [对比差异预测编码](https:\u002F\u002Fopenreview.net\u002Fpdf?id=0akLDTFR9x) | 海报展示 |\n| [基于人类偏好学习奖励的 hindsight PRIORs](https:\u002F\u002Fopenreview.net\u002Fpdf?id=NLevOah0CJ) | 海报展示 |\n| [奖励模型集成有助于缓解过优化问题](https:\u002F\u002Fopenreview.net\u002Fpdf?id=dcjtMYkpXx) | 海报展示 |\n| [利用可行性引导的扩散模型实现安全的离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=j5JvZCaDM0) | 海报展示 |\n| [组合保守主义：一种转导式的离线强化学习方法](https:\u002F\u002Fopenreview.net\u002Fpdf?id=HRkyLbBRHI) | 海报展示 |\n| [流向更好：通过生成偏好轨迹进行基于偏好离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=EG68RSznLT) | 海报展示 |\n| [PAE：从外部知识中获取强化学习信息以高效探索](https:\u002F\u002Fopenreview.net\u002Fpdf?id=R7rZUSGOPD) | 海报展示 |\n| [识别策略梯度子空间](https:\u002F\u002Fopenreview.net\u002Fpdf?id=iPWxqnt2ke) | 海报展示 |\n| [PARL：强化学习中策略对齐的统一框架](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ByR3NdDSZB) | 海报展示 |\n| [SafeDreamer：基于世界模型的安全强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tsE5HLYtYg) | 海报展示 |\n| [语言模型强化微调中的梯度消失现象](https:\u002F\u002Fopenreview.net\u002Fpdf?id=IcVNBR7qZi) | 海报展示 |\n| [强化学习中的古德哈特定律](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5o9G4XF1LI) | 海报展示 |\n| [通过扩散行为进行得分正则化的策略优化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=xCRr9DrolJ) | 海报展示 |\n| [通过随机化使基于偏好反馈的强化学习更高效](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Pe2lo3QOvo) | 海报展示 |\n| [一致性模型作为强化学习丰富且高效的策略类别](https:\u002F\u002Fopenreview.net\u002Fpdf?id=v8jdwkUNXb) | 海报展示 |\n| [对比偏好学习：无需强化学习即可从人类反馈中学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=iX1RjVQODj) | 海报展示 |\n| [特权感知支撑强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=EpVe8jAjdx) | 海报展示 |\n| [从语言中学习规划抽象](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3UWuFoksGb) | 海报展示 |\n| [CrossLoco：通过引导式无监督强化学习，以人类运动驱动足式机器人的控制](https:\u002F\u002Fopenreview.net\u002Fpdf?id=UCfz492fM8) | 海报展示 |\n| [利用Koopman理论在交互环境中高效建模动力学](https:\u002F\u002Fopenreview.net\u002Fpdf?id=fkrYDQaHOJ) | 海报展示 |\n| [Jumanji：JAX中多样化的可扩展强化学习环境套件](https:\u002F\u002Fopenreview.net\u002Fpdf?id=C4CxQmp9wc) | 海报展示 |\n| [利用强化学习和Transformer寻找高价值分子](https:\u002F\u002Fopenreview.net\u002Fpdf?id=nqlymMx42E) | 海报展示 |\n| [通过强化学习私下对齐语言模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3d0OmYTNui) | 海报展示 |\n| [Legged Locomotion的HIM解决方案：最少传感器、高效学习和卓越敏捷性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=93LoCyww8o) | 海报展示 |\n| [S$2$AC：基于能量的强化学习，采用Stein Soft Actor Critic](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rAHcTCMaLc) | 海报展示 |\n| [跨实验回放：离策略强化学习的自然延伸](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Nf4Lm6fXN8) | 海报展示 |\n| [策略的分段线性参数化：迈向可解释的深度强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hOMVq57Ce0) | 海报展示 |\n| [利用随机状态策略实现时间高效的强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=5liV2xUdJL) | 海报展示 |\n| [打开黑箱：基于步骤的策略更新，用于时序相关的周期性强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=mnipav175N) | 海报展示 |\n| [关于离策略评估的轨迹增广](https:\u002F\u002Fopenreview.net\u002Fpdf?id=eMNN0wIyVw) | 海报展示 |\n| [释放大规模视频生成式预训练在视觉机器人操作中的潜力](https:\u002F\u002Fopenreview.net\u002Fpdf?id=NxoFmGgWC9) | 海报展示 |\n| [理解RLHF对LLM泛化和多样性的影响](https:\u002F\u002Fopenreview.net\u002Fpdf?id=PXD3FAVHJT) | 海报展示 |\n| [德尔菲式离线强化学习，在不可识别的隐藏混杂因素下](https:\u002F\u002Fopenreview.net\u002Fpdf?id=lUYY2qsRTI) | 海报展示 |\n| [优先级软Q分解用于词典式强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=c0MyyXyGfn) | 海报展示 |\n| [基于集成的探索中的多样性诅咒](https:\u002F\u002Fopenreview.net\u002Fpdf?id=M3QXCOTTk4) | 海报展示 |\n| [离策略原对偶安全强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=vy42bYs1Wo) | 海报展示 |\n| [STARC：量化奖励函数之间差异的通用框架](https:\u002F\u002Fopenreview.net\u002Fpdf?id=wPhbtwlCDa) | 海报展示 |\n| [释放预训练语言模型在离线强化学习中的力量](https:\u002F\u002Fopenreview.net\u002Fpdf?id=AY6aM13gGF) | 海报展示 |\n| [发现时间感知的强化学习算法](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MJJcs3zbmi) | 海报展示 |\n| [重新审视深度强化学习中的数据增强](https:\u002F\u002Fopenreview.net\u002Fpdf?id=EGQBpkIEuu) | 海报展示 |\n| [无奖励课程用于训练稳健的世界模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=eCGpNGDeNu) | 海报展示 |\n| [CPPO：结合人类反馈的持续强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=86zAUE80pP) | 海报展示 |\n| [关于离线强化学习泛化的一次研究](https:\u002F\u002Fopenreview.net\u002Fpdf?id=3w6xuXDOdY) | 海报展示 |\n| [RLIF：交互式模仿学习即强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=oLLZhbBSOU) | 海报展示 |\n| [逆向约束强化学习中不确定性驱动的约束推断](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ILYjDvUM6U) | 海报展示 |\n| [迈向为MIP分支的模仿学习：一种基于混合强化学习的样本增广方法](https:\u002F\u002Fopenreview.net\u002Fpdf?id=NdcQQ82mfy) | 海报展示 |\n| [迈向评估和基准测试离策略评估的风险-收益权衡](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ycF7mKfVGO) | 海报展示 |\n| [Uni-O4：通过多步在线策略优化统一线上和线下深度强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tbFBh3LMKi) | 海报展示 |\n| [摆脱贝尔曼完备性：基于模型的回报条件监督学习实现轨迹拼接](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7zY781bMDO) | 海报展示 |\n| [重新审视视觉强化学习中的可塑性：数据、模块和训练阶段](https:\u002F\u002Fopenreview.net\u002Fpdf?id=0aR1s9YxoL) | 海报展示 |\n| [通过策略合并实现机器人舰队学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=IL71c1z7et) | 海报展示 |\n| [通过创建静止目标改善内在探索](https:\u002F\u002Fopenreview.net\u002Fpdf?id=YbZxT0SON4) | 海报展示 |\n| [Motif：来自人工智能反馈的内在动机](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tmBKIecDE9) | 海报展示 |\n| [理解何时动力学不变的数据增广有利于无模型强化学习更新](https:\u002F\u002Fopenreview.net\u002Fpdf?id=sVEu295o70) | 海报展示 |\n| [RLCD：基于对比蒸馏的强化学习，用于语言模型对齐](https:\u002F\u002Fopenreview.net\u002Fpdf?id=v3XXtxWKi6) | 海报展示 |\n| [在离线强化学习中利用潜扩散进行推理](https:\u002F\u002Fopenreview.net\u002Fpdf?id=tGQirjzddO) | 海报展示 |\n| [信念增强的悲观Q学习，抵御对手的状态扰动](https:\u002F\u002Fopenreview.net\u002Fpdf?id=7gDENzTzw1) | 海报展示 |\n| [为正当的序列决策设计奖励](https:\u002F\u002Fopenreview.net\u002Fpdf?id=OUkZXbbwQr) | 海报展示 |\n| [MAMBA：一种有效的元强化学习世界模型方法](https:\u002F\u002Fopenreview.net\u002Fpdf?id=1RE0H6mU7M) | 海报展示 |\n| [LOQA：具备对手Q学习意识的学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=FDQF6A1s6M) | 海报展示 |\n| [用于免重置强化学习的智能切换](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Nq45xeghcL) | 海报展示 |\n| [真正的知识源于实践：通过强化学习将大型语言模型与具身环境对齐](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hILVmJ4Uvu) | 海报展示 |\n| [技能机器：强化学习中的时序逻辑技能组合](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qiduMcw3CU) | 海报展示 |\n| [Uni-RLHF：一个通用平台和基准套件，用于接受多样化人类反馈的强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=WesY0H9ghM) | 海报展示 |\n| [视觉-语言模型是零样本强化学习的奖励模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=N0I2RtD8je) | 海报展示 |\n| [DittoGym：学习控制柔软变形机器人](https:\u002F\u002Fopenreview.net\u002Fpdf?id=MpyFAhH9CK) | 海报展示 |\n| [将正则化与动作空间解耦](https:\u002F\u002Fopenreview.net\u002Fpdf?id=UaMgmoKEBj) | 海报展示 |\n| [Plan-Seq-Learn：语言模型引导的强化学习，用于解决长期机器人任务](https:\u002F\u002Fopenreview.net\u002Fpdf?id=hQVCCxQrYN) | 海报展示 |\n| [利用$\\mathcal{L}_1$自适应控制实现稳健的基于模型强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GaLCLvJaoF) | 海报展示 |\n| [DMBP：基于扩散模型的预测器，用于抵御状态观测扰动的稳健离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ZULjcYLWKe) | 海报展示 |\n| [LoTa-Bench：面向具身智能体的语言导向任务规划者基准测试](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ADSxCpCu9s) | 海报展示 |\n| [通过自动归纳任务子结构将规划与深度强化学习相结合](https:\u002F\u002Fopenreview.net\u002Fpdf?id=PR6RMsxuW7) | 海报展示 |\n| [从泛化角度弥合TD学习与监督学习之间的差距](https:\u002F\u002Fopenreview.net\u002Fpdf?id=qg5JENs0N4) | 海报展示 |\n| [COPlanner：为基于模型的强化学习制定保守推进但乐观探索的计划](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jnFcKjtUPN) | 海报展示 |\n| [π2vec：利用后继特征表示策略](https:\u002F\u002Fopenreview.net\u002Fpdf?id=o5Bqa4o5Mi) | 海报展示 |\n| [部分可观测条件下视觉房间重新布局的任务规划](https:\u002F\u002Fopenreview.net\u002Fpdf?id=jJvXNpvOdM) | 海报展示 |\n| [DreamSmooth：通过奖励平滑改进基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=GruDNzQ4ux) | 海报展示 |\n| [元逆向约束强化学习：收敛保证与泛化分析](https:\u002F\u002Fopenreview.net\u002Fpdf?id=bJ3gFiwRgi) | 海报展示 |\n| [Cleanba：一个可重复且高效的分布式强化学习平台](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Diq6urt3lS) | 海报展示 |\n| [受意识启发的时空抽象，用于提升强化学习的泛化能力](https:\u002F\u002Fopenreview.net\u002Fpdf?id=eo9dHwtTFt) | 海报展示 |\n| [用于序列学习的神经网络函数空间参数化](https:\u002F\u002Fopenreview.net\u002Fpdf?id=2dhxxIKhqz) | 海报展示 |\n| [我们何时应优先选择决策Transformer用于离线强化学习？](https:\u002F\u002Fopenreview.net\u002Fpdf?id=vpV7fOFQy4) | 海报展示 |\n| [连接状态与历史表征：理解自我预测式强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=ms0VgzSGF2) | 海报展示 |\n| [具身主动防御：利用循环反馈对抗敌方补丁](https:\u002F\u002Fopenreview.net\u002Fpdf?id=uXjfOmTiDt) | 海报展示 |\n| [风格化的离线强化学习：从异质数据集中提取多样且高质量的行为](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rnHNDihrIT) | 海报展示 |\n| [使用合成数据进行预训练有助于离线强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=PcxQgtHGj2) | 海报展示 |\n| [基于查询的提示评估与优化，结合离线逆向强化学习](https:\u002F\u002Fopenreview.net\u002Fpdf?id=N6o0ZtPzTg) | 海报展示 |\n| [一种简单的解决方案，用于从可能不完整的轨迹观测和示例中进行离线模仿](https:\u002F\u002Fopenreview.net\u002Fattachment?id=805CW5w2CY&name=pdf) | 海报展示 |\n| [利用变分反事实推理进行离线模仿学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6d9Yxttb3w&name=pdf) | 海报展示 |\n| [阅读并收获成果：借助说明书学习玩Atari游戏](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qP0Drg2HuH&name=pdf) | 海报展示 |\n| [具有快速遗忘记忆的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=KTfAtro6vP&name=pdf) | 海报展示 |\n| [在有限视觉可观测性下的主动视觉强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=j2oYaFpbrB&name=pdf) | 海报展示 |\n| [用于高效人类反馈强化学习的顺序偏好排序](https:\u002F\u002Fopenreview.net\u002Fattachment?id=MIYBTjCVjR&name=pdf) | 海报展示 |\n| [用于多模态视觉强化学习的层次化自适应价值估计](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jB4wsc1DQW&name=pdf) | 海报展示 |\n| [弹性决策Transformer](https:\u002F\u002Fopenreview.net\u002Fattachment?id=RMeQjexaRj&name=pdf) | 海报展示 |\n| [重视程度导向的协同教学，用于离线基于模型的优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OvPnc5kVsb&name=pdf) | 海报展示 |\n| [平行辅导用于离线基于模型的优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tJwyg9Zg9G&name=pdf) | 海报展示 |\n| [离线强化学习中的问责制：用实例语料库解释决策](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kmbG9iBRIb&name=pdf) | 海报展示 |\n\n\u003Ca id='ICML24'>\u003C\u002Fa>\n\n## ICML24\n| 论文 | 类型 |\n| ---- | ---- |\n| [停止回归：通过分类训练价值函数以实现可扩展的深度强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dVpFKfqF3R&name=pdf) | 口头报告 |\n| [立场：自动环境塑造是强化学习的下一个前沿](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dslUyy1rN4&name=pdf) | 口头报告 |\n| [ACE：具有因果感知熵正则化的离策略 Actor-Critic 算法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=1puvYh729M&name=pdf) | 口头报告 |\n| [DPO 是否优于 PPO 用于 LLM 对齐？一项全面研究](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6XH8R7YrSk&name=pdf) | 口头报告 |\n| [SAPG：拆分与聚合策略梯度](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4dOJAfXhNV&name=pdf) | 口头报告 |\n| [逆向强化学习中的环境设计](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Ar0dsOMStE&name=pdf) | 口头报告 |\n| [OMPO：一种统一的框架，用于处理策略和动态变化下的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=R83VIZtHXA&name=pdf) | 口头报告 |\n| [用语言学习建模世界](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7dP6Yq9Uwv&name=pdf) | 口头报告 |\n| [离策略 Actor-Critic 强化学习可扩展至大型模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tl2qmO5kpD&name=pdf) | 口头报告 |\n| [用于可扩展持续强化学习的自组合策略](https:\u002F\u002Fopenreview.net\u002Fattachment?id=f5gtX2VWSB&name=pdf) | 口头报告 |\n| [Genie：生成式交互环境](https:\u002F\u002Fopenreview.net\u002Fattachment?id=bJbSbJskOS&name=pdf) | 口头报告 |\n| [基于功能奖励编码的无监督零样本强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=a6wCNfIj8E&name=pdf) | 展示报告 |\n| [Craftax：一个超快速的开放式强化学习基准测试](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hg4wXlrQCV&name=pdf) | 展示报告 |\n| [专家混合模型解锁深度强化学习的参数扩展性](https:\u002F\u002Fopenreview.net\u002Fattachment?id=X9VMhfFxwn&name=pdf) | 展示报告 |\n| [RICE：利用解释突破强化学习的训练瓶颈](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PKJqsZD5nQ&name=pdf) | 展示报告 |\n| [代码即奖励：利用视觉语言模型赋能强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6P88DMUDvH&name=pdf) | 展示报告 |\n| [EfficientZero V2：在数据有限的情况下掌握离散和连续控制](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LHGMXcr6zx&name=pdf) | 展示报告 |\n| [基于潜在动作的行为生成](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hoVwecMqV5&name=pdf) | 展示报告 |\n| [Actor-Critic 中的过估计、过拟合与可塑性：强化学习的惨痛教训](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5vZzmCeTYu&name=pdf) | 海报展示 |\n| [先难后易：通过任务调度进行多任务强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=haUOhXo70o&name=pdf) | 海报展示 |\n| [通过强化学习实现检索增强型大型语言模型的可信对齐](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XwnABAdH5y&name=pdf) | 海报展示 |\n| [跨多个环境的部分识别治疗效应的元学习者](https:\u002F\u002Fopenreview.net\u002Fattachment?id=s5PLISyNyP&name=pdf) | 海报展示 |\n| [如何用信念探索：部分可观测马尔可夫决策过程中的状态熵最大化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LbcNAIgNnB&name=pdf) | 海报展示 |\n| [PIPER：基于事后重标记的原语信息偏好式层次强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=l6Hef6FVd0&name=pdf) | 海报展示 |\n| [使用不完美示范的迭代正则化策略优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Gp5F6qzwGK&name=pdf) | 海报展示 |\n| [用于具身学习实时决策的傅里叶控制器网络](https:\u002F\u002Fopenreview.net\u002Fpdf?id=nDps3Q8j2l) | 海报展示 |\n| [通过反向课程强化学习训练大型语言模型进行推理](https:\u002F\u002Fopenreview.net\u002Fattachment?id=t82Y3fmRtk&name=pdf) | 海报展示 |\n| [AD3：隐式动作是世界模型区分多样化视觉干扰的关键](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZwrfsrCduj&name=pdf) | 海报展示 |\n| [DRED：通过数据驱动的环境设计实现强化学习的零样本迁移](https:\u002F\u002Fopenreview.net\u002Fattachment?id=uku9r6RROl&name=pdf) | 海报展示 |\n| [使用卷积注入器调整预训练 ViT 模型以用于视觉-运动控制](https:\u002F\u002Fopenreview.net\u002Fattachment?id=CuiRGtVI55&name=pdf) | 海报展示 |\n| [无退化策略优化：无需退化的语言模型强化学习微调](https:\u002F\u002Fopenreview.net\u002Fattachment?id=lwTshcWlmB&name=pdf) | 海报展示 |\n| [能量引导的扩散采样用于离线到在线强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hunSEjeCPE&name=pdf) | 海报展示 |\n| [RVI-SAC：平均奖励离策略深度强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xB6YJZOKyT&name=pdf) | 海报展示 |\n| [基于对比能量学习的离线转移建模](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dqpg8jdA2w&name=pdf) | 海报展示 |\n| [针对混杂 POMDP 的基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DlR8fWgJRl&name=pdf) | 海报展示 |\n| [重新审视强化学习应用中的可扩展海森矩阵对角近似](https:\u002F\u002Fopenreview.net\u002Fattachment?id=yrFUJzcTsk&name=pdf) | 海报展示 |\n| [绝对策略优化：以高置信度提升性能下界](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Ss3h1ixJAU&name=pdf) | 海报展示 |\n| [鲁棒于分布偏移的元强化学习：通过终身上下文学习实现](https:\u002F\u002Fopenreview.net\u002Fattachment?id=laIOUtstMs&name=pdf) | 海报展示 |\n| [DIDI：离线行为生成中的扩散引导多样性](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8296yUBoXr&name=pdf) | 海报展示 |\n| [技能何时有助于强化学习？时间抽象的理论分析](https:\u002F\u002Fopenreview.net\u002Fattachment?id=39UqOkTjFn&name=pdf) | 海报展示 |\n| [BeigeMaps：基于图像的强化学习行为特征映射](https:\u002F\u002Fopenreview.net\u002Fattachment?id=myCgfQZzbc&name=pdf) | 海报展示 |\n| [物理信息神经网络策略迭代：算法、收敛性和验证](https:\u002F\u002Fopenreview.net\u002Fattachment?id=sZla6SnooP&name=pdf) | 海报展示 |\n| [RL-VLM-F：基于视觉语言基础模型反馈的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YSoMmNWZZx&name=pdf) | 海报展示 |\n| [RoboDreamer：为机器人想象学习组合式世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kHjOmAUfVe&name=pdf) | 海报展示 |\n| [探究基于视觉的强化学习中用于泛化的预训练目标](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OiI12sNbgD&name=pdf) | 海报展示 |\n| [RoboGen：通过生成式仿真释放无限数据以实现自动化机器人学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=SQIDlJd3hN&name=pdf) | 海报展示 |\n| [协处理器 Actor-Critic：一种基于模型的强化学习方法，用于自适应脑刺激](https:\u002F\u002Fopenreview.net\u002Fattachment?id=t3SEfoTaYQ&name=pdf) | 海报展示 |\n| [通过策略梯度训练 GFlowNet](https:\u002F\u002Fopenreview.net\u002Fattachment?id=G1igwiBBUj&name=pdf) | 海报展示 |\n| [基于价值进化论的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XobPpcN4yZ&name=pdf) | 海报展示 |\n| [PEARL：零样本跨任务偏好对齐及鲁棒奖励学习，用于机器人操作](https:\u002F\u002Fopenreview.net\u002Fattachment?id=0urN0PnNDj&name=pdf) | 海报展示 |\n| [面向安全强化学习的一致可行性表征学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JNHK11bAGl&name=pdf) | 海报展示 |\n| [蒸馏形态条件超网络以实现高效通用形态控制](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WjvEvYTy3w&name=pdf) | 海报展示 |\n| [模型不匹配下的约束强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=GcW9pg4P9x&name=pdf) | 海报展示 |\n| [在离线强化学习中从单个任务发现多种解](https:\u002F\u002Fopenreview.net\u002Fattachment?id=j6rG1ETRyu&name=pdf) | 海报展示 |\n| [学习在无界状态空间中稳定在线强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=64fdhmogiD&name=pdf) | 海报展示 |\n| [在代币世界中学习玩 Atari](https:\u002F\u002Fopenreview.net\u002Fattachment?id=w8BnKGFIYN&name=pdf) | 海报展示 |\n| [突破壁垒：平滑 DRL 智能体的效用与鲁棒性提升](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WJ5fJhwvCl&name=pdf) | 海报展示 |\n| [具有形式可解释性的概率约束强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Zo9zXdVhW2&name=pdf) | 海报展示 |\n| [Hieros：基于结构化状态序列的世界模型的层次想象](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IUBhvyJ9Sr&name=pdf) | 海报展示 |\n| [深度强化学习中的随机潜在探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Y9qzwNlKVU&name=pdf) | 海报展示 |\n| [面向参数化动作空间的基于模型的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xW79geE0RA&name=pdf) | 海报展示 |\n| [自信感知的逆向约束强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6TCeizkLJV&name=pdf) | 海报展示 |\n| [n-step 回报的平均降低了强化学习的方差](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jM9A3Kz6Ki&name=pdf) | 海报展示 |\n| [立场：呼吁发展具身人工智能](https:\u002F\u002Fopenreview.net\u002Fattachment?id=e5admkWKgV&name=pdf) | 海报展示 |\n| [直接聚类：利用聚类和预训练表征进行高维探索的方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=cXBPPfNUZJ&name=pdf) | 海报展示 |\n| [多目标强化学习的极大极小公式：从理论到无模型算法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=cY9g0bwiZx&name=pdf) | 海报展示 |\n| [技能集优化：通过可迁移技能强化语言模型行为](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9laB7ytoMp&name=pdf) | 海报展示 |\n| [离线增强的 Actor-Critic：在深度离策略强化学习中自适应融合最优历史行为](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7joG3i2pUR&name=pdf) | 海报展示 |\n| [序列压缩加速了强化学习中的信用分配](https:\u002F\u002Fopenreview.net\u002Fattachment?id=NlM4gp8hyO&name=pdf) | 海报展示 |\n| [把握偶然性：在离策略 Actor-Critic 中利用过去成功的价值](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9Tq4L3Go9f&name=pdf) | 海报展示 |\n| [通过上下文学习将泛化能力扩展到新的序列决策任务](https:\u002F\u002Fopenreview.net\u002Fattachment?id=lVQ4FUZ6dp&name=pdf) | 海报展示 |\n| [离线强化学习的简单要素](https:\u002F\u002Fopenreview.net\u002Fattachment?id=japBn31gXC&name=pdf) | 海报展示 |\n| [具有上下文感知标记的高效世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BiWIERWBFX&name=pdf) | 海报展示 |\n| [在基于价值的深度强化学习中，剪枝后的网络才是好网络](https:\u002F\u002Fopenreview.net\u002Fattachment?id=seo9V9QRZp&name=pdf) | 海报展示 |\n| [面向层次强化学习的概率子目标表征](https:\u002F\u002Fopenreview.net\u002Fattachment?id=b6AwZauZPV&name=pdf) | 海报展示 |\n| [Premier-TACO 是一种少样本策略学习器：通过时间驱动的动作对比损失预训练多任务表征](https:\u002F\u002Fopenreview.net\u002Fattachment?id=KSNl7VgeVr&name=pdf) | 海报展示 |\n| [理解和诊断深度强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=s9RKqT7jVM&name=pdf) | 海报展示 |\n| [追求极致：重塑强化学习中的奖励](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4KQ0VwqPg8&name=pdf) | 海报展示 |\n| [ReLU 来救场：用积极优势改进您的在策略 Actor-Critic](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3umNqxjFad&name=pdf) | 海报展示 |\n| [适用于大型离散动作空间的随机 Q 学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HPQaMmABgK&name=pdf) | 海报展示 |\n| [可行可达策略迭代](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ks8qSwkkuZ&name=pdf) | 海报展示 |\n| [立场：视频是现实世界决策的新语言](https:\u002F\u002Fopenreview.net\u002Fattachment?id=EZH4CsKV6O&name=pdf) | 海报展示 |\n| [学习用于世界模型的稳健潜在动态表征](https:\u002F\u002Fopenreview.net\u002Fattachment?id=C4jkx6AgWc&name=pdf) | 海报展示 |\n| [Reinformer：面向离线 RL 的最大回报序列建模](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mBc8Pestd5&name=pdf) | 海报展示 |\n| [重新思考 Transformer 在解决 POMDP 问题中的作用](https:\u002F\u002Fopenreview.net\u002Fpdf?id=SyY7ScNpGL) | 海报展示 |\n| [单轨迹分布鲁棒强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3B6vmW2L80&name=pdf) | 海报展示 |\n| [信任模型所信任的地方——基于模型的 Actor-Critic 与不确定性感知的滚动预测适应](https:\u002F\u002Fopenreview.net\u002Fattachment?id=N0ntTjTfHb&name=pdf) | 海报展示 |\n| [基于最小化极大值原则的人工智能反馈强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5kVgd2MwMY&name=pdf) | 海报展示 |\n| [EvoRainbow：结合进化强化学习的改进以进行策略搜索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=75Hes6Zse4&name=pdf) | 海报展示 |\n| [SeMOPO：从低质量离线视觉数据集中学习高质量模型和策略](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZtOXZCTgBa&name=pdf) | 海报展示 |\n| [自适应梯度策略优化：提升非光滑可微模拟中的策略学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=S9DV6ZP4eE&name=pdf) | 海报展示 |\n| [从人类反馈中免费获得密集奖励](https:\u002F\u002Fopenreview.net\u002Fattachment?id=eyxVRMrZ4m&name=pdf) | 海报展示 |\n| [可配置镜像下降：迈向决策统一](https:\u002F\u002Fopenreview.net\u002Fpdf?id=U841CrDUx9) | 海报展示 |\n| [平衡短期与长期奖励的策略学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7Qf1uHTahP&name=pdf) | 海报展示 |\n| [通过辅助奖励智能体进行强化学习的奖励塑造](https:\u002F\u002Fopenreview.net\u002Fattachment?id=a3XFF0PGLU&name=pdf) | 海报展示 |\n| [基于均值嵌入的分布贝尔曼算子](https:\u002F\u002Fopenreview.net\u002Fattachment?id=j2pLfsBm4J&name=pdf) | 海报展示 |\n| [SiT：面向泛化的对称不变 Transformer](https:\u002F\u002Fopenreview.net\u002Fattachment?id=SWrwurHAeq&name=pdf) | 海报展示 |\n| [马尔可夫决策过程中的几何主动探索：抽象的好处](https:\u002F\u002Fopenreview.net\u002Fattachment?id=2JYOxcGlRe&name=pdf) | 海报展示 |\n| [通过 Q 分数匹配从奖励中学习扩散模型策略](https:\u002F\u002Fopenreview.net\u002Fattachment?id=35ahHydjXo&name=pdf) | 海报展示 |\n| [ACPO：一种用于平均 MDP 且带约束的策略优化算法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dmfvHU1LNF&name=pdf) | 海报展示 |\n| [立场：基准测试在强化学习研究中存在局限性](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Xe7n2ZqpBP&name=pdf) | 海报展示 |\n| [通过优越的分布校正估计从离线演示中学习约束](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Ax90jQPbgF&name=pdf) | 海报展示 |\n| [在强化学习中用假设增强决策](https:\u002F\u002Fopenreview.net\u002Fattachment?id=NeO2hoSexj&name=pdf) | 海报展示 |\n| [SHINE：保护深度强化学习中的后门](https:\u002F\u002Fopenreview.net\u002Fattachment?id=nMWxLnSBGW&name=pdf) | 海报展示 |\n| [利用深度强化学习在未知环境中学习覆盖路径](https:\u002F\u002Fopenreview.net\u002Fattachment?id=nCZYRBK1J4&name=pdf) | 海报展示 |\n| [通过并行观测预测改进基于令牌的世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Lfp5Dk1xb6&name=pdf) | 海报展示 |\n| [通过信息性奖励学习在 POMDP 中探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=oTD3WoQyFR&name=pdf) | 海报展示 |\n| [隐蔽模仿：奖励引导的无环境策略窃取](https:\u002F\u002Fopenreview.net\u002Fattachment?id=H5FDHzrWe2&name=pdf) | 海报展示 |\n| [FuRL：视觉语言模型作为模糊奖励用于强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BmPWtzL7Eq&name=pdf) | 海报展示 |\n| [在离线强化学习中通过一阶状态-动作动力学提升价值函数估计](https:\u002F\u002Fopenreview.net\u002Fattachment?id=nSGnx8lNJ6&name=pdf) | 海报展示 |\n| [面向可变动作空间的上下文强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pp3v2ch5Sd&name=pdf) | 海报展示 |\n| [面向离线强化学习的信息导向悲观主义](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JOKOsJHSao&name=pdf) | 海报展示 |\n| [PcLast：发现可规划的连续潜在状态](https:\u002F\u002Fopenreview.net\u002Fattachment?id=AaTYLZQPyC&name=pdf) | 海报展示 |\n| [面向离线到在线强化学习的贝叶斯设计原则](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HLHQxMydFk&name=pdf) | 海报展示 |\n| [面向离线强化学习的自适应优势引导策略正则化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FV3kY9FBW6&name=pdf) | 海报展示 |\n| [ArCHer：通过层次多轮强化学习训练语言模型智能体](https:\u002F\u002Fopenreview.net\u002Fattachment?id=b6rA0kAHT1&name=pdf) | 海报展示 |\n| [RLAIF 与 RLHF：利用 AI 反馈扩大从人类反馈中受益的强化学习规模](https:\u002F\u002Fopenreview.net\u002Fattachment?id=uydQ2W41KO&name=pdf) | 海报展示 |\n| [朗之万策略用于安全强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xgoilgLPGD&name=pdf) | 海报展示 |\n| [反思式策略优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Cs0Xy6WETl&name=pdf) | 海报展示 |\n| [从人类反馈中进行迭代偏好学习：弥合 KL 约束下 RLHF 的理论与实践](https:\u002F\u002Fopenreview.net\u002Fattachment?id=c1AKcA6ry1&name=pdf) | 海报展示 |\n| [用于跨域离线强化学习数据过滤的对比表征](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rReWhol66R&name=pdf) | 海报展示 |\n| [立场：基础智能体是决策范式的转变](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jzHmElqpPe&name=pdf) | 海报展示 |\n| [基于惩罚的原则性方法用于双层强化学习和 RLHF](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Xb3IXEBYuw&name=pdf) | 海报展示 |\n| [Transformer 世界模型是否能提供更好的策略梯度？](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Uoved2xD81&name=pdf) | 海报展示 |\n| [通过辅助短延迟增强具有强烈延迟反馈的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=0IDaPnY5d5&name=pdf) | 海报展示 |\n| [通过函数编码器实现零样本强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tHBLwSYnLf&name=pdf) | 海报展示 |\n| [3D-VLA：一个 3D 视觉-语言-行动生成式世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=EZcFK8HupF&name=pdf) | 海报展示 |\n| [SF-DQN：利用后继特征实现可证明的知识迁移的深度强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pDoAjdrMf0&name=pdf) | 海报展示 |\n| [上下文决策 Transformer：通过层次化思维链进行强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jmmji1EU3g&name=pdf) | 海报展示 |\n| [质量-多样性 Actor-Critic：通过价值和后继特征批评家学习高性能且多样化的行为](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ISG3l8nXrI&name=pdf) | 海报展示 |\n| [面向离线偏好式强化学习的列表式奖励估计](https:\u002F\u002Fopenreview.net\u002Fattachment?id=If6Q9OYfoJ&name=pdf) | 海报展示 |\n| [立场：扩大仿真规模既不是野外机器人操作的必要条件，也不是充分条件](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Jtjurj7oIJ&name=pdf) | 海报展示 |\n| [仅凭离线观察的混合强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=c6rVlTKpb5&name=pdf) | 海报展示 |\n| [逆向强化学习是否比标准强化学习更难？一个理论视角](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6dKUu2EkZy&name=pdf) | 海报展示 |\n| [通过稳健平均进行正则化 Q 学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=07f24ya6eX&name=pdf) | 海报展示 |\n| [通过捕捉表征差异实现跨域策略适应](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3uPSQmjXzd&name=pdf) | 海报展示 |\n| [HarmoDT：面向离线强化学习的和谐多任务决策 Transformer](https:\u002F\u002Fopenreview.net\u002Fattachment?id=2Asakozn3Z&name=pdf) | 海报展示 |\n| [具有希尔伯特表征的基础策略](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LhNsSaAKub&name=pdf) | 海报展示 |\n| [3D 多实体物理环境中的次等价强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hQpUhySEJi&name=pdf) | 海报展示 |\n| [LLM 赋能的状态表示用于强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xJMZbdiQnf&name=pdf) | 海报展示 |\n| [基于提示的视觉对齐用于零样本策略迁移](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PPoQz8K4GZ&name=pdf) | 海报展示 |\n| [3D 世界中的具身通才智能体](https:\u002F\u002Fopenreview.net\u002Fattachment?id=V4qV08Vk6S&name=pdf) | 海报展示 |\n| [Q 值正则化的 Transformer 用于离线强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ojtddicekd&name=pdf) | 海报展示 |\n| [高速公路价值迭代网络](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rORsGuE2hV&name=pdf) | 海报展示 |\n| [在模型误设情况下的鲁棒逆向约束强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pkUl39b0in&name=pdf) | 海报展示 |\n| [利用分布随机网络蒸馏进行探索与反探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rIrpzmqRBk&name=pdf) | 海报展示 |\n| [策略条件环境模型更具泛化能力](https:\u002F\u002Fopenreview.net\u002Fattachment?id=g9mYBdooPA&name=pdf) | 海报展示 |\n| [用于无监督技能发现的约束集合式探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=AOJCCFTlfJ&name=pdf) | 海报展示 |\n| [DiffStitch：通过基于扩散的轨迹拼接提升离线强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=phGHQOKmaU&name=pdf) | 海报展示 |\n| [通过层次强化学习重新思考决策 Transformer](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WsM4TVsZpJ&name=pdf) | 海报展示 |\n| [从 Transformer 表征中学习认知地图，以实现在部分可观测环境中的高效规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JUa5XNXuoT&name=pdf) | 海报展示 |\n| [HarmonyDream：世界模型内部的任务协调](https:\u002F\u002Fopenreview.net\u002Fattachment?id=x0yIaw2fgk&name=pdf) | 海报展示 |\n| [推进商业格斗游戏中 DRL 智能体的发展：训练、集成和智能体与人类的对齐](https:\u002F\u002Fopenreview.net\u002Fattachment?id=eN1T7I7OpZ&name=pdf) | 海报展示 |\n| [通过原始 Wasserstein 状态占用匹配从观察中进行离线模仿](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4Zr7T6UrBS&name=pdf) | 海报展示 |\n| [通过量化进行细粒度因果动力学学习，以提高强化学习的鲁棒性](https:\u002F\u002Fopenreview.net\u002Fpdf?id=mrd4e8ZJjm) | 海报展示 |\n| [在批量强化学习中切换损失可以降低成本](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7PXSc5fURu&name=pdf) | 海报展示 |\n| [三思而后行：带有工作记忆的决策 Transformer](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PSQ5Z920M8&name=pdf) | 海报展示 |\n\n\u003Ca id='NeurIPS24'>\u003C\u002Fa>\n\n## NeurIPS24\n| 论文 | 类型 |\n| ---- | ---- |\n| [基于能量模型的扩散模型的最大熵逆强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=V0oJaLqY4E&name=pdf) | 口头报告 |\n| [改进环境新颖性量化以实现高效的无监督环境设计](https:\u002F\u002Fopenreview.net\u002Fattachment?id=UdxpjKO2F9&name=pdf) | 口头报告 |\n| [RL-GPT：将强化学习与代码即策略相结合](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LEzx6QRkRH&name=pdf) | 口头报告 |\n| [利用深度强化学习优化自动微分](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hVmi98a0ki&name=pdf) | 聚光灯报告 |\n| [更大、正则化、更乐观：面向计算和样本高效的连续控制扩展](https:\u002F\u002Fopenreview.net\u002Fattachment?id=fu0xdh4aEJ&name=pdf) | 聚光灯报告 |\n| [学习型优化能否使强化学习变得更简单？](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YbxFwaSA9Z&name=pdf) | 聚光灯报告 |\n| [通过环路移除进行目标简化，加速强化学习并模拟目标导向学习中的人类大脑活动](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Y0EfJJeb4V&name=pdf) | 聚光灯报告 |\n| [BricksRL：一个利用乐高积木 democratize 机器人学和强化学习研究与教育的平台](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8iytZCnXIu&name=pdf) | 聚光灯报告 |\n| [类人机器人运动作为下一个标记预测](https:\u002F\u002Fopenreview.net\u002Fattachment?id=GrMczQGTlA&name=pdf) | 聚光灯报告 |\n| [利用变分偏好学习从人类反馈中个性化强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=gRG6SzbW9p&name=pdf) | 聚光灯报告 |\n| [预训练的文本到图像扩散模型是用于控制的多功能表征学习器](https:\u002F\u002Fopenreview.net\u002Fattachment?id=KY07A73F3Y&name=pdf) | 聚光灯报告 |\n| [关于在策略深度强化学习中可塑性损失的研究](https:\u002F\u002Fopenreview.net\u002Fattachment?id=MsUf8kpKTF&name=pdf) | 聚光灯报告 |\n| [用于世界建模的扩散：Atari 游戏中视觉细节至关重要](https:\u002F\u002Fopenreview.net\u002Fattachment?id=NadTwTODgC&name=pdf) | 聚光灯报告 |\n| [专为离线强化学习设计的惩罚式 Q 学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=2bdSnxeQcW&name=pdf) | 聚光灯报告 |\n| [DiffTORI：面向深度强化学习和模仿学习的可微轨迹优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Mwj57TcHWX&name=pdf) | 聚光灯报告 |\n| [变分延迟策略优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DAtNDZHbqj&name=pdf) | 聚光灯报告 |\n| [利用有效的基于度量的探索奖励重新思考强化学习中的探索问题](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QpKWFLtZKi&name=pdf) | 聚光灯报告 |\n| [迈向基于信息论的上下文离线元强化学习框架](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QFUsZvw9mx&name=pdf) | 聚光灯报告 |\n| [强化学习梯度作为在线微调决策 Transformer 的“维生素”](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5l5bhYexYO&name=pdf) | 聚光灯报告 |\n| [强化学习中奖励前瞻的重要性](https:\u002F\u002Fopenreview.net\u002Fattachment?id=URyeU8mwz1&name=pdf) | 聚光灯报告 |\n| [PEAC：面向跨具身强化学习的无监督预训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14073) | 海报展示 |\n| [用于噪声和不确定性环境中的深度 RL 的奖励机器](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Cc0ckJlJF2&name=pdf) | 海报展示 |\n| [具有特权信息的可证明部分可观测强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=o3i1JEfzKw&name=pdf) | 海报展示 |\n| [人工世代智能：强化学习中的文化积累](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pMaCRgu8GV&name=pdf) | 海报展示 |\n| [SimPO：一种无需参考奖励的简单偏好优化方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3Tzcot1LKb&name=pdf) | 海报展示 |\n| [子词即技能：稀疏奖励强化学习中的分词方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WfpvtH7oC1&name=pdf) | 海报展示 |\n| [基于模型的扩散用于轨迹优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BJndYScO6o&name=pdf) | 海报展示 |\n| [面向强化学习的算子世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kbBjVMcJ7G&name=pdf) | 海报展示 |\n| [多元分布强化学习的基础](https:\u002F\u002Fopenreview.net\u002Fattachment?id=aq3I5B6GLG&name=pdf) | 海报展示 |\n| [通过可扩展的逆强化学习模仿语言](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5d2eScRiRC&name=pdf) | 海报展示 |\n| [超越乐观：部分可观测奖励下的探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=k6ZHvF1vkg&name=pdf) | 海报展示 |\n| [SleeperNets：针对强化学习智能体的通用后门中毒攻击](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HkC4OYee3Q&name=pdf) | 海报展示 |\n| [学习用于无约束目标导航的世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=aYqTwcDlCG&name=pdf) | 海报展示 |\n| [为目标条件强化学习探索潜在状态簇的边缘](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9hKN99RNdR&name=pdf) | 海报展示 |\n| [通过领域适应和奖励增强的模仿学习进行非动力学强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=k2hS5Rt1N0&name=pdf) | 海报展示 |\n| [基于模型的离线强化学习中的受限隐动作策略](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pEhvscmSgG&name=pdf) | 海报展示 |\n| [拆解 DPO 和 PPO：理清从偏好反馈中学习的最佳实践](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JMBWTlazjW&name=pdf) | 海报展示 |\n| [重新思考逆强化学习：从数据对齐到任务对齐](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VFRyS7Wx08&name=pdf) | 海报展示 |\n| [强化学习中的归一化与有效学习率](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZbjJE6Nq5k&name=pdf) | 海报展示 |\n| [ReST-MCTS*：通过过程奖励引导的树搜索进行 LLM 自我训练](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8rcFOqEud5&name=pdf) | 海报展示 |\n| [无需批量更新、目标网络或回放缓冲区的深度策略梯度方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DX5GUwMFFb&name=pdf) | 海报展示 |\n| [面向策略学习的文本感知扩散](https:\u002F\u002Fopenreview.net\u002Fattachment?id=nK6OnCpd3n&name=pdf) | 海报展示 |\n| [离线 RL 的一种可处理推理视角](https:\u002F\u002Fopenreview.net\u002Fattachment?id=UZIHW8eFRp&name=pdf) | 海报展示 |\n| [通过行动分解进行策略优化来强化 LLM 智能体](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Hz6cSigMyU&name=pdf) | 海报展示 |\n| [帕塞瓦尔正则化用于持续强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=RB1F2h5YEx&name=pdf) | 海报展示 |\n| [令人惊讶的是，预训练的视觉表征对于基于模型的强化学习几乎无效](https:\u002F\u002Fopenreview.net\u002Fattachment?id=LvAy07mCxU&name=pdf) | 海报展示 |\n| [推测式蒙特卡洛树搜索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=g1HxCIc0wi&name=pdf) | 海报展示 |\n| [在约束强化学习中通过反馈确保安全](https:\u002F\u002Fopenreview.net\u002Fattachment?id=WSsht66fbC&name=pdf) | 海报展示 |\n| [在决策至关重要的地方进行测试：面向深度强化学习的重要性驱动测试](https:\u002F\u002Fopenreview.net\u002Fattachment?id=TwrnhZfD6a&name=pdf) | 海报展示 |\n| [面向零样本泛化的技能感知互信息优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=GtbwJ6mruI&name=pdf) | 海报展示 |\n| [带有 Q 集成的熵正则化扩散策略，用于离线强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hWRVbdAWiS&name=pdf) | 海报展示 |\n| [多目标强化学习中效用函数的分析性研究](https:\u002F\u002Fopenreview.net\u002Fattachment?id=K3h2kZFz8h&name=pdf) | 海报展示 |\n| [扩散-DICE：离线强化学习中的样本内扩散指导](https:\u002F\u002Fopenreview.net\u002Fattachment?id=EIl9qmMmvy&name=pdf) | 海报展示 |\n| [高效的循环式非策略 RL 需要特定于上下文编码器的学习率](https:\u002F\u002Fopenreview.net\u002Fattachment?id=tSWoT8ttkO&name=pdf) | 海报展示 |\n| [基于不确定性的离线变分贝叶斯强化学习，用于应对多样化的数据损坏以提高鲁棒性](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rTxCIWsfsD&name=pdf) | 海报展示 |\n| [Any2Policy：使用任意模态学习视觉运动策略](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8lcW9ltJx9&name=pdf) | 海报展示 |\n| [采用适应性正则化的强化学习，用于关键系统的安全控制](https:\u002F\u002Fopenreview.net\u002Fattachment?id=MRO2QhydPF&name=pdf) | 海报展示 |\n| [局部时间上的 Adam：通过相对 Adam 时间步长解决 RL 中的非平稳性问题](https:\u002F\u002Fopenreview.net\u002Fattachment?id=biAqUbAuG7&name=pdf) | 海报展示 |\n| [ROIDICE：用于高效决策的离线投资回报最大化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6Kg26g1quR&name=pdf) | 海报展示 |\n| [边预测边行动：通过联合去噪过程进行视觉策略学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=teVxVdy8R2&name=pdf) | 海报展示 |\n| [通过将扩散行为与 Q 函数对齐实现高效的连续控制] | 海报展示 |\n\n\u003Ca id='ICLR25'>\u003C\u002Fa>\n\n## ICLR25\n| 论文 | 类型 |\n| ---- | ---- |\n| [RM-Bench：以细腻与风格为基准评估语言模型的奖励模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QEHrmQPBdd&name=pdf) | 口头报告 |\n| [基于扩散的自动驾驶灵活引导规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=wM2sfVgMDH&name=pdf) | 口头报告 |\n| [从演示序列中学习搜索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=v593OaNePQ&name=pdf) | 口头报告 |\n| [偏好标注扩展：用于高效LLM对齐的直接偏好判断](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BPgK5XW1Nb&name=pdf) | 口头报告 |\n| [无模型强化学习中涌现式规划的解释](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DzGe40glxs&name=pdf) | 口头报告 |\n| [Kinetix：通过开放式的物理控制任务探究通用智能体的训练](https:\u002F\u002Fopenreview.net\u002Fattachment?id=zCxGCdzreM&name=pdf) | 口头报告 |\n| [OptionZero：基于学习到的选项进行规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=3IFRygQKGL&name=pdf) | 口头报告 |\n| [预测性逆动力学模型是可扩展的机器人操作学习器](https:\u002F\u002Fopenreview.net\u002Fattachment?id=meRCKuUpmc&name=pdf) | 口头报告 |\n| [机器人操作模仿学习中的数据缩放规律](https:\u002F\u002Fopenreview.net\u002Fattachment?id=pISLZG7ktL&name=pdf) | 口头报告 |\n| [更多的RLHF，更多的信任？关于偏好对齐对可信度的影响](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FpiCLJrSW8&name=pdf) | 口头报告 |\n| [面向形状多变及可变形物体操作的几何感知强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7BLXhmWvwF&name=pdf) | 口头报告 |\n| [DeepLTL：学习高效满足多任务强化学习中复杂LTL规范](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9pW2J49flQ&name=pdf) | 口头报告 |\n| [通过强化学习训练语言模型实现自我修正](https:\u002F\u002Fopenreview.net\u002Fattachment?id=CjwERcAU7w&name=pdf) | 口头报告 |\n| [优先级生成式回放](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5IkDAfabuo&name=pdf) | 口头报告 |\n| [策略参数空间中的平坦奖励意味着稳健的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4OaO3GjP7k&name=pdf) | 口头报告 |\n| [基于长短时想象的开放世界强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vzItLaEoDa&name=pdf) | 口头报告 |\n| [基于计数的探索实现语言模型的在线偏好对齐](https:\u002F\u002Fopenreview.net\u002Fattachment?id=cfKZ5VrhXt&name=pdf) | 展示报告 |\n| [结合示范与人类反馈的联合奖励与策略学习提升对齐效果](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VCbqXtS5YY&name=pdf) | 展示报告 |\n| [相关代理：奖励欺骗的新定义及其改进缓解方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=msEr27EejF&name=pdf) | 展示报告 |\n| [非平稳情境驱动环境下的在线强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=l6QnSQizmN&name=pdf) | 展示报告 |\n| [DataEnvGym：在教师环境中利用学生反馈生成数据的智能体](https:\u002F\u002Fopenreview.net\u002Fattachment?id=00SnKBGTsz&name=pdf) | 展示报告 |\n| [纠正KL正则化的迷思：通过卡方偏好优化实现无过度优化的直接对齐](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hXm0Wu2U9K&name=pdf) | 展示报告 |\n| [TOP-ERL：基于Transformer的离策略分幕强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=N4NhVN30ph&name=pdf) | 展示报告 |\n| [VisualPredicator：利用神经符号谓词学习抽象世界模型用于机器人规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QOfswj7hij&name=pdf) | 展示报告 |\n| [基于扩散模型的多机器人运动规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=AUCYptvAf3&name=pdf) | 展示报告 |\n| [简化深度时序差分学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7IzeL0kflu&name=pdf) | 展示报告 |\n| [基于ODE的平滑神经网络用于强化学习任务](https:\u002F\u002Fopenreview.net\u002Fattachment?id=S5Yo6w3n3f&name=pdf) | 展示报告 |\n| [MAD-TD：模型增强数据稳定高更新率强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=6RtRsg8ZV1&name=pdf) | 展示报告 |\n| [在可微多物理仿真中稳定强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DRiLWb8bJg&name=pdf) | 展示报告 |\n| [不要展平，要分词！解锁SoftMoE在深度强化学习中高效性的关键](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8oCrlOaYcc&name=pdf) | 展示报告 |\n| [基于Transformer的世界模型学习与对比预测编码](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YK9G4Htdew&name=pdf) | 展示报告 |\n| [迈向通用无模型强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=R1hIXdST22&name=pdf) | 展示报告 |\n| [重新思考奖励模型评估：我们是否找错了方向？](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Cnwz9jONi5&name=pdf) | 展示报告 |\n| [加速目标条件强化学习算法与研究](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4gaySj8kvX&name=pdf) | 展示报告 |\n| [SimBa：简单性偏差助力深度强化学习参数规模扩大](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jXLiDKsuDo&name=pdf) | 展示报告 |\n| [无需奖励过度优化的扩散模型测试时对齐](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vi3DjUhFVm&name=pdf) | 展示报告 |\n| [通过直接优化缓解树状强化学习中的信息损失](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qpXctF2aLZ&name=pdf) | 展示报告 |\n| [什么样的扩散规划器适合决策？](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7BQkXXM8Fy&name=pdf) | 展示报告 |\n| [ADAM：开放世界环境中的具身因果智能体](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Ouu3HnIVBc&name=pdf) | 海报展示 |\n| [如何评估用于RLHF的奖励模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=cbttLtO94Q&name=pdf) | 海报展示 |\n| [SafeDiffuser：使用扩散概率模型的安全规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ig2wk7kK9J&name=pdf) | 海报展示 |\n| [高效的在线强化学习微调无需保留离线数据](https:\u002F\u002Fopenreview.net\u002Fattachment?id=HN0CYZbAPw&name=pdf) | 海报展示 |\n| [利用大型语言模型先验进行高效强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=e2NRNQ0sZe&name=pdf) | 海报展示 |\n| [朗之万软演员—评论家：通过不确定性驱动的评论家学习实现高效探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FvQsk3la17&name=pdf) | 海报展示 |\n| [带有安全约束的强化学习高效策略评估](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Dem5LyVk8R&name=pdf) | 海报展示 |\n| [模型编辑作为DPO的鲁棒去噪变体：以毒性问题为例](https:\u002F\u002Fopenreview.net\u002Fattachment?id=lOi6FtIwR8&name=pdf) | 海报展示 |\n| [面向约束强化学习的安全优先课程](https:\u002F\u002Fopenreview.net\u002Fattachment?id=f3QR9TEERH&name=pdf) | 海报展示 |\n| [用于不确定性感知的离线强化学习的神经随机微分方程](https:\u002F\u002Fopenreview.net\u002Fattachment?id=hxUMQ4fic3&name=pdf) | 海报展示 |\n| [MaxInfoRL：通过最大化信息增益提升强化学习中的探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=R4q3cY3kQf&name=pdf) | 海报展示 |\n| [SEMDICE：通过稳态分布校正估计实现离策略状态熵最大化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rJ5g8ueQaI&name=pdf) | 海报展示 |\n| [Strategist：通过双层树搜索实现LLM决策的自我改进](https:\u002F\u002Fopenreview.net\u002Fattachment?id=gfI9v7AbFg&name=pdf) | 海报展示 |\n\n\u003Ca id='ICML25'>\u003C\u002Fa>\n\n\n## ICML25\n| 论文 | 类型 |\n| ---- | ---- |\n| [EmbodiedBench: 面向视觉驱动具身智能体的多模态大语言模型综合基准测试](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DgGF2LEBPS&name=pdf) | 口头报告 |\n| [网络稀疏性释放深度强化学习的可扩展潜力](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mIomqOskaa&name=pdf) | 口头报告 |\n| [通过单步奖励实现多轮代码生成](https:\u002F\u002Fopenreview.net\u002Fattachment?id=aJeLhLcsh0&name=pdf) | 焦点报告 |\n| [策略标注的偏好学习：偏好是否足以用于RLHF？](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qLfo1sef50&name=pdf) | 焦点报告 |\n| [用于系统2规划的蒙特卡洛树扩散](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XrCbBdycDc&name=pdf) | 焦点报告 |\n| [RLEF: 基于强化学习的执行反馈，使代码LLM落地](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PzSG5nKe1q&name=pdf) | 焦点报告 |\n| [指数族下的决策：基于贝叶斯模糊集的分布鲁棒优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=r9HlTuCQfr&name=pdf) | 焦点报告 |\n| [超球面归一化用于可扩展的深度强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kfYxyvCYQ4&name=pdf) | 焦点报告 |\n| [LLM与RL的协同作用利用低质量数据解锁可泛化的语言条件策略的离线学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=5hyfZ2jYfI&name=pdf) | 焦点报告 |\n| [在使用离线数据的强化学习中惩罚不可行动作及奖励缩放](https:\u002F\u002Fopenreview.net\u002Fattachment?id=FSVdEzR4To&name=pdf) | 焦点报告 |\n| [隐私与鲁棒离线对齐的统一理论分析：从RLHF到DPO](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XEyGcrhxB8&name=pdf) | 焦点报告 |\n| [视频预测策略：一种具有预测性视觉表征的通用机器人策略](https:\u002F\u002Fopenreview.net\u002Fattachment?id=c0dhw1du33&name=pdf) | 焦点报告 |\n| [通过在线世界模型规划实现持续强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mQeZEsdODh&name=pdf) | 焦点报告 |\n| [跨动力学强化学习中对全局可达状态的策略正则化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Go0DdhjATH&name=pdf) | 焦点报告 |\n| [用于模仿学习的潜在扩散规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=vhACnRfuYh&name=pdf) | 焦点报告 |\n| [利用世界模型进行强化学习中的新奇性检测](https:\u002F\u002Fopenreview.net\u002Fattachment?id=xtlixzbcfV&name=pdf) | 焦点报告 |\n| [DPO与PPO相遇：用于RLHF的强化标记优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IfWKVF6LfY&name=pdf) | 焦点报告 |\n\n\n\u003Ca id='NeurIPS25'>\u003C\u002Fa>\n\n## NeurIPS25\n| 论文 | 类型 |\n| ---- | ---- |\n| [用于鲁棒强化学习的状态熵正则化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=rtG7n93Ru8&name=pdf) | 口头报告 |\n| [PRIMT：基于偏好、多模态反馈以及由基础模型生成轨迹的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4xvE6Iy77Y&name=pdf) | 口头报告 |\n| [离线强化学习的新起点](https:\u002F\u002Fopenreview.net\u002Fattachment?id=8P3QNSckMp&name=pdf) | 口头报告 |\n| [用于自监督RL的1000层网络：扩大深度可实现新的目标达成能力](https:\u002F\u002Fopenreview.net\u002Fattachment?id=s0JVsx3bx1&name=pdf) | 口头报告 |\n| [QoQ-Med：通过领域感知的GRPO训练构建多模态临床基础模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZwCVFBFUFb&name=pdf) | 口头报告 |\n| [强化学习是否真的能在基础模型之外激励大语言模型的推理能力？](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4OsgYD7em5&name=pdf) | 口头报告 |\n| [AceSearcher：通过强化自我博弈为大语言模型启动推理与搜索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=jSgCM0uZn3&name=pdf) | 聚光灯报告 |\n| [Pass@K策略优化：解决更难的强化学习问题](https:\u002F\u002Fopenreview.net\u002Fattachment?id=W6WC6047X2&name=pdf) | 聚光灯报告 |\n| [非平稳环境下的离线强化学习预测](https:\u002F\u002Fopenreview.net\u002Fattachment?id=24UJqxw1kv&name=pdf) | 聚光灯报告 |\n| [对抗性RL：重新思考高效且可扩展的深度强化学习的核心原则](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qaHrpITIvB&name=pdf) | 聚光灯报告 |\n| [利用强化学习逆向工程人类偏好](https:\u002F\u002Fopenreview.net\u002Fattachment?id=heY0zzGvYm&name=pdf) | 聚光灯报告 |\n| [SafeVLA：通过约束学习实现视觉-语言-行动模型的安全对齐](https:\u002F\u002Fopenreview.net\u002Fattachment?id=dt940loCBT&name=pdf) | 聚光灯报告 |\n| [d1：通过强化学习扩展扩散式大型语言模型的推理能力](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7ZVRlBFuEv&name=pdf) | 聚光灯报告 |\n| [SoFar：语言锚定的方向感桥梁空间推理与物体操作](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kmv7yg6QXv&name=pdf) | 聚光灯报告 |\n| [Memo：利用强化学习训练内存高效的具身智能体](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9eIntNc69t&name=pdf) | 聚光灯报告 |\n| [DeepDiver：通过强化学习自适应调整网络搜索强度](https:\u002F\u002Fopenreview.net\u002Fattachment?id=CqLWckpTbG&name=pdf) | 聚光灯报告 |\n| [DenseDPO：针对视频扩散模型的细粒度时间偏好优化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YFa7eULIeN&name=pdf) | 聚光灯报告 |\n| [用于具有偏移动力学数据的强化学习的复合流匹配](https:\u002F\u002Fopenreview.net\u002Fattachment?id=7cPDOBWTbM&name=pdf) | 聚光灯报告 |\n| [利用不完美的转移预测进行强化学习：一种贝尔曼-延森方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DYuPwwDy9n&name=pdf) | 聚光灯报告 |\n| [深度强化学习中大规模稳定学习的稳定梯度](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Vqj65VeDOu&name=pdf) | 聚光灯报告 |\n| [AlphaZero神经规模与齐普夫定律：棋类游戏与幂律的故事](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IMmkDMqFMU&name=p) | 聚光灯报告 |\n| [DAPO：通过直接优势驱动的策略优化提升大语言模型的多步推理能力](https:\u002F\u002Fopenreview.net\u002Fattachment?id=77eEDRhPkQ&name=pdf) | 聚光灯报告 |\n| [VL-Rethinker：利用强化学习激励视觉-语言模型的自我反思](https:\u002F\u002Fopenreview.net\u002Fattachment?id=4oYxzssbVg&name=pdf) | 聚光灯报告 |\n| [CURE：通过强化学习协同进化编码员和单元测试员](https:\u002F\u002Fopenreview.net\u002Fattachment?id=wPdBe9zxNr&name=pdf) | 聚光灯报告 |\n| [蒸馏还是决策？理解部分可观测强化学习中的算法权衡](https:\u002F\u002Fopenreview.net\u002Fattachment?id=iEgaS6wbLa&name=pdf) | 聚光灯报告 |\n| [面向离线强化学习的自适应邻域约束Q学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=qgi5TfBXBw&name=pdf) | 聚光灯报告 |\n| [思考还是不思考？规则驱动的视觉强化微调中思考行为的研究](https:\u002F\u002Fopenreview.net\u002Fattachment?id=YexxvBGwQM&name=pdf) | 聚光灯报告 |\n| [DexGarmentLab：具有可泛化策略的灵巧服装操作环境](https:\u002F\u002Fopenreview.net\u002Fattachment?id=ZZ09oX2Xpo&name=pdf) | 聚光灯报告 |\n| [Q-Insight：通过视觉强化学习理解图像质量](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Bds54EfR9x&name=pdf) | 聚光灯报告 |\n| [通过正交性实现新颖探索](https:\u002F\u002Fopenreview.net\u002Fattachment?id=yJS1eZSNUv&name=pdf) | 海报展示 |\n| [Router-R1：通过强化学习教导大语言模型多轮路由与聚合](https:\u002F\u002Fopenreview.net\u002Fattachment?id=DWf4vroKWJ&name=pdf) | 海报展示 |\n| [带有回溯反馈的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=14B5d6NEaH&name=pdf) | 海报展示 |\n| [安全RLHF-V：来自多模态人类反馈的安全强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OIH3T5ZPBW&name=pdf) | 海报展示 |\n| [世界感知规划叙事增强大型视觉-语言模型规划器](https:\u002F\u002Fopenreview.net\u002Fattachment?id=fggSyPPk0K&name=pdf) | 海报展示 |\n| [STAR：通过双重正则化实现高效的基于偏好的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=E9EwDc45f8&name=pdf) | 海报展示 |\n| [FairDICE：以公平为导向的离线多目标强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=2jQJ7aNdT1&name=pdf) | 海报展示 |\n| [Robot-R1：用于增强机器人具身推理的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=N2bLuwofZ0&name=pdf) | 海报展示 |\n| [关于评估鲁棒POMDPs策略的方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=l2Wl77TSYY&name=pdf) | 海报展示 |\n| [周期性技能发现](https:\u002F\u002Fopenreview.net\u002Fattachment?id=BPSU46emit&name=pdf) | 海报展示 |\n| [带有动作分块的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=XUks1Y96NR&name=pdf) | 海报展示 |\n| [ReSearch：通过强化学习让大语言模型学会用搜索进行推理](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OuGAwwAT8G&name=pdf) | 海报展示 |\n| [UFO-RL：以不确定性为中心的优化，用于高效选择强化学习数据](https:\u002F\u002Fopenreview.net\u002Fattachment?id=sH0ZwzDJZn&name=pdf) | 海报展示 |\n| [段落策略优化：在大语言模型的强化学习中实现有效的段级信用分配](https:\u002F\u002Fopenreview.net\u002Fattachment?id=9osvTOYbT4&name=pdf) | 海报展示 |\n| [DISCOVER：稀疏奖励强化学习的自动化课程](https:\u002F\u002Fopenreview.net\u002Fpdf?id=guZBnsKPsw) | 海报展示 |\n| [EnerVerse：设想机器人操作的具身未来空间](https:\u002F\u002Fopenreview.net\u002Fattachment?id=igtjRQfght&name=pdf) | 海报展示 |\n| [上下文世界模型中与动力学对齐的潜在想象，用于零样本泛化](https:\u002F\u002Fopenreview.net\u002Fattachment?id=41bIzD5sit&name=pdf) | 海报展示 |\n| [基于模型的探索增强的离策略强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JGkZgEEjiM&name=pdf) | 海报展示 |\n| [IOSTOM：通过状态转移占用率匹配从观察中进行离线模仿学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=OEp1J4V2fN&name=pdf) | 海报展示 |\n| [树引导的扩散规划器](https:\u002F\u002Fopenreview.net\u002Fattachment?id=I1C0a01BZu&name=pdf) | 海报展示 |\n| [动作分块流程策略的实时执行](https:\u002F\u002Fopenreview.net\u002Fattachment?id=UkR2zO5uww&name=pdf) | 海报展示 |\n| [提示式策略搜索：通过大语言模型中的语言和数值推理进行强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=95plu1Mo20&name=pdf) | 海报展示 |\n| [行为注入：为语言模型的强化学习做准备](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mzlwDAQkgJ&name=pdf) | 海报展示 |\n| [ExPO：通过自我解释引导的强化学习解锁高难度推理](https:\u002F\u002Fopenreview.net\u002Fattachment?id=D1PeGJtVEu&name=pdf) | 海报展示 |\n| [具有动作序列的粗细结合Q网络，用于数据高效的强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VoFXUNc9Zh&name=pdf) | 海报展示 |\n| [持续模拟人类角色，采用多轮强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=A0T3piHiis&name=pdf) | 海报展示 |\n| [深度RL需要深度行为分析：探索无模型智能体在开放环境中进行的隐式规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=QD06Qv7O0P&name=pdf) | 海报展示 |\n\n\u003Ca id='ICLR26'>\u003C\u002Fa>\n\n\n## ICLR26\n| 论文 | 类型 |\n| ---- | ---- |\n| [用于无监督强化学习的探索性扩散模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=k0Kb1ynFbt) | 口头报告 |\n| [通过离线奖励评估和策略搜索增强生成式自动出价](https:\u002F\u002Fopenreview.net\u002Fattachment?id=kMuQBgPIdg&name=pdf) | 口头报告 |\n| [为什么DPO是一种误设的估计量以及如何修复它](https:\u002F\u002Fopenreview.net\u002Fattachment?id=btEiAfnLsX&name=pdf) | 口头报告 |\n| [SafeDPO：一种简单且安全性更强的直接偏好优化方法](https:\u002F\u002Fopenreview.net\u002Fattachment?id=PJdw4VBsXD&name=pdf) | 口头报告 |\n| [基于引导搜索的组合扩散模型用于长 horizon 规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=b8avf4F2hn&name=pdf) | 口头报告 |\n| [LoongRL：面向长上下文的高级推理强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=o29E01Q6bv&name=pdf) | 口头报告 |\n| [GEPA：反思式提示进化可超越强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=RQm2KQTM5r&name=pdf) | 口头报告 |\n| [无需训练的推理：你的基础模型比你想象的更聪明](https:\u002F\u002Fopenreview.net\u002Fattachment?id=Vsgq2ldr4K&name=pdf) | 口头报告 |\n| [用于学习机器人动作的罗德里格斯网络](https:\u002F\u002Fopenreview.net\u002Fattachment?id=IZHk6BXBST&name=pdf) | 口头报告 |\n| [具有瞬时速度约束的均流策略用于单步动作生成]() | 口头报告 |\n| [TD-JEPA：用于零样本强化学习的潜在预测表征](https:\u002F\u002Fopenreview.net\u002Fattachment?id=SzXDuBN8M1&name=pdf) | 口头报告 |\n| [WoW！：闭环世界中的世界模型](https:\u002F\u002Fopenreview.net\u002Fattachment?id=yDmb7xAfeb&name=pdf) | 口头报告 |\n| [DiffusionNFT：带有前向过程的在线扩散强化学习](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VJZ477R89F&name=pdf) | 口头报告 |\n| [通过预训练模型和深度强化学习掌握稀疏CUDA生成](https:\u002F\u002Fopenreview.net\u002Fattachment?id=VdLEaGPYWT&name=pdf) | 口头报告 |\n| [LongWriter-Zero：通过强化学习掌握超长文本生成](https:\u002F\u002Fopenreview.net\u002Fattachment?id=JWx4DI2N8k&name=pdf) | 口头报告 |\n| [MomaGraph：结合视觉-语言模型的状态感知统一场景图用于具身任务规划](https:\u002F\u002Fopenreview.net\u002Fattachment?id=mIeKe74W43&name=pdf) | 口头报告 |","# Reinforcement-Learning-Papers 快速上手指南\n\n**项目简介**：\n`Reinforcement-Learning-Papers` 是一个精选的强化学习（RL）论文清单，主要聚焦于单智能体领域。它涵盖了从经典算法（如 DQN, PPO, SAC）到前沿研究（如基于 Transformer 的 RL、离线 RL、扩散模型结合等）的广泛内容，并持续更新包括 ICLR、ICML、NeurIPS 等顶级会议的最新论文。本项目主要作为**文献索引和知识库**使用，帮助开发者快速定位高质量论文及其核心贡献。\n\n## 环境准备\n\n由于本项目本质上是论文列表和文档集合，**无需安装复杂的深度学习框架或 GPU 环境**即可浏览核心内容。\n\n*   **系统要求**：Windows \u002F macOS \u002F Linux 均可。\n*   **前置依赖**：\n    *   Git（用于克隆仓库）\n    *   Markdown 阅读器（推荐 VS Code 或 GitHub 网页版）\n    *   （可选）Python 3.x：如果你打算运行列表中部分论文提供的官方代码链接，需根据具体论文的要求配置相应的 PyTorch 或 TensorFlow 环境。\n\n## 安装步骤\n\n### 1. 克隆仓库\n打开终端（Terminal 或 CMD），执行以下命令将项目下载到本地：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FReinforcement-Learning-Papers.git\n```\n\n> **国内加速建议**：\n> 如果直接克隆速度较慢，可以使用 Gitee 镜像（如果有）或通过代理加速。若无特定镜像，可尝试指定深度为 1 以加快下载：\n> ```bash\n> git clone --depth 1 https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FReinforcement-Learning-Papers.git\n> ```\n\n### 2. 进入目录\n```bash\ncd Reinforcement-Learning-Papers\n```\n\n### 3. （可选）安装依赖\n本项目主要是 Markdown 文档，通常不需要 `pip install`。但如果需要复现列表中某些论文的代码，请查看对应论文标题链接指向的原始仓库，并按照其特定的 `requirements.txt` 进行安装。\n\n## 基本使用\n\n本项目的主要使用方式是**查阅文献索引**。你可以通过本地 Markdown 文件或直接在 GitHub 页面上浏览分类好的论文列表。\n\n### 方式一：本地浏览（推荐）\n使用支持 Markdown 预览的编辑器（如 VS Code）打开 `README.md` 文件。\n\n1.  使用 VS Code 打开项目文件夹。\n2.  点击 `README.md` 文件。\n3.  点击右上角的“打开侧边预览”图标（或按 `Ctrl+Shift+V` \u002F `Cmd+Shift+V`）。\n4.  利用目录（Contents）中的锚点快速跳转至感兴趣的领域，例如：\n    *   **Model Free (Online) RL**：在线无模型方法（含 DQN, PPO, SAC 等经典与最新算法）。\n    *   **Model Based (Online) RL**：在线基于模型的方法（含 World Models, Dreamer 等）。\n    *   **Offline RL**：离线强化学习（含结合扩散模型的最新研究）。\n    *   **RL with Transformer\u002FLLM**：结合大语言模型的强化学习。\n    *   **Conference Lists**：按年份和会议（ICLR, ICML, NeurIPS 2022-2026）查找最新论文。\n\n### 方式二：在线查阅\n直接访问 GitHub 仓库页面，利用右侧的目录导航栏快速定位。点击表格中的论文标题即可直接跳转到 PDF 原文或 ArXiv 页面。\n\n### 使用示例：查找最新 PPO 改进或 SAC 相关论文\n1.  在 `README.md` 中找到 **[Model Free (Online) RL](#Model-Free-Online)** 章节。\n2.  向下滚动至 **[Classic Methods](#model-free-classic)** 表格。\n3.  查找 `PPO` 或 `SAC` 行，阅读其 `Description` 列了解核心思想（如 PPO 通过裁剪系数替代硬约束，SAC 基于最大熵理论）。\n4.  点击标题链接下载论文全文。\n5.  若需查找最新进展，可直接跳转至文档底部的 **[ICLR24](#ICLR24)**, **[ICML24](#ICML24)**, 或 **[NeurIPS24](#NeurIPS24)** 等章节查看当年收录的新文章。\n\n---\n*注：本指南仅用于指导如何查阅该论文列表。若需复现具体算法，请务必前往论文原文中提供的官方代码仓库获取详细实现细节。*","某自动驾驶初创公司的算法工程师正在研发基于强化学习的决策模块，急需从海量顶会论文中筛选出适合连续动作空间的最新模型基线。\n\n### 没有 Reinforcement-Learning-Papers 时\n- **检索效率极低**：面对每年数万件 RL 新论文，工程师需在 Google Scholar 和 arXiv 上盲目关键词搜索，耗费数天才能拼凑出零散的文献列表。\n- **经典与前沿割裂**：难以快速厘清技术演进脉络，往往找到了最新的 ICLR 2024 论文，却遗漏了支撑该方法的 Double DQN 或 Rainbow 等经典基石，导致复现时缺乏理论根基。\n- **关键信息缺失**：下载论文后需逐篇阅读摘要才能确认是否支持“离线训练”或“连续动作空间”，无法预先通过结构化表格快速过滤不匹配的方法。\n- **领域覆盖盲区**：容易忽略如\"RL 结合扩散模型”或\"Meta RL\"等交叉领域的突破性进展，导致技术方案选型局限在传统框架内。\n\n### 使用 Reinforcement-Learning-Papers 后\n- **一站式精准导航**：直接利用按会议（ICLR\u002FNeurIPS 等）和方法类型（Model-Free\u002FOffline）分类的目录，10 分钟内即可锁定针对连续动作空间的 SOTA 算法清单。\n- **脉络清晰可视**：通过\"Classic Methods\"到\"Current methods\"的结构化梳理，迅速掌握从 DQN 到最新 Transformer 结合方案的技术迭代路径，夯实实验设计基础。\n- **核心属性速查**：借助包含策略类型、动作空间、在线\u002F离线标记的详细表格，无需阅读全文即可判断论文适用性，大幅缩短预研周期。\n- **前沿动态同步**：及时获取直至 ICLR 2026 的最新收录论文，确保团队能第一时间将“离线 RL 结合扩散模型”等前沿思路融入自动驾驶决策系统。\n\nReinforcement-Learning-Papers 将原本耗时数周的文献调研工作压缩至小时级，让研发团队能更专注于算法落地而非信息搜集。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fyingchengyang_Reinforcement-Learning-Papers_109a29d9.png","yingchengyang","Chengyang Ying","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fyingchengyang_84a1c24a.jpg","PhD Student @ Tsinghua University, Machine Learning, Reinforcement Learning, Embodied AI","Tsinghua University","Beijing, China",null,"https:\u002F\u002Fyingchengyang.github.io\u002F","https:\u002F\u002Fgithub.com\u002Fyingchengyang",559,38,"2026-04-17T17:11:19","MIT","","未说明",{"notes":92,"python":90,"dependencies":93},"该仓库是一个强化学习论文列表（Awesome List），主要包含论文标题、方法、会议信息及链接，并非可执行的软件代码库，因此没有特定的运行环境、依赖库或硬件需求。文中仅在部分论文描述中提及某些方法基于 PyTorch 实现，但这属于被引用论文的范畴，而非本仓库本身的运行要求。",[],[18],[96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114],"reinforcement-learning","model-free-rl","model-based-rl","offline-rl","meta-reinforcement-learning","unsupervised-reinforcement-learning","reinforcement-learning-papers","generalization-reinforcement-learning","iclr23","iclr24","icml22","icml23","neurips22","neurips23","icml24","neurips24","iclr25","neurips25","icml25","2026-03-27T02:49:30.150509","2026-04-19T03:03:09.112880",[],[]]