[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-eleurent--phd-bibliography":3,"tool-eleurent--phd-bibliography":65},[4,18,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,3,"2026-04-06T03:28:53",[13,14,15,16],"开发框架","图像","Agent","视频","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,2,"2026-04-10T11:13:16",[14,27,16,28,15,29,30,13,31],"数据工具","插件","其他","语言模型","音频",{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":10,"last_commit_at":38,"category_tags":39,"status":17},3833,"MoneyPrinterTurbo","harry0703\u002FMoneyPrinterTurbo","MoneyPrinterTurbo 是一款利用 AI 大模型技术，帮助用户一键生成高清短视频的开源工具。只需输入一个视频主题或关键词，它就能全自动完成从文案创作、素材匹配、字幕合成到背景音乐搭配的全过程，最终输出完整的竖屏或横屏短视频。\n\n这款工具主要解决了传统视频制作流程繁琐、门槛高以及素材版权复杂等痛点。无论是需要快速产出内容的自媒体创作者，还是希望尝试视频生成的普通用户，无需具备专业的剪辑技能或昂贵的硬件配置（普通电脑即可运行），都能轻松上手。同时，其清晰的 MVC 架构和对多种主流大模型（如 DeepSeek、Moonshot、通义千问等）的广泛支持，也使其成为开发者进行二次开发或技术研究的理想底座。\n\nMoneyPrinterTurbo 的独特亮点在于其高度的灵活性与本地化友好性。它不仅支持中英文双语及多种语音合成，允许用户精细调整字幕样式和画面比例，还特别优化了国内网络环境下的模型接入方案，让用户无需依赖 VPN 即可使用高性能国产大模型。此外，工具提供批量生成模式，可一次性产出多个版本供用户择优，极大地提升了内容创作的效率与质量。",54991,"2026-04-05T12:23:02",[13,30,15,16,14],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":24,"last_commit_at":46,"category_tags":47,"status":17},2179,"oh-my-openagent","code-yeongyu\u002Foh-my-openagent","oh-my-openagent（简称 omo）是一款强大的开源智能体编排框架，前身名为 oh-my-opencode。它致力于打破单一模型供应商的生态壁垒，解决开发者在构建 AI 应用时面临的“厂商锁定”难题。不同于仅依赖特定模型的封闭方案，omo 倡导开放市场理念，支持灵活调度多种主流大模型：利用 Claude、Kimi 或 GLM 进行任务编排，调用 GPT 处理复杂推理，借助 Minimax 提升响应速度，或发挥 Gemini 的创意优势。\n\n这款工具特别适合希望摆脱平台限制、追求极致性能与成本平衡的开发者及研究人员使用。通过统一接口，用户可以轻松组合不同模型的长处，构建更高效、更具适应性的智能体系统。其独特的技术亮点在于“全模型兼容”架构，让用户不再受制于某一家公司的策略变动或定价调整，真正实现对前沿模型资源的自由驾驭。无论是构建自动化编码助手，还是开发多步骤任务处理流程，oh-my-openagent 都能提供灵活且稳健的基础设施支持，助力用户在快速演进的 AI 生态中保持技术主动权。",50701,"2026-04-12T11:29:54",[16,30,13,14,15],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":17},5295,"tabby","TabbyML\u002Ftabby","Tabby 是一款可私有化部署的开源 AI 编程助手，旨在为开发团队提供 GitHub Copilot 的安全替代方案。它核心解决了代码辅助过程中的数据隐私顾虑与云端依赖问题，让企业能够在完全掌控数据的前提下享受智能代码补全、聊天问答及上下文理解带来的效率提升。\n\n这款工具特别适合注重代码安全的企业开发团队、希望本地化运行大模型的科研机构，以及拥有消费级显卡的个人开发者。Tabby 的最大亮点在于其“开箱即用”的自包含架构，无需配置复杂的数据库或依赖云服务即可快速启动。同时，它对硬件十分友好，支持在普通的消费级 GPU 上流畅运行，大幅降低了部署门槛。此外，Tabby 提供了标准的 OpenAPI 接口，能轻松集成到现有的云 IDE 或内部开发流程中，并支持通过 REST API 接入自定义文档以增强知识上下文。从代码自动补全到基于 Git 仓库的智能问答，Tabby 致力于成为开发者身边懂业务、守安全的智能伙伴。",33308,"2026-04-07T20:23:18",[13,30,15,14,16],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":62,"last_commit_at":63,"category_tags":64,"status":17},6525,"generative-models","Stability-AI\u002Fgenerative-models","Generative Models 是 Stability AI 推出的开源项目，核心亮点在于最新发布的 Stable Video 4D 2.0（SV4D 2.0）。这是一个先进的视频转 4D 扩散模型，旨在解决从单一视角视频中生成高保真、多视角动态 3D 资产的技术难题。传统方法往往难以处理物体自遮挡或背景杂乱的情况，且生成的动态细节容易模糊，而 SV4D 2.0 通过改进的架构，显著提升了运动中的画面锐度与时空一致性，无需依赖额外的多视角参考图即可稳健地合成新颖视角的视频。\n\n该项目特别适合计算机视觉研究人员、AI 开发者以及从事 3D 内容创作的设计师使用。对于研究者，它提供了探索 4D 生成前沿的完整代码与训练权重；对于开发者，其支持自动回归生成长视频及低显存优化选项，便于集成与调试；对于设计师，它能将简单的物体运动视频快速转化为可用于游戏或影视的多视角 4D 素材。技术层面，SV4D 2.0 支持一次性生成 12 帧视频对应 4 个相机视角（或 5 帧对应 8 视角），分辨率达 576x576，并能更好地泛化至真实世界场景。用户只需准备一段白底或经简单抠图处理的物体运动视频，",27078,4,"2026-04-10T22:08:34",[16,29],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":83,"owner_website":84,"owner_url":85,"languages":83,"stars":86,"forks":87,"last_commit_at":88,"license":83,"difficulty_score":89,"env_os":90,"env_gpu":91,"env_ram":91,"env_deps":92,"category_tags":95,"github_topics":96,"view_count":24,"oss_zip_url":83,"oss_zip_packed_at":83,"status":17,"created_at":102,"updated_at":103,"faqs":104,"releases":105},6999,"eleurent\u002Fphd-bibliography","phd-bibliography","References on Optimal Control, Reinforcement Learning and Motion Planning","phd-bibliography 是一个专为最优控制、强化学习与运动规划领域打造的开源文献知识库。它系统性地梳理了从动态规划、线性规划到模型预测控制等经典理论，并深入涵盖了安全控制、博弈论、序列学习及模仿学习等前沿方向。\n\n面对相关领域文献浩如烟海、知识体系庞杂的痛点，phd-bibliography 通过结构化的目录分类，将分散的学术论文、经典著作和关键技术（如 MCTS、REPS、Actor-Critic 等）整合成一张清晰的学习地图。它不仅罗列资源，更按算法流派与应用场景（如自动驾驶）进行逻辑串联，帮助用户快速定位核心资料，理清技术演进脉络。\n\n这份资源特别适合人工智能领域的研究人员、硕博研究生以及算法工程师使用。无论是刚入门需要构建知识框架的新手，还是资深专家希望查漏补缺、追踪最新进展，都能从中获益。其独特的亮点在于极高的专业度与细致的分类颗粒度，不仅包含基础理论，还细分至“无导数优化”、“分层时序抽象”等具体技术点，是从事智能决策与机器人控制研究不可或缺的案头参考指南。","# Bibliography\n\n# Table of contents\n\n* [Optimal Control](#optimal-control-dart)\n  * [Dynamic Programming](#dynamic-programming)\n  * [Linear Programming](#linear-programming)\n  * [Tree-Based Planning](#tree-based-planning)\n  * [Control Theory](#control-theory)\n  * [Model Predictive Control](#model-predictive-control)\n* [Safe Control](#safe-control-lock)\n  * [Robust Control](#robust-control)\n  * [Risk-Averse Control](#risk-averse-control)\n  * [Value-Constrained Control](#value-constrained-control)\n  * [State-Constrained Control and Stability](#state-constrained-control-and-stability)\n  * [Uncertain Dynamical Systems](#uncertain-dynamical-systems)\n* [Game Theory](#game-theory-spades)\n* [Sequential Learning](#sequential-learning-shoe)\n  * [Multi-Armed Bandit](#multi-armed-bandit-slot_machine)\n    * [Best Arm Identification](#best-arm-identification-muscle)\n    * [Black-box Optimization](#black-box-optimization-black_large_square)\n  * [Reinforcement Learning](#reinforcement-learning-robot)\n    * [Theory](#theory-books)\n    * [Value-based](#value-based-chart_with_upwards_trend)\n    * [Policy-based](#policy-based-muscle)\n      * [Policy Gradient](#policy-gradient)\n      * [Actor-critic](#actor-critic)\n      * [Derivative-free](#derivative-free)\n    * [Model-based](#model-based-world_map)\n    * [Exploration](#exploration-tent)\n    * [Hierarchy and Temporal Abstraction](#hierarchy-and-temporal-abstraction-clock2)\n    * [Partial Observability](#partial-observability-eye)\n    * [Transfer](#transfer-earth_americas)\n    * [Multi-agent](#multi-agent-two_men_holding_hands)\n    * [Representation Learning](#representation-learning)\n    * [Offline](#offline)\n* [Learning from Demonstrations](#learning-from-demonstrations-mortar_board)\n  * [Imitation Learning](#imitation-learning)\n    * [Applications to Autonomous Driving](#applications-to-autonomous-driving-car)\n  * [Inverse Reinforcement Learning](#inverse-reinforcement-learning)\n    * [Applications to Autonomous Driving](#applications-to-autonomous-driving-taxi)\n* [Motion Planning](#motion-planning-running_man)\n  * [Search](#search)\n  * [Sampling](#sampling)\n  * [Optimization](#optimization)\n  * [Reactive](#reactive)\n  * [Architecture and applications](#architecture-and-applications)\n\n![RL Diagram](https:\u002F\u002Frawgit.com\u002Feleurent\u002Fphd-bibliography\u002Fmaster\u002Freinforcement-learning.svg)\n\n# Optimal Control :dart:\n\n## Dynamic Programming\n\n* (book) [Dynamic Programming](https:\u002F\u002Fpress.princeton.edu\u002Ftitles\u002F9234.html), Bellman R. (1957).\n* (book) [Dynamic Programming and Optimal Control, Volumes 1 and 2](http:\u002F\u002Fweb.mit.edu\u002Fdimitrib\u002Fwww\u002Fdpchapter.html), Bertsekas D. (1995).\n* (book) [Markov Decision Processes - Discrete Stochastic Dynamic Programming](http:\u002F\u002Feu.wiley.com\u002FWileyCDA\u002FWileyTitle\u002FproductCd-1118625870.html), Puterman M. (1995).\n* [An Upper Bound on the Loss from Approximate Optimal-Value Functions](https:\u002F\u002Fwww.cis.upenn.edu\u002F~mkearns\u002Fteaching\u002Fcis620\u002Fpapers\u002FSinghYee.pdf), Singh S., Yee R. (1994).\n* [Stochastic optimization of sailing trajectories in an upwind regatta](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1057%2Fjors.2014.40), Dalang R. et al. (2015).\n\n## Linear Programming\n\n* (book) [Markov Decision Processes - Discrete Stochastic Dynamic Programming](http:\u002F\u002Feu.wiley.com\u002FWileyCDA\u002FWileyTitle\u002FproductCd-1118625870.html), Puterman M. (1995).\n* **`REPS`** [Relative Entropy Policy Search](https:\u002F\u002Fwww.aaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI10\u002Fpaper\u002FviewFile\u002F1851\u002F2264), Peters J. et al. (2010).\n\n## Tree-Based Planning\n\n* **`ExpectiMinimax`** [Optimal strategy in games with chance nodes](http:\u002F\u002Fwww.inf.u-szeged.hu\u002Factacybernetica\u002Fedb\u002Fvol18n2\u002Fpdf\u002FMelko_2007_ActaCybernetica.pdf), Melkó E., Nagy B. (2007).\n* **`Sparse sampling`** [A sparse sampling algorithm for near-optimal planning in large Markov decision processes](https:\u002F\u002Fwww.cis.upenn.edu\u002F~mkearns\u002Fpapers\u002Fsparsesampling-journal.pdf), Kearns M. et al. (2002).\n* **`MCTS`** [Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search](https:\u002F\u002Fhal.inria.fr\u002Finria-00116992\u002Fdocument), Rémi Coulom, *SequeL* (2006).\n* **`UCT`** [Bandit based Monte-Carlo Planning](http:\u002F\u002Fggp.stanford.edu\u002Freadings\u002Fuct.pdf), Kocsis L., Szepesvári C. (2006).\n* [Bandit Algorithms for Tree Search](https:\u002F\u002Fhal.inria.fr\u002Finria-00136198v2), Coquelin P-A., Munos R. (2007).\n* **`OPD`** [Optimistic Planning for Deterministic Systems](https:\u002F\u002Fhal.inria.fr\u002Fhal-00830182), Hren J., Munos R. (2008).\n* **`OLOP`** [Open Loop Optimistic Planning](http:\u002F\u002Fsbubeck.com\u002FCOLT10_BM.pdf), Bubeck S., Munos R. (2010).\n* **`SOOP`** [Optimistic Planning for Continuous-Action Deterministic Systems](http:\u002F\u002Fresearchers.lille.inria.fr\u002Fmunos\u002Fpapers\u002Ffiles\u002Fadprl13-soop.pdf), Buşoniu L. et al. (2011).\n* **`OPSS`** [Optimistic planning for sparsely stochastic systems](https:\u002F\u002Fwww.dcsc.tudelft.nl\u002F~bdeschutter\u002Fpub\u002Frep\u002F11_007.pdf), L. Buşoniu, R. Munos, B. De Schutter, and R. Babuska (2011).\n* **`HOOT`** [Sample-Based Planning for Continuous ActionMarkov Decision Processes](https:\u002F\u002Fwww.aaai.org\u002Focs\u002Findex.php\u002FICAPS\u002FICAPS11\u002Fpaper\u002FviewFile\u002F2679\u002F3175), Mansley C., Weinstein A., Littman M. (2011).\n* **`HOLOP`** [Bandit-Based Planning and Learning inContinuous-Action Markov Decision Processes](https:\u002F\u002Fpdfs.semanticscholar.org\u002Fa445\u002Fd8cc503781c481c3f3c4ee1758b862b3e869.pdf), Weinstein A., Littman M. (2012).\n* **`BRUE`** [Simple Regret Optimization in Online Planning for Markov Decision Processes](https:\u002F\u002Fwww.jair.org\u002Findex.php\u002Fjair\u002Farticle\u002Fview\u002F10905\u002F26003), Feldman Z. and Domshlak C. (2014).\n* **`LGP`** [Logic-Geometric Programming: An Optimization-Based Approach to Combined Task and Motion Planning](https:\u002F\u002Fipvs.informatik.uni-stuttgart.de\u002Fmlr\u002Fpapers\u002F15-toussaint-IJCAI.pdf), Toussaint M. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=B2s85xfo2uE)\n* **`AlphaGo`** [Mastering the game of Go with deep neural networks and tree search](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature16961), Silver D. et al. (2016).\n* **`AlphaGo Zero`** [Mastering the game of Go without human knowledge](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature24270), Silver D. et al. (2017).\n* **`AlphaZero`** [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.01815), Silver D. et al. (2017).\n* **`TrailBlazer`** [Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F6253-blazing-the-trails-before-beating-the-path-sample-efficient-monte-carlo-planning.pdf), Grill J. B., Valko M., Munos R. (2017).\n* **`MCTSnets`** [Learning to search with MCTSnets](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.04697), Guez A. et al. (2018).\n* **`ADI`** [Solving the Rubik's Cube Without Human Knowledge](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.07470), McAleer S. et al. (2018).\n* **`OPC\u002FSOPC`** [Continuous-action planning for discounted inﬁnite-horizon nonlinear optimal control with Lipschitz values](http:\u002F\u002Fbusoniu.net\u002Ffiles\u002Fpapers\u002Faut18.pdf), Buşoniu L., Pall E., Munos R. (2018).\n* [Real-time tree search with pessimistic scenarios: Winning the NeurIPS 2018 Pommerman Competition](http:\u002F\u002Fproceedings.mlr.press\u002Fv101\u002Fosogami19a.html), Osogami T., Takahashi T. (2019)\n\n## Control Theory\n\n* (book) [The Mathematical Theory of Optimal Processes](https:\u002F\u002Fbooks.google.fr\u002Fbooks?id=kwzq0F4cBVAC&printsec=frontcover&redir_esc=y#v=onepage&q&f=false), L. S. Pontryagin, Boltyanskii V. G., Gamkrelidze R. V., and Mishchenko E. F. (1962).\n* (book) [Constrained Control and Estimation](http:\u002F\u002Fwww.springer.com\u002Fgp\u002Fbook\u002F9781852335489),  Goodwin G. (2005).\n* **`PI²`** [A Generalized Path Integral Control Approach to Reinforcement Learning](http:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fvolume11\u002Ftheodorou10a\u002Ftheodorou10a.pdf), Theodorou E. et al. (2010).\n* **`PI²-CMA`** [Path Integral Policy Improvement with Covariance Matrix Adaptation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1206.4621), Stulp F., Sigaud O. (2010).\n* **`iLQG`** [A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems](http:\u002F\u002Fmaeresearch.ucsd.edu\u002Fskelton\u002Fpublications\u002Fweiwei_ilqg_CDC43.pdf), Todorov E. (2005). [:octocat:](https:\u002F\u002Fgithub.com\u002Fneka-nat\u002Filqr-gym)\n* **`iLQG+`** [Synthesis and stabilization of complex behaviors through online trajectory optimization](https:\u002F\u002Fhomes.cs.washington.edu\u002F~todorov\u002Fpapers\u002FTassaIROS12.pdf), Tassa Y. (2012).\n\n## Model Predictive Control\n\n* (book) [Model Predictive Control](http:\u002F\u002Feen.iust.ac.ir\u002Fprofs\u002FShamaghdari\u002FMPC\u002FResources\u002F), Camacho E. (1995).\n* (book) [Predictive Control With Constraints](https:\u002F\u002Fbooks.google.fr\u002Fbooks\u002Fabout\u002FPredictive_Control.html?id=HV_Y58c7KiwC&redir_esc=y), Maciejowski J. M. (2002).\n* [Linear Model Predictive Control for Lane Keeping and Obstacle Avoidance on Low Curvature Roads](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F6728261\u002F), Turri V. et al. (2013).\n* **`MPCC`** [Optimization-based autonomous racing of 1:43 scale RC cars](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.07300), Liniger A. et al. (2014). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mXaElWYQKC4) | [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JoHfJ6LEKVo)\n* **`MIQP`** [Optimal trajectory planning for autonomous driving integrating logical constraints: An MIQP perspective](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-01342358v1\u002Fdocument), Qian X., Altché F., Bender P., Stiller C. de La Fortelle A. (2016).\n\n\n\n# Safe Control :lock:\n\n## Robust Control\n\n* [Minimax analysis of stochastic problems](https:\u002F\u002Fwww2.isye.gatech.edu\u002F~anton\u002FMinimaxSP.pdf), Shapiro A., Kleywegt A. (2002).\n* **`Robust DP`** [Robust Dynamic Programming](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F220442530\u002Fdownload), Iyengar G. (2005).\n* [Robust Planning and Optimization](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FFrancisco_Perez-Galarce\u002Fpost\u002Fcan_anyone_recommend_a_report_or_article_on_two_stage_robust_optimization\u002Fattachment\u002F59d62578c49f478072e9a500\u002FAS%3A272164542976002%401441900491330\u002Fdownload\u002F2011+-+Robust+planning+and+optimization.pdf), Laumanns M. (2011). (lecture notes)\n* [Robust Markov Decision Processes](https:\u002F\u002Fpubsonline.informs.org\u002Fdoi\u002Fpdf\u002F10.1287\u002Fmoor.1120.0566), Wiesemann W., Kuhn D., Rustem B. (2012).\n* [Safe and Robust Learning Control with Gaussian Processes](http:\u002F\u002Fwww.dynsyslab.org\u002Fwp-content\u002Fpapercite-data\u002Fpdf\u002Fberkenkamp-ecc15.pdf), Berkenkamp F., Schoellig A. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=YqhLnCm0KXY)\n* **`Tube-MPPI`** [Robust Sampling Based Model Predictive Control with Sparse Objective Information](http:\u002F\u002Fwww.roboticsproceedings.org\u002Frss14\u002Fp42.pdf), Williams G. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=32v-e3dptjo)\n* [Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.06266), Lukas Bronke et al. (2021). [:octocat:](https:\u002F\u002Fgithub.com\u002FutiasDSL\u002Fsafe-control-gym)\n\n## Risk-Averse Control\n\n* [A Comprehensive Survey on Safe Reinforcement Learning](http:\u002F\u002Fjmlr.org\u002Fpapers\u002Fv16\u002Fgarcia15a.html), García J., Fernández F. (2015).\n* **`RA-QMDP`** [Risk-averse Behavior Planning for Autonomous Driving under Uncertainty](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.01254), Naghshvar M. et al. (2018).\n* **`StoROO`** [X-Armed Bandits: Optimizing Quantiles and Other Risks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.08205), Torossian L., Garivier A., Picheny V. (2019).\n* [Worst Cases Policy Gradients](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.03618), Tang Y. C. et al. (2019).\n* [Model-Free Risk-Sensitive Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.02907), Delétang G. et al. (2021).\n* [Optimal Thompson Sampling strategies for support-aware CVaR bandits](https:\u002F\u002Fproceedings.mlr.press\u002Fv139\u002Fbaudry21a.html), Baudry D., Gautron R., Kaufmann E., Maillard O. (2021).\n\n## Value-Constrained Control\n\n* **`ICS`** [Will the Driver Seat Ever Be Empty?](https:\u002F\u002Fhal.inria.fr\u002Fhal-00965176), Fraichard T. (2014).\n* **`SafeOPT`** [Safe Controller Optimization for Quadrotors with Gaussian Processes](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.01066), Berkenkamp F., Schoellig A., Krause A. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GiqNQdzc5TI) [:octocat:](https:\u002F\u002Fgithub.com\u002Fbefelix\u002FSafeOpt)\n* **`SafeMDP`** [Safe Exploration in Finite Markov Decision Processes with Gaussian Processes](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.04753), Turchetta M., Berkenkamp F., Krause A. (2016). [:octocat:](https:\u002F\u002Fgithub.com\u002Fbefelix\u002FSafeMDP)\n* **`RSS`** [On a Formal Model of Safe and Scalable Self-driving Cars](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.06374), Shalev-Shwartz S. et al. (2017).\n* **`CPO`** [Constrained Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.10528), Achiam J., Held D., Tamar A., Abbeel P. (2017). [:octocat:](https:\u002F\u002Fgithub.com\u002Fjachiam\u002Fcpo)\n* **`RCPO`** [Reward Constrained Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.11074), Tessler C., Mankowitz D., Mannor S. (2018).\n* **`BFTQ`** [A Fitted-Q Algorithm for Budgeted MDPs](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-01867353), Carrara N. et al. (2018).\n* **`SafeMPC`** [Learning-based Model Predictive Control for Safe Exploration](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.08287), Koller T, Berkenkamp F., Turchetta M. Krause A. (2018).\n* **`CCE`** [Constrained Cross-Entropy Method for Safe Reinforcement Learning](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F7974-constrained-cross-entropy-method-for-safe-reinforcement-learning), Wen M., Topcu U. (2018). [:octocat:](https:\u002F\u002Fgithub.com\u002Fliuzuxin\u002Fsafe-mbrl)\n* **`LTL-RL`** [Reinforcement Learning with Probabilistic Guarantees for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.07189), Bouton M. et al. (2019).\n* [Safe Reinforcement Learning with Scene Decomposition for Navigating Complex Urban Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.11483v1), Bouton M. et al. (2019). [:octocat:](https:\u002F\u002Fgithub.com\u002Fsisl\u002FAutomotivePOMDPs.jl)\n* [Batch Policy Learning under Constraints](https:\u002F\u002Farxiv.org\u002Fabs\u002F1903.08738), Le H., Voloshin C., Yue Y. (2019).\n* [Value constrained model-free continuous control](https:\u002F\u002Farxiv.org\u002Fabs\u002F1902.04623), Bohez S. et al (2019). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fsuccessatanycost)\n* [Safely Learning to Control the Constrained Linear Quadratic Regulator](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8814865), Dean S. et al (2019).\n* [Learning to Walk in the Real World with Minimal Human Effort](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.08550), Ha S. et al. (2020) [🎞️](https:\u002F\u002Fyoutu.be\u002Fcwyiq6dCgOc)\n* [Responsive Safety in Reinforcement Learning by PID Lagrangian Methods](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.03964), Stooke A., Achiam J., Abbeel P. (2020). [:octocat:](https:\u002F\u002Fgithub.com\u002Fastooke\u002Frlpyt\u002Ftree\u002Fmaster\u002Frlpyt\u002Fprojects\u002Fsafe)\n* **`Envelope MOQ-Learning`** [A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.08342), Yang R. et al (2019).\n\n## State-Constrained Control and Stability\n\n* **`HJI-reachability`** [Safe learning for control: Combining disturbance estimation, reachability analysis and reinforcement learning with systematic exploration](http:\u002F\u002Fkth.diva-portal.org\u002Fsmash\u002Fget\u002Fdiva2:1140173\u002FFULLTEXT01.pdf), Heidenreich C. (2017).\n* **`MPC-HJI`** [On Infusing Reachability-Based Safety Assurance within Probabilistic Planning Frameworks for Human-Robot Vehicle Interactions](https:\u002F\u002Fstanfordasl.github.io\u002Fwp-content\u002Fpapercite-data\u002Fpdf\u002FLeung.Schmerling.Chen.ea.ISER18.pdf), Leung K. et al. (2018).\n* [A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.01292), Fisac J. et al (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=WAAxyeSk2bw&feature=youtu.be)\n* [Safe Model-based Reinforcement Learning with Stability Guarantees](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.08551), Berkenkamp F. et al. (2017).\n* **`Lyapunov-Net`** [Safe Interactive Model-Based Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.06556), Gallieri M. et al. (2019).\n* [Enforcing robust control guarantees within neural network policies](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.08105), Donti P. et al. (2021). [:octocat:](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Frobust-nn-control)\n* **`ATACOM`** [Robot Reinforcement Learning on the Constraint Manifold](https:\u002F\u002Fopenreview.net\u002Fforum?id=zwo1-MdMl1P), Liu P. et al (2021).\n\n## Uncertain Dynamical Systems\n\n* [Simulation of Controlled Uncertain Nonlinear Systems](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002F009630039400112H), Tibken B., Hofer E. (1995).\n* [Trajectory computation of dynamic uncertain systems](https:\u002F\u002Fieeexplore.ieee.org\u002Fiel5\u002F8969\u002F28479\u002F01272787.pdf), Adrot O., Flaus J-M. (2002).\n* [Simulation of Uncertain Dynamic Systems Described By Interval Models: a Survey](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS1474667016362206), Puig V. et al. (2005).\n* [Design of interval observers for uncertain dynamical systems](https:\u002F\u002Fhal.inria.fr\u002Fhal-01276439\u002Ffile\u002FInterval_Survey.pdf), Efimov D., Raïssi T. (2016).\n\n\n\n\n# Game Theory :spades:\n\n* [Hierarchical Game-Theoretic Planning for Autonomous Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.05766), Fisac J. et al. (2018).\n* [Efficient Iterative Linear-Quadratic Approximations for Nonlinear Multi-Player General-Sum Differential Games](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.04694), Fridovich-Keil D. et al. (2019). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KPEPk-QrkQ8&feature=youtu.be)\n\n\n\n\n# Sequential Learning :shoe:\n\n- [Prediction, Learning and Games](https:\u002F\u002Fwww.ii.uni.wroc.pl\u002F~lukstafi\u002Fpmwiki\u002Fuploads\u002FAGT\u002FPrediction_Learning_and_Games.pdf), Cesa-Bianchi N., Lugosi G. (2006).\n\n## Multi-Armed Bandit :slot_machine:\n\n* **`TS`** [On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples](https:\u002F\u002Fwww.jstor.org\u002Fstable\u002Fpdf\u002F2332286.pdf), Thompson W. (1933).\n* [Exploration and Exploitation in Organizational Learning](https:\u002F\u002Fwww3.nd.edu\u002F~ggoertz\u002Fabmir\u002Fmarch1991.pdf), March J. (1991).\n* **`UCB1 \u002F UCB2`** [Finite-time Analysis of the Multiarmed Bandit Problem](https:\u002F\u002Fhomes.di.unimi.it\u002F~cesabian\u002FPubblicazioni\u002Fml-02.pdf), Auer P., Cesa-Bianchi N., Fischer P. (2002).\n* **`Empirical Bernstein \u002F UCB-V`** [Exploration-exploitation tradeoff using variance estimates in multi-armed bandits](https:\u002F\u002Fhal.inria.fr\u002Fhal-00711069\u002F),  Audibert J-Y, Munos R., Szepesvari C. (2009).\n* [Empirical Bernstein Bounds and Sample Variance Penalization](https:\u002F\u002Farxiv.org\u002Fabs\u002F0907.3740), Maurer A., Ponti M. (2009).\n* [An Empirical Evaluation of Thompson Sampling](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4321-an-empirical-evaluation-of-thompson-sampling), Chapelle O., Li L. (2011).\n* **`kl-UCB`** [The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond](https:\u002F\u002Farxiv.org\u002Fabs\u002F1102.2490), Garivier A., Cappé O. (2011).\n* **`KL-UCB`** [Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation](https:\u002F\u002Fprojecteuclid.org\u002Feuclid.aos\u002F1375362558), Cappé O. et al. (2013).\n* **`IDS`** [Information Directed Sampling and Bandits with Heteroscedastic Noise](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.09667) Kirschner J., Krause A. (2018).\n\n#### Contextual\n\n* **`LinUCB`** [A Contextual-Bandit Approach to Personalized News Article Recommendation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1003.0146), Li L. et al. (2010).\n* **`OFUL`** [Improved Algorithms for Linear Stochastic Bandits](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4417-improved-algorithms-for-linear-stochastic-bandits), Abbasi-yadkori Y., Pal D., Szepesvári C. (2011).\n* [Contextual Bandits with Linear Payoff Functions](http:\u002F\u002Fproceedings.mlr.press\u002Fv15\u002Fchu11a.html), Chu W. et al. (2011).\n* [Self-normalization techniques for streaming confident regression](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-01349727v2), Maillard O.-A. (2017).\n* [Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.09387) Mann T. et al. (2018). (prediction setting)\n* [Weighted Linear Bandits for Non-Stationary Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.09146), Russac Y. et al. (2019).\n* [Linear bandits with Stochastic Delayed Feedback](http:\u002F\u002Fproceedings.mlr.press\u002Fv119\u002Fvernade20a.html), Vernade C. et al. (2020).\n\n\n### Best Arm Identification :muscle:\n\n* **`Successive Elimination`** [Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems](http:\u002F\u002Fjmlr.csail.mit.edu\u002Fpapers\u002Fvolume7\u002Fevendar06a\u002Fevendar06a.pdf), Even-Dar E. et al. (2006).\n* **`LUCB`** [PAC Subset Selection in Stochastic Multi-armed Bandits](https:\u002F\u002Fwww.cse.iitb.ac.in\u002F~shivaram\u002Fpapers\u002Fktas_icml_2012.pdf), Kalyanakrishnan S. et al. (2012).\n* **`UGapE`** [Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-00747005), Gabillon V., Ghavamzadeh M., Lazaric A. (2012).\n* **`Sequential Halving`** [Almost Optimal Exploration in Multi-Armed Bandits](http:\u002F\u002Fproceedings.mlr.press\u002Fv28\u002Fkarnin13.pdf), Karnin Z. et al (2013).\n* **`M-LUCB \u002F M-Racing`** [Maximin Action Identification: A New Bandit Framework for Games](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.04676), Garivier A., Kaufmann E., Koolen W. (2016).\n* **`Track-and-Stop`** [Optimal Best Arm Identification with Fixed Confidence](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.04589), Garivier A., Kaufmann E. (2016).\n* **`LUCB-micro`** [Structured Best Arm Identification with Fixed Confidence](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.05198), Huang R. et al. (2017).\n\n### Black-box Optimization :black_large_square:\n\n* **`GP-UCB`** [Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design](https:\u002F\u002Farxiv.org\u002Fabs\u002F0912.3995), Srinivas N., Krause A., Kakade S., Seeger M. (2009).\n* **`HOO`** [X–Armed Bandits](https:\u002F\u002Farxiv.org\u002Fabs\u002F1001.4475), Bubeck S., Munos R., Stoltz G., Szepesvari C. (2009).\n* **`DOO\u002FSOO`** [Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4304-optimistic-optimization-of-a-deterministic-function-without-the-knowledge-of-its-smoothness), Munos R. (2011).\n* **`StoOO`** [From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-00747575v4\u002F), Munos R. (2014).\n* **`StoSOO`** [Stochastic Simultaneous Optimistic Optimization](http:\u002F\u002Fproceedings.mlr.press\u002Fv28\u002Fvalko13.pdf), Valko M., Carpentier A., Munos R. (2013). \n* **`POO`** [Black-box optimization of noisy functions with unknown smoothness](https:\u002F\u002Fhal.inria.fr\u002Fhal-01222915v4\u002F), Grill J-B., Valko M., Munos R. (2015).\n* **`EI-GP`** [Bayesian Optimization in AlphaGo](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.06855), Chen Y. et al. (2018)\n\n\n\n# Reinforcement Learning :robot:\n\n* [Reinforcement learning: A survey](https:\u002F\u002Fwww.jair.org\u002Fmedia\u002F301\u002Flive-301-1562-jair.pdf), Kaelbling L. et al. (1996).\n\n## Theory :books:\n\n* [Expected mistake bound model for on-line reinforcement learning](https:\u002F\u002Fpdfs.semanticscholar.org\u002F13b8\u002F1dd08aab636c3761c5eb4337dbe43aedaf31.pdf), Fiechter C-N. (1997).\n* **`UCRL2`** [Near-optimal Regret Bounds for Reinforcement Learning](http:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fvolume11\u002Fjaksch10a\u002Fjaksch10a.pdf), Jaksch T. (2010). ![Setting](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-average-green)![Setting](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcommunicating-orange)![Bound](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDS(AT)^0.5-orange)\n* **`PSRL`** [Why is Posterior Sampling Better than Optimism for Reinforcement Learning?](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.00215), Osband I., Van Roy B. (2016). ![Setting](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green) ![Bayesian](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fbayesian-green)\n* **`UCBVI`** [Minimax Regret Bounds for Reinforcement Learning](http:\u002F\u002Fproceedings.mlr.press\u002Fv70\u002Fazar17a.html), Azar M., Osband I., Munos R. (2017). ![Setting](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green)![Bound](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F(H²SAT)^0.5-orange)\n* **`Q-Learning-UCB`** [Is Q-Learning Provably Efficient?](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F7735-is-q-learning-provably-efficient), Jin C., Allen-Zhu Z., Bubeck S., Jordan M. (2018). ![Setting](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green) ![Bound](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F(H^3SAT)^0.5-orange)\n* **`LSVI-UCB`** [Provably Efficient Reinforcement Learning with Linear Function Approximation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.05388), Jin C., Yang Z., Wang Z., Jordan M. (2019). ![Setting](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green) ![Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fapproximation-linear-green)\n* [Lipschitz Continuity in Model-based Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.07193), Asadi K. et al (2018).\n* [On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.04622), Yang Z., Jin C., Wang Z., Wang M., Jordan M. (2021) ![Setting](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green) ![Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fapproximation-kernel\u002Fnn-green) ![Bound](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdelta.H^2(T)^0.5-orange)\n\n### Generative Model\n\n*  **`QVI`** [On the Sample Complexity of Reinforcement Learning with a Generative Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F1206.6461), Azar M., Munos R., Kappen B. (2012).\n* [Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.03804), Agarwal A. et al. (2019).\n\n### Policy Gradient\n\n* [Policy Gradient Methods for Reinforcement Learning with Function Approximation](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation), Sutton R. et al (2000).\n* [Approximately Optimal Approximate Reinforcement Learning](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~pabbeel\u002Fcs287-fa09\u002Freadings\u002FKakadeLangford-icml2002.pdf), Kakade S., Langford J. (2002).\n* [On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.00261), Agarwal A. et al. (2019)\n* [PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.08459),  Agarwal A. et al. (2020) \n* [Is the Policy Gradient a Gradient?](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.07073), Nota C., Thomas P. S. (2020).\n\n### Linear Systems\n\n* [PAC Adaptive Control of Linear Systems](https:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.49.339&rep=rep1&type=pdf), Fiechter C.-N. (1997)\n* **`OFU-LQ`** [Regret Bounds for the Adaptive Control of Linear Quadratic Systems](http:\u002F\u002Fproceedings.mlr.press\u002Fv19\u002Fabbasi-yadkori11a\u002Fabbasi-yadkori11a.pdf), Abbasi-Yadkori Y., Szepesvari C. (2011).\n* **`TS-LQ`** [Improved Regret Bounds for Thompson Sampling in Linear Quadratic Control Problems](http:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Fabeille18a.html), Abeille M., Lazaric A. (2018).\n* [Exploration-Exploitation with Thompson Sampling in Linear Systems](https:\u002F\u002Ftel.archives-ouvertes.fr\u002Ftel-01816069\u002F), Abeille M. (2017). (phd thesis)\n* **`Coarse-Id`** [On the Sample Complexity of the Linear Quadratic Regulator](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.01688), Dean S., Mania H., Matni N., Recht B., Tu S. (2017).\n* [Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator](http:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F7673-regret-bounds-for-robust-adaptive-control-of-the-linear-quadratic-regulator), Dean S. et al (2018).\n* [Robust exploration in linear quadratic reinforcement learning](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F9668-robust-exploration-in-linear-quadratic-reinforcement-learning), Umenberger J. et al (2019).\n* [Online Control with Adversarial Disturbances](https:\u002F\u002Farxiv.org\u002Fabs\u002F1902.08721), Agarwal N. et al (2019).  ![Noise](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fnoise-adversarial-red)![Costs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcosts-convex-green)\n* [Logarithmic Regret for Online Control](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.05062), Agarwal N. et al (2019).  ![Noise](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fnoise-adversarial-red)![Costs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcosts-convex-green)\n\n## Value-based :chart_with_upwards_trend:\n\n* **`NFQ`** [Neural fitted Q iteration - First experiences with a data efficient neural Reinforcement Learning method](http:\u002F\u002Fml.informatik.uni-freiburg.de\u002Fformer\u002F_media\u002Fpublications\u002Frieecml05.pdf), Riedmiller M. (2005).\n* **`DQN`** [Playing Atari with Deep Reinforcement Learning](https:\u002F\u002Fwww.cs.toronto.edu\u002F~vmnih\u002Fdocs\u002Fdqn.pdf), Mnih V. et al. (2013). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=iqXKQf2BOSE)\n* **`DDQN`** [Deep Reinforcement Learning with Double Q-learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.06461), van Hasselt H., Silver D. et al. (2015).\n* **`DDDQN`** [Dueling Network Architectures for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06581), Wang Z. et al. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=qJd3yaEN9Sw)\n* **`PDDDQN`** [Prioritized Experience Replay](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.05952), Schaul T. et al. (2015).\n* **`NAF`** [Continuous Deep Q-Learning with Model-based Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.00748), Gu S. et al. (2016).\n* **`Rainbow`** [Rainbow: Combining Improvements in Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.02298), Hessel M. et al. (2017).\n* **`Ape-X DQfD`** [Observe and Look Further: Achieving Consistent Performance on Atari](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.11593), Pohlen T. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-0xOdnoxAFo&index=4&list=PLnZpNNVLsMmOfqMwJLcpLpXKLr3yKZ8Ak)\n\n## Policy-based :muscle:\n\n### Policy gradient\n\n* **`REINFORCE`** [Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning](http:\u002F\u002Fwww-anw.cs.umass.edu\u002F~barto\u002Fcourses\u002Fcs687\u002Fwilliams92simple.pdf), Williams R. (1992).\n* **`Natural Gradient`** [A Natural Policy Gradient](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F2073-a-natural-policy-gradient.pdf), Kakade S. (2002).\n* [Policy Gradient Methods for Robotics](http:\u002F\u002Fwww.kyb.mpg.de\u002Ffileadmin\u002Fuser_upload\u002Ffiles\u002Fpublications\u002Fattachments\u002FIROS2006-Peters_%5b0%5d.pdf), Peters J.,  Schaal S. (2006).\n* **`TRPO`** [Trust Region Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1502.05477), Schulman J. et al. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KJ15iGGJFvQ)\n* **`PPO`** [Proximal Policy Optimization Algorithms](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347), Schulman J. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=bqdjsmSoSgI)\n* **`DPPO`** [Emergence of Locomotion Behaviours in Rich Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.02286), Heess N. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hx_bgoTF7bs)\n\n### Actor-critic\n\n* **`AC`** [Policy Gradient Methods for Reinforcement Learning with Function Approximation](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf), Sutton R. et al. (1999).\n* **`NAC`** [Natural Actor-Critic](https:\u002F\u002Fhomes.cs.washington.edu\u002F~todorov\u002Fcourses\u002Famath579\u002Freading\u002FNaturalActorCritic.pdf), Peters J. et al. (2005).\n* **`DPG`** [Deterministic Policy Gradient Algorithms](http:\u002F\u002Fproceedings.mlr.press\u002Fv32\u002Fsilver14.pdf), Silver D. et al. (2014).\n* **`DDPG`** [Continuous Control With Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.02971), Lillicrap T. et al. (2015). [🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lV5JhxsrSH8) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=8CNck-hdys8) | [3](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=xw73qehvSRQ) | [4](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=vWxBmHRnQMI)\n* **`MACE`** [Terrain-Adaptive Locomotion Skills Using Deep Reinforcement Learning](https:\u002F\u002Fwww.cs.ubc.ca\u002F~van\u002Fpapers\u002F2016-TOG-deepRL\u002Findex.html), Peng X., Berseth G., van de Panne M. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KPfzRSBzNX4) | [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=A0BmHoujP9k)\n* **`A3C`** [Asynchronous Methods for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.01783), Mnih V. et al 2016. [🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Ajjc08-iPx8) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0xo1Ldx3L5Q) | [3](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nMR5mjCFZCw)\n* **`SAC`** [Soft Actor-Critic : Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.01290), Haarnoja T. et al. (2018). [🎞️](https:\u002F\u002Fvimeo.com\u002F252185258)\n* **`MPO`** [Maximum a Posteriori Policy Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1806.06920), Abdolmaleki A. et al (2018).\n* [A Deeper Look at Discounting Mismatch in Actor-Critic Algorithms](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.01069), Zhang S., Laroche R. et al. (2020).\n\n### Derivative-free\n\n* **`CEM`** [Learning Tetris Using the Noisy Cross-Entropy Method](http:\u002F\u002Fiew3.technion.ac.il\u002FCE\u002Ffiles\u002Fpapers\u002FLearning%20Tetris%20Using%20the%20Noisy%20Cross-Entropy%20Method.pdf), Szita I., Lörincz A. (2006). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UZnDYGk1j2c)\n* **`CMAES`** [Completely Derandomized Self-Adaptation in Evolution Strategies](https:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=1108843), Hansen N., Ostermeier A. (2001).\n* **`NEAT`** [Evolving Neural Networks through Augmenting Topologies](http:\u002F\u002Fnn.cs.utexas.edu\u002Fdownloads\u002Fpapers\u002Fstanley.ec02.pdf), Stanley K. (2002). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5lJuEW-5vr8)\n* **`iCEM`** [Sample-efficient Cross-Entropy Method for Real-time Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2008.06389), Pinneri C. et al. (2020).\n\n## Model-based :world_map:\n\n* **`Dyna`** [Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming](http:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.84.6983&rep=rep1&type=pdf), Sutton R. (1990).\n* **`PILCO`** [PILCO: A Model-Based and Data-Efficient Approach to Policy Search](http:\u002F\u002Fmlg.eng.cam.ac.uk\u002Fpub\u002Fpdf\u002FDeiRas11.pdf), Deisenroth M., Rasmussen C. (2011). ([talk](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=f7y60SEZfXc))\n* **`DBN`** [Probabilistic MDP-behavior planning for cars](http:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=6082928), Brechtel S. et al. (2011).\n* **`GPS`** [End-to-End Training of Deep Visuomotor Policies](https:\u002F\u002Farxiv.org\u002Fabs\u002F1504.00702), Levine S. et al. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Q4bMcUk6pcw)\n* **`DeepMPC`** [DeepMPC: Learning Deep Latent Features for Model Predictive Control](https:\u002F\u002Fwww.cs.stanford.edu\u002Fpeople\u002Fasaxena\u002Fpapers\u002Fdeepmpc_rss2015.pdf), Lenz I. et al. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=BwA90MmkvPU)\n* **`SVG`** [Learning Continuous Control Policies by Stochastic Value Gradients](https:\u002F\u002Farxiv.org\u002Fabs\u002F1510.09142), Heess N. et al. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=PYdL7bcn_cM)\n* **`FARNN`** [Nonlinear Systems Identification Using Deep Dynamic Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.01439), Ogunmolu O. et al. (2016). [:octocat:](https:\u002F\u002Fgithub.com\u002Flakehanne\u002FFARNN)\n* [Optimal control with learned local models: Application to dexterous manipulation](https:\u002F\u002Fhomes.cs.washington.edu\u002F~todorov\u002Fpapers\u002FKumarICRA16.pdf), Kumar V. et al. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=bD5z1I1TU3w)\n* **`BPTT`** [Long-term Planning by Short-term Prediction](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.01580), Shalev-Shwartz S. et al. (2016). [🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Nqmv1anUaF4) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UgGZ9lMvey8)\n* [Deep visual foresight for planning robot motion](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.00696), Finn C., Levine S. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=6k7GHG4IUCY)\n* **`VIN`** [Value Iteration Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.02867), Tamar A. et al  (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=RcRkog93ZRU)\n* **`VPN`** [Value Prediction Network](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.03497), Oh J. et al. (2017).\n* **`DistGBP`** [Model-Based Planning with Discrete and Continuous Actions](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.07177), Henaff M. et al. (2017). [🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=9Xh2TRQ_4nM) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=XLdme0TTjiw)\n* [Prediction and Control with Temporal Segment Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.04070), Mishra N. et al. (2017).\n* **`Predictron`** [The Predictron: End-To-End Learning and Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.08810), Silver D. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=BeaLdaN2C3Q)\n* **`MPPI`** [Information Theoretic MPC for Model-Based Reinforcement Learning](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7989202\u002F), Williams G. et al. (2017). [:octocat:](https:\u002F\u002Fgithub.com\u002Fferreirafabio\u002Fmppi_pendulum) [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=f2at-cqaJMM)\n* [Learning Real-World Robot Policies by Dreaming](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.07813), Piergiovanni A. et al. (2018).\n* [Coupled Longitudinal and Lateral Control of a Vehicle using Deep Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.09365), Devineau G., Polack P., Alchté F., Moutarde F. (2018) [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=yyWy1uavlXs)\n* **`PlaNet`** [Learning Latent Dynamics for Planning from Pixels](https:\u002F\u002Fplanetrl.github.io\u002F), Hafner et al. (2018).  [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=tZk1eof_VNA)\n* **`NeuralLander`** [Neural Lander: Stable Drone Landing Control using Learned Dynamics](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.08027), Shi G. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=FLLsG0S78ik)\n* **`DBN+POMCP`** [Towards Human-Like Prediction and Decision-Making for Automated Vehicles in Highway Scenarios ](https:\u002F\u002Ftel.archives-ouvertes.fr\u002Ftel-02184362), Sierra Gonzalez D. (2019).\n* [Planning with Goal-Conditioned Policies](https:\u002F\u002Fsites.google.com\u002Fview\u002Fgoal-planning), Nasiriany S. et al. (2019). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fgoal-planning#h.p_0m-H0QfKVj4n)\n* **`MuZero`** [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.08265), Schrittwiese J. et al. (2019). [:octocat:](https:\u002F\u002Fgithub.com\u002Fwerner-duvaud\u002Fmuzero-general)\n* **`BADGR`** [BADGR: An Autonomous Self-Supervised Learning-Based Navigation System](https:\u002F\u002Fsites.google.com\u002Fview\u002Fbadgr), Kahn G., Abbeel P., Levine S. (2020). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=EMV0zEXbcc4) [:octocat:](https:\u002F\u002Fgithub.com\u002Fgkahn13\u002Fbadgr)\n* **`H-UCRL`** [Efficient Model-Based Reinforcement Learning through Optimistic Policy Search and Planning](https:\u002F\u002Fproceedings.neurips.cc\u002F\u002Fpaper_files\u002Fpaper\u002F2020\u002Fhash\u002Fa36b598abb934e4528412e5a2127b931-Abstract.html), Curi S., Berkenkamp F., Krause A. (2020). [:octocat:](https:\u002F\u002Fgithub.com\u002Fsebascuri\u002Fhucrl)\n\n\n## Exploration :tent:\n\n* [Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.01211), Lipton Z. et al. (2016).\n* **`Pseudo-count`** [Unifying Count-Based Exploration and Intrinsic Motivation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.01868), Bellemare M. et al (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0yI2wJ6F8r0) \n* **`HER`** [Hindsight Experience Replay](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01495), Andrychowicz M. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Dz_HuzgMxzo)\n* **`VHER`** [Visual Hindsight Experience Replay](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.11529), Sahni H. et al. (2019).\n* **`RND`** [Exploration by Random Network Distillation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.12894), Burda Y. et al. (OpenAI) (2018).  [🎞️](https:\u002F\u002Fopenai.com\u002Fblog\u002Freinforcement-learning-with-prediction-based-rewards\u002F)\n* **`Go-Explore`** [Go-Explore: a New Approach for Hard-Exploration Problems](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.10995), Ecoffet A. et al. (Uber) (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=gnGyUPd_4Eo)\n* **`C51-IDS`** [Information-Directed Exploration for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.07544), Nikolov N., Kirschner J., Berkenkamp F., Krause A. (2019). [:octocat:](https:\u002F\u002Fgithub.com\u002Fnikonikolov\u002Frltf)\n* **`Plan2Explore`** [Planning to Explore via Self-Supervised World Models](https:\u002F\u002Framanans1.github.io\u002Fplan2explore\u002F), Sekar R. et al. (2020). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GftqnPWsCWw&feature=emb_title) [:octocat:](https:\u002F\u002Fgithub.com\u002Framanans1\u002Fplan2explore)\n* **`RIDE`** [RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rkg-TJBFPB), Raileanu R., Rocktäschel T., (2020). [:octocat:](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fimpact-driven-exploration)\n\n## Hierarchy and Temporal Abstraction :clock2:\n\n* [Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning](http:\u002F\u002Fwww-anw.cs.umass.edu\u002F~barto\u002Fcourses\u002Fcs687\u002FSutton-Precup-Singh-AIJ99.pdf), Sutton R. et al. (1999).\n* [Intrinsically motivated learning of hierarchical collections of skills](http:\u002F\u002Fwww-anw.cs.umass.edu\u002Fpubs\u002F2004\u002Fbarto_sc_ICDL04.pdf), Barto A. et al. (2004).\n* **`OC`** [The Option-Critic Architecture](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.05140), Bacon P-L., Harb J., Precup D. (2016).\n* [Learning and Transfer of Modulated Locomotor Controllers](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.05182), Heess N. et al. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=sboPYvhpraQ&feature=youtu.be)\n* [Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.03295), Shalev-Shwartz S. et al. (2016).\n* **`FuNs`** [FeUdal Networks for Hierarchical Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.01161), Vezhnevets A. et al. (2017).\n* [Combining Neural Networks and Tree Search for Task and Motion Planning in Challenging Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.07887), Paxton C. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=MM2U_SGMtk8)\n* **`DeepLoco`** [DeepLoco: Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning ](https:\u002F\u002Fwww.cs.ubc.ca\u002F~van\u002Fpapers\u002F2017-TOG-deepLoco\u002F), Peng X. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hd1yvLWm6oA) | [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=x-HrYko_MRU)\n* [Hierarchical Policy Design for Sample-Efficient Learning of Robot Table Tennis Through Self-Play](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.12927), Mahjourian R. et al (2018). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Frobottabletennis)\n* **`DAC`** [DAC: The Double Actor-Critic Architecture for Learning Options](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.12691), Zhang S., Whiteson S. (2019). \n* [Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.05224), Nachum O. et al (2019). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmanipulation-via-locomotion)\n* [SoftCon: Simulation and Control of Soft-Bodied Animals with Biomimetic Actuators](http:\u002F\u002Fmrl.snu.ac.kr\u002Fpublications\u002FProjectSoftCon\u002FSoftCon.html), Min S. et al. (2020). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=I2ylkhPSkT4) [:octocat:](https:\u002F\u002Fgithub.com\u002Fseiing\u002FSoftCon)\n* **`H-REIL`** [Reinforcement Learning based Control of Imitative Policies for Near-Accident Driving](http:\u002F\u002Filiad.stanford.edu\u002Fpdfs\u002Fpublications\u002Fcao2020reinforcement.pdf), Cao Z. et al. (2020). [🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=CY24zlC_HdI&feature=youtu.be), [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=envT7b5YRts&feature=youtu.be)\n\n## Partial Observability :eye:\n\n* **`PBVI`** [Point-based Value Iteration: An anytime algorithm for POMDPs](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002Fpub4\u002Fpineau_joelle_2003_3\u002Fpineau_joelle_2003_3.pdf), Pineau J. et al. (2003).\n* **`cPBVI`** [Point-Based Value Iteration for Continuous POMDPs](http:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fvolume7\u002Fporta06a\u002Fporta06a.pdf), Porta J. et al. (2006).\n* **`POMCP`** [Monte-Carlo Planning in Large POMDPs](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4031-monte-carlo-planning-in-large-pomdps), Silver D., Veness J. (2010).\n* [A POMDP Approach to Robot Motion Planning under Uncertainty](http:\u002F\u002Fusers.isr.ist.utl.pt\u002F~mtjspaan\u002FPOMDPPractioners\u002Fpomdp2010_submission_5.pdf), Du Y. et al. (2010).\n* [Probabilistic Online POMDP Decision Making for Lane Changes in Fully Automated Driving](https:\u002F\u002Fusers.cs.duke.edu\u002F~pdinesh\u002Fsources\u002F06728533.pdf), Ulbrich S., Maurer M. (2013).\n* [Solving Continuous POMDPs: Value Iteration with Incremental Learning of an Efficient Space Representation](http:\u002F\u002Fproceedings.mlr.press\u002Fv28\u002Fbrechtel13.pdf), Brechtel S. et al. (2013).\n* [Probabilistic Decision-Making under Uncertainty for Autonomous Driving using Continuous POMDPs](http:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=6957722), Brechtel S. et al. (2014).\n* **`MOMDP`** [Intention-Aware Motion Planning](http:\u002F\u002Fares.lids.mit.edu\u002Ffm\u002Fdocuments\u002Fintentionawaremotionplanning.pdf), Bandyopadhyay T. et al. (2013).\n* **`DNC`** [Hybrid computing using a neural network with dynamic external memory](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature20101), Graves A. et al (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=B9U8sI7TcMY)\n* [The value of inferring the internal state of traffic participants for autonomous freeway driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F1702.00858), Sunberg Z. et al. (2017).\n* [Belief State Planning for Autonomously Navigating Urban Intersections](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.04322), Bouton M., Cosgun A., Kochenderfer M. (2017).\n* [Scalable Decision Making with Sensor Occlusions for Autonomous Driving](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F8460914), Bouton M. et al. (2018).\n* [Probabilistic Decision-Making at Road Intersections: Formulation and Quantitative Evaluation](https:\u002F\u002Fhal.inria.fr\u002Fhal-01940392), Barbier M., Laugier C., Simonin O., Ibanez J. (2018).\n* [Beauty and the Beast: Optimal Methods Meet Learning for Drone Racing](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.06224), Kaufmann E. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UuQvijZcUSc)\n*  **`social perception`** [Behavior Planning of Autonomous Cars with Social Perception](https:\u002F\u002Farxiv.org\u002Fabs\u002F1905.00988), Sun L. et al (2019).\n\n## Transfer :earth_americas:\n\n* **`IT&E`** [Robots that can adapt like animals](https:\u002F\u002Farxiv.org\u002Fabs\u002F1407.3501), Cully A., Clune J., Tarapore D., Mouret J-B. (2014). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=T-c17RKh3uE)\n* **`MAML`** [Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.03400), Finn C., Abbeel P., Levine S. (2017). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmaml)\n* [Virtual to Real Reinforcement Learning for Autonomous Driving](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.03952), Pan X. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Bce2ZSlMuqY)\n* [Sim-to-Real: Learning Agile Locomotion For Quadruped Robots](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.10332), Tan J. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lUZUr7jxoqM)\n* **`ME-TRPO`** [Model-Ensemble Trust-Region Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.10592), Kurutach T. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=tpS8qj7yhoU)\n* [Kickstarting Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.03835), Schmitt S. et al. (2018).\n* [Learning Dexterous In-Hand Manipulation](https:\u002F\u002Fblog.openai.com\u002Flearning-dexterity\u002F), OpenAI (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=DKe8FumoD4E)\n* **`GrBAL \u002F ReBAL`** [Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.11347), Nagabandi A. et al. (2018). [🎞️](https:\u002F\u002Fsites.google.com\u002Fberkeley.edu\u002Fmetaadaptivecontrol)\n* [Learning agile and dynamic motor skills for legged robots](https:\u002F\u002Frobotics.sciencemag.org\u002Fcontent\u002F4\u002F26\u002Feaau5872), Hwangbo J. et al. (ETH Zurich \u002F Intel ISL) (2019). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ITfBKjBH46E)\n* [Robust Recovery Controller for a Quadrupedal Robot using Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.07517), Lee J., Hwangbo J., Hutter M. (ETH Zurich RSL) (2019)\n* **`IT&E`** [Learning and adapting quadruped gaits with the \"Intelligent Trial & Error\" algorithm](https:\u002F\u002Fhal.inria.fr\u002Fhal-02084619), Dalin E., Desreumaux P., Mouret J-B. (2019). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=v90CWJ_HsnM)\n* **`FAMLE`** [Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.04663), Kaushik R., Anne T., Mouret J-B. (2020). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=QIY1Sm7wHhE)\n* [Robust Deep Reinforcement Learning against Adversarial Perturbations on Observations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.08938), Zhang H. et al (2020). [:octocat:](https:\u002F\u002Fgithub.com\u002Fchenhongge\u002FStateAdvDRL)\n* [Learning quadrupedal locomotion over challenging terrain](https:\u002F\u002Frobotics.sciencemag.org\u002Fcontent\u002F5\u002F47\u002Feabc5986), Lee J. et al. (2020). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=9j2a1oAHDL8)\n* **`PACOH`** [PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.05551), Rothfuss J., Fortuin V., Josifoski M., Krause A. (2021).\n* [Model-Based Domain Generalization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.11436), Robey A. et al. (2021).\n* **`SimGAN`** [SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.06005), Jiang Y. et al. (2021). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=McKOGllO7nc&feature=youtu.be) [:octocat:](https:\u002F\u002Fgithub.com\u002Fjyf588\u002FSimGAN)\n* [Learning robust perceptive locomotion for quadrupedal robots in the wild](https:\u002F\u002Fleggedrobotics.github.io\u002Frl-perceptiveloco\u002F), Miki T. et al. (2022).\n\n## Multi-agent :two_men_holding_hands:\n\n* **`Minimax-Q`** [Markov games as a framework for multi-agent reinforcement learning](https:\u002F\u002Fwww.cs.rutgers.edu\u002F~mlittman\u002Fpapers\u002Fml94-final.pdf), M. Littman (1994).\n* [Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.08071), Albrecht S., Stone P. (2017).\n* **`MILP`** [Time-optimal coordination of mobile robots along specified paths](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.04610), Altché F. et al. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=RiW2OFsdHOY)\n* **`MIQP`** [An Algorithm for Supervised Driving of Cooperative Semi-Autonomous Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.08046), Altché F. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JJZKfHMUeCI)\n* **`SA-CADRL`** [Socially Aware Motion Planning with Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.08862), Chen Y. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=CK1szio7PyA)\n* [Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction: Theory and experiment](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs10514-017-9619-z), Galceran E. et al. (2017).\n* [Online decision-making for scalable autonomous systems](https:\u002F\u002Fwww.ijcai.org\u002Fproceedings\u002F2017\u002F664), Wray K. et al. (2017).\n* **`MAgent`** [MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.00600), Zheng L. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=HCSm0kVolqI)\n* [Cooperative Motion Planning for Non-Holonomic Agents with Value Iteration Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.05273), Rehder E. et al. (2017).\n* **`MPPO`** [Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.10082), Long P. et al. (2017). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdrlmaca)\n* **`COMA`** [Counterfactual Multi-Agent Policy Gradients](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.05273), Foerster J. et al. (2017).\n* **`MADDPG`** [Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.02275), Lowe R. et al (2017). [:octocat:](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmaddpg) \n* **`FTW`** [Human-level performance in first-person multiplayer games with population-based deep reinforcement learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.01281), Jaderberg M. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=dltN4MxV1RI)\n* [Towards Learning Multi-agent Negotiations via Self-Play](https:\u002F\u002Farxiv.org\u002Fabs\u002F2001.10208), Tang Y. C. (2020).\n* **`MAPPO`** [The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.01955), Yu C. et al. (2021). |:octocat:](https:\u002F\u002Fgithub.com\u002Fmarlbenchmark\u002Fon-policy)\n* [Many-agent Reinforcement Learning](https:\u002F\u002Fdiscovery.ucl.ac.uk\u002Fid\u002Feprint\u002F10124273\u002F), Yang Y. (2021)\n\n\n## Representation Learning\n\n* [Variable Resolution Discretization in Optimal Control](https:\u002F\u002Frd.springer.com\u002Fcontent\u002Fpdf\u002F10.1023%2FA%3A1017992615625.pdf), Munos R., Moore A. (2002). [🎞️](http:\u002F\u002Fresearchers.lille.inria.fr\u002F~munos\u002Fvariable\u002Findex.html)\n* **`DeepDriving`** [DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving](http:\u002F\u002Fdeepdriving.cs.princeton.edu\u002Fpaper.pdf), Chen C. et al. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5hFvoXV9gII)\n* [On the Sample Complexity of End-to-end Training vs. Semantic Abstraction Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.06915), Shalev-Shwartz S. et al. (2016).\n* [Learning sparse representations in reinforcement learning with sparse coding](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.08316), Le L., Kumaraswamy M., White M. (2017).\n* [World Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.10122), Ha D., Schmidhuber J. (2018). [🎞️](https:\u002F\u002Fworldmodels.github.io\u002F) [:octocat:](https:\u002F\u002Fgithub.com\u002Fctallec\u002Fworld-models)\n* [Learning to Drive in a Day](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.00412), Kendall A. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eRwTbRtnT1I)\n* **`MERLIN`** [Unsupervised Predictive Memory in a Goal-Directed Agent](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.10760), Wayne G. et al. (2018). [🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=YFx-D4eEs5A) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=IiR_NOomcpk) | [3](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=dQMKJtLScmk) | [4](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=xrYDlTXyC6Q) | [5](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=04H28-qA3f8) | [6](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=3iA19h0Vvq0)\n* [Variational End-to-End Navigation and Localization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.10119), Amini A. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=aXI4a_Nvcew)\n* [Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.10191.pdf), Lee M. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=TjwDJ_R2204)\n* [Deep Neuroevolution of Recurrent and Discrete World Models](http:\u002F\u002Fsebastianrisi.com\u002Fwp-content\u002Fuploads\u002Frisi_gecco19.pdf), Risi S., Stanley K.O. (2019). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=a-tcsnZe-yE) [:octocat:](https:\u002F\u002Fgithub.com\u002Fsebastianrisi\u002Fga-world-models)\n* **`FERM`** [A Framework for Efficient Robotic Manipulation](https:\u002F\u002Fsites.google.com\u002Fview\u002Fefficient-robotic-manipulation), Zhan A., Zhao R. et al. (2021). [:octocat:](https:\u002F\u002Fgithub.com\u002FPhilipZRH\u002Fferm)\n* **`S4RL`** [S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.06326), Sinha S. et al (2021).\n\n## Offline\n\n* **`SPI-BB`** [Safe Policy Improvement with Baseline Bootstrapping](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.06924), Laroche R. et al (2019).\n* **`AWAC`** [AWAC: Accelerating Online Reinforcement Learning with Offline Datasets](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.09359), Nair A. et al (2020).\n* **`CQL`** [Conservative Q-Learning for Offline Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.04779), Kumar A. et al. (2020).\n* [Decision Transformer: Reinforcement Learning via Sequence Modeling](https:\u002F\u002Fsites.google.com\u002Fberkeley.edu\u002Fdecision-transformer), Chen L., Lu K. et al. (2021). [:octocat:](https:\u002F\u002Fgithub.com\u002Fkzl\u002Fdecision-transformer)\n* [Reinforcement Learning as One Big Sequence Modeling Problem](https:\u002F\u002Ftrajectory-transformer.github.io\u002F), Janner M., Li Q., Levine S. (2021).\n\n## Other\n\n* [Is the Bellman residual a bad proxy?](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.07636), Geist M., Piot B., Pietquin O. (2016).\n* [Deep Reinforcement Learning that Matters](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.06560), Henderson P. et al. (2017).\n* [Automatic Bridge Bidding Using Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.03290), Yeh C. and Lin H. (2016).\n* [Shared Autonomy via Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.01744), Reddy S. et al. (2018). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdeep-assist)\n* [Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.00909), Levine S. (2018).\n* [The Value Function Polytope in Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.11524), Dadashi R. et al. (2019).\n* [On Value Functions and the Agent-Environment Boundary](https:\u002F\u002Farxiv.org\u002Fabs\u002F1905.13341), Jiang N. (2019).\n* [How to Train Your Robot with Deep Reinforcement Learning; Lessons We've Learned](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.02915), Ibartz J. et al (2021).\n\n\n# Learning from Demonstrations :mortar_board:\n\n## Imitation Learning\n\n* **`DAgger`** [A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning](https:\u002F\u002Fwww.cs.cmu.edu\u002F~sross1\u002Fpublications\u002FRoss-AIStats11-NoRegret.pdf), Ross S., Gordon G., Bagnell J. A. (2011).\n* **`QMDP-RCNN`** [Reinforcement Learning via Recurrent Convolutional Neural Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1701.02392), Shankar T. et al. (2016). ([talk](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=gpwA3QNTPOQ))\n* **`DQfD`** [Learning from Demonstrations for Real World Reinforcement Learning](https:\u002F\u002Fpdfs.semanticscholar.org\u002Fa7fb\u002F199f85943b3fb6b5f7e9f1680b2e2a445cce.pdf), Hester T. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JR6wmLaYuu4&list=PLdjpGm3xcO-0aqVf--sBZHxCKg-RZfa5T)\n* [Find Your Own Way: Weakly-Supervised Segmentation of Path Proposals for Urban Autonomy](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.01238), Barnes D., Maddern W., Posner I. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=rbZ8ck_1nZk)\n* **`GAIL`** [Generative Adversarial Imitation Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.03476), Ho J., Ermon S. (2016).\n* [From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.07910), Pfeiffer M. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ZedKmXzwdgI)\n* **`Branched`** [End-to-end Driving via Conditional Imitation Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.02410), Codevilla F. et al. (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cFtnflNe5fM) | [talk](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KunVjVHN3-U)\n* **`UPN`** [Universal Planning Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.00645), Srinivas A. et al. (2018). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fupn-public\u002Fhome)\n* **`DeepMimic`** [DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills](https:\u002F\u002Fxbpeng.github.io\u002Fprojects\u002FDeepMimic\u002Findex.html), Peng X. B. et al. (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=vppFvq2quQ0&feature=youtu.be)\n* **`R2P2`** [Deep Imitative Models for Flexible Inference, Planning, and Control](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.06544), Rhinehart N. et al. (2018). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fimitativeforecastingcontrol)\n* [Learning Agile Robotic Locomotion Skills by Imitating Animals](https:\u002F\u002Fxbpeng.github.io\u002Fprojects\u002FRobotic_Imitation\u002Findex.html), Bin Peng X. et al (2020). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lKYh6uuCwRY)\n* [Deep Imitative Models for Flexible Inference, Planning, and Control](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Skl4mRNYDr), Rhinehart N., McAllister R., Levine S. (2020). \n\n### Applications to Autonomous Driving :car:\n\n* [ALVINN, an autonomous land vehicle in a neural network](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F95-alvinn-an-autonomous-land-vehicle-in-a-neural-network), Pomerleau D. (1989).\n* [End to End Learning for Self-Driving Cars](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.07316), Bojarski M. et al. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=qhUvQiKec2U)\n* [End-to-end Learning of Driving Models from Large-scale Video Datasets](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.01079), Xu H., Gao Y. et al. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=jxlNfUzbGAY)\n* [End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.03804), Eraqi H. et al. (2017).\n* [Driving Like a Human: Imitation Learning for Path Planning using Convolutional Neural Networks](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDriving-Like-a-Human%3A-Imitation-Learning-for-Path-Rehder-Quehl\u002Fa1150417083918c3f5f88b7ddad8841f2ce88188), Rehder E. et al. (2017).\n* [Imitating Driver Behavior with Generative Adversarial Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1701.06699), Kuefler A. et al. (2017).\n* **`PS-GAIL`** [Multi-Agent Imitation Learning for Driving Simulation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.01044), Bhattacharyya R. et al. (2018). [🎞️](https:\u002F\u002Fgithub.com\u002Fsisl\u002Fngsim_env\u002Fblob\u002Fmaster\u002Fmedia\u002Fsingle_multi_model_2_seed_1.gif) [:octocat:](https:\u002F\u002Fgithub.com\u002Fsisl\u002Fngsim_env)\n* [Deep Imitation Learning for Autonomous Driving in Generic Urban Scenarios with Enhanced Safety](https:\u002F\u002Farxiv.org\u002Fabs\u002F1903.00640), Chen J. et al. (2019).\n\n## Inverse Reinforcement Learning\n\n* **`Projection`** [Apprenticeship learning via inverse reinforcement learning](http:\u002F\u002Fai.stanford.edu\u002F~ang\u002Fpapers\u002Ficml04-apprentice.pdf), Abbeel P., Ng A. (2004).\n* **`MMP`** [Maximum margin planning](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002Fpub4\u002Fratliff_nathan_2006_1\u002Fratliff_nathan_2006_1.pdf), Ratliff N. et al. (2006).\n* **`BIRL`** [Bayesian inverse reinforcement learning](https:\u002F\u002Fwww.aaai.org\u002FPapers\u002FIJCAI\u002F2007\u002FIJCAI07-416.pdf), Ramachandran D., Amir E. (2007).\n* **`MEIRL`** [Maximum Entropy Inverse Reinforcement Learning](https:\u002F\u002Fwww.aaai.org\u002FPapers\u002FAAAI\u002F2008\u002FAAAI08-227.pdf), Ziebart B. et al. (2008).\n* **`LEARCH`** [Learning to search: Functional gradient techniques for imitation learning](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002F2009\u002F7\u002Flearch.pdf), Ratliff N., Siver D. Bagnell A. (2009).\n* **`CIOC`** [Continuous Inverse Optimal Control with Locally Optimal Examples](http:\u002F\u002Fgraphics.stanford.edu\u002Fprojects\u002Fcioc\u002F), Levine S., Koltun V. (2012). [🎞️](http:\u002F\u002Fgraphics.stanford.edu\u002Fprojects\u002Fcioc\u002Fcioc.mp4)\n* **`MEDIRL`** [Maximum Entropy Deep Inverse Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1507.04888), Wulfmeier M. (2015).\n* **`GCL`** [Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.00448), Finn C. et al. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hXxaepw0zAw)\n* **`RIRL`** [Repeated Inverse Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.05427), Amin K. et al. (2017).\n* [Bridging the Gap Between Imitation Learning and Inverse Reinforcement Learning](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7464854\u002F), Piot B. et al. (2017).\n\n### Applications to Autonomous Driving :taxi:\n\n* [Apprenticeship Learning for Motion Planning, with Application to Parking Lot Navigation](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F4651222\u002F), Abbeel P. et al. (2008).\n* [Navigate like a cabbie: Probabilistic reasoning from observed context-aware behavior](http:\u002F\u002Fwww.cs.cmu.edu\u002F~bziebart\u002Fpublications\u002Fnavigate-bziebart.pdf), Ziebart B. et al. (2008).\n* [Planning-based Prediction for Pedestrians](http:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F5354147\u002F), Ziebart B. et al. (2009). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=XOZ69Bg4JKg)\n* [Learning for autonomous navigation](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002F2010\u002F6\u002FLearning%20for%20Autonomous%20Navigation-%20Advances%20in%20Machine%20Learning%20for%20Rough%20Terrain%20Mobility.pdf), Bagnell A. et al. (2010).\n* [Learning Autonomous Driving Styles and Maneuvers from Expert Demonstration](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002F2012\u002F6\u002Fiser12.pdf), Silver D. et al. (2012).\n* [Learning Driving Styles for Autonomous Vehicles from Demonstration](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7139555\u002F), Kuderer M. et al. (2015).\n* [Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.03653), Sharifzadeh S. et al. (2016).\n* [Watch This: Scalable Cost-Function Learning for Path Planning in Urban Environments](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.02329), Wulfmeier M. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Sdfir_1T-UQ)\n* [Planning for Autonomous Cars that Leverage Effects on Human Actions](https:\u002F\u002Frobotics.eecs.berkeley.edu\u002F~sastry\u002Fpubs\u002FPdfs%20of%202016\u002FSadighPlanning2016.pdf), Sadigh D. et al. (2016).\n* [A Learning-Based Framework for Handling Dilemmas in Urban Automated Driving](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7989172\u002F), Lee S., Seo S. (2017).\n* [Learning Trajectory Prediction with Continuous Inverse Optimal Control via Langevin Sampling of Energy-Based Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.05453), Xu Y. et al. (2019).\n* [Analyzing the Suitability of Cost Functions for Explaining and Imitating Human Driving Behavior based on Inverse Reinforcement Learning](https:\u002F\u002Fras.papercept.net\u002Fproceedings\u002FICRA20\u002F0320.pdf), Naumann M. et al (2020).\n\n\n\n# Motion Planning :running_man:\n\n## Search\n\n* **`Dijkstra`** [A Note on Two Problems in Connexion with Graphs](http:\u002F\u002Fwww-m3.ma.tum.de\u002Ffoswiki\u002Fpub\u002FMN0506\u002FWebHome\u002Fdijkstra.pdf), Dijkstra E. W. (1959).\n* **`A*`** [ A Formal Basis for the Heuristic Determination of Minimum Cost Paths ](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F4082128\u002F), Hart P. et al. (1968).\n* [Planning Long Dynamically-Feasible Maneuvers For Autonomous Vehicles](https:\u002F\u002Fwww.cs.cmu.edu\u002F~maxim\u002Ffiles\u002Fplanlongdynfeasmotions_rss08.pdf), Likhachev M., Ferguson D. (2008).\n* [Optimal Trajectory Generation for Dynamic Street Scenarios in a Frenet Frame](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FMoritz_Werling\u002Fpublication\u002F224156269_Optimal_Trajectory_Generation_for_Dynamic_Street_Scenarios_in_a_Frenet_Frame\u002Flinks\u002F54f749df0cf210398e9277af.pdf), Werling M., Kammel S. (2010). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Cj6tAQe7UCY)\n* [3D perception and planning for self-driving and cooperative automobiles](http:\u002F\u002Fwww.mrt.kit.edu\u002Fz\u002Fpubl\u002Fdownload\u002F2012\u002FStillerZiegler2012SSD.pdf), Stiller C., Ziegler J. (2012).\n* [Motion Planning under Uncertainty for On-Road Autonomous Driving](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002F2014\u002F6\u002FICRA14_0863_Final.pdf), Xu W. et al. (2014).\n* [Monte Carlo Tree Search for Simulated Car Racing](http:\u002F\u002Fjulian.togelius.com\u002FFischer2015Monte.pdf), Fischer J. et al. (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GbUMssvolvU)\n\n## Sampling\n\n* **`RRT*`** [Sampling-based Algorithms for Optimal Motion Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1105.1186), Karaman S., Frazzoli E. (2011). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=p3nZHnOWhrg)\n* **`LQG-MP`** [LQG-MP: Optimized Path Planning for Robots with Motion Uncertainty and Imperfect State Information](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~pabbeel\u002Fpapers\u002FvandenBergAbbeelGoldberg_RSS2010.pdf), van den Berg J. et al. (2010).\n* [Motion Planning under Uncertainty using Differential Dynamic Programming in Belief Space](http:\u002F\u002Frll.berkeley.edu\u002F~sachin\u002Fpapers\u002FBerg-ISRR2011.pdf), van den Berg J. et al. (2011).\n* [Rapidly-exploring Random Belief Trees for Motion Planning Under Uncertainty](https:\u002F\u002Fgroups.csail.mit.edu\u002Frrg\u002Fpapers\u002Fabry_icra11.pdf), Bry A., Roy N. (2011).\n* **`PRM-RL`** [PRM-RL: Long-range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.03937), Faust A. et al. (2017).\n\n## Optimization\n\n* [Trajectory planning for Bertha - A local, continuous method](https:\u002F\u002Fpdfs.semanticscholar.org\u002Fbdca\u002F7fe83f8444bb4e75402a417053519758d36b.pdf), Ziegler J. et al. (2014).\n* [Learning Attractor Landscapes for Learning Motor Primitives](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F2140-learning-attractor-landscapes-for-learning-motor-primitives.pdf), Ijspeert A. et al. (2002).\n* [Online Motion Planning based on Nonlinear Model Predictive Control with Non-Euclidean Rotation Groups](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.03534), Rösmann C. et al (2020). [:octocat:](https:\u002F\u002Fgithub.com\u002Frst-tu-dortmund\u002Fmpc_local_planner)\n\n## Reactive\n\n* **`PF`** [Real-time obstacle avoidance for manipulators and mobile robots](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F1087247\u002F), Khatib O. (1986).\n* **`VFH`** [The Vector Field Histogram - Fast Obstacle Avoidance For Mobile Robots](http:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=88137), Borenstein J. (1991).\n* **`VFH+`** [VFH+: Reliable Obstacle Avoidance for Fast Mobile Robots](http:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.438.3464&rep=rep1&type=pdf), Ulrich I., Borenstein J. (1998).\n* **`Velocity Obstacles`** [Motion planning in dynamic environments using velocity obstacles](http:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.56.6352&rep=rep1&type=pdf), Fiorini P., Shillert Z. (1998).\n\n## Architecture and applications\n\n* [A Review of Motion Planning Techniques for Automated Vehicles](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7339478\u002F), González D. et al. (2016).\n* [A Survey of Motion Planning and Control Techniques for Self-driving Urban Vehicles](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.07446), Paden B. et al. (2016).\n* [Autonomous driving in urban environments: Boss and the Urban Challenge](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpublications\u002Fautonomous-driving-in-urban-environments-boss-and-the-urban-challenge\u002F), Urmson C. et al. (2008).\n* [The MIT-Cornell collision and why it happened](http:\u002F\u002Fonlinelibrary.wiley.com\u002Fdoi\u002F10.1002\u002Frob.20266\u002Fpdf), Fletcher L. et al. (2008).\n* [Making bertha drive-an autonomous journey on a historic route](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F6803933\u002F), Ziegler J. et al. (2014).\n","# 参考文献\n\n# 目录\n\n* [最优控制](#optimal-control-dart)\n  * [动态规划](#dynamic-programming)\n  * [线性规划](#linear-programming)\n  * [基于树的规划](#tree-based-planning)\n  * [控制理论](#control-theory)\n  * [模型预测控制](#model-predictive-control)\n* [安全控制](#safe-control-lock)\n  * [鲁棒控制](#robust-control)\n  * [风险规避控制](#risk-averse-control)\n  * [价值约束控制](#value-constrained-control)\n  * [状态约束控制与稳定性](#state-constrained-control-and-stability)\n  * [不确定动力系统](#uncertain-dynamical-systems)\n* [博弈论](#game-theory-spades)\n* [序列学习](#sequential-learning-shoe)\n  * [多臂老虎机](#multi-armed-bandit-slot_machine)\n    * [最佳 arm 识别](#best-arm-identification-muscle)\n    * [黑箱优化](#black-box-optimization-black_large_square)\n  * [强化学习](#reinforcement-learning-robot)\n    * [理论](#theory-books)\n    * [基于价值的方法](#value-based-chart_with_upwards_trend)\n    * [基于策略的方法](#policy-based-muscle)\n      * [策略梯度](#policy-gradient)\n      * [演员-评论家](#actor-critic)\n      * [无导数方法](#derivative-free)\n    * [基于模型的方法](#model-based-world_map)\n    * [探索](#exploration-tent)\n    * [层次结构与时间抽象](#hierarchy-and-temporal-abstraction-clock2)\n    * [部分可观测性](#partial-observability-eye)\n    * [迁移学习](#transfer-earth_americas)\n    * [多智能体](#multi-agent-two_men_holding_hands)\n    * [表示学习](#representation-learning)\n    * [离线强化学习](#offline)\n* [从演示中学习](#learning-from-demonstrations-mortar_board)\n  * [模仿学习](#imitation-learning)\n    * [在自动驾驶中的应用](#applications-to-autonomous-driving-car)\n  * [逆向强化学习](#inverse-reinforcement-learning)\n    * [在自动驾驶中的应用](#applications-to-autonomous-driving-taxi)\n* [运动规划](#motion-planning-running_man)\n  * [搜索](#search)\n  * [采样](#sampling)\n  * [优化](#optimization)\n  * [反应式规划](#reactive)\n  * [架构与应用](#architecture-and-applications)\n\n![强化学习图](https:\u002F\u002Frawgit.com\u002Feleurent\u002Fphd-bibliography\u002Fmaster\u002Freinforcement-learning.svg)\n\n# 最优控制 :dart:\n\n## 动态规划\n\n* (书) [动态规划](https:\u002F\u002Fpress.princeton.edu\u002Ftitles\u002F9234.html), 贝尔曼 R. (1957).\n* (书) [动态规划与最优控制，第1卷和第2卷](http:\u002F\u002Fweb.mit.edu\u002Fdimitrib\u002Fwww\u002Fdpchapter.html), 伯特塞卡斯 D. (1995).\n* (书) [马尔可夫决策过程——离散随机动态规划](http:\u002F\u002Feu.wiley.com\u002FWileyCDA\u002FWileyTitle\u002FproductCd-1118625870.html), 普特曼 M. (1995).\n* [近似最优值函数带来的损失上界](https:\u002F\u002Fwww.cis.upenn.edu\u002F~mkearns\u002Fteaching\u002Fcis620\u002Fpapers\u002FSinghYee.pdf), 辛格 S., 耶 R. (1994).\n* [迎风帆船赛中航行轨迹的随机优化](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1057%2Fjors.2014.40), 达朗 R. 等 (2015).\n\n## 线性规划\n\n* (书) [马尔可夫决策过程——离散随机动态规划](http:\u002F\u002Feu.wiley.com\u002FWileyCDA\u002FWileyTitle\u002FproductCd-1118625870.html), 普特曼 M. (1995).\n* **`REPS`** [相对熵策略搜索](https:\u002F\u002Fwww.aaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI10\u002Fpaper\u002FviewFile\u002F1851\u002F2264), 彼得斯 J. 等 (2010).\n\n## 基于树的规划\n\n* **`ExpectiMinimax`** [具有随机节点的游戏中的最优策略](http:\u002F\u002Fwww.inf.u-szeged.hu\u002Factacybernetica\u002Fedb\u002Fvol18n2\u002Fpdf\u002FMelko_2007_ActaCybernetica.pdf)，Melkó E.、Nagy B.（2007年）。\n* **`Sparse sampling`** [大型马尔可夫决策过程中的近似最优规划的稀疏采样算法](https:\u002F\u002Fwww.cis.upenn.edu\u002F~mkearns\u002Fpapers\u002Fsparsesampling-journal.pdf)，Kearns M. 等（2002年）。\n* **`MCTS`** [蒙特卡洛树搜索中的高效选择与备份算子](https:\u002F\u002Fhal.inria.fr\u002Finria-00116992\u002Fdocument)，Rémi Coulom，*SequeL*（2006年）。\n* **`UCT`** [基于赌博机的蒙特卡洛规划](http:\u002F\u002Fggp.stanford.edu\u002Freadings\u002Fuct.pdf)，Kocsis L.、Szepesvári C.（2006年）。\n* [用于树搜索的赌博机算法](https:\u002F\u002Fhal.inria.fr\u002Finria-00136198v2)，Coquelin P-A.、Munos R.（2007年）。\n* **`OPD`** [确定性系统的乐观规划](https:\u002F\u002Fhal.inria.fr\u002Fhal-00830182)，Hren J.、Munos R.（2008年）。\n* **`OLOP`** [开环乐观规划](http:\u002F\u002Fsbubeck.com\u002FCOLT10_BM.pdf)，Bubeck S.、Munos R.（2010年）。\n* **`SOOP`** [连续动作确定性系统的乐观规划](http:\u002F\u002Fresearchers.lille.inria.fr\u002Fmunos\u002Fpapers\u002Ffiles\u002Fadprl13-soop.pdf)，Buşoniu L. 等（2011年）。\n* **`OPSS`** [稀疏随机系统的乐观规划](https:\u002F\u002Fwww.dcsc.tudelft.nl\u002F~bdeschutter\u002Fpub\u002Frep\u002F11_007.pdf)，L. Buşoniu、R. Munos、B. De Schutter 和 R. Babuska（2011年）。\n* **`HOOT`** [连续动作马尔可夫决策过程中的基于样本的规划](https:\u002F\u002Fwww.aaai.org\u002Focs\u002Findex.php\u002FICAPS\u002FICAPS11\u002Fpaper\u002FviewFile\u002F2679\u002F3175)，Mansley C.、Weinstein A.、Littman M.（2011年）。\n* **`HOLOP`** [连续动作马尔可夫决策过程中的基于赌博机的规划与学习](https:\u002F\u002Fpdfs.semanticscholar.org\u002Fa445\u002Fd8cc503781c481c3f3c4ee1758b862b3e869.pdf)，Weinstein A.、Littman M.（2012年）。\n* **`BRUE`** [马尔可夫决策过程在线规划中的简单遗憾优化](https:\u002F\u002Fwww.jair.org\u002Findex.php\u002Fjair\u002Farticle\u002Fview\u002F10905\u002F26003)，Feldman Z. 和 Domshlak C.（2014年）。\n* **`LGP`** [逻辑几何规划：一种基于优化的组合任务与运动规划方法](https:\u002F\u002Fipvs.informatik.uni-stuttgart.de\u002Fmlr\u002Fpapers\u002F15-toussaint-IJCAI.pdf)，Toussaint M.（2015年）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=B2s85xfo2uE)\n* **`AlphaGo`** [利用深度神经网络和树搜索掌握围棋游戏](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature16961)，Silver D. 等（2016年）。\n* **`AlphaGo Zero`** [无需人类知识即可掌握围棋游戏](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature24270)，Silver D. 等（2017年）。\n* **`AlphaZero`** [通过通用强化学习算法的自我对弈掌握国际象棋和将棋](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.01815)，Silver D. 等（2017年）。\n* **`TrailBlazer`** [在开辟道路之前先铺路：高效的蒙特卡洛规划](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F6253-blazing-the-trails-before-beating-the-path-sample-efficient-monte-carlo-planning.pdf)，Grill J. B.、Valko M.、Munos R.（2017年）。\n* **`MCTSnets`** [使用 MCTSnets 学习搜索](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.04697)，Guez A. 等（2018年）。\n* **`ADI`** [无需人类知识解决魔方问题](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.07470)，McAleer S. 等（2018年）。\n* **`OPC\u002FSOPC`** [具有 Lipschitz 值的折扣无限 horizon 非线性最优控制的连续动作规划](http:\u002F\u002Fbusoniu.net\u002Ffiles\u002Fpapers\u002Faut18.pdf)，Buşoniu L.、Pall E.、Munos R.（2018年）。\n* [带有悲观情景的实时树搜索：赢得 2018 年 NeurIPS Pommerman 比赛](http:\u002F\u002Fproceedings.mlr.press\u002Fv101\u002Fosogami19a.html)，Osogami T.、Takahashi T.（2019年）\n\n## 控制理论\n\n* （书籍）[最优过程的数学理论](https:\u002F\u002Fbooks.google.fr\u002Fbooks?id=kwzq0F4cBVAC&printsec=frontcover&redir_esc=y#v=onepage&q&f=false)，L. S. Pontryagin、Boltyanskii V. G.、Gamkrelidze R. V. 和 Mishchenko E. F.（1962年）。\n* （书籍）[约束控制与估计](http:\u002F\u002Fwww.springer.com\u002Fgp\u002Fbook\u002F9781852335489)，Goodwin G.（2005年）。\n* **`PI²`** [强化学习的广义路径积分控制方法](http:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fvolume11\u002Ftheodorou10a\u002Ftheodorou10a.pdf)，Theodorou E. 等（2010年）。\n* **`PI²-CMA`** [协方差矩阵自适应的路径积分策略改进](https:\u002F\u002Farxiv.org\u002Fabs\u002F1206.4621)，Stulp F.、Sigaud O.（2010年）。\n* **`iLQG`** [用于约束非线性随机系统局部最优反馈控制的广义迭代 LQG 方法](http:\u002F\u002Fmaeresearch.ucsd.edu\u002Fskelton\u002Fpublications\u002Fweiwei_ilqg_CDC43.pdf)，Todorov E.（2005年）。[:octocat:](https:\u002F\u002Fgithub.com\u002Fneka-nat\u002Filqr-gym)\n* **`iLQG+`** [通过在线轨迹优化合成与稳定复杂行为](https:\u002F\u002Fhomes.cs.washington.edu\u002F~todorov\u002Fpapers\u002FTassaIROS12.pdf)，Tassa Y.（2012年）。\n\n## 模型预测控制\n\n* （书籍）[模型预测控制](http:\u002F\u002Feen.iust.ac.ir\u002Fprofs\u002FShamaghdari\u002FMPC\u002FResources\u002F)，Camacho E.（1995年）。\n* （书籍）[具有约束的预测控制](https:\u002F\u002Fbooks.google.fr\u002Fbooks\u002Fabout\u002FPredictive_Control.html?id=HV_Y58c7KiwC&redir_esc=y)，Maciejowski J. M.（2002年）。\n* [低曲率道路上的车道保持与障碍物避让的线性模型预测控制](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F6728261\u002F)，Turri V. 等（2013年）。\n* **`MPCC`** [基于优化的 1:43 比例遥控车自主竞速](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.07300)，Liniger A. 等（2014年）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mXaElWYQKC4) | [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JoHfJ6LEKVo)\n* **`MIQP`** [整合逻辑约束的自动驾驶最优轨迹规划：MIQP 视角](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-01342358v1\u002Fdocument)，Qian X.、Altché F.、Bender P.、Stiller C. de La Fortelle A.（2016年）。\n\n\n# 安全控制 :lock:\n\n## 鲁棒控制\n\n* [随机问题的极小极大分析](https:\u002F\u002Fwww2.isye.gatech.edu\u002F~anton\u002FMinimaxSP.pdf), Shapiro A., Kleywegt A. (2002)。\n* **`Robust DP`** [鲁棒动态规划](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F220442530\u002Fdownload), Iyengar G. (2005)。\n* [鲁棒规划与优化](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FFrancisco_Perez-Galarce\u002Fpost\u002Fcan_anyone_recommend_a_report_or_article_on_two_stage_robust_optimization\u002Fattachment\u002F59d62578c49f478072e9a500\u002FAS%3A272164542976002%401441900491330\u002Fdownload\u002F2011+-+Robust+planning+and+optimization.pdf), Laumanns M. (2011)。（讲义）\n* [鲁棒马尔可夫决策过程](https:\u002F\u002Fpubsonline.informs.org\u002Fdoi\u002Fpdf\u002F10.1287\u002Fmoor.1120.0566), Wiesemann W., Kuhn D., Rustem B. (2012)。\n* [基于高斯过程的安全鲁棒学习控制](http:\u002F\u002Fwww.dynsyslab.org\u002Fwp-content\u002Fpapercite-data\u002Fpdf\u002Fberkenkamp-ecc15.pdf), Berkenkamp F., Schoellig A. (2015)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=YqhLnCm0KXY)\n* **`Tube-MPPI`** [稀疏目标信息下的鲁棒采样型模型预测控制](http:\u002F\u002Fwww.roboticsproceedings.org\u002Frss14\u002Fp42.pdf), Williams G. 等 (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=32v-e3dptjo)\n* [机器人中的安全学习：从基于学习的控制到安全强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.06266), Lukas Bronke 等 (2021)。[:octocat:](https:\u002F\u002Fgithub.com\u002FutiasDSL\u002Fsafe-control-gym)\n\n## 风险规避控制\n\n* [安全强化学习综合综述](http:\u002F\u002Fjmlr.org\u002Fpapers\u002Fv16\u002Fgarcia15a.html), García J., Fernández F. (2015)。\n* **`RA-QMDP`** [不确定性下自动驾驶的风险规避行为规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.01254), Naghshvar M. 等 (2018)。\n* **`StoROO`** [X武装带兵：优化分位数及其他风险](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.08205), Torossian L., Garivier A., Picheny V. (2019)。\n* [最坏情况策略梯度](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.03618), Tang Y. C. 等 (2019)。\n* [无模型风险敏感强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.02907), Delétang G. 等 (2021)。\n* [支持感知CVaR带兵的最优汤普森采样策略](https:\u002F\u002Fproceedings.mlr.press\u002Fv139\u002Fbaudry21a.html), Baudry D., Gautron R., Kaufmann E., Maillard O. (2021)。\n\n## 价值约束控制\n\n* **`ICS`** [驾驶座会空着吗？](https:\u002F\u002Fhal.inria.fr\u002Fhal-00965176), Fraichard T. (2014)。\n* **`SafeOPT`** [基于高斯过程的四旋翼无人机安全控制器优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.01066), Berkenkamp F., Schoellig A., Krause A. (2015)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GiqNQdzc5TI) [:octocat:](https:\u002F\u002Fgithub.com\u002Fbefelix\u002FSafeOpt)\n* **`SafeMDP`** [使用高斯过程在有限马尔可夫决策过程中进行安全探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.04753), Turchetta M., Berkenkamp F., Krause A. (2016)。[:octocat:](https:\u002F\u002Fgithub.com\u002Fbefelix\u002FSafeMDP)\n* **`RSS`** [关于安全且可扩展的自动驾驶汽车的形式化模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.06374), Shalev-Shwartz S. 等 (2017)。\n* **`CPO`** [约束策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.10528), Achiam J., Held D., Tamar A., Abbeel P. (2017)。[:octocat:](https:\u002F\u002Fgithub.com\u002Fjachiam\u002Fcpo)\n* **`RCPO`** [奖励约束策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.11074), Tessler C., Mankowitz D., Mannor S. (2018)。\n* **`BFTQ`** [用于预算型MDP的拟合Q算法](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-01867353), Carrara N. 等 (2018)。\n* **`SafeMPC`** [基于学习的模型预测控制用于安全探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.08287), Koller T, Berkenkamp F., Turchetta M. Krause A. (2018)。\n* **`CCE`** [用于安全强化学习的约束交叉熵方法](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F7974-constrained-cross-entropy-method-for-safe-reinforcement-learning), Wen M., Topcu U. (2018)。[:octocat:](https:\u002F\u002Fgithub.com\u002Fliuzuxin\u002Fsafe-mbrl)\n* **`LTL-RL`** [具有概率保证的自动驾驶强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.07189), Bouton M. 等 (2019)。\n* [通过场景分解进行复杂城市环境导航的安全强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.11483v1), Bouton M. 等 (2019)。[:octocat:](https:\u002F\u002Fgithub.com\u002Fsisl\u002FAutomotivePOMDPs.jl)\n* [约束条件下的批量策略学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1903.08738), Le H., Voloshin C., Yue Y. (2019)。\n* [价值约束的无模型连续控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1902.04623), Bohez S. 等 (2019)。[🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fsuccessatanycost)\n* [安全学习控制受约束的线性二次调节器](https:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8814865), Dean S. 等 (2019)。\n* [以最小的人工干预在现实世界中学会行走](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.08550), Ha S. 等 (2020) [🎞️](https:\u002F\u002Fyoutu.be\u002Fcwyiq6dCgOc)\n* [通过PID拉格朗日方法实现强化学习中的响应式安全性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.03964), Stooke A., Achiam J., Abbeel P. (2020)。[:octocat:](https:\u002F\u002Fgithub.com\u002Fastooke\u002Frlpyt\u002Ftree\u002Fmaster\u002Frlpyt\u002Fprojects\u002Fsafe)\n* **`Envelope MOQ-Learning`** [多目标强化学习和策略适应的通用算法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.08342), Yang R. 等 (2019)。\n\n## 状态约束控制与稳定性\n\n* **`HJI-reachability`** [控制领域的安全学习：结合扰动估计、可达性分析和强化学习，辅以系统性探索](http:\u002F\u002Fkth.diva-portal.org\u002Fsmash\u002Fget\u002Fdiva2:1140173\u002FFULLTEXT01.pdf), Heidenreich C. (2017)。\n* **`MPC-HJI`** [关于将基于可达性的安全保证融入人机交互的概率规划框架](https:\u002F\u002Fstanfordasl.github.io\u002Fwp-content\u002Fpapercite-data\u002Fpdf\u002FLeung.Schmerling.Chen.ea.ISER18.pdf), Leung K. 等 (2018)。\n* [不确定机器人系统中基于学习控制的一般安全框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.01292), Fisac J. 等 (2017)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=WAAxyeSk2bw&feature=youtu.be)\n* [具有稳定性保证的安全基于模型的强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.08551), Berkenkamp F. 等 (2017)。\n* **`Lyapunov-Net`** [安全的交互式基于模型学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.06556), Gallieri M. 等 (2019)。\n* [在神经网络策略中强制执行鲁棒控制保证](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.08105), Donti P. 等 (2021)。[:octocat:](https:\u002F\u002Fgithub.com\u002Flocuslab\u002Frobust-nn-control)\n* **`ATACOM`** [约束流形上的机器人强化学习](https:\u002F\u002Fopenreview.net\u002Fforum?id=zwo1-MdMl1P), Liu P. 等 (2021)。\n\n## 不确定动力系统\n\n* [受控不确定非线性系统的仿真](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002F009630039400112H)，Tibken B.，Hofer E.（1995）。\n* [动态不确定系统的轨迹计算](https:\u002F\u002Fieeexplore.ieee.org\u002Fiel5\u002F8969\u002F28479\u002F01272787.pdf)，Adrot O.，Flaus J-M.（2002）。\n* [基于区间模型的不确定动态系统仿真：综述](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS1474667016362206)，Puig V.等（2005）。\n* [不确定动态系统的区间观测器设计](https:\u002F\u002Fhal.inria.fr\u002Fhal-01276439\u002Ffile\u002FInterval_Survey.pdf)，Efimov D.，Raïssi T.（2016）。\n\n\n\n\n# 博弈论 :spades:\n\n* [自动驾驶车辆的分层博弈论规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.05766)，Fisac J.等（2018）。\n* [非线性多玩家一般和微分博弈的高效迭代线性二次近似](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.04694)，Fridovich-Keil D.等（2019）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KPEPk-QrkQ8&feature=youtu.be)\n\n\n\n\n# 顺序学习 :shoe:\n\n- [预测、学习与博弈](https:\u002F\u002Fwww.ii.uni.wroc.pl\u002F~lukstafi\u002Fpmwiki\u002Fuploads\u002FAGT\u002FPrediction_Learning_and_Games.pdf)，Cesa-Bianchi N.，Lugosi G.（2006）。\n\n## 多臂老虎机 :slot_machine:\n\n* **`TS`** [根据两个样本的证据，一个未知概率超过另一个的概率研究](https:\u002F\u002Fwww.jstor.org\u002Fstable\u002Fpdf\u002F2332286.pdf)，Thompson W.（1933）。\n* [组织学习中的探索与利用](https:\u002F\u002Fwww3.nd.edu\u002F~ggoertz\u002Fabmir\u002Fmarch1991.pdf)，March J.（1991）。\n* **`UCB1 \u002F UCB2`** [多臂老虎机问题的有限时间分析](https:\u002F\u002Fhomes.di.unimi.it\u002F~cesabian\u002FPubblicazioni\u002Fml-02.pdf)，Auer P.，Cesa-Bianchi N.，Fischer P.（2002）。\n* **`经验贝叶斯 \u002F UCB-V`** [利用方差估计解决多臂老虎机中的探索-利用权衡问题](https:\u002F\u002Fhal.inria.fr\u002Fhal-00711069\u002F)，Audibert J-Y，Munos R.，Szepesvari C.（2009）。\n* [经验贝叶斯界与样本方差惩罚](https:\u002F\u002Farxiv.org\u002Fabs\u002F0907.0467)，Maurer A.，Ponti M.（2009）。\n* [汤普森采样的经验评估](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4321-an-empirical-evaluation-of-thompson-sampling)，Chapelle O.，Li L.（2011）。\n* **`kl-UCB`** [有界随机多臂老虎机及其扩展的KL-UCB算法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1102.2490)，Garivier A.，Cappé O.（2011）。\n* **`KL-UCB`** [库尔巴克-莱布勒上置信界用于最优顺序分配](https:\u002F\u002Fprojecteuclid.org\u002Feuclid.aos\u002F1375362558)，Cappé O.等（2013）。\n* **`IDS`** [信息导向采样与异方差噪声下的多臂老虎机](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.09667)，Kirschner J.，Krause A.（2018）。\n\n#### 上下文相关\n\n* **`LinUCB`** [基于上下文的带状方法在个性化新闻文章推荐中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F1003.0146)，Li L.等（2010）。\n* **`OFUL`** [线性随机多臂老虎机的改进算法](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4417-improved-algorithms-for-linear-stochastic-bandits)，Abbasi-yadkori Y.，Pal D.，Szepesvári C.（2011）。\n* [具有线性收益函数的上下文多臂老虎机](http:\u002F\u002Fproceedings.mlr.press\u002Fv15\u002Fchu11a.html)，Chu W.等（2011）。\n* [流式置信回归的自归一化技术](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-01349727v2)，Maillard O.-A.（2017）。\n* [通过代理从延迟结果中学习及其在推荐系统中的应用](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.09387)，Mann T.等（2018）。（预测场景）\n* [非平稳环境下的加权线性多臂老虎机](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.09146)，Russac Y.等（2019）。\n* [具有随机延迟反馈的线性多臂老虎机](http:\u002F\u002Fproceedings.mlr.press\u002Fv119\u002Fvernade20a.html)，Vernade C.等（2020）。\n\n\n### 最优臂识别 :muscle:\n\n* **`连续淘汰法`** [多臂老虎机和强化学习问题中的动作淘汰与停止条件](http:\u002F\u002Fjmlr.csail.mit.edu\u002Fpapers\u002Fvolume7\u002Fevendar06a\u002Fevendar06a.pdf)，Even-Dar E.等（2006）。\n* **`LUCB`** [随机多臂老虎机中的PAC子集选择](https:\u002F\u002Fwww.cse.iitb.ac.in\u002F~shivaram\u002Fpapers\u002Fktas_icml_2012.pdf)，Kalyanakrishnan S.等（2012）。\n* **`UGapE`** [最优臂识别：固定预算与固定置信度的统一方法](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-00747005)，Gabillon V.，Ghavamzadeh M.，Lazaric A.（2012）。\n* **`序列二分法`** [多臂老虎机中的近乎最优探索](http:\u002F\u002Fproceedings.mlr.press\u002Fv28\u002Fkarnin13.pdf)，Karnin Z.等（2013）。\n* **`M-LUCB \u002F M-Racing`** [极大极小动作识别：一种用于博弈的新老虎机框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.04676)，Garivier A.，Kaufmann E.，Koolen W.（2016）。\n* **`跟踪并停止`** [固定置信度下的最优臂识别优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.04589)，Garivier A.，Kaufmann E.（2016）。\n* **`LUCB-micro`** [固定置信度下的结构化最优臂识别](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.05198)，Huang R.等（2017）。\n\n### 黑箱优化 :black_large_square:\n\n* **`GP-UCB`** [带状情境下的高斯过程优化：无遗憾与实验设计](https:\u002F\u002Farxiv.org\u002Fabs\u002F0912.3995)，Srinivas N.，Krause A.，Kakade S.，Seeger M.（2009）。\n* **`HOO`** [X–武装带状](https:\u002F\u002Farxiv.org\u002Fabs\u002F1001.4475)，Bubeck S.，Munos R.，Stoltz G.，Szepesvari C.（2009）。\n* **`DOO\u002FSOO`** [无需了解函数光滑度的确定性函数乐观优化](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4304-optimistic-optimization-of-a-deterministic-function-without-the-knowledge-of-its-smoothness)，Munos R.（2011）。\n* **`StoOO`** [从带状到蒙特卡洛树搜索：乐观原则在优化与规划中的应用](https:\u002F\u002Fhal.archives-ouvertes.fr\u002Fhal-00747575v4\u002F)，Munos R.（2014）。\n* **`StoSOO`** [随机同时乐观优化](http:\u002F\u002Fproceedings.mlr.press\u002Fv28\u002Fvalko13.pdf)，Valko M.，Carpentier A.，Munos R.（2013）。\n* **`POO`** [黑箱环境下噪声函数的优化，且未知其光滑度](https:\u002F\u002Fhal.inria.fr\u002Fhal-01222915v4\u002F)，Grill J-B.，Valko M.，Munos R.（2015）。\n* **`EI-GP`** [AlphaGo中的贝叶斯优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.06855)，Chen Y.等（2018）\n\n\n\n# 强化学习 :robot:\n\n* [强化学习：综述](https:\u002F\u002Fwww.jair.org\u002Fmedia\u002F301\u002Flive-301-1562-jair.pdf)，Kaelbling L.等（1996）。\n\n## 理论 :books:\n\n* [在线强化学习的期望误差界模型](https:\u002F\u002Fpdfs.semanticscholar.org\u002F13b8\u002F1dd08aab636c3761c5eb4337dbe43aedaf31.pdf)，Fiechter C-N. (1997)。\n* **`UCRL2`** [强化学习的近似最优后悔界](http:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fvolume11\u002Fjaksch10a\u002Fjaksch10a.pdf)，Jaksch T. (2010)。![设置](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-average-green)![通信](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcommunicating-orange)![界](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDS(AT)^0.5-orange)\n* **`PSRL`** [为什么后验采样比乐观主义更适合强化学习？](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.00215)，Osband I., Van Roy B. (2016)。![设置](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green) ![贝叶斯](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fbayesian-green)\n* **`UCBVI`** [强化学习的极小极大后悔界](http:\u002F\u002Fproceedings.mlr.press\u002Fv70\u002Fazar17a.html)，Azar M., Osband I., Munos R. (2017)。![设置](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green)![界](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F(H²SAT)^0.5-orange)\n* **`Q-Learning-UCB`** [Q学习是否可证明高效？](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F7735-is-q-learning-provably-efficient)，Jin C., Allen-Zhu Z., Bubeck S., Jordan M. (2018)。![设置](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green) ![界](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F(H^3SAT)^0.5-orange)\n* **`LSVI-UCB`** [具有线性函数逼近的可证明高效强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1907.05388)，Jin C., Yang Z., Wang Z., Jordan M. (2019)。![设置](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green) ![空间](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fapproximation-linear-green)\n* [基于模型的强化学习中的利普希茨连续性](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.07193)，Asadi K. 等 (2018)。\n* [关于强化学习中的函数逼近：面对大规模状态空间时的乐观主义](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.04622)，Yang Z., Jin C., Wang Z., Wang M., Jordan M. (2021) ![设置](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsetting-episodic-green) ![空间](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fapproximation-kernel\u002Fnn-green) ![界](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdelta.H^2(T)^0.5-orange)\n\n### 生成模型\n\n*  **`QVI`** [关于具有生成模型的强化学习的样本复杂度](https:\u002F\u002Farxiv.org\u002Fabs\u002F1206.6461)，Azar M., Munos R., Kappen B. (2012)。\n* [基于生成模型的强化学习在极小极大意义上是最优的](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.03804)，Agarwal A. 等 (2019)。\n\n### 策略梯度\n\n* [具有函数逼近的强化学习中的策略梯度方法](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation)，Sutton R. 等 (2000)。\n* [近似最优的近似强化学习](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~pabbeel\u002Fcs287-fa09\u002Freadings\u002FKakadeLangford-icml2002.pdf)，Kakade S., Langford J. (2002)。\n* [关于策略梯度方法的理论：最优性、近似与分布偏移](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.00261)，Agarwal A. 等 (2019)\n* [PC-PG：策略覆盖引导探索以实现可证明的策略梯度学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.08459)，Agarwal A. 等 (2020) \n* [策略梯度真的是梯度吗？](https:\u002F\u002Farxiv.org\u002Fabs\u002F1906.07073)，Nota C., Thomas P. S. (2020)。\n\n### 线性系统\n\n* [线性系统的 PAC 自适应控制](https:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.49.339&rep=rep1&type=pdf)，Fiechter C.-N. (1997)\n* **`OFU-LQ`** [线性二次系统的自适应控制的后悔界](http:\u002F\u002Fproceedings.mlr.press\u002Fv19\u002Fabbasi-yadkori11a\u002Fabbasi-yadkori11a.pdf)，Abbasi-Yadkori Y., Szepesvari C. (2011)。\n* **`TS-LQ`** [线性二次控制问题中汤普森采样的改进后悔界](http:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Fabeille18a.html)，Abeille M., Lazaric A. (2018)。\n* [线性系统中使用汤普森采样的探索-利用](https:\u002F\u002Ftel.archives-ouvertes.fr\u002Ftel-01816069\u002F)，Abeille M. (2017)。（博士论文）\n* **`Coarse-Id`** [线性二次调节器的样本复杂度研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.01688)，Dean S., Mania H., Matni N., Recht B., Tu S. (2017)。\n* [线性二次调节器鲁棒自适应控制的后悔界](http:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F7673-regret-bounds-for-robust-adaptive-control-of-the-linear-quadratic-regulator)，Dean S. 等 (2018)。\n* [线性二次强化学习中的鲁棒探索](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F9668-robust-exploration-in-linear-quadratic-reinforcement-learning)，Umenberger J. 等 (2019)。\n* [带有对抗性干扰的在线控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1902.08721)，Agarwal N. 等 (2019)。![噪声](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fnoise-adversarial-red)![成本](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcosts-convex-green)\n* [在线控制的对数后悔](https:\u002F\u002Farxiv.org\u002Fabs\u002F1909.05062)，Agarwal N. 等 (2019)。![噪声](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fnoise-adversarial-red)![成本](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcosts-convex-green)\n\n## 基于价值 :chart_with_upwards_trend:\n\n* **`NFQ`** [神经拟合 Q 迭代——一种数据高效的神经强化学习方法的首次尝试](http:\u002F\u002Fml.informatik.uni-freiburg.de\u002Fformer\u002F_media\u002Fpublications\u002Frieecml05.pdf)，Riedmiller M. (2005)。\n* **`DQN`** [用深度强化学习玩雅达利游戏](https:\u002F\u002Fwww.cs.toronto.edu\u002F~vmnih\u002Fdocs\u002Fdqn.pdf)，Mnih V. 等 (2013)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=iqXKQf2BOSE)\n* **`DDQN`** [采用双重 Q 学习的深度强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.06461)，van Hasselt H., Silver D. 等 (2015)。\n* **`DDDQN`** [用于深度强化学习的决斗网络架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06581)，Wang Z. 等 (2015)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=qJd3yaEN9Sw)\n* **`PDDDQN`** [优先经验回放](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.05952)，Schaul T. 等 (2015)。\n* **`NAF`** [基于模型加速的连续深度 Q 学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.00748)，Gu S. 等 (2016)。\n* **`Rainbow`** [彩虹：结合深度强化学习中的多项改进](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.02298)，Hessel M. 等 (2017)。\n* **`Ape-X DQfD`** [观察并看得更远：在雅达利游戏中实现稳定表现](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.11593)，Pohlen T. 等 (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-0xOdnoxAFo&index=4&list=PLnZpNNVLsMmOfqMwJLcpLpXKLr3yKZ8Ak)\n\n## 基于策略 :muscle:\n\n### 策略梯度\n\n* **`REINFORCE`** [用于连接主义强化学习的简单统计梯度跟随算法](http:\u002F\u002Fwww-anw.cs.umass.edu\u002F~barto\u002Fcourses\u002Fcs687\u002Fwilliams92simple.pdf)，威廉姆斯 R.（1992）。\n* **`自然梯度`** [一种自然策略梯度](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F2073-a-natural-policy-gradient.pdf)，卡卡德 S.（2002）。\n* [用于机器人的策略梯度方法](http:\u002F\u002Fwww.kyb.mpg.de\u002Ffileadmin\u002Fuser_upload\u002Ffiles\u002Fpublications\u002Fattachments\u002FIROS2006-Peters_%5b0%5d.pdf)，彼得斯 J.、沙尔 S.（2006）。\n* **`TRPO`** [信任域策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1502.05477)，舒尔曼 J. 等（2015）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KJ15iGGJFvQ)\n* **`PPO`** [近端策略优化算法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347)，舒尔曼 J. 等（2017）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=bqdjsmSoSgI)\n* **`DPPO`** [丰富环境中的运动行为涌现](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.02286)，希斯 N. 等（2017）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hx_bgoTF7bs)\n\n### 演员-评论家\n\n* **`AC`** [具有函数逼近的强化学习中的策略梯度方法](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf)，萨顿 R. 等（1999）。\n* **`NAC`** [自然演员-评论家](https:\u002F\u002Fhomes.cs.washington.edu\u002F~todorov\u002Fcourses\u002Famath579\u002Freading\u002FNaturalActorCritic.pdf)，彼得斯 J. 等（2005）。\n* **`DPG`** [确定性策略梯度算法](http:\u002F\u002Fproceedings.mlr.press\u002Fv32\u002Fsilver14.pdf)，西尔弗 D. 等（2014）。\n* **`DDPG`** [深度强化学习下的连续控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.02971)，利利克拉普 T. 等（2015）。[🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lV5JhxsrSH8) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=8CNck-hdys8) | [3](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=xw73qehvSRQ) | [4](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=vWxBmHRnQMI)\n* **`MACE`** [利用深度强化学习实现地形自适应运动技能](https:\u002F\u002Fwww.cs.ubc.ca\u002F~van\u002Fpapers\u002F2016-TOG-deepRL\u002Findex.html)，彭 X.、伯塞斯 G.、范德潘内 M.（2016）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KPfzRSBzNX4) | [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=A0BmHoujP9k)\n* **`A3C`** [深度强化学习的异步方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.01783)，米尼 V. 等（2016）。[🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Ajjc08-iPx8) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0xo1Ldx3L5Q) | [3](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=nMR5mjCFZCw)\n* **`SAC`** [软演员-评论家：带有随机演员的离策略最大熵深度强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.01290)，哈尔诺亚 T. 等（2018）。[🎞️](https:\u002F\u002Fvimeo.com\u002F252185258)\n* **`MPO`** [最大后验策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1806.06920)，阿卜杜勒马莱基 A. 等（2018）。\n* [对演员-评论家算法中折扣率不匹配的深入研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.01069)，张 S.、拉罗什 R. 等（2020）。\n\n### 无导数方法\n\n* **`CEM`** [使用噪声交叉熵法学习俄罗斯方块](http:\u002F\u002Fiew3.technion.ac.il\u002FCE\u002Ffiles\u002Fpapers\u002FLearning%20Tetris%20Using%20the%20Noisy%20Cross-Entropy%20Method.pdf)，斯齐塔 I.、洛林茨 A.（2006）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UZnDYGk1j2c)\n* **`CMAES`** [进化策略中的完全去随机化自适应](https:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=1108843)，汉森 N.、奥斯特迈尔 A.（2001）。\n* **`NEAT`** [通过拓扑增殖进化神经网络](http:\u002F\u002Fnn.cs.utexas.edu\u002Fdownloads\u002Fpapers\u002Fstanley.ec02.pdf)，斯坦利 K.（2002）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5lJuEW-5vr8)\n* **`iCEM`** [用于实时规划的高效样本交叉熵法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2008.06389)，平内里 C. 等（2020）。\n\n## 基于模型的 :world_map:\n\n* **`Dyna`** [基于近似动态规划的学习、规划和反应的集成架构](http:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.84.6983&rep=rep1&type=pdf), Sutton R. (1990).\n* **`PILCO`** [PILCO：一种基于模型且数据高效的策略搜索方法](http:\u002F\u002Fmlg.eng.cam.ac.uk\u002Fpub\u002Fpdf\u002FDeiRas11.pdf), Deisenroth M., Rasmussen C. (2011). ([演讲](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=f7y60SEZfXc))\n* **`DBN`** [用于汽车的概率MDP行为规划](http:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=6082928), Brechtel S. 等 (2011).\n* **`GPS`** [深度视觉-运动策略的端到端训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F1504.00702), Levine S. 等 (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Q4bMcUk6pcw)\n* **`DeepMPC`** [DeepMPC：为模型预测控制学习深层潜在特征](https:\u002F\u002Fwww.cs.stanford.edu\u002Fpeople\u002Fasaxena\u002Fpapers\u002Fdeepmpc_rss2015.pdf), Lenz I. 等 (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=BwA90MmkvPU)\n* **`SVG`** [通过随机价值梯度学习连续控制策略](https:\u002F\u002Farxiv.org\u002Fabs\u002F1510.09142), Heess N. 等 (2015). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=PYdL7bcn_cM)\n* **`FARNN`** [使用深度动态神经网络进行非线性系统辨识](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.01439), Ogunmolu O. 等 (2016). [:octocat:](https:\u002F\u002Fgithub.com\u002Flakehanne\u002FFARNN)\n* [利用学习到的局部模型进行最优控制：应用于灵巧操作](https:\u002F\u002Fhomes.cs.washington.edu\u002F~todorov\u002Fpapers\u002FKumarICRA16.pdf), Kumar V. 等 (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=bD5z1I1TU3w)\n* **`BPTT`** [通过短期预测实现长期规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.01580), Shalev-Shwartz S. 等 (2016). [🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Nqmv1anUaF4) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UgGZ9lMvey8)\n* [用于规划机器人运动的深度视觉预见](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.00696), Finn C., Levine S. (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=6k7GHG4IUCY)\n* **`VIN`** [价值迭代网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.02867), Tamar A. 等 (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=RcRkog93ZRU)\n* **`VPN`** [价值预测网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.03497), Oh J. 等 (2017).\n* **`DistGBP`** [基于模型的离散与连续动作规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.07177), Henaff M. 等 (2017). [🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=9Xh2TRQ_4nM) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=XLdme0TTjiw)\n* [基于时间片段模型的预测与控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.04070), Mishra N. 等 (2017).\n* **`Predictron`** [Predictron：端到端学习与规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.08810), Silver D. 等 (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=BeaLdaN2C3Q)\n* **`MPPI`** [面向基于模型强化学习的信息论MPC](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7989202\u002F), Williams G. 等 (2017). [:octocat:](https:\u002F\u002Fgithub.com\u002Fferreirafabio\u002Fmppi_pendulum) [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=f2at-cqaJMM)\n* [通过“梦想”学习真实世界机器人策略](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.07813), Piergiovanni A. 等 (2018).\n* [利用深度学习对车辆进行纵向与横向联合控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.09365), Devineau G., Polack P., Alchté F., Moutarde F. (2018) [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=yyWy1uavlXs)\n* **`PlaNet`** [从像素中学习潜在动力学以进行规划](https:\u002F\u002Fplanetrl.github.io\u002F), Hafner 等 (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=tZk1eof_VNA)\n* **`NeuralLander`** [Neural Lander：利用学习到的动力学实现稳定的无人机着陆控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.08027), Shi G. 等 (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=FLLsG0S78ik)\n* **`DBN+POMCP`** [面向高速公路场景下自动驾驶车辆的人类化预测与决策] (https:\u002F\u002Ftel.archives-ouvertes.fr\u002Ftel-02184362), Sierra Gonzalez D. (2019).\n* [基于目标条件的策略规划](https:\u002F\u002Fsites.google.com\u002Fview\u002Fgoal-planning), Nasiriany S. 等 (2019). [🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fgoal-planning#h.p_0m-H0QfKVj4n)\n* **`MuZero`** [通过基于学习模型的规划掌握Atari、围棋、国际象棋和将棋](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.08265), Schrittwiese J. 等 (2019). [:octocat:](https:\u002F\u002Fgithub.com\u002Fwerner-duvaud\u002Fmuzero-general)\n* **`BADGR`** [BADGR：一种自主的自监督学习导航系统](https:\u002F\u002Fsites.google.com\u002Fview\u002Fbadgr), Kahn G., Abbeel P., Levine S. (2020). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=EMV0zEXbcc4) [:octocat:](https:\u002F\u002Fgithub.com\u002Fgkahn13\u002Fbadgr)\n* **`H-UCRL`** [通过乐观策略搜索与规划实现高效的基于模型强化学习](https:\u002F\u002Fproceedings.neurips.cc\u002F\u002Fpaper_files\u002Fpaper\u002F2020\u002Fhash\u002Fa36b598abb934e4528412e5a2127b931-Abstract.html), Curi S., Berkenkamp F., Krause A. (2020). [:octocat:](https:\u002F\u002Fgithub.com\u002Fsebascuri\u002Fhucrl)\n\n\n## 探索 :tent:\n\n* [用内在恐惧对抗强化学习的西西弗斯式诅咒](https:\u002F\u002Farxiv.org\u002Fabs\u002F1611.01211), Lipton Z. 等 (2016).\n* **`伪计数`** [统一基于计数的探索与内在动机](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.01868), Bellemare M. 等 (2016). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=0yI2wJ6F8r0) \n* **`HER`** [事后经验回放](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.01495), Andrychowicz M. 等 (2017). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Dz_HuzgMxzo)\n* **`VHER`** [视觉事后经验回放](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.11529), Sahni H. 等 (2019).\n* **`RND`** [通过随机网络蒸馏进行探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.12894), Burda Y. 等 (OpenAI) (2018). [🎞️](https:\u002F\u002Fopenai.com\u002Fblog\u002Freinforcement-learning-with-prediction-based-rewards\u002F)\n* **`Go-Explore`** [Go-Explore：一种解决困难探索问题的新方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.10995), Ecoffet A. 等 (Uber) (2018). [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=gnGyUPd_4Eo)\n* **`C51-IDS`** [面向深度强化学习的信息导向探索](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.07544), Nikolov N., Kirschner J., Berkenkamp F., Krause A. (2019). [:octocat:](https:\u002F\u002Fgithub.com\u002Fnikonikolov\u002Frltf)\n* **`Plan2Explore`** [通过自监督世界模型规划探索](https:\u002F\u002Framanans1.github.io\u002Fplan2explore\u002F), Sekar R. 等 (2020). [上演](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GftqnPWsCWw&feature=emb_title) [:octocat:](https:\u002F\u002Fgithub.com\u002Framanans1\u002Fplan2explore)\n* **`RIDE`** [RIDE：奖励由影响驱动的程序化生成环境中的探索](https:\u002F\u002Fopenreview.net\u002Fpdf?id=rkg-TJBFPB), Raileanu R., Rocktäschel T. (2020). [:octocat:](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fimpact-driven-exploration)\n\n## 层次结构与时间抽象 :clock2:\n\n* [在马尔可夫决策过程与半马尔可夫决策过程之间：强化学习中的时间抽象框架](http:\u002F\u002Fwww-anw.cs.umass.edu\u002F~barto\u002Fcourses\u002Fcs687\u002FSutton-Precup-Singh-AIJ99.pdf)，萨顿 R. 等（1999）。\n* [内在动机驱动的层次化技能集合学习](http:\u002F\u002Fwww-anw.cs.umass.edu\u002Fpubs\u002F2004\u002Fbarto_sc_ICDL04.pdf)，巴托 A. 等（2004）。\n* **`OC`** [选项评论家架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.05140)，培根 P-L.、哈布 J.、普雷库普 D.（2016）。\n* [调制型运动控制器的学习与迁移](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.05182)，希斯 N. 等（2016）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=sboPYvhpraQ&feature=youtu.be)\n* [面向自动驾驶的安全多智能体强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.03295)，沙列夫-施瓦茨 S. 等（2016）。\n* **`FuNs`** [用于层次强化学习的封建网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.01161)，韦日涅维茨 A. 等（2017）。\n* [结合神经网络与树搜索的复杂环境任务与运动规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.07887)，帕克斯顿 C. 等（2017）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=MM2U_SGMtk8)\n* **`DeepLoco`** [DeepLoco：基于层次深度强化学习的动态运动技能](https:\u002F\u002Fwww.cs.ubc.ca\u002F~van\u002Fpapers\u002F2017-TOG-deepLoco\u002F)，彭 X. 等（2017）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hd1yvLWm6oA) | [🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=x-HrYko_MRU)\n* [通过自我博弈实现机器人乒乓球样本高效学习的层次策略设计](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.12927)，马朱里安 R. 等（2018）。[🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Frobottabletennis)\n* **`DAC`** [DAC：用于学习选项的双演员-评论家架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.12691)，张 S.、怀特森 S.（2019）。\n* [基于层次Sim2Real的运动式多智能体操作](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.05224)，纳楚姆 O. 等（2019）。[🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmanipulation-via-locomotion)\n* [SoftCon：具有仿生执行器的软体动物仿真与控制](http:\u002F\u002Fmrl.snu.ac.kr\u002Fpublications\u002FProjectSoftCon\u002FSoftCon.html)，闵 S. 等（2020）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=I2ylkhPSkT4) [:octocat:](https:\u002F\u002Fgithub.com\u002Fseiing\u002FSoftCon)\n* **`H-REIL`** [基于强化学习的近事故驾驶模仿策略控制](http:\u002F\u002Filiad.stanford.edu\u002Fpdfs\u002Fpublications\u002Fcao2020reinforcement.pdf)，曹 Z. 等（2020）。[🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=CY24zlC_HdI&feature=youtu.be), [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=envT7b5YRts&feature=youtu.be)\n\n## 部分可观测性 :eye:\n\n* **`PBVI`** [基于点的价值迭代：POMDPs 的随时可用算法](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002Fpub4\u002Fpineau_joelle_2003_3\u002Fpineau_joelle_2003_3.pdf)，派诺 J. 等（2003）。\n* **`cPBVI`** [连续 POMDPs 的基于点的价值迭代](http:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fvolume7\u002Fporta06a\u002Fporta06a.pdf)，波塔 J. 等（2006）。\n* **`POMCP`** [大型 POMDPs 中的蒙特卡洛规划](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4031-monte-carlo-planning-in-large-pomdps)，西尔弗 D.、维内斯 J.（2010）。\n* [不确定性下的机器人运动规划的 POMDP 方法](http:\u002F\u002Fusers.isr.ist.utl.pt\u002F~mtjspaan\u002FPOMDPPractioners\u002Fpomdp2010_submission_5.pdf)，杜 Y. 等（2010）。\n* [全自动驾驶中变道的基于概率的在线 POMDP 决策](https:\u002F\u002Fusers.cs.duke.edu\u002F~pdinesh\u002Fsources\u002F06728533.pdf)，乌尔布里希 S.、毛雷尔 M.（2013）。\n* [求解连续 POMDPs：带有高效空间表示增量学习的价值迭代](http:\u002F\u002Fproceedings.mlr.press\u002Fv28\u002Fbrechtel13.pdf)，布雷赫特尔 S. 等（2013）。\n* [使用连续 POMDPs 进行自动驾驶的不确定性下概率决策](http:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=6957722)，布雷赫特尔 S. 等（2014）。\n* **`MOMDP`** [意图感知的运动规划](http:\u002F\u002Fares.lids.mit.edu\u002Ffm\u002Fdocuments\u002Fintentionawaremotionplanning.pdf)，班迪奥帕迪亚 T. 等（2013）。\n* **`DNC`** [利用具有动态外部记忆的神经网络进行混合计算](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fnature20101)，格雷夫斯 A. 等（2016）。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=B9U8sI7TcMY)\n* [推断交通参与者内部状态对自动驾驶高速公路行驶的价值](https:\u002F\u002Farxiv.org\u002Fabs\u002F1702.00858)，桑伯格 Z. 等（2017）。\n* [用于自主导航城市交叉口的信任状态规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.04322)，布顿 M.、科斯贡 A.、科亨德费尔 M.（2017）。\n* [针对自动驾驶的传感器遮挡情况下的可扩展决策](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F8460914)，布顿 M. 等（2018）。\n* [道路交叉口的概率决策：公式化与定量评估](https:\u002F\u002Fhal.inria.fr\u002Fhal-01940392)，巴比耶 M.、洛吉耶 C.、西莫宁 O.、伊巴涅斯 J.（2018）。\n* [美女与野兽：无人机竞速中的最优方法与学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.06224)，考夫曼 E. 等（2018）。[上演示视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UuQvijZcUSc)\n* **`社会感知`** [具备社会感知能力的自动驾驶汽车行为规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1905.00988)，孙 L. 等（2019）。\n\n## 迁移学习 :earth_americas:\n\n* **`IT&E`** [能够像动物一样适应的机器人](https:\u002F\u002Farxiv.org\u002Fabs\u002F1407.3501)，Cully A., Clune J., Tarapore D., Mouret J-B. (2014)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=T-c17RKh3uE)\n* **`MAML`** [用于深度网络快速适应的模型无关元学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.03400)，Finn C., Abbeel P., Levine S. (2017)。[🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmaml)\n* [自动驾驶中的虚拟到现实强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1704.03952)，Pan X. 等 (2017)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Bce2ZSlMuqY)\n* [从仿真到现实：四足机器人的敏捷运动学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.10332)，Tan J. 等 (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lUZUr7jxoqM)\n* **`ME-TRPO`** [模型集成信任区域策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.10592)，Kurutach T. 等 (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=tpS8qj7yhoU)\n* [深度强化学习的启动](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.03835)，Schmitt S. 等 (2018)。\n* [学习灵巧的手部操作](https:\u002F\u002Fblog.openai.com\u002Flearning-dexterity\u002F)，OpenAI (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=DKe8FumoD4E)\n* **`GrBAL \u002F ReBAL`** [通过元强化学习在动态的真实环境中学习适应](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.11347)，Nagabandi A. 等 (2018)。[🎞️](https:\u002F\u002Fsites.google.com\u002Fberkeley.edu\u002Fmetaadaptivecontrol)\n* [为足式机器人学习敏捷且动态的运动技能](https:\u002F\u002Frobotics.sciencemag.org\u002Fcontent\u002F4\u002F26\u002Feaau5872)，Hwangbo J. 等 (ETH Zurich \u002F Intel ISL) (2019)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ITfBKjBH46E)\n* [基于深度强化学习的四足机器人鲁棒恢复控制器](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.07517)，Lee J., Hwangbo J., Hutter M. (ETH Zurich RSL) (2019)\n* **`IT&E`** [使用“智能试错”算法学习和适应四足步态](https:\u002F\u002Fhal.inria.fr\u002Fhal-02084619)，Dalin E., Desreumaux P., Mouret J-B. (2019)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=v90CWJ_HsnM)\n* **`FAMLE`** [通过模拟先验的元学习嵌入实现机器人领域的快速在线适应](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.04663)，Kaushik R., Anne T., Mouret J-B. (2020)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=QIY1Sm7wHhE)\n* [针对观测值对抗性扰动的鲁棒深度强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.08938)，Zhang H. 等 (2020)。[:octocat:](https:\u002F\u002Fgithub.com\u002Fchenhongge\u002FStateAdvDRL)\n* [在复杂地形上学习四足运动](https:\u002F\u002Frobotics.sciencemag.org\u002Fcontent\u002F5\u002F47\u002Feabc5986)，Lee J. 等 (2020)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=9j2a1oAHDL8)\n* **`PACOH`** [PACOH：具有PAC保证的贝叶斯最优元学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.05551)，Rothfuss J., Fortuin V., Josifoski M., Krause A. (2021)。\n* [基于模型的领域泛化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.11436)，Robey A. 等 (2021)。\n* **`SimGAN`** [SimGAN：通过对抗性强化学习进行领域适应的混合模拟器识别](https:\u002F\u002Farxiv.org\u002Fabs\u002F2101.06005)，Jiang Y. 等 (2021)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=McKOGllO7nc&feature=youtu.be) [:octocat:](https:\u002F\u002Fgithub.com\u002Fjyf588\u002FSimGAN)\n* [为野外四足机器人学习鲁棒的感知运动](https:\u002F\u002Fleggedrobotics.github.io\u002Frl-perceptiveloco\u002F)，Miki T. 等 (2022)。\n\n## 多智能体 :two_men_holding_hands:\n\n* **`Minimax-Q`** [马尔可夫博弈作为多智能体强化学习的框架](https:\u002F\u002Fwww.cs.rutgers.edu\u002F~mlittman\u002Fpapers\u002Fml94-final.pdf)，M. Littman (1994)。\n* [自主智能体对其他智能体的建模：综合综述与开放问题](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.08071)，Albrecht S., Stone P. (2017)。\n* **`MILP`** [沿指定路径的移动机器人时间最优协调](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.04610)，Altché F. 等 (2016)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=RiW2OFsdHOY)\n* **`MIQP`** [用于协同半自主车辆监督驾驶的算法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.08046)，Altché F. 等 (2017)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JJZKfHMUeCI)\n* **`SA-CADRL`** [基于深度强化学习的社会意识运动规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.08862)，Chen Y. 等 (2017)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=CK1szio7PyA)\n* [基于变点的行为预测的自动驾驶多策略决策：理论与实验](https:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002Fs10514-017-9619-z)，Galceran E. 等 (2017)。\n* [面向可扩展自治系统的在线决策](https:\u002F\u002Fwww.ijcai.org\u002Fproceedings\u002F2017\u002F664)，Wray K. 等 (2017)。\n* **`MAgent`** [MAgent：用于人工群体智能的多智能体强化学习平台](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.00600)，Zheng L. 等 (2017)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=HCSm0kVolqI)\n* [利用价值迭代网络进行非完整约束智能体的协作运动规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.05273)，Rehder E. 等 (2017)。\n* **`MPPO`** [通过深度强化学习实现最优去中心化的多机器人避障](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.10082)，Long P. 等 (2017)。[🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdrlmaca)\n* **`COMA`** [反事实多智能体策略梯度](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.05273)，Foerster J. 等 (2017)。\n* **`MADDPG`** [用于混合合作—竞争环境的多智能体演员—评论家](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.02275)，Lowe R. 等 (2017)。[:octocat:](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fmaddpg) \n* **`FTW`** [基于群体的深度强化学习在第一人称多人游戏中达到人类水平的表现](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.01281)，Jaderberg M. 等 (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=dltN4MxV1RI)\n* [通过自我博弈学习多智能体谈判的尝试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2001.10208)，Tang Y. C. (2020)。\n* **`MAPPO`** [MAPPO在合作性多智能体游戏中的惊人效果](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.01955)，Yu C. 等 (2021)。|:octocat:](https:\u002F\u002Fgithub.com\u002Fmarlbenchmark\u002Fon-policy)\n* [多智能体强化学习](https:\u002F\u002Fdiscovery.ucl.ac.uk\u002Fid\u002Feprint\u002F10124273\u002F)，Yang Y. (2021)\n\n## 表征学习\n\n* [最优控制中的可变分辨率离散化](https:\u002F\u002Frd.springer.com\u002Fcontent\u002Fpdf\u002F10.1023%2FA%3A1017992615625.pdf)，Munos R., Moore A. (2002)。[🎞️](http:\u002F\u002Fresearchers.lille.inria.fr\u002F~munos\u002Fvariable\u002Findex.html)\n* **`DeepDriving`** [DeepDriving：在自动驾驶中学习直接感知的可供性](http:\u002F\u002Fdeepdriving.cs.princeton.edu\u002Fpaper.pdf)，Chen C. 等 (2015)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5hFvoXV9gII)\n* [端到端训练与语义抽象训练的样本复杂度比较](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.06915)，Shalev-Shwartz S. 等 (2016)。\n* [利用稀疏编码在强化学习中学习稀疏表征](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.08316)，Le L., Kumaraswamy M., White M. (2017)。\n* [世界模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.10122)，Ha D., Schmidhuber J. (2018)。[🎞️](https:\u002F\u002Fworldmodels.github.io\u002F) [:octocat:](https:\u002F\u002Fgithub.com\u002Fctallec\u002Fworld-models)\n* [一天学会驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.00412)，Kendall A. 等 (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eRwTbRtnT1I)\n* **`MERLIN`** [目标导向智能体中的无监督预测性记忆](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.10760)，Wayne G. 等 (2018)。[🎞️ 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=YFx-D4eEs5A) | [2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=IiR_NOomcpk) | [3](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=dQMKJtLScmk) | [4](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=xrYDlTXyC6Q) | [5](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=04H28-qA3f8) | [6](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=3iA19h0Vvq0)\n* [变分端到端导航与定位](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.10119)，Amini A. 等 (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=aXI4a_Nvcew)\n* [理解视觉与触觉：面向接触密集型任务的多模态表征自监督学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.10191.pdf)，Lee M. 等 (2018)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=TjwDJ_R2204)\n* [递归与离散世界模型的深度神经进化](http:\u002F\u002Fsebastianrisi.com\u002Fwp-content\u002Fuploads\u002Frisi_gecco19.pdf)，Risi S., Stanley K.O. (2019)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=a-tcsnZe-yE) [:octocat:](https:\u002F\u002Fgithub.com\u002Fsebastianrisi\u002Fga-world-models)\n* **`FERM`** [高效机器人操作框架](https:\u002F\u002Fsites.google.com\u002Fview\u002Fefficient-robotic-manipulation)，Zhan A., Zhao R. 等 (2021)。[:octocat:](https:\u002F\u002Fgithub.com\u002FPhilipZRH\u002Fferm)\n* **`S4RL`** [S4RL：离线强化学习中出人意料的简单自监督方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.06326)，Sinha S. 等 (2021)。\n\n## 离线\n\n* **`SPI-BB`** [基于基线自助法的安全策略改进](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.06924)，Laroche R. 等 (2019)。\n* **`AWAC`** [AWAC：利用离线数据集加速在线强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.09359)，Nair A. 等 (2020)。\n* **`CQL`** [用于离线强化学习的保守Q学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.04779)，Kumar A. 等 (2020)。\n* [决策变换器：通过序列建模进行强化学习](https:\u002F\u002Fsites.google.com\u002Fberkeley.edu\u002Fdecision-transformer)，Chen L., Lu K. 等 (2021)。[:octocat:](https:\u002F\u002Fgithub.com\u002Fkzl\u002Fdecision-transformer)\n* [将强化学习视为一个大型序列建模问题](https:\u002F\u002Ftrajectory-transformer.github.io\u002F)，Janner M., Li Q., Levine S. (2021)。\n\n## 其他\n\n* [贝尔曼残差是一个糟糕的代理吗？](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.07636)，Geist M., Piot B., Pietquin O. (2016)。\n* [重要的深度强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1709.06560)，Henderson P. 等 (2017)。\n* [利用深度强化学习进行自动桥牌叫牌](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.03290)，Yeh C. 和 Lin H. (2016)。\n* [通过深度强化学习实现共享自主](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.01744)，Reddy S. 等 (2018)。[🎞️](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdeep-assist)\n* [强化学习与控制作为概率推理：教程与综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.00909)，Levine S. (2018)。\n* [强化学习中的值函数多面体](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.11524)，Dadashi R. 等 (2019)。\n* [关于值函数与智能体-环境边界](https:\u002F\u002Farxiv.org\u002Fabs\u002F1905.13341)，Jiang N. (2019)。\n* [如何用深度强化学习训练你的机器人；我们学到的经验](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.02915)，Ibartz J. 等 (2021)。\n\n\n# 示范学习 :mortar_board:\n\n## 模仿学习\n\n* **`DAgger`** [模仿学习和结构化预测向无悔在线学习的约简](https:\u002F\u002Fwww.cs.cmu.edu\u002F~sross1\u002Fpublications\u002FRoss-AIStats11-NoRegret.pdf)，Ross S., Gordon G., Bagnell J. A. (2011)。\n* **`QMDP-RCNN`** [通过循环卷积神经网络进行强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1701.02392)，Shankar T. 等 (2016)。([演讲](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=gpwA3QNTPOQ))\n* **`DQfD`** [为现实世界强化学习从示范中学习](https:\u002F\u002Fpdfs.semanticscholar.org\u002Fa7fb\u002F199f85943b3fb6b5f7e9f1680b2e2a445cce.pdf)，Hester T. 等 (2017)。[上演](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=JR6wmLaYuu4&list=PLdjpGm3xcO-0aqVf--sBZHxCKg-RZfa5T)\n* [找到属于自己的路：城市自主驾驶中路径建议的弱监督分割](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.01238)，Barnes D., Maddern W., Posner I. (2016)。[上演](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=rbZ8ck_1nZk)\n* **`GAIL`** [生成对抗式模仿学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.03476)，Ho J., Ermon S. (2016)。\n* [从感知到决策：一种数据驱动的端到端运动规划方法，用于自主地面机器人](https:\u002F\u002Farxiv.org\u002Fabs\u002F1609.07910)，Pfeiffer M. 等 (2017)。[上演](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ZedKmXzwdgI)\n* **`Branched`** [通过条件模仿学习实现端到端驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.02410)，Codevilla F. 等 (2017)。[上演](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cFtnflNe5fM) | [演讲](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KunVjVHN3-U)\n* **`UPN`** [通用规划网络](https:\u002F\u002Farxiv.org\u002Fabs\u002F1804.00645)，Srinivas A. 等 (2018)。[上演](https:\u002F\u002Fsites.google.com\u002Fview\u002Fupn-public\u002Fhome)\n* **`DeepMimic`** [DeepMimic：基于物理的角色技能示例引导深度强化学习](https:\u002F\u002Fxbpeng.github.io\u002Fprojects\u002FDeepMimic\u002Findex.html)，Peng X. B. 等 (2018)。[上演](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=vppFvq2quQ0&feature=youtu.be)\n* **`R2P2`** [用于灵活推理、规划和控制的深度模仿模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.06544)，Rhinehart N. 等 (2018)。[上演](https:\u002F\u002Fsites.google.com\u002Fview\u002Fimitativeforecastingcontrol)\n* [通过模仿动物学习敏捷的机器人运动技能](https:\u002F\u002Fxbpeng.github.io\u002Fprojects\u002FRobotic_Imitation\u002Findex.html)，Bin Peng X. 等 (2020)。[上演](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lKYh6uuCwRY)\n* [用于灵活推理、规划和控制的深度模仿模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=Skl4mRNYDr)，Rhinehart N., McAllister R., Levine S. (2020)。\n\n### 自动驾驶应用 :car:\n\n* [ALVINN：基于神经网络的自主陆地车辆](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F95-alvinn-an-autonomous-land-vehicle-in-a-neural-network)，Pomerleau D. (1989)。\n* [面向自动驾驶汽车的端到端学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.07316)，Bojarski M. 等 (2016)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=qhUvQiKec2U)\n* [基于大规模视频数据集的驾驶模型端到端学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.01079)，Xu H.、Gao Y. 等 (2016)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=jxlNfUzbGAY)\n* [考虑时间依赖性的自动驾驶车辆转向端到端深度学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.03804)，Eraqi H. 等 (2017)。\n* [像人类一样驾驶：使用卷积神经网络进行路径规划的模仿学习](https:\u002F\u002Fwww.semanticscholar.org\u002Fpaper\u002FDriving-Like-a-Human%3A-Imitation-Learning-for-Path-Rehder-Quehl\u002Fa1150417083918c3f5f88b7ddad8841f2ce88188)，Rehder E. 等 (2017)。\n* [利用生成对抗网络模仿驾驶员行为](https:\u002F\u002Farxiv.org\u002Fabs\u002F1701.06699)，Kuefler A. 等 (2017)。\n* **`PS-GAIL`** [用于驾驶模拟的多智能体模仿学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.01044)，Bhattacharyya R. 等 (2018)。[🎞️](https:\u002F\u002Fgithub.com\u002Fsisl\u002Fngsim_env\u002Fblob\u002Fmaster\u002Fmedia\u002Fsingle_multi_model_2_seed_1.gif) [:octocat:](https:\u002F\u002Fgithub.com\u002Fsisl\u002Fngsim_env)\n* [在通用城市场景中增强安全性的自动驾驶深度模仿学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1903.00640)，Chen J. 等 (2019)。\n\n## 逆强化学习\n\n* **`Projection`** [通过逆强化学习进行学徒式学习](http:\u002F\u002Fai.stanford.edu\u002F~ang\u002Fpapers\u002Ficml04-apprentice.pdf)，Abbeel P.、Ng A. (2004)。\n* **`MMP`** [最大间隔规划](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002Fpub4\u002Fratliff_nathan_2006_1\u002Fratliff_nathan_2006_1.pdf)，Ratliff N. 等 (2006)。\n* **`BIRL`** [贝叶斯逆强化学习](https:\u002F\u002Fwww.aaai.org\u002FPapers\u002FIJCAI\u002F2007\u002FIJCAI07-416.pdf)，Ramachandran D.、Amir E. (2007)。\n* **`MEIRL`** [最大熵逆强化学习](https:\u002F\u002Fwww.aaai.org\u002FPapers\u002FAAAI\u002F2008\u002FAAAI08-227.pdf)，Ziebart B. 等 (2008)。\n* **`LEARCH`** [学习搜索：用于模仿学习的函数梯度技术](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002F2009\u002F7\u002Flearch.pdf)，Ratliff N.、Siver D.、Bagnell A. (2009)。\n* **`CIOC`** [具有局部最优示例的连续逆最优控制](http:\u002F\u002Fgraphics.stanford.edu\u002Fprojects\u002Fcioc\u002F)，Levine S.、Koltun V. (2012)。[🎞️](http:\u002F\u002Fgraphics.stanford.edu\u002Fprojects\u002Fcioc\u002Fcioc.mp4)\n* **`MEDIRL`** [最大熵深度逆强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1507.04888)，Wulfmeier M. (2015)。\n* **`GCL`** [引导成本学习：通过策略优化实现的深度逆最优控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.00448)，Finn C. 等 (2016)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=hXxaepw0zAw)\n* **`RIRL`** [重复逆强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.05427)，Amin K. 等 (2017)。\n* [弥合模仿学习与逆强化学习之间的差距](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7464854\u002F)，Piot B. 等 (2017)。\n\n### 自动驾驶应用 :taxi:\n\n* [用于运动规划的学徒式学习及其在停车场导航中的应用](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F4651222\u002F)，Abbeel P. 等 (2008)。\n* [像出租车司机一样导航：基于观察到的情境感知行为的概率推理](http:\u002F\u002Fwww.cs.cmu.edu\u002F~bziebart\u002Fpublications\u002Fnavigate-bziebart.pdf)，Ziebart B. 等 (2008)。\n* [基于规划的行人预测](http:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F5354147\u002F)，Ziebart B. 等 (2009)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=XOZ69Bg4JKg)\n* [自主导航的学习](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002F2010\u002F6\u002FLearning%20for%20Autonomous%20Navigation-%20Advances%20in%20Machine%20Learning%20for%20Rough%20Terrain%20Mobility.pdf)，Bagnell A. 等 (2010)。\n* [从专家演示中学习自动驾驶风格和操作](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002F2012\u002F6\u002Fiser12.pdf)，Silver D. 等 (2012)。\n* [从演示中学习自动驾驶车辆的驾驶风格](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7139555\u002F)，Kuderer M. 等 (2015)。\n* [利用逆强化学习和深度Q网络学习驾驶](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.03653)，Sharifzadeh S. 等 (2016)。\n* [请关注：城市环境中路径规划的可扩展成本函数学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.02329)，Wulfmeier M. (2016)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Sdfir_1T-UQ)\n* [为能够利用对人类行为影响的自动驾驶汽车进行规划](https:\u002F\u002Frobotics.eecs.berkeley.edu\u002F~sastry\u002Fpubs\u002FPdfs%20of%202016\u002FSadighPlanning2016.pdf)，Sadigh D. 等 (2016)。\n* [用于处理城市自动驾驶中困境的学习框架](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7989172\u002F)，Lee S.、Seo S. (2017)。\n* [利用基于能量模型的朗之万采样进行连续逆最优控制的轨迹预测学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.05453)，Xu Y. 等 (2019)。\n* [基于逆强化学习分析成本函数在解释和模仿人类驾驶行为中的适用性](https:\u002F\u002Fras.papercept.net\u002Fproceedings\u002FICRA20\u002F0320.pdf)，Naumann M. 等 (2020)。\n\n\n\n# 运动规划 :running_man:\n\n## 搜索\n\n* **`Dijkstra`** [关于图论中两个问题的一则注记](http:\u002F\u002Fwww-m3.ma.tum.de\u002Ffoswiki\u002Fpub\u002FMN0506\u002FWebHome\u002Fdijkstra.pdf)，Dijkstra E. W. (1959)。\n* **`A*`** [启发式确定最小成本路径的正式基础](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F4082128\u002F)，Hart P. 等 (1968)。\n* [为自动驾驶车辆规划长距离动态可行的操作](https:\u002F\u002Fwww.cs.cmu.edu\u002F~maxim\u002Ffiles\u002Fplanlongdynfeasmotions_rss08.pdf)，Likhachev M.、Ferguson D. (2008)。\n* [在弗雷内坐标系下为动态街道场景生成最优轨迹](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FMoritz_Werling\u002Fpublication\u002F224156269_Optimal_Trajectory_Generation_for_Dynamic_Street_Scenarios_in_a_Frenet_Frame\u002Flinks\u002F54f749df0cf210398e9277af.pdf)，Werling M.、Kammel S. (2010)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Cj6tAQe7UCY)\n* [面向自动驾驶和协同汽车的3D感知与规划](http:\u002F\u002Fwww.mrt.kit.edu\u002Fz\u002Fpubl\u002Fdownload\u002F2012\u002FStillerZiegler2012SSD.pdf)，Stiller C.、Ziegler J. (2012)。\n* [面向道路自动驾驶的不确定性下的运动规划](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpub_files\u002F2014\u002F6\u002FICRA14_0863_Final.pdf)，Xu W. 等 (2014)。\n* [蒙特卡洛树搜索用于模拟赛车](http:\u002F\u002Fjulian.togelius.com\u002FFischer2015Monte.pdf)，Fischer J. 等 (2015)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=GbUMssvolvU)\n\n## 采样\n\n* **`RRT*`** [用于最优运动规划的基于采样的算法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1105.1186)，Karaman S., Frazzoli E. (2011)。[🎞️](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=p3nZHnOWhrg)\n* **`LQG-MP`** [LQG-MP：针对具有运动不确定性及不完全状态信息的机器人优化路径规划](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~pabbeel\u002Fpapers\u002FvandenBergAbbeelGoldberg_RSS2010.pdf)，van den Berg J. 等 (2010)。\n* [在信念空间中使用微分动态规划进行不确定性下的运动规划](http:\u002F\u002Frll.berkeley.edu\u002F~sachin\u002Fpapers\u002FBerg-ISRR2011.pdf)，van den Berg J. 等 (2011)。\n* [用于不确定性下运动规划的快速探索随机信念树](https:\u002F\u002Fgroups.csail.mit.edu\u002Frrg\u002Fpapers\u002Fabry_icra11.pdf)，Bry A., Roy N. (2011)。\n* **`PRM-RL`** [PRM-RL：结合强化学习与基于采样的规划实现长距离机器人导航任务](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.03937)，Faust A. 等 (2017)。\n\n## 优化\n\n* [为“伯莎”号规划轨迹——一种局部、连续的方法](https:\u002F\u002Fpdfs.semanticscholar.org\u002Fbdca\u002F7fe83f8444bb4e75402a417053519758d36b.pdf)，Ziegler J. 等 (2014)。\n* [学习吸引子景观以获取运动基元](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F2140-learning-attractor-landscapes-for-learning-motor-primitives.pdf)，Ijspeert A. 等 (2002)。\n* [基于非欧几里得旋转群的非线性模型预测控制的在线运动规划](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.03534)，Rösmann C. 等 (2020)。[:octocat:](https:\u002F\u002Fgithub.com\u002Frst-tu-dortmund\u002Fmpc_local_planner)\n\n## 反应式\n\n* **`PF`** [机械臂与移动机器人实时避障](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F1087247\u002F)，Khatib O. (1986)。\n* **`VFH`** [矢量场直方图——移动机器人快速避障方法](http:\u002F\u002Fieeexplore.ieee.org\u002Fstamp\u002Fstamp.jsp?arnumber=88137)，Borenstein J. (1991)。\n* **`VFH+`** [VFH+：适用于高速移动机器人的可靠避障方法](http:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.438.3464&rep=rep1&type=pdf)，Ulrich I., Borenstein J. (1998)。\n* **`速度障碍`** [利用速度障碍在动态环境中进行运动规划](http:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.56.6352&rep=rep1&type=pdf)，Fiorini P., Shillert Z. (1998)。\n\n## 架构与应用\n\n* [自动驾驶车辆运动规划技术综述](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7339478\u002F)，González D. 等 (2016)。\n* [城市自动驾驶车辆运动规划与控制技术综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.07446)，Paden B. 等 (2016)。\n* [城市环境中的自动驾驶：Boss 与城市挑战赛](https:\u002F\u002Fwww.ri.cmu.edu\u002Fpublications\u002Fautonomous-driving-in-urban-environments-boss-and-the-urban-challenge\u002F)，Urmson C. 等 (2008)。\n* [麻省理工学院—康奈尔大学碰撞事件及其原因分析](http:\u002F\u002Fonlinelibrary.wiley.com\u002Fdoi\u002F10.1002\u002Frob.20266\u002Fpdf)，Fletcher L. 等 (2008)。\n* [让“伯莎”号行驶——一次沿历史路线的自动驾驶之旅](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F6803933\u002F)，Ziegler J. 等 (2014)。","# phd-bibliography 快速上手指南\n\n`phd-bibliography` 并非一个可执行的软件工具或代码库，而是一个由社区维护的**学术文献索引清单**。它汇集了最优控制、安全控制、博弈论、强化学习、模仿学习及运动规划等领域的经典书籍与核心论文。\n\n本指南旨在帮助开发者快速浏览、检索并利用该资源构建自己的知识库。\n\n## 环境准备\n\n由于该项目本质上是 Markdown 格式的文档列表，无需复杂的系统环境或编译依赖。\n\n*   **操作系统**：Windows, macOS, Linux 均可。\n*   **前置依赖**：\n    *   Web 浏览器（推荐 Chrome, Edge 或 Firefox）。\n    *   Git（可选，用于克隆仓库到本地）。\n    *   Markdown 编辑器（可选，如 VS Code, Typora，用于本地阅读）。\n\n## 安装\u002F获取步骤\n\n你可以通过以下两种方式获取文献列表：\n\n### 方式一：在线浏览（推荐）\n直接访问 GitHub 仓库页面查看渲染后的目录和链接：\n```text\nhttps:\u002F\u002Fgithub.com\u002Feleurent\u002Fphd-bibliography\n```\n\n### 方式二：克隆到本地\n如果你希望离线阅读或贡献内容，可以使用 Git 克隆仓库。国内用户建议使用 Gitee 镜像（如有）或通过加速代理克隆，若直接克隆速度较慢，可尝试以下命令：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Feleurent\u002Fphd-bibliography.git\n```\n\n进入目录：\n```bash\ncd phd-bibliography\n```\n\n## 基本使用\n\n该项目的核心用法是**按主题检索文献**。打开 `README.md` 文件或在 GitHub 网页端浏览，利用目录跳转至感兴趣的研究领域。\n\n### 1. 浏览核心领域\n项目主要涵盖以下六大板块，点击对应标题即可展开：\n\n*   **Optimal Control (最优控制)**: 包含动态规划、线性规划、基于树的规划（如 MCTS, AlphaGo）、控制理论及模型预测控制 (MPC)。\n*   **Safe Control (安全控制)**: 涵盖鲁棒控制、风险规避控制、值约束控制及状态约束稳定性。\n*   **Game Theory (博弈论)**: 相关理论基础。\n*   **Sequential Learning (序列学习)**: 重点包含多臂老虎机 (Multi-Armed Bandit) 和强化学习 (RL)，细分为基于值、基于策略、基于模型、分层强化学习等子方向。\n*   **Learning from Demonstrations (从演示中学习)**: 包含模仿学习 (Imitation Learning) 和逆向强化学习 (IRL)，特别是自动驾驶领域的应用。\n*   **Motion Planning (运动规划)**: 涉及搜索、采样、优化及反应式规划方法。\n\n### 2. 查找特定算法文献\n每个条目通常标记了算法名称（加粗显示），例如寻找 **MCTS** 或 **SafeOPT** 的相关论文：\n\n1.  在页面使用浏览器搜索功能 (`Ctrl+F` 或 `Cmd+F`)。\n2.  输入算法关键词，例如：`AlphaZero` 或 `iLQG`。\n3.  点击对应的标题链接，直接跳转到论文原文、书籍页面或代码仓库。\n\n**示例：查找蒙特卡洛树搜索 (MCTS) 相关文献**\n*   定位到 `Tree-Based Planning` 章节。\n*   找到 **`MCTS`**: *Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search* (Rémi Coulom, 2006)。\n*   找到 **`UCT`**: *Bandit based Monte-Carlo Planning* (Kocsis L., 2006)。\n*   点击链接即可阅读原始论文。\n\n### 3. 利用可视化图表\n项目中包含一张强化学习知识图谱 (`reinforcement-learning.svg`)，可在仓库根目录直接查看，帮助理清各子领域（如 Model-based, Policy-based, Offline RL 等）之间的逻辑关系。\n\n> **提示**：大部分链接指向学术论文数据库（如 arXiv, IEEE, Springer）或官方代码库。国内访问部分外文学术站点可能较慢，建议配合学术加速工具或使用机构网络访问。","某自动驾驶初创公司的算法工程师正在研发一套基于强化学习的城市道路决策系统，急需梳理最优控制与运动规划领域的理论基石以突破瓶颈。\n\n### 没有 phd-bibliography 时\n- **文献检索如大海捞针**：在 Google Scholar 上搜索\"Safe Control\"或\"Model-based RL\"时，返回成千上万篇论文，难以区分哪些是奠基性经典，哪些是过时方法。\n- **知识体系支离破碎**：手动整理的参考文献缺乏系统性，容易遗漏动态规划、博弈论或多智能体协同等关键分支的理论联系，导致算法架构设计存在盲区。\n- **复现基准难以确立**：面对黑盒优化或逆强化学习等细分方向，找不到权威的开源实现参考或标准测试用例（如 REPS 算法的具体出处），浪费大量时间验证基础理论。\n- **跨领域融合困难**：难以快速定位将“风险规避控制”与“分层时序抽象”结合的前沿交叉研究，限制了复杂场景下的策略创新能力。\n\n### 使用 phd-bibliography 后\n- **精准锁定核心经典**：直接查阅其分类清晰的目录，瞬间获取从 Bellman 动态规划到现代 MCTS 的必读清单，大幅缩短文献调研周期。\n- **构建完整知识图谱**：依托其结构化的大纲（涵盖从理论推导到自动驾驶应用的全链路），快速建立起从状态约束控制到不确定性系统处理的系统化认知框架。\n- **高效复现权威算法**：通过列表中提供的精确论文链接（如 Peters 的相对熵策略搜索），迅速找到算法源头与数学证明，为代码复现提供坚实依据。\n- **激发交叉创新灵感**：利用其独特的分类视角（如将模仿学习与逆向强化学习并列对比），快速发现多智能体博弈与表示学习的结合点，加速新策略原型的诞生。\n\nphd-bibliography 将原本散乱的学术海洋浓缩为一张精准的导航图，让研发人员能从繁琐的文献筛选中解放出来，专注于核心算法的突破与创新。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Feleurent_phd-bibliography_5af895cc.png","eleurent","Edouard Leurent","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Feleurent_6eb9fe3a.png","Research Scientist @google-deepmind","Google DeepMind","London","eleurent@gmail.com",null,"edouardleurent.com","https:\u002F\u002Fgithub.com\u002Feleurent",976,208,"2026-03-30T05:49:00",1,"","未说明",{"notes":93,"python":91,"dependencies":94},"该项目并非可执行的软件工具或代码库，而是一个学术文献综述列表（Bibliography），涵盖了最优控制、安全控制、博弈论、强化学习等领域的经典论文和书籍。因此，它没有特定的操作系统、硬件配置、Python 版本或依赖库要求。用户只需通过提供的链接阅读相关文献即可。",[],[16],[97,98,99,100,101],"bibliography","reinforcement-learning","optimal-control","motion-planning","robust-control","2026-03-27T02:49:30.150509","2026-04-13T13:38:14.282607",[],[]]