[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Rafael1s--Deep-Reinforcement-Learning-Algorithms":3,"tool-Rafael1s--Deep-Reinforcement-Learning-Algorithms":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150037,2,"2026-04-10T23:33:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":76,"owner_website":76,"owner_url":79,"languages":80,"stars":89,"forks":90,"last_commit_at":91,"license":76,"difficulty_score":10,"env_os":92,"env_gpu":93,"env_ram":93,"env_deps":94,"category_tags":104,"github_topics":106,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":126,"updated_at":127,"faqs":128,"releases":156},4974,"Rafael1s\u002FDeep-Reinforcement-Learning-Algorithms","Deep-Reinforcement-Learning-Algorithms","32 projects in the framework of Deep Reinforcement Learning algorithms: Q-learning, DQN, PPO, DDPG, TD3, SAC, A2C and others. Each project is provided with a detailed training log.","Deep-Reinforcement-Learning-Algorithms 是一个专注于深度强化学习算法的开源项目集合，旨在为开发者提供从理论到实践的完整参考。该项目涵盖了 Q-learning、DQN、PPO、DDPG、TD3、SAC 及 A2C 等 32 个主流算法实现，并针对 Ant、BipedalWalker、LunarLander 等 20 多种经典仿真环境提供了具体的解决方案。\n\n它主要解决了强化学习领域“入门难、复现难”的痛点。通过将每个项目以 Jupyter Notebook 的形式呈现，并附带详细的训练日志，用户可以直接观察算法在不同环境下的收敛过程与性能表现，无需从零搭建实验框架。无论是验证蒙特卡洛方法、时序差分学习，还是探索基于策略梯度或 Actor-Critic 架构的高级模型，这里都提供了清晰的代码范例。\n\n该资源特别适合人工智能领域的研究人员、算法工程师以及高校学生使用。对于希望深入理解神经网络函数逼近能力、掌握异步优势演员 - 评论家（A3C）等复杂机制的学习者而言，这是一份极具价值的实战指南。其独特的“环境×模型”矩阵式布局，让用户能直观对比同一任务下","Deep-Reinforcement-Learning-Algorithms 是一个专注于深度强化学习算法的开源项目集合，旨在为开发者提供从理论到实践的完整参考。该项目涵盖了 Q-learning、DQN、PPO、DDPG、TD3、SAC 及 A2C 等 32 个主流算法实现，并针对 Ant、BipedalWalker、LunarLander 等 20 多种经典仿真环境提供了具体的解决方案。\n\n它主要解决了强化学习领域“入门难、复现难”的痛点。通过将每个项目以 Jupyter Notebook 的形式呈现，并附带详细的训练日志，用户可以直接观察算法在不同环境下的收敛过程与性能表现，无需从零搭建实验框架。无论是验证蒙特卡洛方法、时序差分学习，还是探索基于策略梯度或 Actor-Critic 架构的高级模型，这里都提供了清晰的代码范例。\n\n该资源特别适合人工智能领域的研究人员、算法工程师以及高校学生使用。对于希望深入理解神经网络函数逼近能力、掌握异步优势演员 - 评论家（A3C）等复杂机制的学习者而言，这是一份极具价值的实战指南。其独特的“环境×模型”矩阵式布局，让用户能直观对比同一任务下不同算法的表现差异，从而更高效地选择或优化适合特定场景的策略模型。","## Deep Reinforcement Learning Algorithms\n\nHere you can find several projects dedicated to the Deep Reinforcement Learning methods.     \nThe projects are deployed in the matrix form: **[env x model]**, where **env** is the environment   \nto be solved, and **model** is the model\u002Falgorithm which solves this environment. In some cases,    \nthe same environment is resolved by several algorithms. All projects are presented as   \na **jupyter notebook** containing **training log**.  \n\nThe following environments are supported:  \n\n__AntBulletEnv__,  __BipedalWalker__, __BipedalWalkerHardcore__, __CarRacing__, __CartPole__, __Crawler__, __HalfCheetahBulletEnv__,   \n__HopperBulletEnv__,  __LunarLander__,  __LunarLanderContinuous__,  __Markov Decision 6x6__,  __Minitaur__, __Minitaur with Duck__,      \n__MountainCar__, __MountainCarContinuous__, __Pong__, __Navigation__, __Reacher__,  __Snake__,  __Tennis__, __Waker2DBulletEnv__.   \n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRafael1s_Deep-Reinforcement-Learning-Algorithms_readme_b8f8a164d91e.png)\n\nFour environments (__Navigation__,  __Crawler__, __Reacher__,  __Tennis__) are solved in the framework of the   \n[**_Udacity Deep Reinforcement Learning Nanodegree Program_**](https:\u002F\u002Fwww.udacity.com\u002Fcourse\u002Fdeep-reinforcement-learning-nanodegree--nd893).  \n \n* [_Monte-Carlo Methods_](https:\u002F\u002Fmedium.com\u002F@zsalloum\u002Fmonte-carlo-in-reinforcement-learning-the-easy-way-564c53010511)       \nIn Monte Carlo (MC), we play episodes of the game until we reach the end, we grab the rewards     \ncollected on the way and move backward to the start of the episode. We repeat this method   \na sufficient number of times and we average  the value of each state.   \n* [_Temporal Difference Methods and Q-learning_](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTemporal_difference_learning)\n* [_Reinforcement Learning in Continuous Space (Deep Q-Network)_](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FReinforcement_learning)\n* [_Function Approximation and Neural Network_](https:\u002F\u002Fmedium.com\u002Fbiffures\u002Frl-course-by-david-silver-lectures-5-to-7-576188d3b033)    \nThe [Universal Approximation Theorem (UAT) states](https:\u002F\u002Ftowardsdatascience.com\u002Fthe-approximation-power-of-neural-networks-with-python-codes-ddfc250bdb58) that feed-forward _neural networks_ containing a     \n_single hidden layer_ with a finite number of nodes can be used to approximate any continuous function     \nprovided rather mild assumptions about the form of the activation function are satisfied.\n* [_Policy-Based Methods_](https:\u002F\u002Ftowardsdatascience.com\u002Fpolicy-based-reinforcement-learning-the-easy-way-8de9a3356083), [_Hill-Climbing_](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHill_climbing), [_Simulating Annealing_](https:\u002F\u002Fmedium.com\u002F@macromoltek\u002Fmachine-learning-and-simulated-annealing-588b2e70d0cc)     \nRandom-restart _hill-climbing_ is a surprisingly effective algorithm in many cases.  _Simulated annealing_ is a good    \nprobabilistic technique because it does not accidentally think a local extrema is a global extrema.\n* [_Policy-Gradient Methods_](https:\u002F\u002Flilianweng.github.io\u002Flil-log\u002F2018\u002F04\u002F08\u002Fpolicy-gradient-algorithms.html), [_REINFORCE_](https:\u002F\u002Fmedium.com\u002Fsamkirkiles\u002Freinforce-policy-gradients-from-scratch-in-numpy-6a09ae0dfe12), [_PPO_](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347)    \nDefine a performance measure _J(\\theta)_ to maximaze. Learn policy paramter \\theta throgh _approximate gradient ascent_.    \n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRafael1s_Deep-Reinforcement-Learning-Algorithms_readme_c471837a7f39.jpg)\n* [_Actor-Critic Methods_](https:\u002F\u002Ftowardsdatascience.com\u002Fsoft-actor-critic-demystified-b8427df61665), [_A3C_](https:\u002F\u002Fmedium.com\u002Femergent-future\u002Fsimple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2), [_A2C_](https:\u002F\u002Fhackernoon.com\u002Fintuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752), [_DDPG_](https:\u002F\u002Fmedium.com\u002F@amitpatel.gt\u002Fpolicy-gradients-1edbbbc8de6b), [_TD3_](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09477), [_SAC_](https:\u002F\u002Ftowardsdatascience.com\u002Fsoft-actor-critic-demystified-b8427df61665)    \nThe key difference from A2C is the Asynchronous part. A3C consists of multiple independent agents(networks) with   \ntheir own weights, who interact with a different copy of the environment in parallel. Thus, they can explore    \na bigger part of the state-action space in much less time.  \n* [_Forward-Looking Actor or FORK_](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.01652)    \nModel-based reinforcement learning uses the model in a sophisticated way, often based   \non deterministic or stochastic optimal control theory to optimize the policy based   \non the model. FORK only uses the _system network_ as a blackbox  to forecast future states,   \nand does not use it as a mathematical model for optimizing control actions.     \nWith this key distinction, any model-free Actor-Critic algorithm with FORK  remains  \nto be model-free.  \n\n\n### Projects, models and methods\n\n[AntBulletEnv, Soft Actor-Critic (SAC)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FAnt-PyBulletEnv-Soft-Actor-Critic)    \n\n[BipedalWalker, Twin Delayed DDPG (TD3)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-TwinDelayed-DDPG%20(TD3))     \n\n[BipedalWalker, PPO, Vectorized Environment](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fblob\u002Fmaster\u002FBipedalWalker-PPO-VectorizedEnv)\n\n[BipedalWalker, Soft Actor-Critic (SAC)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-Soft-Actor-Critic)\n\n[BipedalWalker, A2C, Vectorized Environment](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-A2C-VectorizedEnv)\n\n[CarRacing with PPO, Learning from Raw Pixels](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fblob\u002Fmaster\u002FCarRacing-From-Pixels-PPO)\n\n[CartPole, Policy Based Methods, Hill Climbing](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartPole-Policy-Based-Hill-Climbing)    \n\n[CartPole, Policy Gradient Methods, REINFORCE](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartPole-Policy-Gradient-Reinforce)   \n\n[Cartpole, DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Deep-Q-Learning)  \n\n[Cartpole, Double DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Double-Deep-Q-Learning)   \n\n[HalfCheetahBulletEnv, Twin Delayed DDPG (TD3)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHalfCheetahBulletEnv-TD3)   \n\n[HopperBulletEnv, Twin Delayed DDPG (TD3)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHopperBulletEnv_v0-TD3)  \n\n[HopperBulletEnv, Soft Actor-Critic (SAC)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHopperBulletEnv-v0-SAC)  \n\n[LunarLander-v2, DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FLunarLander-v2-DQN)\n\n[LunarLanderContinuous-v2, DDPG](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FLunarLanderContinuous-v2-DDPG)\n\n[Markov Decision Process, Monte-Carlo, Gridworld 6x6](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMarkov-Decision-Process_6x6)  \n\n[MinitaurBulletEnv, Soft Actor-Critic (SAC)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMinitaur-Soft-Actor-Critic)\n\n[MinitaurBulletDuckEnv, Soft Actor-Critic (SAC)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMinitaurDuck-Soft-Actor-Critic)   \n\n[MountainCar, Q-learning](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCar-Q-Learning)    \n\n[MountainCar, DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCar-DQN)   \n\n[MountainCarContinuous, Twin Delayed DDPG (TD3)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCarContinuous-TD3)   \n\n[MountainCarContinuous, PPO, Vectorized Environment](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCarContinuous_PPO)   \n\n[Pong, Policy Gradient Methods, PPO](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FPong-Policy-Gradient-PPO)      \n\n[Pong, Policy Gradient Methods, REINFORCE](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FPong-Policy-Gradient-REINFORCE)   \n\n[Snake, DQN, Pygame](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FSnake-Pygame-DQN)\n\n[Udacity Project 1: Navigation, DQN, ReplayBuffer](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-1_Navigation-DQN)   \n\n[Udacity Project 2: Continuous Control-Reacher, DDPG](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-2_Continuous-Control-Reacher-DDPG), environment [Reacher (Double-Jointed-Arm)](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents\u002Fblob\u002Fmaster\u002Fdocs\u002FLearning-Environment-Examples.md#reacher)    \n\n[Udacity Project 2: Continuous Control-Crawler, PPO](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-2_Continuous-Control-Crawler-PPO), environment [Crawler](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents\u002Fblob\u002Fmaster\u002Fdocs\u002FLearning-Environment-Examples.md#crawler)    \n     \n[Udacity Project 3: Collaboration_Competition-Tennis, Multi-agent DDPG](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-3_Collaboration_Competition-Tennis-Maddpg), environment [Tennis](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents\u002Fblob\u002Fmaster\u002Fdocs\u002FLearning-Environment-Examples.md#tennis)     \n\n[Walker2DBulletEnv, Twin Delayed DDPG (TD3)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FWalker2DBulletEnv-v0_TD3)   \n\n[Walker2DBulletEnv, Soft Actor-Critic (SAC)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FWalker2DBulletEnv-v0_SAC)\n\n### Projects with DQN and Double DQN\n\n* [Cartpole, DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Deep-Q-Learning)    \n* [Cartpole, Double DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Double-Deep-Q-Learning)  \n* [LunarLander-v2, DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FLunarLander-v2-DQN)   \n* [MountainCar, DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCar-DQN)\n* [Navigation, DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-1_Navigation-DQN)      \n* [Snake, DQN, Pygame](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FSnake-Pygame-DQN)\n  \n### Projects with PPO\n  * [BipedalWalker](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002F\u002FBipedalWalker-PPO-VectorizedEnv),  16 environments   \n  * [CarRacing](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCarRacing-From-Pixels-PPO),  Single environment, Learning from pixels   \n  * [C r a w l e r  ](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-2_Continuous-Control-Crawler-PPO), 12 environments      \n  * [MountainCarContinuous](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCarContinuous_PPO), 16 environments\n  * [Pong](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FPong-Policy-Gradient-PPO), 8 environments    \n\n### Projects with TD3\n  * [BipedalWalker](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-TwinDelayed-DDPG%20(TD3))    \n  * [HalfChhetahBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHalfCheetahBulletEnv-TD3)     \n  * [HopperBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHopperBulletEnv_v0-TD3)    \n  * [MountainCarContinuous](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCarContinuous-TD3)   \n  * [Walker2DBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FWalker2DBulletEnv-v0_TD3)   \n  \n ### Projects with Soft Actor-Critic (SAC)\n * [AntBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FAnt-PyBulletEnv-Soft-Actor-Critic)   \n * [BipedalWalker](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-Soft-Actor-Critic)   \n * [HopperBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHopperBulletEnv-v0-SAC)   \n * [MinitaurBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMinitaur-Soft-Actor-Critic)   \n * [MinitaurBulletDuckEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMinitaurDuck-Soft-Actor-Critic)\n * [Walker2dBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FWalker2DBulletEnv-v0_SAC)   \n \n  \n ###  BipedalWalker, different models\n  \n* [BipedalWalker, Twin Delayed DDPG (TD3)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-TwinDelayed-DDPG%20(TD3))     \n* [BipedalWalker, PPO, Vectorized Environment](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fblob\u002Fmaster\u002FBipedalWalker-PPO-VectorizedEnv)   \n* [BipedalWalker, Soft-Actor-Critic (SAC)](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-Soft-Actor-Critic)    \n* [BipedalWalker, A2C, Vectorized Environment](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-A2C-VectorizedEnv)  \n\n### CartPole, different models\n\n* [CartPole, Policy Based Methods, Hill Climbing](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartPole-Policy-Based-Hill-Climbing)    \n* [CartPole, Policy Gradient Methods, REINFORCE](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartPole-Policy-Gradient-Reinforce)   \n* [Cartpole with Deep Q-Learning](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Deep-Q-Learning)   \n* [Cartpole with Doouble Deep Q-Learning](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Double-Deep-Q-Learning)    \n\n### For more links \n\n  * on _Policy-Gradient Methods_, see [1](https:\u002F\u002Fmedium.com\u002F@jonathan_hui\u002Frl-policy-gradients-explained-9b13b688b146), [2](https:\u002F\u002Ftowardsdatascience.com\u002Fan-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c), [3](https:\u002F\u002Ftowardsdatascience.com\u002Fpolicy-gradients-in-a-nutshell-8b72f9743c5d).\n  * on _REINFORCE_, see [1](https:\u002F\u002Ftowardsdatascience.com\u002Fan-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c),\n  [2](http:\u002F\u002Fkarpathy.github.io\u002F2016\u002F05\u002F31\u002Frl\u002F), [3](https:\u002F\u002Fmedium.com\u002Fmini-distill\u002Fdiscrete-optimization-beyond-reinforce-5ca171bebf17).       \n  * on _PPO_,  see [1](https:\u002F\u002Fmedium.com\u002Farxiv-bytes\u002Fsummary-proximal-policy-optimization-ppo-86e41b557a8b), [2](https:\u002F\u002Fopenai.com\u002Fblog\u002Fopenai-baselines-ppo\u002F), [3](https:\u002F\u002Ftowardsdatascience.com\u002Fthe-pursuit-of-robotic-happiness-how-trpo-and-ppo-stabilize-policy-gradient-methods-545784094e3b), [4](https:\u002F\u002Fmedium.com\u002F@jonathan_hui\u002Frl-proximal-policy-optimization-ppo-explained-77f014ec3f12), [5](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-various-reinforcement-learning-algorithms-part-ii-trpo-ppo-87f2c5919bb9).        \n  * on _DDPG_, see [1](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-various-reinforcement-learning-algorithms-i-q-learning-sarsa-dqn-ddpg-72a5e0cb6287), [2](https:\u002F\u002Fspinningup.openai.com\u002Fen\u002Flatest\u002Falgorithms\u002Fddpg.html#the-q-learning-side-of-ddpg).        \n  * on _Actor-Critic Methods_, and _A3C_, see [1](https:\u002F\u002Ftowardsdatascience.com\u002Fadvanced-reinforcement-learning-6d769f529eb3), [2](https:\u002F\u002Fblog.goodaudience.com\u002Fa3c-what-it-is-what-i-built-6b91fe5ec09c), [3](https:\u002F\u002Ftowardsdatascience.com\u002Funderstanding-actor-critic-methods-931b97b6df3f), [4](http:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F1786-actor-critic-algorithms.pdf).          \n   * on _TD3_, see [1](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09477), [2](https:\u002F\u002Fspinningup.openai.com\u002Fen\u002Flatest\u002Falgorithms\u002Ftd3.html), [3](https:\u002F\u002Fstable-baselines.readthedocs.io\u002Fen\u002Fmaster\u002Fmodules\u002Ftd3.html)    \n   * on _SAC_, see [1](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.01290), [2](https:\u002F\u002Ftowardsdatascience.com\u002Fsoft-actor-critic-demystified-b8427df61665), [3](https:\u002F\u002Fstable-baselines.readthedocs.io\u002Fen\u002Fmaster\u002Fmodules\u002Fsac.html), [4](https:\u002F\u002Fspinningup.openai.com\u002Fen\u002Flatest\u002Falgorithms\u002Fsac.html), [5](https:\u002F\u002Fsites.google.com\u002Fview\u002Fsac-and-applications)     \n   * on _A2C_,  see [1](https:\u002F\u002Ftowardsdatascience.com\u002Funderstanding-actor-critic-methods-931b97b6df3f), [2](https:\u002F\u002Fopenai.com\u002Fblog\u002Fbaselines-acktr-a2c\u002F), [3](https:\u002F\u002Fsergioskar.github.io\u002FActor_critics\u002F), [4](https:\u002F\u002Fstable-baselines.readthedocs.io\u002Fen\u002Fmaster\u002Fmodules\u002Fa2c.html), [5](https:\u002F\u002Fhackernoon.com\u002Fintuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752)      \n\n### My articles on TowardsDataScience\n\n* [How does the Bellman equation work in Deep Reinforcement Learning?](https:\u002F\u002Ftowardsdatascience.com\u002Fhow-the-bellman-equation-works-in-deep-reinforcement-learning-5301fe41b25a)  \n* [A pair of interrelated neural networks in Deep Q-Network](https:\u002F\u002Ftowardsdatascience.com\u002Fa-pair-of-interrelated-neural-networks-in-dqn-f0f58e09b3c4)    \n* [Three aspects of Deep Reinforcement Learning: noise, overestimation and exploration](https:\u002F\u002Ftowardsdatascience.com\u002Fthree-aspects-of-deep-rl-noise-overestimation-and-exploration-122ffb4bb92b)      \n* [Entropy in Soft Actor-Critic (Part 1)](https:\u002F\u002Ftowardsdatascience.com\u002Fentropy-in-soft-actor-critic-part-1-92c2cd3a3515)   \n* [Entropy in Soft Actor-Critic (Part 2)](https:\u002F\u002Ftowardsdatascience.com\u002Fentropy-in-soft-actor-critic-part-2-59821bdd5671)\n\n### Videos I have developed within the above projects\n* [Four BipedalWalker Gaits](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=PFixqZEYKh4)      \n* [BipedalWalker by Training Stages](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=g01mIFbxVns)  \n* [CarRacing by Training Stages](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=55buBR2pPdc)\n* [Lucky Hopper](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Ipctq89yLB0)\n* [Martian Ant](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=s7aMZ1bbQgk)\n* [Lunar Armada](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=6O6g9LCWvIs)\n* [Wooden Snake](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-T4wQirNDRo)\n* [Walking through the chess fields](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=qUT3TznKWAk)\n* [Artificial snake on the way](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-jNfUrVniNg)\n* [Learned Long Snake](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Tt1rqWTR8ZA)\n* [Such a fast cheetah](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Q-FchLEZKRk)\n* [Four stages of Minitaur training](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=uEAqyEwvi54)\n* [Chessboard chase with four Pybullet actors](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=NXX4GTim_NM)\n* [You can sleep while I drive, Minitaur with Duck](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=_7_Rke5R3JQ)\n\n\n","## 深度强化学习算法\n\n在这里，您可以找到多个专注于深度强化学习方法的项目。  \n这些项目以矩阵形式呈现：**[环境 × 模型]**，其中 **环境** 是需要求解的任务场景，而 **模型** 则是用于解决该任务的算法或模型。在某些情况下，同一个环境会由多种算法来求解。所有项目均以 **Jupyter Notebook** 的形式呈现，包含 **训练日志**。\n\n支持的环境如下：\n\n__AntBulletEnv__、__BipedalWalker__、__BipedalWalkerHardcore__、__CarRacing__、__CartPole__、__Crawler__、__HalfCheetahBulletEnv__、  \n__HopperBulletEnv__、__LunarLander__、__LunarLanderContinuous__、__Markov决策6×6__、__Minitaur__、__Minitaur with Duck__、  \n__MountainCar__、__MountainCarContinuous__、__Pong__、__Navigation__、__Reacher__、__Snake__、__Tennis__、__Waker2DBulletEnv__。\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRafael1s_Deep-Reinforcement-Learning-Algorithms_readme_b8f8a164d91e.png)\n\n其中四个环境（__Navigation__、__Crawler__、__Reacher__、__Tennis__）是在 [**Udacity深度强化学习纳米学位课程**](https:\u002F\u002Fwww.udacity.com\u002Fcourse\u002Fdeep-reinforcement-learning-nanodegree--nd893) 的框架下完成的。\n\n* [_蒙特卡洛方法_](https:\u002F\u002Fmedium.com\u002F@zsalloum\u002Fmonte-carlo-in-reinforcement-learning-the-easy-way-564c53010511)  \n在蒙特卡洛（MC）方法中，我们通过完整地进行多轮游戏直到结束，记录沿途获得的奖励，然后从终点回溯到起点。重复这一过程足够多次后，我们对每个状态的价值取平均值。\n\n* [_时序差分方法与Q-Learning_](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTemporal_difference_learning)\n* [_连续空间中的强化学习（深度Q网络）_](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FReinforcement_learning)\n* [_函数逼近与神经网络_](https:\u002F\u002Fmedium.com\u002Fbiffures\u002Frl-course-by-david-silver-lectures-5-to-7-576188d3b033)  \n根据**通用逼近定理（UAT）**，只要激活函数的形式满足较为宽松的条件，含有单个隐藏层且节点数有限的前馈式_神经网络_就能够逼近任意连续函数。\n* [_基于策略的方法_](https:\u002F\u002Ftowardsdatascience.com\u002Fpolicy-based-reinforcement-learning-the-easy-way-8de9a3356083)、[_爬山法_](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHill_climbing)、[_模拟退火_](https:\u002F\u002Fmedium.com\u002F@macromoltek\u002Fmachine-learning-and-simulated-annealing-588b2e70d0cc)  \n随机重启的_爬山法_在许多情况下表现得非常有效。而_模拟退火_则是一种优秀的概率性优化技术，因为它不会将局部极值误判为全局极值。\n* [_策略梯度方法_](https:\u002F\u002Flilianweng.github.io\u002Flil-log\u002F2018\u002F04\u002F08\u002Fpolicy-gradient-algorithms.html)、[_REINFORCE_](https:\u002F\u002Fmedium.com\u002Fsamkirkiles\u002Freinforce-policy-gradients-from-scratch-in-numpy-6a09ae0dfe12)、[_PPO_](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347)  \n定义一个需要最大化的性能指标 _J(θ)_，并通过_近似梯度上升_来学习策略参数 θ。  \n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRafael1s_Deep-Reinforcement-Learning-Algorithms_readme_c471837a7f39.jpg)\n* [_演员-评论家方法_](https:\u002F\u002Ftowardsdatascience.com\u002Fsoft-actor-critic-demystified-b8427df61665)、[_A3C_](https:\u002F\u002Fmedium.com\u002Femergent-future\u002Fsimple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2)、[_A2C_](https:\u002F\u002Fhackernoon.com\u002Fintuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752)、[_DDPG_](https:\u002F\u002Fmedium.com\u002F@amitpatel.gt\u002Fpolicy-gradients-1edbbbc8de6b)、[_TD3_](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09477)、[_SAC_](https:\u002F\u002Ftowardsdatascience.com\u002Fsoft-actor-critic-demystified-b8427df61665)  \n与A2C的主要区别在于其异步特性。A3C由多个独立的智能体（网络）组成，各自拥有独立的权重，并并行地与不同副本的环境交互。因此，它们能够在更短的时间内探索更大的状态-动作空间。\n* [_前瞻型演员或FORK_](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.01652)  \n基于模型的强化学习会以复杂的方式利用模型，通常依托确定性或随机最优控制理论来优化策略。而FORK仅将_系统网络_作为一个黑箱来预测未来状态，而不将其用作数学模型来进行控制动作的优化。正是由于这一关键区别，任何无模型的演员-评论家算法结合FORK后，仍然保持无模型的特性。\n\n### 项目、模型和方法\n\n[AntBulletEnv，软演员-评论家（SAC）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FAnt-PyBulletEnv-Soft-Actor-Critic)    \n\n[BipedalWalker，双延迟DDPG（TD3）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-TwinDelayed-DDPG%20(TD3))     \n\n[BipedalWalker，PPO，向量化环境](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fblob\u002Fmaster\u002FBipedalWalker-PPO-VectorizedEnv)\n\n[BipedalWalker，软演员-评论家（SAC）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-Soft-Actor-Critic)\n\n[BipedalWalker，A2C，向量化环境](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-A2C-VectorizedEnv)\n\n[CarRacing与PPO，从原始像素学习](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fblob\u002Fmaster\u002FCarRacing-From-Pixels-PPO)\n\n[CartPole，基于策略的方法，爬山法](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartPole-Policy-Based-Hill-Climbing)    \n\n[CartPole，策略梯度方法，REINFORCE](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartPole-Policy-Gradient-Reinforce)   \n\n[Cartpole，DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Deep-Q-Learning)  \n\n[Cartpole，双DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Double-Deep-Q-Learning)   \n\n[HalfCheetahBulletEnv，双延迟DDPG（TD3）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHalfCheetahBulletEnv-TD3)   \n\n[HopperBulletEnv，双延迟DDPG（TD3）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHopperBulletEnv_v0-TD3)  \n\n[HopperBulletEnv，软演员-评论家（SAC）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHopperBulletEnv-v0-SAC)  \n\n[LunarLander-v2，DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FLunarLander-v2-DQN)\n\n[LunarLanderContinuous-v2，DDPG](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FLunarLanderContinuous-v2-DDPG)\n\n[马尔可夫决策过程，蒙特卡洛，6x6网格世界](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMarkov-Decision-Process_6x6)  \n\n[MinitaurBulletEnv，软演员-评论家（SAC）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMinitaur-Soft-Actor-Critic)\n\n[MinitaurBulletDuckEnv，软演员-评论家（SAC）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMinitaurDuck-Soft-Actor-Critic)   \n\n[MountainCar，Q-learning](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCar-Q-Learning)    \n\n[MountainCar，DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCar-DQN)   \n\n[MountainCarContinuous，双延迟DDPG（TD3）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCarContinuous-TD3)   \n\n[MountainCarContinuous，PPO，向量化环境](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCarContinuous_PPO)   \n\n[Pong，策略梯度方法，PPO](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FPong-Policy-Gradient-PPO)      \n\n[Pong，策略梯度方法，REINFORCE](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FPong-Policy-Gradient-REINFORCE)   \n\n[Snake，DQN，Pygame](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FSnake-Pygame-DQN)\n\n[Udacity项目1：导航，DQN，回放缓冲区](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-1_Navigation-DQN)   \n\n[Udacity项目2：连续控制-机械臂，DDPG](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-2_Continuous-Control-Reacher-DDPG)，环境 [Reacher（双关节臂）](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents\u002Fblob\u002Fmaster\u002Fdocs\u002FLearning-Environment-Examples.md#reacher)    \n\n[Udacity项目2：连续控制-爬行者，PPO](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-2_Continuous-Control-Crawler-PPO)，environment [Crawler（爬行者）](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents\u002Fblob\u002Fmaster\u002Fdocs\u002FLearning-Environment-Examples.md#crawler)    \n     \n[Udacity项目3：协作-竞争-网球，多智能体DDPG](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-3_Collaboration_Competition-Tennis-Maddpg)，environment [Tennis（网球）](https:\u002F\u002Fgithub.com\u002FUnity-Technologies\u002Fml-agents\u002Fblob\u002Fmaster\u002Fdocs\u002FLearning-Environment-Examples.md#tennis)     \n\n[Walker2DBulletEnv，双延迟DDPG（TD3）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FWalker2DBulletEnv-v0_TD3)   \n\n[Walker2DBulletEnv，软演员-评论家（SAC）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FWalker2DBulletEnv-v0_SAC)\n\n### 使用DQN和双DQN的项目\n\n* [Cartpole，DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Deep-Q-Learning)    \n* [Cartpole，双DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Double-Deep-Q-Learning)  \n* [LunarLander-v2，DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FLunarLander-v2-DQN)   \n* [MountainCar，DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCar-DQN)\n* [导航，DQN](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-1_Navigation-DQN)      \n* [Snake，DQN，Pygame](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FSnake-Pygame-DQN)\n  \n### 使用PPO的项目\n  * [BipedalWalker](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002F\u002FBipedalWalker-PPO-VectorizedEnv)，16个环境   \n  * [CarRacing](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCarRacing-From-Pixels-PPO)，单个环境，从像素学习   \n  * [C r a w l e r  ](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FProject-2_Continuous-Control-Crawler-PPO)，12个环境      \n  * [MountainCarContinuous](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCarContinuous_PPO)，16个环境\n  * [Pong](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FPong-Policy-Gradient-PPO)，8个环境\n\n### 使用TD3的项目\n  * [BipedalWalker](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-TwinDelayed-DDPG%20(TD3))    \n  * [HalfCheetahBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHalfCheetahBulletEnv-TD3)     \n  * [HopperBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHopperBulletEnv_v0-TD3)    \n  * [MountainCarContinuous](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMountainCarContinuous-TD3)   \n  * [Walker2DBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FWalker2DBulletEnv-v0_TD3)   \n  \n ### 使用软演员-评论家（SAC）的项目\n * [AntBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FAnt-PyBulletEnv-Soft-Actor-Critic)   \n * [BipedalWalker](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-Soft-Actor-Critic)   \n * [HopperBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FHopperBulletEnv-v0-SAC)   \n * [MinitaurBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMinitaur-Soft-Actor-Critic)   \n * [MinitaurBulletDuckEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FMinitaurDuck-Soft-Actor-Critic)\n * [Walker2dBulletEnv](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FWalker2DBulletEnv-v0_SAC)   \n \n  \n ### BipedalWalker，不同模型\n  \n* [BipedalWalker，双延迟DDPG（TD3）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-TwinDelayed-DDPG%20(TD3))     \n* [BipedalWalker，PPO，向量化环境](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fblob\u002Fmaster\u002FBipedalWalker-PPO-VectorizedEnv)   \n* [BipedalWalker，软演员-评论家（SAC）](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-Soft-Actor-Critic)    \n* [BipedalWalker，A2C，向量化环境](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FBipedalWalker-A2C-VectorizedEnv)  \n\n### CartPole，不同模型\n\n* [CartPole，基于策略的方法，爬山法](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartPole-Policy-Based-Hill-Climbing)    \n* [CartPole，策略梯度方法，REINFORCE](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartPole-Policy-Gradient-Reinforce)   \n* [使用深度Q学习的Cartpole](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Deep-Q-Learning)   \n* [使用双重深度Q学习的Cartpole](https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Ftree\u002Fmaster\u002FCartpole-Double-Deep-Q-Learning)    \n\n### 更多链接 \n\n  * 关于_策略梯度方法_，参见[1](https:\u002F\u002Fmedium.com\u002F@jonathan_hui\u002Frl-policy-gradients-explained-9b13b688b146), [2](https:\u002F\u002Ftowardsdatascience.com\u002Fan-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c), [3](https:\u002F\u002Ftowardsdatascience.com\u002Fpolicy-gradients-in-a-nutshell-8b72f9743c5d)。\n  * 关于_REINFORCE_，参见[1](https:\u002F\u002Ftowardsdatascience.com\u002Fan-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c),\n  [2](http:\u002F\u002Fkarpathy.github.io\u002F2016\u002F05\u002F31\u002Frl\u002F), [3](https:\u002F\u002Fmedium.com\u002Fmini-distill\u002Fdiscrete-optimization-beyond-reinforce-5ca171bebf17)。       \n  * 关于_PPO_，参见[1](https:\u002F\u002Fmedium.com\u002Farxiv-bytes\u002Fsummary-proximal-policy-optimization-ppo-86e41b557a8b), [2](https:\u002F\u002Fopenai.com\u002Fblog\u002Fopenai-baselines-ppo\u002F), [3](https:\u002F\u002Ftowardsdatascience.com\u002Fthe-pursuit-of-robotic-happiness-how-trpo-and-ppo-stabilize-policy-gradient-methods-545784094e3b), [4](https:\u002F\u002Fmedium.com\u002F@jonathan_hui\u002Frl-proximal-policy-optimization-ppo-explained-77f014ec3f12), [5](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-various-reinforcement-learning-algorithms-part-ii-trpo-ppo-87f2c5919bb9)。        \n  * 关于_DDPG_，参见[1](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-various-reinforcement-learning-algorithms-i-q-learning-sarsa-dqn-ddpg-72a5e0cb6287), [2](https:\u002F\u002Fspinningup.openai.com\u002Fen\u002Flatest\u002Falgorithms\u002Fddpg.html#the-q-learning-side-of-ddpg)。        \n  * 关于_演员-评论家方法_和_A3C_，参见[1](https:\u002F\u002Ftowardsdatascience.com\u002Fadvanced-reinforcement-learning-6d769f529eb3), [2](https:\u002F\u002Fblog.goodaudience.com\u002Fa3c-what-it-is-what-i-built-6b91fe5ec09c), [3](https:\u002F\u002Ftowardsdatascience.com\u002Funderstanding-actor-critic-methods-931b97b6df3f), [4](http:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F1786-actor-critic-algorithms.pdf)。          \n   * 关于_TD3_，参见[1](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09477), [2](https:\u002F\u002Fspinningup.openai.com\u002Fen\u002Flatest\u002Falgorithms\u002Ftd3.html), [3](https:\u002F\u002Fstable-baselines.readthedocs.io\u002Fen\u002Fmaster\u002Fmodules\u002Ftd3.html)    \n   * 关于_SAC_，参见[1](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.01290), [2](https:\u002F\u002Ftowardsdatascience.com\u002Fsoft-actor-critic-demystified-b8427df61665), [3](https:\u002F\u002Fstable-baselines.readthedocs.io\u002Fen\u002Fmaster\u002Fmodules\u002Fsac.html), [4](https:\u002F\u002Fspinningup.openai.com\u002Fen\u002Flatest\u002Falgorithms\u002Fsac.html), [5](https:\u002F\u002Fsites.google.com\u002Fview\u002Fsac-and-applications)     \n   * 关于_A2C_，参见[1](https:\u002F\u002Ftowardsdatascience.com\u002Funderstanding-actor-critic-methods-931b97b6df3f), [2](https:\u002F\u002Fopenai.com\u002Fblog\u002Fbaselines-acktr-a2c\u002F), [3](https:\u002F\u002Fsergioskar.github.io\u002FActor_critics\u002F), [4](https:\u002F\u002Fstable-baselines.readthedocs.io\u002Fen\u002Fmaster\u002Fmodules\u002Fa2c.html), [5](https:\u002F\u002Fhackernoon.com\u002Fintuitive-rl-intro-to-advantage-actor-critic-a2c-4ff545978752)      \n\n### 我在TowardsDataScience上的文章\n\n* [贝尔曼方程在深度强化学习中是如何工作的？](https:\u002F\u002Ftowardsdatascience.com\u002Fhow-the-bellman-equation-works-in-deep-reinforcement-learning-5301fe41b25a)  \n* [深度Q网络中的两组相互关联的神经网络](https:\u002F\u002Ftowardsdatascience.com\u002Fa-pair-of-interrelated-neural-networks-in-dqn-f0f58e09b3c4)    \n* [深度强化学习的三个方面：噪声、过估计和探索](https:\u002F\u002Ftowardsdatascience.com\u002Fthree-aspects-of-deep-rl-noise-overestimation-and-exploration-122ffb4bb92b)      \n* [软演员-评论家中的熵（第1部分）](https:\u002F\u002Ftowardsdatascience.com\u002Fentropy-in-soft-actor-critic-part-1-92c2cd3a3515)   \n* [软演员-评论家中的熵（第2部分）](https:\u002F\u002Ftowardsdatascience.com\u002Fentropy-in-soft-actor-critic-part-2-59821bdd5671)\n\n### 我在上述项目中开发的视频\n* [四种双足行走者步态](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=PFixqZEYKh4)      \n* [按训练阶段展示的双足行走者](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=g01mIFbxVns)  \n* [按训练阶段展示的赛车](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=55buBR2pPdc)\n* [幸运跳跃者](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Ipctq89yLB0)\n* [火星蚂蚁](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=s7aMZ1bbQgk)\n* [月球舰队](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=6O6g9LCWvIs)\n* [木制蛇](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-T4wQirNDRo)\n* [穿越国际象棋棋盘](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=qUT3TznKWAk)\n* [正在前进的人工蛇](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-jNfUrVniNg)\n* [学会行走的长蛇](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Tt1rqWTR8ZA)\n* [如此迅捷的猎豹](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=Q-FchLEZKRk)\n* [Minitaur的四个训练阶段](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=uEAqyEwvi54)\n* [四名Pybullet角色在棋盘上的追逐](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=NXX4GTim_NM)\n* [我可以开车，你尽管睡吧——带鸭子的Minitaur](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=_7_Rke5R3JQ)","# Deep-Reinforcement-Learning-Algorithms 快速上手指南\n\n本仓库汇集了多种深度强化学习（DRL）算法的实现，采用 **[环境 x 模型]** 的矩阵形式组织项目。每个项目均包含完整的 **Jupyter Notebook** 代码及训练日志，涵盖从经典的 CartPole 到复杂的 BipedalWalker 等多种环境。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux, macOS 或 Windows\n*   **Python 版本**: 推荐 Python 3.6 - 3.8 (部分旧版 Gym 环境可能对新版 Python 兼容性有限)\n*   **硬件建议**: 虽然部分简单环境可在 CPU 上运行，但推荐使用 NVIDIA GPU 以加速深度学习模型的训练。\n*   **前置依赖**:\n    *   `pip` 包管理工具\n    *   `git` 版本控制工具\n    *   Jupyter Notebook (`jupyter` 或 `jupyterlab`)\n\n## 安装步骤\n\n### 1. 克隆仓库\n首先将项目代码克隆到本地：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms.git\ncd Deep-Reinforcement-Learning-Algorithms\n```\n\n### 2. 创建虚拟环境 (推荐)\n为避免依赖冲突，建议使用 `conda` 或 `venv` 创建独立环境：\n\n```bash\n# 使用 conda\nconda create -n drl_env python=3.7\nconda activate drl_env\n\n# 或使用 venv\npython -m venv drl_env\nsource drl_env\u002Fbin\u002Factivate  # Linux\u002FmacOS\n# drl_env\\Scripts\\activate   # Windows\n```\n\n### 3. 安装依赖\n进入具体的项目文件夹后，通常会有一个 `requirements.txt` 文件。由于不同项目（如基于像素的 CarRacing 或基于物理引擎的 BulletEnv）依赖略有不同，建议按需安装。\n\n通用基础依赖安装命令：\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install gymnasium pygame matplotlib numpy jupyter\n```\n\n> **注意**：部分特定环境（如 `BipedalWalkerHardcore` 或 `BulletEnv`）可能需要额外安装 `swig` 和 `box2d-py`。\n> *   Linux: `sudo apt-get install swig`\n> *   Mac: `brew install swig`\n> *   Python 包: `pip install box2d-py pybullet`\n\n如果遇到下载速度慢的问题，可使用国内镜像源：\n```bash\npip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n本仓库的核心内容是 **Jupyter Notebook**。每个子文件夹对应一个特定的“环境 + 算法”组合。\n\n### 1. 选择项目\n根据你想尝试的环境和算法，进入对应的目录。例如，想要运行 **CartPole** 环境的 **DQN** 算法：\n\n```bash\ncd Cartpole-Deep-Q-Learning\n```\n\n### 2. 启动 Jupyter Notebook\n在项目目录下启动 Notebook 服务：\n\n```bash\njupyter notebook\n```\n\n### 3. 运行示例\n在浏览器打开的界面中，找到对应的 `.ipynb` 文件（通常命名为 `Train_Agent.ipynb` 或类似名称）。\n\n*   **查看代码**: 单元格中包含了环境初始化、模型定义、训练循环等完整逻辑。\n*   **执行训练**: 依次点击单元格运行（Shift+Enter）。\n*   **观察结果**: Notebook 会实时输出训练奖励（Reward）曲线，并在训练结束后展示代理（Agent）在环境中的表现视频或截图。\n\n### 典型项目结构示例\n以 `Cartpole-Deep-Q-Learning` 为例，其核心逻辑通常如下（伪代码示意，具体请以 Notebook 为准）：\n\n```python\nimport torch\nimport gym\nfrom dqn_agent import Agent # 假设的代理类\n\n# 1. 初始化环境\nenv = gym.make('CartPole-v1')\n\n# 2. 初始化代理\nagent = Agent(state_size=4, action_size=2, seed=0)\n\n# 3. 训练循环\nscores = []\nfor episode in range(2000):\n    state = env.reset()\n    score = 0\n    for t in range(1000):\n        action = agent.act(state)\n        next_state, reward, done, _ = env.step(action)\n        agent.step(state, action, reward, next_state, done)\n        state = next_state\n        score += reward\n        if done:\n            break\n    scores.append(score)\n    # 打印平均得分...\n```\n\n### 常用项目推荐\n*   **入门首选**: `Cartpole-Deep-Q-Learning` (经典平衡杆问题)\n*   **连续控制**: `LunarLanderContinuous-v2-DDPG` (登月舱着陆)\n*   **视觉输入**: `CarRacing-From-Pixels-PPO` (直接从像素学习赛车)\n*   **多智能体**: `Project-3_Collaboration_Competition-Tennis-Maddpg` (网球双打协作)\n\n运行完 Notebook 后，您可以根据需要修改超参数或网络结构，探索不同算法在相同环境下的表现差异。","某机器人初创团队正在研发一款能在复杂地形中自主行走的双足机器人，急需验证不同强化学习算法在仿真环境中的控制效果。\n\n### 没有 Deep-Reinforcement-Learning-Algorithms 时\n- 工程师需从零搭建 PPO、SAC 等主流算法的代码框架，耗时数周且极易引入底层逻辑错误。\n- 缺乏标准的训练日志和基准对比，难以判断模型不收敛是代码问题还是超参数设置不当。\n- 面对 BipedalWalkerHardcore 等高难度环境，无法快速复用已验证的成功策略，导致试错成本极高。\n- 团队成员对蒙特卡洛、时序差分等理论理解不一，代码风格混乱，协作效率低下。\n\n### 使用 Deep-Reinforcement-Learning-Algorithms 后\n- 直接调用项目中现成的 32 个 Jupyter Notebook，涵盖从 DQN 到 TD3 的全套算法，将环境配置时间从数周缩短至几小时。\n- 每个项目均附带详细的训练日志，团队可立即比对自身实验数据，快速定位性能瓶颈并调整策略。\n- 针对双足行走场景，直接参考 BipedalWalker 环境下多种算法的成熟解法，大幅提升了机器人步态的稳定性。\n- 统一的代码矩阵结构（环境 x 模型）让团队成员能清晰理解不同算法在同一任务上的表现差异，加速技术决策。\n\nDeep-Reinforcement-Learning-Algorithms 通过提供开箱即用的算法实现与详尽的训练基准，将研发团队从重复造轮子中解放出来，专注于核心控制策略的优化与创新。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRafael1s_Deep-Reinforcement-Learning-Algorithms_b8f8a164.png","Rafael1s","Rafael Stekolshchik","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FRafael1s_a86ee3c6.jpg",null,"Petah Tikva,  Israel","klivlend1@yahoo.com","https:\u002F\u002Fgithub.com\u002FRafael1s",[81,85],{"name":82,"color":83,"percentage":84},"Jupyter Notebook","#DA5B0B",98.4,{"name":86,"color":87,"percentage":88},"Python","#3572A5",1.6,999,231,"2026-04-06T22:06:58","","未说明",{"notes":95,"python":93,"dependencies":96},"该项目包含多个基于不同环境（如 PyBullet, Unity ML-Agents, Pygame）和算法（DQN, PPO, SAC, TD3 等）的独立 Jupyter Notebook 项目。具体依赖可能因运行的特定子项目而异，部分环境（如 CarRacing, BipedalWalker）需要额外的物理引擎或游戏渲染支持。README 中未提供统一的安装脚本或具体的版本约束，建议根据所选的具体子项目查看其内部代码或单独的配置说明。",[97,98,99,100,101,102,103],"gym","pybullet","pygame","numpy","tensorflow","torch","unity-ml-agents",[14,105],"其他",[107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125],"deep-rl-algorithms","github-udacity","dqn-ppo-ddpg","dqn","td3","cartpole","bipedalwalker","deep-reinforcement-learning","sac","carracing","hopperbulletenv","lunarlander","ddpg","ppo","a2c","antbulletenv","soft-actor-critic","halfcheetahbulletenv","walker2dbulletenv","2026-03-27T02:49:30.150509","2026-04-11T18:31:38.657538",[129,134,139,144,148,152],{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},22591,"为什么在解决 CarRacing 任务时移除了失败时的死亡惩罚（-100 奖励）？","这是为了公平性调整。在 OpenAI 的 CarRacing 原始环境中，如果车辆驶出赛道会受到 -100 的惩罚，但当车辆成功跑完整个赛道时，同样会受到 -100 的惩罚。为了抵消成功完成赛道时的不合理惩罚，代码在 Wrapper.step() 中通过 +100 来恢复奖励值。这是一个需要长期调试的参数，移除或调整该惩罚有助于避免智能体性能崩溃。","https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fissues\u002F1",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},22592,"如何去除 CarRacing 环境在控制台输出的大量赛道生成日志（如 Track generation...）？","在创建环境时将 verbose 参数设置为 0 即可屏蔽这些日志。具体代码为：env = gym.make('CarRacing-v0', verbose=0)。","https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fissues\u002F2",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},22593,"为什么 README 中的评估分数超过 1000，这在实际中可能吗？","这是因为评估代码中未移除奖励塑造（reward shaping）导致的虚高分数。CarRacing 的设计使得原始分数很难超过 900。如果在评估阶段保留了训练时用于抵消失败惩罚的 +100 奖励，会导致结果不准确。正确的做法是区分训练和评估版本的 Wrapper.step()：训练时可以添加奖励塑造（如对驶入草地的惩罚），但在评估时应移除这些额外奖励，以反映真实的环境表现。","https:\u002F\u002Fgithub.com\u002FRafael1s\u002FDeep-Reinforcement-Learning-Algorithms\u002Fissues\u002F3",{"id":145,"question_zh":146,"answer_zh":147,"source_url":143},22594,"在训练 CarRacing 时，应该如何处理车辆驶入绿色草地（非赛道区域）的惩罚？","需要在训练版本的 Wrapper.step() 中添加针对驶入绿色区域的惩罚机制。维护者指出，可能需要将绿色的阈值设定得低于 185，或者将惩罚奖励设定得比 -0.05 更精确，以确保智能体能有效学习避开非赛道区域。",{"id":149,"question_zh":150,"answer_zh":151,"source_url":143},22595,"CarRacing 环境中奖励塑造（Reward Shaping）在训练和评估阶段应如何区别使用？","奖励塑造应仅用于训练阶段。例如，可以在训练时对车辆驶入草地或表现不佳时添加额外惩罚，以加速收敛。但在评估阶段（如验证是否达到平均分>900 的解决标准），必须使用原始环境奖励，不能包含任何人为添加的奖励塑造（如提前结束回合或人为加分），否则评估结果将不公平且无效。",{"id":153,"question_zh":154,"answer_zh":155,"source_url":133},22596,"使用 PPO 算法训练 CarRacing 时遇到性能突然崩溃怎么办？","这通常与奖励函数的设置有关，特别是关于环境结束时的惩罚处理。建议参考相关实现（如 pytorch_car_caring），检查是否在环境未完成（失败）时给予了过大的负奖励。尝试移除或调整 Wrapper 中对失败状态的 -100 惩罚偏移量，这可能是一个需要精细调整的超参数。",[]]