[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-sweetice--Deep-reinforcement-learning-with-pytorch":3,"tool-sweetice--Deep-reinforcement-learning-with-pytorch":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",153609,2,"2026-04-13T11:34:59",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":79,"owner_twitter":77,"owner_website":80,"owner_url":81,"languages":82,"stars":87,"forks":88,"last_commit_at":89,"license":90,"difficulty_score":10,"env_os":91,"env_gpu":91,"env_ram":91,"env_deps":92,"category_tags":100,"github_topics":101,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":120,"updated_at":121,"faqs":122,"releases":148},7134,"sweetice\u002FDeep-reinforcement-learning-with-pytorch","Deep-reinforcement-learning-with-pytorch","PyTorch implementation of DQN, AC,  ACER, A2C, A3C, PG,  DDPG, TRPO, PPO, SAC, TD3 and ....","Deep-reinforcement-learning-with-pytorch 是一个基于 PyTorch 框架的开源项目，致力于提供经典及前沿深度强化学习算法的清晰代码实现。它涵盖了从基础的 DQN、策略梯度（PG），到先进的 A3C、PPO、SAC、TD3 等多种主流算法，旨在帮助学习者深入理解算法原理与工程细节。\n\n该项目主要解决了强化学习领域代码复现难、参考实现不透明的问题。通过提供结构规范、注释清晰的源码，它让复杂的数学公式转化为可运行的程序，降低了入门门槛，并方便开发者对比不同算法在如 CartPole、MountainCar 及双足行走等标准环境中的表现。此外，项目还整合了 TensorBoard 可视化支持，便于用户直观监控训练过程中的损失变化与智能体行为。\n\nDeep-reinforcement-learning-with-pytorch 特别适合人工智能研究人员、算法工程师以及高校学生使用。对于希望从零开始掌握强化学习，或需要快速验证新想法的开发者而言，这是一个极佳的学习库和实验基准。虽然项目目前处于活跃开发阶段，可能伴随版本迭代带来的调整，但其持续更新的特性确保","Deep-reinforcement-learning-with-pytorch 是一个基于 PyTorch 框架的开源项目，致力于提供经典及前沿深度强化学习算法的清晰代码实现。它涵盖了从基础的 DQN、策略梯度（PG），到先进的 A3C、PPO、SAC、TD3 等多种主流算法，旨在帮助学习者深入理解算法原理与工程细节。\n\n该项目主要解决了强化学习领域代码复现难、参考实现不透明的问题。通过提供结构规范、注释清晰的源码，它让复杂的数学公式转化为可运行的程序，降低了入门门槛，并方便开发者对比不同算法在如 CartPole、MountainCar 及双足行走等标准环境中的表现。此外，项目还整合了 TensorBoard 可视化支持，便于用户直观监控训练过程中的损失变化与智能体行为。\n\nDeep-reinforcement-learning-with-pytorch 特别适合人工智能研究人员、算法工程师以及高校学生使用。对于希望从零开始掌握强化学习，或需要快速验证新想法的开发者而言，这是一个极佳的学习库和实验基准。虽然项目目前处于活跃开发阶段，可能伴随版本迭代带来的调整，但其持续更新的特性确保了技术内容的时效性，是探索智能决策系统不可多得的实用资源。","**Status:** Active (under active development, breaking changes may occur)\n\nThis repository will implement the classic and state-of-the-art deep reinforcement learning algorithms. The aim of this repository is to provide clear pytorch code for people to learn the deep reinforcement learning algorithm. \n\nIn the future, more state-of-the-art algorithms will be added and the existing codes will also be maintained.\n\n![demo](https:\u002F\u002Fgithub.com\u002Fsweetice\u002FDeep-reinforcement-learning-with-pytorch\u002Fblob\u002Fmaster\u002Ffigures\u002Fgrid.gif)\n\n## Requirements\n- python \u003C=3.6 \n- tensorboardX\n- gym >= 0.10\n- pytorch >= 0.4\n\n**Note that tensorflow does not support python3.7** \n\n## Installation\n\n```\npip install -r requirements.txt\n```\n\nIf you fail:  \n\n- Install gym\n\n```\npip install gym\n```\n\n\n\n- Install the pytorch\n```bash\nplease go to official webisite to install it: https:\u002F\u002Fpytorch.org\u002F\n\nRecommend use Anaconda Virtual Environment to manage your packages\n\n```\n\n- Install tensorboardX\n```bash\npip install tensorboardX\npip install tensorflow==1.12\n```\n\n- Test \n```\ncd Char10\\ TD3\u002F\npython TD3_BipedalWalker-v2.py --mode test\n```\n\nYou could see a bipedalwalker if you install successfully.\n\nBipedalWalker: \n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_dc616bb70eff.png)\n\n- 4. install openai-baselines (**Optional**)\n\n```bash\n# clone the openai baselines\ngit clone https:\u002F\u002Fgithub.com\u002Fopenai\u002Fbaselines.git\ncd baselines\npip install -e .\n\n```\n\n## DQN\n\nHere I uploaded two DQN models which is trianing CartPole-v0 and MountainCar-v0.\n\n### Tips for MountainCar-v0\n\nThis is a sparse binary reward task. Only when car reach the top of the mountain there is a none-zero reward. In genearal it may take 1e5 steps in stochastic policy. You can add a reward term, for example, to change to the current position of the Car is positively related. Of course, there is a more advanced approach that is inverse reinforcement learning.\n\n![value_loss](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_026a0c89e768.jpg)   \n![step](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_f9501c516bc2.jpg) \nThis is value loss for DQN, We can see that the loss increaded to 1e13, however, the network work well. Because the target_net and act_net are very different with the training process going on. The calculated loss cumulate large. The previous loss was small because the reward was very sparse, resulting in a small update of the two networks.\n\n### Papers Related to the DQN\n\n\n  1. Playing Atari with Deep Reinforcement Learning [[arxiv]](https:\u002F\u002Fwww.cs.toronto.edu\u002F~vmnih\u002Fdocs\u002Fdqn.pdf) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F1.dqn.ipynb)\n  2. Deep Reinforcement Learning with Double Q-learning [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.06461) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F2.double%20dqn.ipynb)\n  3. Dueling Network Architectures for Deep Reinforcement Learning [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06581) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F3.dueling%20dqn.ipynb)\n  4. Prioritized Experience Replay [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.05952) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F4.prioritized%20dqn.ipynb)\n  5. Noisy Networks for Exploration [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.10295) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F5.noisy%20dqn.ipynb)\n  6. A Distributional Perspective on Reinforcement Learning [[arxiv]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06887.pdf) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F6.categorical%20dqn.ipynb)\n  7. Rainbow: Combining Improvements in Deep Reinforcement Learning [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.02298) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F7.rainbow%20dqn.ipynb)\n  8. Distributional Reinforcement Learning with Quantile Regression [[arxiv]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.10044.pdf) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F8.quantile%20regression%20dqn.ipynb)\n  9. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation  [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.06057) [[code]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F9.hierarchical%20dqn.ipynb)\n  10. Neural Episodic Control [[arxiv]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.01988.pdf) [[code]](#)\n\n\n## Policy Gradient\n\n\nUse the following command to run a saved model\n\n\n```\npython Run_Model.py\n```\n\n\nUse the following command to train model\n\n\n```\npython pytorch_MountainCar-v0.py\n```\n\n\n\n> policyNet.pkl\n\nThis is a model that I have trained.\n\n\n## Actor-Critic\n\nThis is an algorithmic framework, and the classic REINFORCE method is stored under Actor-Critic.\n \n## DDPG  \nEpisode reward in Pendulum-v0:  \n\n![ep_r](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_b9b6d8a6bfee.jpg)  \n\n\n## PPO  \n\n- Original paper: https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347\n- Openai Baselines blog post: https:\u002F\u002Fblog.openai.com\u002Fopenai-baselines-ppo\u002F\n\n\n## A2C\n\nAdvantage Policy Gradient, an paper in 2017 pointed out that the difference in performance between A2C and A3C is not obvious.\n\nThe Asynchronous Advantage Actor Critic method (A3C) has been very influential since the paper was published. The algorithm combines a few key ideas:\n\n- An updating scheme that operates on fixed-length segments of experience (say, 20 timesteps) and uses these segments to compute estimators of the returns and advantage function.\n- Architectures that share layers between the policy and value function.\n- Asynchronous updates.\n\n## A3C\n\nOriginal paper: https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.01783\n\n## SAC\n\n**This is not the implementation of the author of paper!!!**\n\nEpisode reward in Pendulum-v0:\n\n![ep_r](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_e0af472e4a00.png)\n\n## TD3\n\n**This is not the implementation of the author of paper!!!**  \n\nEpisode reward in Pendulum-v0:  \n\n![ep_r](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_d608d19ef0f5.png)  \n\nEpisode reward in BipedalWalker-v2:  \n![ep_r](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_651d842abd1b.png)  \n\nIf you want to use the test your model:\n\n```\npython TD3_BipedalWalker-v2.py --mode test\n```\n\n## Papers Related to the Deep Reinforcement Learning\n[01] [A Brief Survey of Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.05866)  \n[02] [The Beta Policy for Continuous Control Reinforcement Learning](https:\u002F\u002Fwww.ri.cmu.edu\u002Fwp-content\u002Fuploads\u002F2017\u002F06\u002Fthesis-Chou.pdf)  \n[03] [Playing Atari with Deep Reinforcement Learning](https:\u002F\u002Fwww.cs.toronto.edu\u002F~vmnih\u002Fdocs\u002Fdqn.pdf)  \n[04] [Deep Reinforcement Learning with Double Q-learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.06461)  \n[05] [Dueling Network Architectures for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06581)  \n[06] [Continuous control with deep reinforcement learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.02971)  \n[07] [Continuous Deep Q-Learning with Model-based Acceleration](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.00748)  \n[08] [Asynchronous Methods for Deep Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.01783)  \n[09] [Trust Region Policy Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1502.05477)  \n[10] [Proximal Policy Optimization Algorithms](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347)  \n[11] [Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.05144)  \n[12] [High-Dimensional Continuous Control Using Generalized Advantage Estimation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.02438)  \n[13] [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.01290)  \n[14] [Addressing Function Approximation Error in Actor-Critic Methods](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09477)  \n\n## TO DO\n- [x] DDPG\n- [x] SAC\n- [x] TD3\n\n\n# Best RL courses\n- [OpenAI's spinning up](https:\u002F\u002Fspinningup.openai.com\u002F)  \n- [David Silver's course](http:\u002F\u002Fwww0.cs.ucl.ac.uk\u002Fstaff\u002Fd.silver\u002Fweb\u002FTeaching.html)  \n- [Berkeley deep RL](http:\u002F\u002Frll.berkeley.edu\u002Fdeeprlcourse\u002F)  \n- [Practical RL](https:\u002F\u002Fgithub.com\u002Fyandexdataschool\u002FPractical_RL)  \n- [Deep Reinforcement Learning by Hung-yi Lee](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLJV_el3uVTsODxQFgzMzPLa16h6B8kWM_)   \n","**状态:** 活跃（正在积极开发中，可能会有破坏性变更）\n\n本仓库将实现经典及最前沿的深度强化学习算法。其目标是提供清晰易懂的 PyTorch 代码，供开发者学习深度强化学习算法。\n\n未来，我们将不断添加更多最先进的算法，并持续维护现有代码。\n\n![demo](https:\u002F\u002Fgithub.com\u002Fsweetice\u002FDeep-reinforcement-learning-with-pytorch\u002Fblob\u002Fmaster\u002Ffigures\u002Fgrid.gif)\n\n## 需求\n- Python \u003C=3.6 \n- tensorboardX\n- gym >= 0.10\n- PyTorch >= 0.4\n\n**请注意，TensorFlow 不支持 Python 3.7。**\n\n## 安装\n\n```\npip install -r requirements.txt\n```\n\n如果安装失败：\n\n- 安装 gym\n\n```\npip install gym\n```\n\n\n\n- 安装 PyTorch\n```bash\n请前往官方网址进行安装：https:\u002F\u002Fpytorch.org\u002F\n\n建议使用 Anaconda 虚拟环境来管理你的包。\n```\n\n- 安装 tensorboardX\n```bash\npip install tensorboardX\npip install tensorflow==1.12\n```\n\n- 测试\n```\ncd Char10\\ TD3\u002F\npython TD3_BipedalWalker-v2.py --mode test\n```\n\n如果安装成功，你将看到一个双足行走的机器人。\n\nBipedalWalker:\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_dc616bb70eff.png)\n\n- 4. 安装 openai-baselines（可选）\n\n```bash\n# 克隆 openai baselines\ngit clone https:\u002F\u002Fgithub.com\u002Fopenai\u002Fbaselines.git\ncd baselines\npip install -e .\n```\n\n## DQN\n\n在这里，我上传了两个 DQN 模型，分别用于训练 CartPole-v0 和 MountainCar-v0。\n\n### MountainCar-v0 的提示\n\n这是一个稀疏的二元奖励任务。只有当车到达山顶时，才会获得非零奖励。通常情况下，在随机策略下可能需要 1e5 步才能完成。你可以添加一个奖励项，例如使奖励与当前车辆位置成正比。当然，还有更高级的方法，比如逆向强化学习。\n\n![value_loss](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_026a0c89e768.jpg)   \n![step](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_f9501c516bc2.jpg) \n这是 DQN 的价值损失。我们可以看到损失增加到了 1e13，然而网络仍然工作得很好。这是因为随着训练的进行，target_net 和 act_net 之间的差异越来越大，导致计算出的损失累积得非常大。之前的损失较小，是因为奖励非常稀疏，从而导致两个网络的更新幅度也很小。\n\n### 与 DQN 相关的论文\n\n\n  1. 使用深度强化学习玩 Atari 游戏 [[arxiv]](https:\u002F\u002Fwww.cs.toronto.edu\u002F~vmnih\u002Fdocs\u002Fdqn.pdf) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F1.dqn.ipynb)\n  2. 带有双重 Q 学习的深度强化学习 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.06461) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F2.double%20dqn.ipynb)\n  3. 用于深度强化学习的对决网络架构 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06581) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F3.dueling%20dqn.ipynb)\n  4. 优先级经验回放 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.05952) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F4.prioritized%20dqn.ipynb)\n  5. 用于探索的噪声网络 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.10295) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F5.noisy%20dqn.ipynb)\n  6. 强化学习的分布视角 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06887.pdf) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F6.categorical%20dqn.ipynb)\n  7. 彩虹：结合深度强化学习的改进 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.02298) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F7.rainbow%20dqn.ipynb)\n  8. 基于分位数回归的分布强化学习 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.10044.pdf) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F8.quantile%20regression%20dqn.ipynb)\n  9. 分层深度强化学习：整合时间抽象与内在动机 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F1604.06057) [[代码]](https:\u002F\u002Fgithub.com\u002Fhiggsfield\u002FRL-Adventure\u002Fblob\u002Fmaster\u002F9.hierarchical%20dqn.ipynb)\n  10. 神经情景控制 [[arxiv]](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.01988.pdf) [[代码]](#)\n\n\n## 策略梯度\n\n\n使用以下命令运行已保存的模型\n\n\n```\npython Run_Model.py\n```\n\n\n使用以下命令训练模型\n\n\n```\npython pytorch_MountainCar-v0.py\n```\n\n\n\n> policyNet.pkl\n\n这是我训练好的模型。\n\n\n## 演员-评论家\n\n这是一种算法框架，经典的 REINFORCE 方法就属于演员-评论家方法。\n\n## DDPG  \n在 Pendulum-v0 中的每集奖励：\n\n![ep_r](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_b9b6d8a6bfee.jpg)  \n\n\n## PPO  \n\n- 原始论文：https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347\n- OpenAI Baselines 博客文章：https:\u002F\u002Fblog.openai.com\u002Fopenai-baselines-ppo\u002F\n\n\n## A2C\n\n优势策略梯度，一篇 2017 年的论文指出，A2C 和 A3C 在性能上的差异并不明显。\n\n异步优势演员评论家方法（A3C）自论文发表以来一直具有很大的影响力。该算法结合了几种关键思想：\n\n- 一种基于固定长度经验片段（例如 20 个时间步）的更新机制，利用这些片段来计算回报和优势函数的估计值。\n- 策略网络和价值网络共享层的架构。\n- 异步更新。\n\n## A3C\n\n原始论文：https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.01783\n\n## SAC\n\n**这并不是论文作者的实现！！！**\n\n在 Pendulum-v0 中的每集奖励：\n\n![ep_r](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_e0af472e4a00.png)\n\n## TD3\n\n**这并不是论文作者的实现！！！**  \n\n在 Pendulum-v0 中的每集奖励：\n\n![ep_r](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_d608d19ef0f5.png)  \n\n在 BipedalWalker-v2 中的每集奖励：\n\n![ep_r](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_readme_651d842abd1b.png)  \n\n如果你想测试你的模型：\n\n```\npython TD3_BipedalWalker-v2.py --mode test\n```\n\n## 与深度强化学习相关的论文\n[01] [深度强化学习简述](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.05866)  \n[02] [连续控制强化学习中的Beta策略](https:\u002F\u002Fwww.ri.cmu.edu\u002Fwp-content\u002Fuploads\u002F2017\u002F06\u002Fthesis-Chou.pdf)  \n[03] [使用深度强化学习玩Atari游戏](https:\u002F\u002Fwww.cs.toronto.edu\u002F~vmnih\u002Fdocs\u002Fdqn.pdf)  \n[04] [基于双重Q-learning的深度强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.06461)  \n[05] [用于深度强化学习的决斗网络架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06581)  \n[06] [深度强化学习中的连续控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1509.02971)  \n[07] [基于模型加速的连续深度Q-learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.00748)  \n[08] [深度强化学习的异步方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.01783)  \n[09] [信任域策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1502.05477)  \n[10] [近端策略优化算法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.06347)  \n[11] [利用克罗内克分解近似法的可扩展信任域深度强化学习方法](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.05144)  \n[12] [使用广义优势估计进行高维连续控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.02438)  \n[13] [软演员-评论家：带有随机演员的离策略最大熵深度强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.01290)  \n[14] [解决演员-评论家方法中的函数逼近误差](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09477)  \n\n## 待办事项\n- [x] DDPG\n- [x] SAC\n- [x] TD3\n\n\n# 最佳强化学习课程\n- [OpenAI的Spinning Up](https:\u002F\u002Fspinningup.openai.com\u002F)  \n- [戴维·西尔弗的课程](http:\u002F\u002Fwww0.cs.ucl.ac.uk\u002Fstaff\u002Fd.silver\u002Fweb\u002FTeaching.html)  \n- [伯克利深度强化学习课程](http:\u002F\u002Frll.berkeley.edu\u002Fdeeprlcourse\u002F)  \n- [实用强化学习](https:\u002F\u002Fgithub.com\u002Fyandexdataschool\u002FPractical_RL)  \n- [李宏毅的深度强化学习课程](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLJV_el3uVTsODxQFgzMzPLa16h6B8kWM_)","# Deep-reinforcement-learning-with-pytorch 快速上手指南\n\n本项目旨在提供清晰、易读的 PyTorch 代码，帮助开发者学习经典及前沿的深度强化学习（Deep RL）算法。\n\n## 1. 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **Python 版本**：\u003C= 3.6（注意：TensorFlow 1.x 不支持 Python 3.7+，项目依赖中包含 TensorFlow）\n*   **核心框架**：PyTorch >= 0.4\n*   **仿真环境**：Gym >= 0.10\n*   **可视化工具**：tensorboardX\n\n**推荐方案**：建议使用 **Anaconda** 创建虚拟环境来管理依赖包，以避免版本冲突。\n\n## 2. 安装步骤\n\n### 第一步：安装基础依赖\n克隆项目后，尝试一键安装 requirements：\n```bash\npip install -r requirements.txt\n```\n\n### 第二步：处理安装失败情况\n如果上述命令失败，请按顺序手动安装以下组件：\n\n1.  **安装 Gym**\n    ```bash\n    pip install gym\n    ```\n    *(国内加速建议：`pip install gym -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`)*\n\n2.  **安装 PyTorch**\n    请访问官网获取适合您环境的安装命令：https:\u002F\u002Fpytorch.org\u002F\n    *推荐使用 Anaconda 环境进行安装。*\n\n3.  **安装 TensorboardX 及兼容版 TensorFlow**\n    ```bash\n    pip install tensorboardX\n    pip install tensorflow==1.12\n    ```\n    *(国内加速建议：添加 `-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple` 参数)*\n\n4.  **(可选) 安装 OpenAI Baselines**\n    如果需要对比或参考官方基线实现：\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fopenai\u002Fbaselines.git\n    cd baselines\n    pip install -e .\n    ```\n    *(国内加速建议：使用 `git clone https:\u002F\u002Fgitee.com\u002Fmirror\u002Fopenai-baselines.git` 或其他镜像源)*\n\n## 3. 基本使用\n\n项目包含多种算法（如 DQN, DDPG, PPO, A3C, SAC, TD3 等），以下是验证安装和运行测试的最简示例。\n\n### 验证安装 (TD3 算法测试)\n运行以下命令测试双足行走机器人（BipedalWalker）环境。如果安装成功，您将看到机器人运动的演示画面。\n\n```bash\ncd Char10\\ TD3\u002F\npython TD3_BipedalWalker-v2.py --mode test\n```\n\n### 训练模型示例\n\n*   **策略梯度 (Policy Gradient) - 训练山地车任务**\n    ```bash\n    python pytorch_MountainCar-v0.py\n    ```\n\n*   **策略梯度 (Policy Gradient) - 运行已保存的模型**\n    ```bash\n    python Run_Model.py\n    ```\n\n*   **DQN 算法**\n    项目 `Char01 DQN` 目录下提供了针对 `CartPole-v0` 和 `MountainCar-v0` 的训练脚本，直接运行对应目录下的主程序即可。\n\n> **提示**：对于稀疏奖励任务（如 MountainCar-v0），随机策略可能需要 1e5 步才能收敛。若训练困难，可尝试修改奖励函数（使其与位置正相关）或参考项目中提供的预训练模型 `policyNet.pkl`。","某机器人初创公司的算法工程师正在为双足行走机器人开发平衡控制系统，需要快速验证多种深度强化学习算法在复杂物理环境中的表现。\n\n### 没有 Deep-reinforcement-learning-with-pytorch 时\n- **重复造轮子耗时严重**：工程师需从零手写 DQN、PPO、TD3 等算法的基础架构，仅搭建代码框架就耗费数周时间。\n- **调试陷阱难以规避**：面对稀疏奖励任务（如 MountainCar），缺乏现成的损失函数监控与网络更新参考，难以判断是策略失效还是代码逻辑错误。\n- **算法对比成本高昂**：想测试不同算法（如从 DDPG 切换到 SAC）的效果，必须重构大量代码，无法在同一框架下公平对比性能。\n- **环境配置繁琐易错**：手动整合 PyTorch、Gym 和 TensorBoard 时常遇到版本冲突（如 Python 3.7 兼容性问题），导致环境迟迟无法跑通。\n\n### 使用 Deep-reinforcement-learning-with-pytorch 后\n- **开箱即用加速研发**：直接调用库中已实现的 TD3 或 PPO 代码，将原本数周的搭建工作缩短至几小时，迅速进入模型调优阶段。\n- **直观诊断训练问题**：参考库中提供的价值损失（Value Loss）变化图和稀疏奖励处理技巧，快速定位网络不收敛的原因并调整奖励函数。\n- **灵活切换算法验证**：只需修改少量配置即可在 DQN、A3C、SAC 等十几种算法间无缝切换，高效筛选出最适合双足机器人的控制策略。\n- **标准化环境部署**：依托清晰的依赖说明和 Anaconda 虚拟环境指南，一次性解决版本兼容问题，确保团队开发环境高度一致。\n\nDeep-reinforcement-learning-with-pytorch 通过提供清晰、全栈的算法实现，将研究人员从繁琐的代码工程中解放出来，使其能专注于策略优化与场景落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsweetice_Deep-reinforcement-learning-with-pytorch_6f9520b6.png","sweetice","Johnny He","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fsweetice_281e2af9.png","PhD @ Ruhr University Bochum",null,"Tuebingen, Germany","johnyhe1997@gmail.com","sweetice.github.io","https:\u002F\u002Fgithub.com\u002Fsweetice",[83],{"name":84,"color":85,"percentage":86},"Python","#3572A5",100,4608,898,"2026-04-11T17:16:33","MIT","未说明",{"notes":93,"python":94,"dependencies":95},"注意：README 明确指出 TensorFlow 不支持 Python 3.7，因此必须使用 Python 3.6 或更低版本。建议使用 Anaconda 虚拟环境管理包。可选安装 openai-baselines。该项目处于活跃开发中，可能会发生破坏性变更。","\u003C=3.6",[96,97,98,99],"pytorch>=0.4","gym>=0.10","tensorboardX","tensorflow==1.12",[14],[102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119],"policy-gradient","pytorch","actor-critic-algorithm","alphago","deep-reinforcement-learning","a2c","dqn","sarsa","ppo","a3c","resnet","algorithm","deep-learning","reinforce","actor-critic","sac","td3","trpo","2026-03-27T02:49:30.150509","2026-04-14T00:12:35.392103",[123,128,133,138,143],{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},32253,"DDPG 代码中缺失的参数 `args.max_length_of_trajectory` 通常设置为多少？","该参数指的是最大步长（maximum step length），通常设置为 199。","https:\u002F\u002Fgithub.com\u002Fsweetice\u002FDeep-reinforcement-learning-with-pytorch\u002Fissues\u002F32",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},32254,"TD3 算法中的策略延迟（policy delay）判断逻辑是否正确？","原代码中检查 `num_iterations % policy_decay` 是错误的。正确的做法应该是检查当前迭代步数 `i` 对延迟参数的取模，即修改为：`if i % args.policy_delay == 0:`。此问题已在后续版本中修复。","https:\u002F\u002Fgithub.com\u002Fsweetice\u002FDeep-reinforcement-learning-with-pytorch\u002Fissues\u002F8",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},32255,"在 DDPG 代码中，外层循环变量 `i` 和内层循环变量 `t` 分别代表什么含义？","变量 `i` 代表回合数（episode number），变量 `t` 代表单个轨迹的长度（即一步一步的步进）。当 `t > args.max_episode` 时，代理会停止 rollout。虽然原代码注释可能引起误解，但逻辑上 `i` 用于计数回合，该混淆问题已在新版本中修复。","https:\u002F\u002Fgithub.com\u002Fsweetice\u002FDeep-reinforcement-learning-with-pytorch\u002Fissues\u002F5",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},32256,"DDPG 实现中是否缺少了动作探索噪声（exploration noise）？","是的，原始脚本遗漏了 DDPG 论文中提到的动作噪声添加步骤。修复方法是在生成动作后添加高斯噪声并裁剪到动作空间范围内。具体代码如下：\n`action = (action + np.random.normal(0, args.exploration_noise, size=env.action_space.shape[0])).clip(env.action_space.low, env.action_space.high)`","https:\u002F\u002Fgithub.com\u002Fsweetice\u002FDeep-reinforcement-learning-with-pytorch\u002Fissues\u002F3",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},32257,"运行 Char05 DDPG 时提示找不到 `utils` 脚本或缺少 TD3\u002FOurDDPG 模块怎么办？","这是因为上传的文件有误，导致部分依赖脚本缺失。该问题已被修复，请拉取最新版本的代码即可解决。该文件夹原本应包含完整实现，而非仅部分片段。","https:\u002F\u002Fgithub.com\u002Fsweetice\u002FDeep-reinforcement-learning-with-pytorch\u002Fissues\u002F2",[]]