[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-quantumiracle--Popular-RL-Algorithms":3,"tool-quantumiracle--Popular-RL-Algorithms":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",153609,2,"2026-04-13T11:34:59",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":76,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":10,"env_os":94,"env_gpu":95,"env_ram":95,"env_deps":96,"category_tags":102,"github_topics":104,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":108,"updated_at":109,"faqs":110,"releases":146},7218,"quantumiracle\u002FPopular-RL-Algorithms","Popular-RL-Algorithms","PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC\u002FA2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet..","Popular-RL-Algorithms 是一个基于 PyTorch 构建的无模型强化学习算法代码合集，涵盖了从基础的 Q-learning、SARSA 到前沿的 SAC、TD3、PPO 及多智能体算法 QMIX 等十余种主流方法。该项目旨在为开发者和研究人员提供一套可直接参考与对比的实现方案，解决了在学习复杂强化学习理论时缺乏高质量、多样化代码示例的痛点。\n\n与其他追求高度封装的官方库不同，Popular-RL-Algorithms 的独特之处在于其“研究笔记”式的定位。作者特意保留了同一算法的多个实现版本（例如包含两种不同架构的 SAC 实现，以及支持优先经验回放的离散版 SAC），方便用户直观对比不同技术路线的差异与优劣。虽然代码结构未做过度美化，但这种原汁原味的呈现方式非常适合需要深入理解算法底层逻辑、进行二次开发或学术研究的工程师与学者。如果你希望透过代码细节掌握强化学习的核心机制，或在 TensorFlow 与 PyTorch 之间进行技术迁移参考，这个项目将是一个极具价值的学习资源。","# Popular Model-free Reinforcement Learning Algorithms  \n\u003C!-- [![Tweet](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttp\u002Fshields.io.svg?style=social)](https:\u002F\u002Ftwitter.com\u002Fintent\u002Ftweet?text=State-of-the-art-Model-free-Reinforcement-Learning-Algorithms%20&url=hhttps:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FSTOA-RL-Algorithms&hashtags=RL) -->\n\n\n**PyTorch** and **Tensorflow 2.0** implementation of state-of-the-art model-free reinforcement learning algorithms on both Openai gym environments and a self-implemented Reacher environment. \n\nAlgorithms include:\n* **Q-learning**;\n* **SARSA**;\n* **Monte-Carlo Regression**;\n* **Actor-Critic (AC\u002FA2C)**;\n* **Soft Actor-Critic (SAC)**;\n* **Deep Deterministic Policy Gradient (DDPG)**;\n* **Twin Delayed DDPG (TD3)**; \n* **Proximal Policy Optimization (PPO)**;\n* **QT-Opt (including Cross-entropy (CE) Method)**;\n* **PointNet**;\n* **Transporter**;\n* **Recurrent Policy Gradient**;\n* **Soft Decision Tree**;\n* **Probabilistic Mixture-of-Experts**;\n* **QMIX**\n* etc.\n\nPlease note that this repo is more of a personal collection of algorithms I implemented and tested during my research and study period, rather than an official open-source library\u002Fpackage for usage. However, I think it could be helpful to share it with others and I'm expecting useful discussions on my implementations. But I didn't spend much time on cleaning or structuring the code. As you may notice that there may be several versions of implementation for each algorithm, I intentionally show all of them here for you to refer and compare. Also, this repo contains only **PyTorch** Implementation.\n\nFor official libraries of RL algorithms, I provided the following two with **TensorFlow 2.0 + TensorLayer 2.0**:\n\n* [**RL Tutorial**](https:\u002F\u002Fgithub.com\u002Ftensorlayer\u002Ftensorlayer\u002Ftree\u002Freinforcement-learning\u002Fexamples\u002Freinforcement_learning) (*Status: Released*) contains RL algorithms implementation as tutorials with simple structures. \n\n* [**RLzoo**](https:\u002F\u002Fgithub.com\u002Ftensorlayer\u002FRLzoo) (*Status: Released*) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.\n\nFor multi-agent RL, a new repository is built (**PyTorch**):\n* [**MARS**](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FMARS) (*Status: WIP*) is a library for multi-agent RL on games, like PettingZoo Atari, SlimeVolleyBall, etc.\n\nSince Tensorflow 2.0 has already incorporated the dynamic graph construction instead of the static one, it becomes a trivial work to transfer the RL code between TensorFlow and PyTorch.\n\n## Contents:\n\n* Multiple versions of **Soft Actor-Critic (SAC)** are implemented.\n\n  **SAC Version 1**:\n\n     `sac.py`: using state-value function.\n\n     paper: https:\u002F\u002Farxiv.org\u002Fpdf\u002F1801.01290.pdf\n\n  **SAC Version 2**:\n\n   `sac_v2.py`: using target Q-value function instead of state-value function.\n\n    paper: https:\u002F\u002Farxiv.org\u002Fpdf\u002F1812.05905.pdf\n    \n  **SAC Discrete**\n  \n   `sac_discrete.py`: for discrete action space.\n\n    paper (the author is actually one of my classmates at IC): https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.07207\n    \n  **SAC Discrete PER**\n  \n   `sac_discrete_per.py`: for discrete action space, and with prioritized experience replay (PER).\n\n* **Deep Deterministic Policy Gradient (DDPG)**:\n\n  `ddpg.py`: implementation of DDPG.\n\n* **Twin Delayed DDPG (TD3)**:\n\n   `td3.py`: implementation of TD3.\n\n   paper: https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.09477.pdf\n\n* **Proximal Policy Optimization (PPO)**:\n  \n  For continuous environments, two versions are implemented:\n  \n  Version 1: `ppo_continuous.py` and `ppo_continuous_multiprocess.py` \n  \n  Version 2: `ppo_continuous2.py` and `ppo_continuous_multiprocess2.py` \n  \n  For discrete environment:\n  \n  `ppo_gae_discrete.py`: with Generalized Advantage Estimation (GAE)\n\n* **Actor-Critic (AC) \u002F A2C**:\n\n  `ac.py`: extensible AC\u002FA2C, easy to change to be DDPG, etc.\n\n   A very extensible version of vanilla AC\u002FA2C, supporting for all continuous\u002Fdiscrete deterministic\u002Fnon-deterministic cases.\n  \n* **Q-learning**, **SARSA**, **Monte-Carlo Regression**:\n\n  `qlearning_sarsa_mc.ipynb`: comparison of the three.\n \n* **DQN**:\n\n  `dqn.py`: a simple DQN.\n\n* **QT-Opt**:\n\n   Two versions are implemented [here](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FQT_Opt).\n\n* **PointNet** for landmarks generation from images with unsupervised learning is implemented [here](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPointNet_Landmarks_from_Image\u002Ftree\u002Fmaster). This method is also used for image-based reinforcement learning as a SOTA algorithm, called **Transporter**.\n\n  original paper: [Unsupervised Learning of Object Landmarksthrough Conditional Image Generation](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F7657-unsupervised-learning-of-object-landmarks-through-conditional-image-generation.pdf)\n\n  paper for RL: [Unsupervised Learning of Object Keypointsfor Perception and Control](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1906.11883.pdf)\n\n* **Recurrent Policy Gradient**:\n\n  `rdpg.py`: DDPG with LSTM policy.\n\n  `td3_lstm.py`: TD3 with LSTM policy.\n\n  `sac_v2_lstm.py`: SAC with LSTM policy.\n  \n  `sac_v2_gru.py`: SAC with GRU policy.\n\n  References:\n\n  [Memory-based control with recurrent neural networks](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.04455)\n\n  [Sim-to-Real Transfer of Robotic Control with Dynamics Randomization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.06537)\n  \n * **Soft Decision Tree** as function approximator for PPO:\n \n   `sdt_ppo_gae_discrete.py`: replace the network layers of policy in PPO to be a [Soft Decision Tree](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.09784), for achieving explainable RL.\n   \n   paper: [CDT: Cascading Decision Trees for Explainable Reinforcement Learning\n](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.07553)\n   \n * **Probabilistic Mixture-of-Experts (PMOE)** :\n \n   PMOE uses a differentiable multi-modal Gaussian distribution to replace the standard unimodal Gaussian distribution for policy representation.\n   \n   `pmoe_sac.py`: based on off-policy SAC.\n   \n   `pmoe_ppo.py`: based on on-policy PPO.\n   \n   paper: [Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement Learning\n](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.09122)\n\n * **QMIX**:\n\n     `qmix.py`: a fully cooperative multi-agent RL algorithm, demo environment using [pettingzoo](https:\u002F\u002Fwww.pettingzoo.ml\u002Fatari\u002Fentombed_cooperative).\n\n     paper: http:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Frashid18a.html\n     \n * **Phasic Policy Gradient (PPG)**:\n\n   todo\n\n   paper: [Phasic Policy Gradient](http:\u002F\u002Fproceedings.mlr.press\u002Fv139\u002Fcobbe21a.html)\n     \n \n * **Maximum a Posteriori Policy Optimisation (MPO)**:\n \n    todo\n\n    paper: [Maximum a Posteriori Policy Optimisation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1806.06920)\n \n * **Advantage-Weighted Regression (AWR)**:\n\n    todo \n\n    paper: [Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.00177.pdf)\n\n## Usage:\n`python ***.py --train` \n\n`python ***.py --test` \n\n## Troubleshooting:\n\nIf you meet problem *\"Not imlplemented Error\"*, it may be due to the wrong gym version. The newest gym==0.14 won't work. Install gym==0.7 or gym==0.10 with `pip install -r requirements.txt`.\n\n## Undervalued Tricks:\n\nAs we all known, there are various tricks in empirical RL algorithm implementations in support the performance in practice, including hyper-parameters, normalization, network architecture or even hidden activation function, etc. I summarize some I met with the programs in this repo here:\n\n * Environment specific: \n    *  For `Pendulum-v0` environment in Gym, a reward pre-processing as `(r+8)\u002F8` usually improves the learning efficiency, as [here](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002F7f2bb74a51cf9cbde92a6ccfa42e97dc129dd145\u002Fppo_continuous2.py#L376)\n    Also, this environment needs the [maximum episode length](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002F7f2bb74a51cf9cbde92a6ccfa42e97dc129dd145\u002Fsac_v2.py#L364) to be at least 150 to learn well, too short episodes make it hard to learn.\n    * `MountainCar-v0` environment in Gym has very sparse reward (only when reaching the flag), general learning curves will be noisy; therefore some specific process may also need for this environment.\n\n * Normalization:\n    * [Reward normalization](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002F7f2bb74a51cf9cbde92a6ccfa42e97dc129dd145\u002Fsac_v2.py#L262) or [advantage normalization](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002F881903e4aa22921f142daedfcf3dd266488405d8\u002Fppo_gae_discrete.py#L79) in batch can have great improvements on performance (learning efficiency, stability) sometimes, although theoretically on-policy algorithms like PPO should not apply data normalization during training due to distribution shift. For an in-depth look at this problem, we should treat it differently (1) when normalizing the direct input data like observation, action, reward, etc; (2) when normalizing the estimation of the values (state value, state-action value, advantage, etc). For (1), a more reasonable way for normalization is to keep a moving average of previous mean and standard deviation, to achieve a similar effect as conducting the normaliztation on the full dataset during RL agent learning (this is not possible since in RL the data comes from interaction of agents and environments). For (2), we can simply conduct normalization on value estimations (rather than keeping the historical average) since we do not want the estimated values to have distribution shift, so we treat them like a static distribution.\n\n* Multiprocessing:\n    * Is the multiprocessing update based on `torch.multiprocessing` the right\u002Fsafe way to parallelize the code? \n It can be seen that the official instruction (example of Hogwild) of using `torch.multiprocessing` is applied without any explicit locks, which means it can be potentially unsafe when multiple processes generate gradients and update the shared model at the same time. See more discussions [here](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fsynchronization-for-sharing-updating-shared-model-state-dict-across-multi-process\u002F50102\u002F2) and some [tests](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fmodel-update-with-share-memory-need-lock-protection\u002F72857) and [answers](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fgrad-sharing-problem-in-a3c\u002F10635). In general, the drawback of unsafe updates may be overwhelmed by the speed up of using multiprocessing (also RL training itself has huge variances and noise).\n\n   * Although I provide the multiprocessing versions of serveral algorithms ([SAC](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002Fmaster\u002Fsac_v2_multiprocess.py), [PPO](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002Fmaster\u002Fppo_continuous_multiprocess2.py), etc), for small-scale environments in Gym, this is usually not necessary or even inefficient. The vectorized environment wrapper for parallel environment sampling may be more proper solution for learning these environments, since the bottelneck in learning efficiency mainly lies in the interaction with environments rather than the model learning (back-propagation) process.\n   * A quick note on multiprocess usage:\n      \u003Cp align=\"center\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_7c6a9a3949ac.png\" width=\"40%\">\n      \u003C\u002Fp>\n      Sharing class instance with its states across multiple processes requires to put the instance inside multiprocessing.manager:\n      \u003Cp align=\"center\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_2bd23d2bca84.png\" width=\"40%\">\n      \u003C\u002Fp>\n* PPO Details:\n\n    * [Here](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F19VucQYtiCubFt6IIfzO-Gsguvs8BfnXTxp76RXUPDNA\u002Fedit?usp=sharing) I summarized a list of implementation details for PPO algorithm on continous action spaces, correspoonding to scripts `ppo_gae_continuous.py`, `ppo_gae_continuous2.py` and `ppo_gae_continuous3.py`.\n\nMore discussions about **implementation tricks** see this [chapter](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-981-15-4095-0_18) in our book.\n\n## Performance:\n\n* **SAC** for gym Pendulum-v0:\n\nSAC with automatically updating variable alpha for entropy:\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_26af11de988f.png\" width=\"100%\">\n\u003C\u002Fp>\nSAC without automatically updating variable alpha for entropy:\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_1681545ef49e.png\" width=\"100%\">\n\u003C\u002Fp>\n\nIt shows that the automatic-entropy update helps the agent to learn faster.\n\n* **TD3** for gym Pendulum-v0:\n\nTD3 with deterministic policy:\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_acf2c9aea2fa.png\" width=\"100%\">\n\u003C\u002Fp>\nTD3 with non-deterministic\u002Fstochastic policy:\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_d180e5f6b27a.png\" width=\"100%\">\n\u003C\u002Fp>\n\nIt seems TD3 with deterministic policy works a little better, but basically similar.\n\n* **AC** for gym CartPole-v0:\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_f33a31e3fec3.png\" width=\"100%\">\n\u003C\u002Fp>\n\n   However, vanilla AC\u002FA2C cannot handle the continuous case like gym Pendulum-v0 well.\n   \n* **PPO** for gym LunarLanderContinuous-v2:\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_fbae67e9a4ea.png\" width=\"100%\">\n\u003C\u002Fp>\n\nUse `ppo_continuous_multiprocess2.py`.\n\n## Citation:\nTo cite this repository:\n```\n@misc{rlalgorithms,\n  author = {Zihan Ding},\n  title = {Popular-RL-Algorithms},\n  year = {2019},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms}},\n}\n```\n\n## Other Resources:\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_001e0467bef7.png\" width=\"20%\">\n\u003C\u002Fp>\n\n**Deep Reinforcement Learning: Foundamentals, Research and Applications** *Springer Nature 2020* \n\nis the book I edited with [Dr. Hao Dong](https:\u002F\u002Fgithub.com\u002Fzsdonghao) and [Dr. Shanghang Zhang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=voqw10cAAAAJ&hl=en), which provides a wide coverage of topics in deep reinforcement learning. Details see [website](https:\u002F\u002Fdeepreinforcementlearningbook.org\u002F) and [Springer webpage](https:\u002F\u002Fwww.springer.com\u002Fgp\u002Fbook\u002F9789811540943). To cite the book:\n```\n@book{deepRL-2020,\n title={Deep Reinforcement Learning: Fundamentals, Research, and Applications},\n editor={Hao Dong, Zihan Ding, Shanghang Zhang},\n author={Hao Dong, Zihan Ding, Shanghang Zhang, Hang Yuan, Hongming Zhang, Jingqing Zhang, Yanhua Huang, Tianyang Yu, Huaqing Zhang, Ruitong Huang},\n publisher={Springer Nature},\n note={\\url{http:\u002F\u002Fwww.deepreinforcementlearningbook.org}},\n year={2020}\n}\n```\n","# 流行的无模型强化学习算法  \n\u003C!-- [![Tweet](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttp\u002Fshields.io.svg?style=social)](https:\u002F\u002Ftwitter.com\u002Fintent\u002Ftweet?text=State-of-the-art-Model-free-Reinforcement-Learning-Algorithms%20&url=hhttps:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FSTOA-RL-Algorithms&hashtags=RL) -->\n\n\n基于 **PyTorch** 和 **TensorFlow 2.0** 的、针对 OpenAI Gym 环境以及自定义 Reacher 环境的最先进无模型强化学习算法实现。\n\n算法包括：\n* **Q-learning**；\n* **SARSA**；\n* **蒙特卡洛回归**；\n* **Actor-Critic (AC\u002FA2C)**；\n* **Soft Actor-Critic (SAC)**；\n* **深度确定性策略梯度 (DDPG)**；\n* **双延迟 DDPG (TD3)**；\n* **近端策略优化 (PPO)**；\n* **QT-Opt（包含交叉熵方法）**；\n* **PointNet**；\n* **Transporter**；\n* **循环策略梯度**；\n* **软决策树**；\n* **概率混合专家 (PMOE)**；\n* **QMIX**\n等。\n\n请注意，这个仓库更像是我在研究和学习期间实现并测试过的算法个人集合，而非用于实际使用的官方开源库或软件包。不过，我认为与大家分享这些内容可能会有所帮助，并期待大家对我的实现进行有益的讨论。但我在代码的清理和结构化方面并没有投入太多精力。正如你可能注意到的那样，每种算法可能有多个版本的实现，我特意将它们全部展示出来，供你参考和比较。此外，该仓库仅包含 **PyTorch** 实现。\n\n对于 RL 算法的官方库，我提供了以下两个基于 **TensorFlow 2.0 + TensorLayer 2.0** 的库：\n\n* [**RL 教程**](https:\u002F\u002Fgithub.com\u002Ftensorlayer\u002Ftensorlayer\u002Ftree\u002Freinforcement-learning\u002Fexamples\u002Freinforcement_learning) (*状态：已发布*) 包含以简单结构呈现的 RL 算法教程式实现。\n\n* [**RLzoo**](https:\u002F\u002Fgithub.com\u002Ftensorlayer\u002FRLzoo) (*状态：已发布*) 是一个具有高级 API 的基准实现，支持多种流行环境，结构层次分明，便于使用。\n\n针对多智能体 RL，我还建立了一个新的仓库（**PyTorch**）：\n* [**MARS**](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FMARS) (*状态：开发中*) 是一个用于游戏类多智能体 RL 的库，例如 PettingZoo Atari、SlimeVolleyBall 等。\n\n由于 TensorFlow 2.0 已经采用了动态图构建方式，取代了静态图，因此在 TensorFlow 和 PyTorch 之间迁移 RL 代码变得非常容易。\n\n## 内容：\n\n* 多个版本的 **Soft Actor-Critic (SAC)** 已被实现。\n\n  **SAC 版本 1**：\n\n     `sac.py`：使用状态值函数。\n\n     论文：https:\u002F\u002Farxiv.org\u002Fpdf\u002F1801.01290.pdf\n\n  **SAC 版本 2**：\n\n   `sac_v2.py`：使用目标 Q 值函数代替状态值函数。\n\n    论文：https:\u002F\u002Farxiv.org\u002Fpdf\u002F1812.05905.pdf\n    \n  **SAC 离散版**\n  \n   `sac_discrete.py`：适用于离散动作空间。\n\n    论文（作者实际上是我的 IC 同学之一）：https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.07207\n    \n  **SAC 离散版 PER**\n  \n   `sac_discrete_per.py`：适用于离散动作空间，并采用优先经验回放（PER）。\n\n* **深度确定性策略梯度 (DDPG)**：\n\n  `ddpg.py`：DDPG 的实现。\n\n* **双延迟 DDPG (TD3)**：\n\n   `td3.py`：TD3 的实现。\n\n   论文：https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.09477.pdf\n\n* **近端策略优化 (PPO)**：\n  \n  对于连续环境，实现了两个版本：\n  \n  版本 1：`ppo_continuous.py` 和 `ppo_continuous_multiprocess.py` \n  \n  版本 2：`ppo_continuous2.py` 和 `ppo_continuous_multiprocess2.py` \n  \n  对于离散环境：\n  \n  `ppo_gae_discrete.py`：采用广义优势估计（GAE）\n\n* **Actor-Critic (AC) \u002F A2C**：\n\n  `ac.py`：可扩展的 AC\u002FA2C，易于修改为 DDPG 等。\n\n   这是一个非常可扩展的普通 AC\u002FA2C 版本，支持所有连续\u002F离散、确定性\u002F非确定性情况。\n  \n* **Q-learning**、**SARSA**、**蒙特卡洛回归**：\n\n  `qlearning_sarsa_mc.ipynb`：三者的对比。\n \n* **DQN**：\n\n  `dqn.py`：一个简单的 DQN。\n\n* **QT-Opt**：\n\n   在 [这里](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FQT_Opt) 实现了两个版本。\n\n* **PointNet** 用于通过无监督学习从图像中生成地标，已在 [这里](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPointNet_Landmarks_from_Image\u002Ftree\u002Fmaster) 实现。这种方法也被用作基于图像的强化学习中的 SOTA 算法，称为 **Transporter**。\n\n  原始论文：[通过条件图像生成进行对象地标的无监督学习](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F7657-unsupervised-learning-of-object-landmarks-through-conditional-image-generation.pdf)\n\n  强化学习相关论文：[用于感知和控制的对象关键点的无监督学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1906.11883.pdf)\n\n* **循环策略梯度**：\n\n  `rdpg.py`：带有 LSTM 策略的 DDPG。\n\n  `td3_lstm.py`：带有 LSTM 策略的 TD3。\n\n  `sac_v2_lstm.py`：带有 LSTM 策略的 SAC。\n\n  `sac_v2_gru.py`：带有 GRU 策略的 SAC。\n\n  参考文献：\n\n  [基于记忆的递归神经网络控制](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.04455)\n\n  [通过动力学随机化实现机器人控制的模拟到现实迁移](https:\u002F\u002Farxiv.org\u002Fabs\u002F1710.06537)\n  \n * **软决策树** 作为 PPO 的函数逼近器：\n \n   `sdt_ppo_gae_discrete.py`：将 PPO 中的策略网络层替换为 [软决策树](https:\u002F\u002Farxiv.org\u002Fabs\u002F1711.09784)，以实现可解释的强化学习。\n   \n   论文：[CDT：用于可解释强化学习的级联决策树](https:\u002F\u002Farxiv.org\u002Fabs\u002F2011.07553)\n   \n * **概率混合专家 (PMOE)**：\n \n   PMOE 使用可微分的多模态高斯分布来替代标准的单模态高斯分布，用于策略表示。\n   \n   `pmoe_sac.py`：基于离线策略 SAC。\n   \n   `pmoe_ppo.py`：基于在线策略 PPO。\n   \n   论文：[用于高效深度强化学习的概率混合专家](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.09122)\n\n * **QMIX**：\n\n     `qmix.py`：一种完全协作式的多智能体强化学习算法，演示环境使用 [pettingzoo](https:\u002F\u002Fwww.pettingzoo.ml\u002Fatari\u002Fentombed_cooperative)。\n\n     论文：http:\u002F\u002Fproceedings.mlr.press\u002Fv80\u002Frashid18a.html\n     \n * **相位策略梯度 (PPG)**：\n\n   待办事项\n\n   论文：[相位策略梯度](http:\u002F\u002Fproceedings.mlr.press\u002Fv139\u002Fcobbe21a.html)\n     \n \n * **最大后验策略优化 (MPO)**：\n \n    待办事项\n\n    论文：[最大后验策略优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1806.06920)\n\n * **优势加权回归 (AWR)**：\n\n    待办事项 \n\n    论文：[优势加权回归：简单且可扩展的离线强化学习](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1910.00177.pdf)\n\n## 使用方法：\n`python ***.py --train` \n\n`python ***.py --test`\n\n## 故障排除：\n\n如果您遇到“未实现错误”，这可能是由于 Gym 版本不正确所致。最新的 gym==0.14 不适用。请使用 `pip install -r requirements.txt` 安装 gym==0.7 或 gym==0.10。\n\n## 一些被低估的技巧：\n\n众所周知，在强化学习算法的实际实现中，有许多技巧可以提升性能，包括超参数调整、归一化、网络架构，甚至是隐藏层的激活函数等。我在本仓库的代码中遇到的一些技巧总结如下：\n\n * 环境特定：\n    * 对于 Gym 中的 `Pendulum-v0` 环境，通常对奖励进行预处理 `(r+8)\u002F8` 可以提高学习效率，如 [此处](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002F7f2bb74a51cf9cbde92a6ccfa42e97dc129dd145\u002Fppo_continuous2.py#L376) 所示。\n    此外，该环境还需要将[最大 episode 长度](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002F7f2bb74a51cf9cbde92a6ccfa42e97dc129dd145\u002Fsac_v2.py#L364)设置为至少 150，才能更好地学习；episode 过短会使得学习变得困难。\n    * Gym 中的 `MountainCar-v0` 环境奖励非常稀疏（只有到达终点旗时才有奖励），因此其学习曲线通常较为嘈杂；所以针对该环境可能也需要一些特殊的处理方法。\n\n * 归一化：\n    * 在批次中进行[奖励归一化](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002F7f2bb74a51cf9cbde92a6ccfa42e97dc129dd145\u002Fsac_v2.py#L262)或[优势归一化](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002F881903e4aa22921f142daedfcf3dd266488405d8\u002Fppo_gae_discrete.py#L79)，有时可以显著提升算法的性能（学习效率和稳定性）。尽管理论上像 PPO 这样的策略梯度算法在训练过程中不应应用数据归一化，因为这会导致分布偏移。要深入理解这个问题，我们需要区分以下两种情况：(1) 对直接输入数据（如观测值、动作、奖励等）进行归一化；(2) 对价值估计（状态值、状态-动作值、优势等）进行归一化。对于 (1)，更合理的归一化方式是维护一个移动平均的均值和标准差，从而达到类似于在 RL 代理学习过程中对整个数据集进行归一化的效果（然而在 RL 中这是不可能的，因为数据来源于智能体与环境的交互）。而对于 (2)，我们可以直接对价值估计进行归一化（而不是保留历史平均值），因为我们并不希望这些估计值发生分布偏移，因此可以将其视为一种静态分布。\n\n* 多进程：\n    * 基于 `torch.multiprocessing` 的多进程更新是否是并行化代码的正确且安全的方式？\n    从官方文档（例如 Hogwild 示例）可以看出，`torch.multiprocessing` 的使用并未采用任何显式锁机制，这意味着当多个进程同时计算梯度并更新共享模型时，可能会存在潜在的安全隐患。更多讨论请参见 [这里](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fsynchronization-for-sharing-updating-shared-model-state-dict-across-multi-process\u002F50102\u002F2)，以及一些 [测试](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fmodel-update-with-share-memory-need-lock-protection\u002F72857) 和 [解答](https:\u002F\u002Fdiscuss.pytorch.org\u002Ft\u002Fgrad-sharing-problem-in-a3c\u002F10635)。总体而言，虽然存在不安全更新的风险，但多进程带来的加速效果往往更为显著（而且 RL 训练本身就有很大的方差和噪声）。\n\n    * 尽管我提供了几种算法的多进程版本（如 [SAC](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002Fmaster\u002Fsac_v2_multiprocess.py)、[PPO](https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fblob\u002Fmaster\u002Fppo_continuous_multiprocess2.py) 等），但对于 Gym 中的小规模环境来说，通常并不需要甚至可能效率更低。在这种情况下，使用向量化环境包装器来进行并行环境采样可能是更合适的解决方案，因为学习效率的瓶颈主要在于与环境的交互，而非模型的学习（反向传播）过程。\n\n    * 关于多进程使用的一个小提示：\n      \u003Cp align=\"center\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_7c6a9a3949ac.png\" width=\"40%\">\n      \u003C\u002Fp>\n      要在多个进程之间共享类实例及其状态，必须将实例放入 `multiprocessing.manager` 中：\n      \u003Cp align=\"center\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_2bd23d2bca84.png\" width=\"40%\">\n      \u003C\u002Fp>\n* PPO 细节：\n\n    * [这里](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F19VucQYtiCubFt6IIfzO-Gsguvs8BfnXTxp76RXUPDNA\u002Fedit?usp=sharing) 我总结了连续动作空间下 PPO 算法的实现细节，对应脚本为 `ppo_gae_continuous.py`、`ppo_gae_continuous2.py` 和 `ppo_gae_continuous3.py`。\n\n关于 **实现技巧** 的更多讨论，请参阅我们书中这一章：[链接](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-981-15-4095-0_18)。\n\n## 性能：\n\n* **SAC** 用于 Gym 的 Pendulum-v0 环境：\n\n带有自动更新熵系数 α 的 SAC：\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_26af11de988f.png\" width=\"100%\">\n\u003C\u002Fp>\n不带自动更新熵系数 α 的 SAC：\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_1681545ef49e.png\" width=\"100%\">\n\u003C\u002Fp>\n\n结果显示，自动更新熵有助于智能体更快地学习。\n\n* **TD3** 用于 Gym 的 Pendulum-v0 环境：\n\n采用确定性策略的 TD3：\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_acf2c9aea2fa.png\" width=\"100%\">\n\u003C\u002Fp>\n采用非确定性\u002F随机策略的 TD3：\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_d180e5f6b27a.png\" width=\"100%\">\n\u003C\u002Fp>\n\n看起来采用确定性策略的 TD3 表现稍好一些，但总体差异不大。\n\n* **AC** 用于 Gym 的 CartPole-v0 环境：\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_f33a31e3fec3.png\" width=\"100%\">\n\u003C\u002Fp>\n\n不过，普通的 AC\u002FA2C 并不能很好地处理像 Gym 的 Pendulum-v0 这样的连续动作空间问题。\n\n* **PPO** 用于 Gym 的 LunarLanderContinuous-v2 环境：\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_fbae67e9a4ea.png\" width=\"100%\">\n\u003C\u002Fp>\n\n请使用 `ppo_continuous_multiprocess2.py`。\n\n## 引用：\n若需引用此仓库，请使用以下格式：\n```\n@misc{rlalgorithms,\n  author = {Zihan Ding},\n  title = {Popular-RL-Algorithms},\n  year = {2019},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms}},\n}\n```\n\n## 其他资源：\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_readme_001e0467bef7.png\" width=\"20%\">\n\u003C\u002Fp>\n\n**深度强化学习：基础、研究与应用** *施普林格自然出版社 2020年*\n\n是我与[董浩博士](https:\u002F\u002Fgithub.com\u002Fzsdonghao)和[张尚恒博士](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=voqw10cAAAAJ&hl=en)共同主编的书籍，全面涵盖了深度强化学习领域的多个主题。详细信息请参见[官方网站](https:\u002F\u002Fdeepreinforcementlearningbook.org\u002F)和[施普林格网页](https:\u002F\u002Fwww.springer.com\u002Fgp\u002Fbook\u002F9789811540943)。如需引用该书，请使用以下格式：\n```\n@book{deepRL-2020,\n title={Deep Reinforcement Learning: Fundamentals, Research, and Applications},\n editor={Hao Dong, Zihan Ding, Shanghang Zhang},\n author={Hao Dong, Zihan Ding, Shanghang Zhang, Hang Yuan, Hongming Zhang, Jingqing Zhang, Yanhua Huang, Tianyang Yu, Huaqing Zhang, Ruitong Huang},\n publisher={Springer Nature},\n note={\\url{http:\u002F\u002Fwww.deepreinforcementlearningbook.org}},\n year={2020}\n}\n```","# Popular-RL-Algorithms 快速上手指南\n\n本指南基于 `Popular-RL-Algorithms` 项目，旨在帮助开发者快速运行基于 PyTorch 的主流无模型强化学习算法（如 SAC, PPO, DDPG, TD3 等）。\n\n> **注意**：本项目主要为作者研究与学习期间的代码合集，包含多种实现版本以供对比参考，并非官方封装的标准库。代码结构可能较为灵活，适合学习与实验。\n\n## 环境准备\n\n*   **操作系统**: Linux \u002F macOS \u002F Windows\n*   **Python 版本**: 推荐 Python 3.6 - 3.8\n*   **核心框架**: PyTorch\n*   **关键依赖**: \n    *   `gym`: **必须安装特定版本**。最新版的 `gym==0.14` 会导致报错，请严格按照要求安装 `gym==0.7` 或 `gym==0.10`。\n    *   `pettingzoo` (可选，用于多智能体 QMIX 算法)\n\n## 安装步骤\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms.git\n    cd Popular-RL-Algorithms\n    ```\n\n2.  **安装依赖**\n    建议使用国内镜像源加速安装，并严格锁定 gym 版本以避免兼容性错误。\n\n    ```bash\n    # 使用清华源安装依赖（自动处理 gym 版本限制）\n    pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n    *如果 `requirements.txt` 未明确指定 gym 版本，请手动执行以下命令强制安装兼容版本：*\n    ```bash\n    pip install gym==0.10 -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n3.  **验证安装**\n    确保 PyTorch 和 Gym 能正常导入且无版本冲突。\n\n## 基本使用\n\n项目中的每个算法对应独立的 `.py` 文件（例如 `sac_v2.py`, `ppo_continuous.py`, `ddpg.py` 等）。\n\n### 1. 训练模型\n在命令行中运行具体算法脚本，并添加 `--train` 参数。\n\n```bash\n# 示例：训练 SAC 算法 (Version 2)\npython sac_v2.py --train\n\n# 示例：训练 PPO 算法 (连续动作空间)\npython ppo_continuous.py --train\n\n# 示例：训练 DDPG 算法\npython ddpg.py --train\n```\n\n### 2. 测试\u002F评估模型\n训练完成后，使用 `--test` 参数加载模型进行评估。\n\n```bash\n# 示例：测试 SAC 算法\npython sac_v2.py --test\n\n# 示例：测试 PPO 算法\npython ppo_continuous.py --test\n```\n\n### 3. 进阶提示\n*   **多进程加速**: 部分算法提供了多进程版本（文件名含 `multiprocess`），适用于计算密集型任务，但在小规模 Gym 环境中可能收益不明显。\n    ```bash\n    python ppo_continuous_multiprocess2.py --train\n    ```\n*   **离散动作空间**: 针对离散动作空间，请使用对应的 `_discrete` 版本脚本（如 `sac_discrete.py`, `ppo_gae_discrete.py`）。\n*   **环境奖励调整**: 对于 `Pendulum-v0` 等特定环境，代码中已内置了奖励预处理（如 `(r+8)\u002F8`）以提升学习效率，无需手动修改。","某机器人初创团队正在研发一款工业机械臂，需要让其通过强化学习自主掌握复杂环境下的精准抓取技能。\n\n### 没有 Popular-RL-Algorithms 时\n- **算法复现成本高昂**：团队成员需从零编写 SAC、TD3 等前沿算法代码，极易在数学公式转换中引入隐蔽 Bug，导致训练无法收敛。\n- **方案对比困难**：面对连续动作空间，难以快速在同一框架下对比 PPO、DDPG 和 SAC 的不同版本（如基于状态值函数或目标 Q 值函数）的性能差异。\n- **调试周期漫长**：缺乏经过验证的基准实现，研究人员花费数周时间排查是算法逻辑错误还是超参数设置不当，严重拖慢研发进度。\n- **离散场景支持缺失**：当任务切换到离散动作控制时，找不到现成的 SAC Discrete 或带优先经验回放（PER）的参考代码，需重新造轮子。\n\n### 使用 Popular-RL-Algorithms 后\n- **即插即用基线**：直接调用库中成熟的 PyTorch 版 SAC 和 TD3 实现，将算法搭建时间从数周缩短至几天，确保核心逻辑正确无误。\n- **多版本灵活切换**：轻松切换 PPO 的多进程版本或 SAC 的不同变体进行消融实验，快速确定最适合当前机械臂物理特性的算法架构。\n- **聚焦核心创新**：研究人员不再纠结于底层代码细节，而是将精力集中在奖励函数设计和仿真环境构建上，显著加速模型迭代。\n- **全场景覆盖**：利用内置的 SAC Discrete PER 等模块，无缝支持从连续轨迹规划到离散抓取指令的各类控制任务，无需额外开发。\n\nPopular-RL-Algorithms 通过提供多样且经过验证的算法实现，让研发团队从繁琐的代码复现中解放出来，专注于解决真实的机器人控制难题。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fquantumiracle_Popular-RL-Algorithms_8776c069.png","quantumiracle","Zihan Ding","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fquantumiracle_2450b96c.jpg","PhD at ECE Dept. Princeton University |\r\nMSc of ML at Imperial College London |\r\nIntern @ Meta GenAI&FAIR @Adobe   @ BorealisAI @ Tencent Robotics X @ Inspir.ai",null,"Princeton","zhding96@gmail.com","https:\u002F\u002Fquantumiracle.github.io\u002Fwebpage\u002F","https:\u002F\u002Fgithub.com\u002Fquantumiracle",[82,86],{"name":83,"color":84,"percentage":85},"Jupyter Notebook","#DA5B0B",71.1,{"name":87,"color":88,"percentage":89},"Python","#3572A5",28.9,1339,148,"2026-04-11T06:08:19","Apache-2.0","","未说明",{"notes":97,"python":95,"dependencies":98},"该仓库主要为个人研究代码集合，非官方库。特别注意：最新的 gym==0.14 版本会导致 'Not implemented Error'，必须安装 gym==0.7 或 gym==0.10。代码包含多种算法的不同实现版本供参考对比。多智能体强化学习部分需参考独立的 MARS 仓库。",[99,100,101],"PyTorch","gym==0.7 或 gym==0.10","pettingzoo",[14,103],"其他",[105,106,107],"reinforcement-learning","soft-actor-critic","state-of-the-art","2026-03-27T02:49:30.150509","2026-04-14T04:32:18.260367",[111,116,121,126,131,136,141],{"id":112,"question_zh":113,"answer_zh":114,"source_url":115},32410,"运行 SAC 算法时遇到 FileNotFoundError: '.\u002Fmodel\u002Fsac' 错误，是缺少模型文件吗？","这不是缺少模型文件，而是代码未自动创建保存模型的目录。解决方法有两种：1. 手动在项目根目录下创建 `model\u002Fsac` 文件夹；2. 删除 `sac.py` 文件中第 371 行（或报错对应的行）关于路径检查的代码。维护者已更新脚本修复了此问题。","https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fissues\u002F41",{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},32411,"在 BipedalWalker 环境中运行 SAC v2 LSTM 版本时出现 ValueError（数组长度不匹配），原因是什么？","这是因为该 LSTM 版本的实现要求所有回合（episode）的长度必须相同，以便将采样数据保存为张量。BipedalWalker 环境存在失败状态（如机器人倒地），导致回合提前终止，长度不一致，从而引发错误。目前代码未自动处理变长序列，建议仅在固定长度的环境（如 Reacher, Pendulum-v0, HalfCheetah-v2）中使用该 LSTM 版本，或自行修改代码以支持变长数据处理。","https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fissues\u002F34",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},32412,"PPO 训练结果中奖励曲线（Reward Chart）每次看起来都一样，这是为什么？","这通常是因为随机种子（random seed）固定导致的。可以通过设置不同的随机种子来验证这一点。如果更改种子后曲线发生变化，则说明之前的重复是由于确定性初始化造成的。","https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fissues\u002F48",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},32413,"运行 dqn.py 时遇到 \"expected sequence of length 4 at dim 1\" 错误，如何修复状态初始化问题？","维护者无法复现该错误，默认情况下在 'CartPole-v1' 环境中输入状态 `x` 的形状应为 (4,)。如果遇到此错误，请首先检查依赖库版本（如 gym, torch）是否与项目要求一致。确保没有错误地修改状态初始化逻辑，通常不需要将 `current_frame_idx` 改为列表形式。","https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fissues\u002F80",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},32414,"该项目推荐的 Python 版本是多少？","项目推荐使用 Python 3.8 版本配合 `requirements.txt` 安装依赖。虽然其他版本可能也能运行，但 Python 3.8 是经过验证最稳定的环境。","https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fissues\u002F81",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},32415,"RDPG 算法是用于 MDP 还是 POMDP 环境？","根目录下的脚本主要针对 MDP（马尔可夫决策过程）环境。虽然 RDPG 理论上常用于处理 POMDP（部分可观测马尔可夫决策过程），但在本仓库的实现中，根目录代码主要是在 MDP 环境下运行的。如果需要处理 POMDP 环境，可以使用仓库中配合 pomdp-wrappers 使用的特定脚本。","https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fissues\u002F31",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},32416,"在计算 SAC 的 Critic Loss 时，是否需要使用 `with torch.no_grad():` 来计算目标 Q 值？","不需要显式使用 `with torch.no_grad():` 包裹计算过程。代码中已经对 `target_q_value` 使用了 `.detach()` 方法（例如 `target_q_value.detach()`），这同样能阻止梯度回传到目标网络参数，效果与 `no_grad` 在此场景下是一致的。","https:\u002F\u002Fgithub.com\u002Fquantumiracle\u002FPopular-RL-Algorithms\u002Fissues\u002F16",[]]